The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):
<img src="https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&videoId=5808028819001" alt="Míchel Salgado, exjugador del Real Madrid: "Cristiano Ronaldo es insustituible"">
{
"path": "IMG@/src",
"alt": "Míchel Salgado, exjugador del Real Madrid: "Cristiano Ronaldo es insustituible"",
"url": "https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&videoId=5808028819001"
},
- note that the problem applies to all kind of XML/HTML character entities:
<a href="https://secure.customersvc.com/wes/servlet/Show?WESPAGE=iam/pages/home.jsp&MSRSMAG=FI">
EU Customer Service
</a>
{
"path": "A@/href",
"text": "EU Customer Service",
"url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE=iam/pages/home.jsp&MSRSMAG=FI"
},
<a class="pdb-meta-link" href="http://www.madsack.de/"
target="_blank" rel="nofollow"
>© Verlagsgesellschaft Madsack GmbH & Co. KG</a>
{
"path": "A@/href",
"rel": "nofollow",
"text": "© Verlagsgesellschaft Madsack GmbH & Co. KG",
"url": "http://www.madsack.de/",
"target": "_blank"
},
<meta property="og:description" content="As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&#39;s defence" >
{
"property": "og:description",
"content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&#39;s defence"
},
The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.
The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):
The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.