Skip to content

WAT: unescape XML/HTML character entities #14

@sebastian-nagel

Description

@sebastian-nagel

The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):

<img src="https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001" alt="Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;">

{
  "path": "IMG@/src",
  "alt": "Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;",
  "url": "https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001"
},
  • note that the problem applies to all kind of XML/HTML character entities:
<a href="https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI">
  EU Customer Service
</a>

{
  "path": "A@/href",
  "text": "EU Customer Service",
  "url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI"
},
  • in text
<a class="pdb-meta-link" href="http://www.madsack.de/"
   target="_blank" rel="nofollow"
   >© Verlagsgesellschaft Madsack GmbH &amp; Co. KG</a>

{
  "path": "A@/href",
  "rel": "nofollow",
  "text": "© Verlagsgesellschaft Madsack GmbH &amp; Co. KG",
  "url": "http://www.madsack.de/",
  "target": "_blank"
},
  • and attribute values
<meta property="og:description" content="As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&amp;#39;s defence" >

{
  "property": "og:description",
  "content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&amp;#39;s defence"
},

The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions