Skip to content

Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6#28

Closed
tfmorris wants to merge 3 commits into
commoncrawl:masterfrom
tfmorris:6-surt-non-utf-8-encoding
Closed

Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6#28
tfmorris wants to merge 3 commits into
commoncrawl:masterfrom
tfmorris:6-surt-non-utf-8-encoding

Conversation

@tfmorris

Copy link
Copy Markdown

Fixes #6

Fixes issue with percent signs (%) getting double escaped for hex encoded characters which use an encoding other than UTF-8.

There is a separate issue with the hex characters being lower case instead of upper case as recommended by both Google canonicalization guidelines (V2) and RFC 3986, which this patch does NOT address.

@sebastian-nagel

Copy link
Copy Markdown

Thanks, @tfmorris! We'll have a look.

@sebastian-nagel

Copy link
Copy Markdown

Just for reference - the discussion in CC's Google group: https://groups.google.com/g/common-crawl/c/ek5bme_RIuM

Opened PR upstream: iipc#102

@sebastian-nagel

Copy link
Copy Markdown

Closing this in favor of the upstream PR iipc#102 which is integrated

The fix was already in use for the January 2025 crawl. Verification of the solution:

  1. verified that the problem exists in CC-MAIN-2024-51:
    • run Athena query
      select count(*) as count, url_host_tld
      from ccindex
      where crawl = 'CC-MAIN-2024-51'
        and regexp_like(url_surtkey, '%25[0-9a-zA-Z][0-9a-zA-Z]')
        and not (regexp_like(url_path, '%25[0-9a-zA-Z][0-9a-zA-Z]')
                 or regexp_like(url_query, '%25[0-9a-zA-Z][0-9a-zA-Z]'))
      group by url_host_tld
      order by count desc;
    • looks for matches of %25 followed by a valid hex number in url_surtkey, not present in the original URL (path or query)
    • top results
       count   url_host_tld
       272603  com
       145292  jp
       100009  ru
        45287   net
        43662   cn
        36144   org
        35896   pl
        29894   de
        18380   kr
      
    • TLDs where charsets different from UTF-8 are still frequently used are well represented
    • 918,754 matches in total
  2. run the same query but for CC-MAIN-2025-05
    • zero results 🚀

See also: https://commoncrawl.org/errata/surt-urls-do-not-properly-encode-non-utf-8-percent-encoded-characters

Thanks again, @tfmorris!

@tfmorris tfmorris deleted the 6-surt-non-utf-8-encoding branch February 12, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WaybackURLKeyMaker to keep non-utf8 percent encodings

2 participants