Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6 by tfmorris · Pull Request #28 · commoncrawl/ia-web-commons

tfmorris · 2023-08-27T17:23:22Z

Fixes #6

Fixes issue with percent signs (%) getting double escaped for hex encoded characters which use an encoding other than UTF-8.

There is a separate issue with the hex characters being lower case instead of upper case as recommended by both Google canonicalization guidelines (V2) and RFC 3986, which this patch does NOT address.

sebastian-nagel · 2023-08-28T13:34:38Z

Thanks, @tfmorris! We'll have a look.

sebastian-nagel · 2024-12-05T18:00:14Z

Just for reference - the discussion in CC's Google group: https://groups.google.com/g/common-crawl/c/ek5bme_RIuM

Opened PR upstream: iipc#102

sebastian-nagel · 2025-02-12T19:08:47Z

Closing this in favor of the upstream PR iipc#102 which is integrated

into CCF's Nutch fork in commoncrawl/nutch@6b2d9ea
resp. ia-web-commons in 1446d35 / 3907d24 / f7be47b

The fix was already in use for the January 2025 crawl. Verification of the solution:

verified that the problem exists in CC-MAIN-2024-51:

run Athena query

select count(*) as count, url_host_tld
from ccindex
where crawl = 'CC-MAIN-2024-51'
  and regexp_like(url_surtkey, '%25[0-9a-zA-Z][0-9a-zA-Z]')
  and not (regexp_like(url_path, '%25[0-9a-zA-Z][0-9a-zA-Z]')
           or regexp_like(url_query, '%25[0-9a-zA-Z][0-9a-zA-Z]'))
group by url_host_tld
order by count desc;

looks for matches of %25 followed by a valid hex number in url_surtkey, not present in the original URL (path or query)

top results

 count   url_host_tld
 272603  com
 145292  jp
 100009  ru
  45287   net
  43662   cn
  36144   org
  35896   pl
  29894   de
  18380   kr

TLDs where charsets different from UTF-8 are still frequently used are well represented
918,754 matches in total

run the same query but for CC-MAIN-2025-05
- zero results 🚀

See also: https://commoncrawl.org/errata/surt-urls-do-not-properly-encode-non-utf-8-percent-encoded-characters

Thanks again, @tfmorris!

tfmorris added 3 commits August 26, 2023 20:05

Add failing test from Sebastian's issue

2bda97a

Add non-UTF-8 encoded test from mailing list

fe04b99

Handle non-UTF-8 encoded characters. Fixes commoncrawl#6

83e1699

sebastian-nagel mentioned this pull request Aug 28, 2023

Upgrade webarchive-commons dependency to include fix of SURT maker / URL canonicalizer commoncrawl/nutch#24

Closed

sebastian-nagel mentioned this pull request Dec 5, 2024

SURT URL canicalization to handle non-UTF-8 percent-encoded characters iipc/webarchive-commons#102

Merged

sebastian-nagel closed this Feb 12, 2025

tfmorris deleted the 6-surt-non-utf-8-encoding branch February 12, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6#28

Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6#28
tfmorris wants to merge 3 commits into
commoncrawl:masterfrom
tfmorris:6-surt-non-utf-8-encoding

tfmorris commented Aug 27, 2023

Uh oh!

sebastian-nagel commented Aug 28, 2023

Uh oh!

sebastian-nagel commented Dec 5, 2024

Uh oh!

sebastian-nagel commented Feb 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tfmorris commented Aug 27, 2023

Uh oh!

sebastian-nagel commented Aug 28, 2023

Uh oh!

sebastian-nagel commented Dec 5, 2024

Uh oh!

sebastian-nagel commented Feb 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants