SURT URL canicalization to handle non-UTF-8 percent-encoded characters#102
Conversation
|
Thanks @sebastian-nagel. I'm not sure if it's important, but note that the issue reference in f7be47b resolves incorrectly in this new context. It's actually a reference to commoncrawl#6 |
|
That's unfortunate.
Technically this is the reference implementation that the Python surt module is supposed to be a port of and this is a breaking change for existing CDX files generated by the original Java tools. On the other hand as OpenWayback is no longer updated and many organisations are moving to pywb, it may indeed pragmatically be better to follow the Python and JavaScript implementations. |
|
Numbers from Nov. 2024: in a sample of 10 million URLs, 4k (0.04%) encode non-ASCII characters not using UTF-8. JP and RU are frequent top-level domains of such URLs, but they're found practically everywhere (83 different TLDs in the sample). |
That has the added advantage of being correct and conforming to the spec. I think a bigger question is how to phase it in with the least impact on the ecosystem. The CDX spec doesn't include a version number or any information on the writer of the CDX file making it difficult for readers to know how to interpret any given file. |
|
I'm going to wait a few more days for comments and if there's no objections raised I will merge this. |
WaybackURLKeyMaker.makeKey(url)replaces percent signs by%25in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5
Python's surt module behaves different which breaks look-up in CDX files for such URLs:
Notes: