Skip to content

SURT URL canicalization to handle non-UTF-8 percent-encoded characters#102

Merged
ato merged 3 commits into
iipc:masterfrom
sebastian-nagel:surt-non-utf-8-encoding
Dec 14, 2024
Merged

SURT URL canicalization to handle non-UTF-8 percent-encoded characters#102
ato merged 3 commits into
iipc:masterfrom
sebastian-nagel:surt-non-utf-8-encoding

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

WaybackURLKeyMaker.makeKey(url) replaces percent signs by %25 in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):

http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5

Python's surt module behaves different which breaks look-up in CDX files for such URLs:

$> pip3 show surt
Name: surt
Version: 0.3.1
Summary: Sort-friendly URI Reordering Transform (SURT) python package.
...

$> python3
>>> from surt import surt
>>> surt("http://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5")
'ua,1kr)/newslist.html?tag=%e4%ee%f8%ea%ee%eb%fc%ed%ee%e5'
>>> surt("https://www.insbase.ac/xoops2/modules/xpwiki/?%A4%D5%A4%AF%A4%AA%A4%AB%B8%A9%A4%AA%A4%AA%A4%CE%A4%B8%A4%E7%A4%A6%BB%D4")
'ac,insbase)/xoops2/modules/xpwiki?%a4%d5%a4%af%a4%aa%a4%ab%b8%a9%a4%aa%a4%aa%a4%ce%a4%b8%a4%e7%a4%a6%bb%d4'

Notes:

@tfmorris

tfmorris commented Dec 5, 2024

Copy link
Copy Markdown
Contributor

Thanks @sebastian-nagel. I'm not sure if it's important, but note that the issue reference in f7be47b resolves incorrectly in this new context. It's actually a reference to commoncrawl#6

@ato

ato commented Dec 5, 2024

Copy link
Copy Markdown
Member

That's unfortunate.

Implementation "%C3" "%C3%23" "%C3%80"
jwarc %25c3 %25c3%23 %c3%80
OutbackCDX %25c3 %25c3%23 %c3%80
urlcanon (java) %ef%bf%bd %ef%bf%bd%23 %c3%80
urlcanon (python) %c3 %c3%23 %c3%80
surt (python) %c3 %c3%23 %c3%80
warcio.js %c3 %c3%23 %c3%80
webarchive-commons %25c3 %25c3%23 %c3%80

Technically this is the reference implementation that the Python surt module is supposed to be a port of and this is a breaking change for existing CDX files generated by the original Java tools.

On the other hand as OpenWayback is no longer updated and many organisations are moving to pywb, it may indeed pragmatically be better to follow the Python and JavaScript implementations.

@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Numbers from Nov. 2024: in a sample of 10 million URLs, 4k (0.04%) encode non-ASCII characters not using UTF-8. JP and RU are frequent top-level domains of such URLs, but they're found practically everywhere (83 different TLDs in the sample).

@tfmorris

tfmorris commented Dec 6, 2024

Copy link
Copy Markdown
Contributor

it may indeed pragmatically be better to follow the Python and JavaScript implementations.

That has the added advantage of being correct and conforming to the spec.

I think a bigger question is how to phase it in with the least impact on the ecosystem. The CDX spec doesn't include a version number or any information on the writer of the CDX file making it difficult for readers to know how to interpret any given file.

@ato

ato commented Dec 9, 2024

Copy link
Copy Markdown
Member

I'm going to wait a few more days for comments and if there's no objections raised I will merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants