Skip to content

URLParser and WaybackURLKeyMaker fail on URLs with IPv6 address hostname#100

Merged
ato merged 1 commit into
iipc:masterfrom
sebastian-nagel:surt-ipv6
Nov 27, 2024
Merged

URLParser and WaybackURLKeyMaker fail on URLs with IPv6 address hostname#100
ato merged 1 commit into
iipc:masterfrom
sebastian-nagel:surt-ipv6

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

URLs/URIs with an IPv6 address as host fail to parse by URLParser. Consequently, WaybackURLKeyMaker fails to make the SURT key:

2024-11-26 11:07:55,243 ERROR o.c.u.WarcCdxWriter [pool-6-thread-1] Failed to make SURT for https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt: java.net.URISyntaxException: bad port 1f18:200d:fb00:2b74:867c:ab0c:150a]: https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt
        at org.archive.url.URLParser.parse(URLParser.java:257)
        at org.archive.url.WaybackURLKeyMaker.makeKey(WaybackURLKeyMaker.java:60)
        at org.commoncrawl.util.WarcCdxWriter.writeCdxLine(WarcCdxWriter.java:141)

This PR fixes the parser failure. Enclosing [ and ] are stripped from the IPv6 hosts to stay compatible with SURT keys generated by the Python surt module:

>>> from surt import surt
>>> surt("https://34.203.211.192/robots.txt")
'192,211,203,34)/robots.txt'
>>> surt("https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt")
'2600:1f18:200d:fb00:2b74:867c:ab0c:150a)/robots.txt'

@ato ato merged commit d589dd9 into iipc:master Nov 27, 2024
@ato

ato commented Nov 27, 2024

Copy link
Copy Markdown
Member

Thanks. Released as 1.1.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants