Skip to content

WaybackURLKeyMaker mangles URLs with IPv4-mapped IPv6 addresses #104

@sebastian-nagel

Description

@sebastian-nagel

IPv4-mapped / IPv4-compatible IPv6 addresses (e.g., ::ffff:192.0.2.128) in URLs are mangled by WaybackURLKeyMaker: the enclosing square brackets are not removed, but moved around together with the parts of the host-port combination after splitting at dots:

jshell> import org.archive.url.WaybackURLKeyMaker;
jshell> var km = new WaybackURLKeyMaker();
jshell> km.makeKey("http://[::ffff:123.123.87.87]:8080/index.html")
$3 ==> "87],87,123,[::ffff:123:8080)/index.html"

For comparison, the Python surt module removes the square brackets before splitting at dots and moving reversing the parts:

$> pip3 show surt
Name: surt
Version: 0.3.1
Summary: Sort-friendly URI Reordering Transform (SURT) python package.

$> python3
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
>>> from surt import surt
>>> surt("http://[::ffff:123.123.87.87]:8080/index.html")
'87,87,123,::ffff:123:8080)/index.html'

I'm not sure, what the best representation is:

  • normalize the IPv4-mapped representation - ::ffff:123.123.87.87 becomes
    • ::ffff:7b7b:5757
    • or 123.123.87.87
  • the double use of the colon in IPv6 addresses and as port separator is troublesome, but maybe not an issue, because SURT keys are recall-oriented and some ambiguity is acceptable. It'd be also a separate issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions