This bug (internetarchive/surt#28) reported against the Python SURT module applies to the URL canonicalization here as well.
The following URLs are incorrectly canonicalized with SURT as "com)/".
SURT = "com)/"
1. https://www1355544.com/
2. https://www3288.com/
3. https://www504778.com/
4. https://www556798.com/
5. https://www57912.com/
There's also a difference in the handling of these prefixes between the two packages: the Java package removes ALL leading matching prefixes while the Python package only removes the first one. I think the less aggressive approach of the Python package might be preferable.
This bug (internetarchive/surt#28) reported against the Python SURT module applies to the URL canonicalization here as well.
There's also a difference in the handling of these prefixes between the two packages: the Java package removes ALL leading matching prefixes while the Python package only removes the first one. I think the less aggressive approach of the Python package might be preferable.