Fix URL decoding for robots.txt #58
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent empty/null hostnames when parsing malformed robots.txt-style URLs in WARC target URIs by normalizing malformed http(s) scheme slashes and adding regression tests.
Changes:
- Normalize malformed
http(s):////...inputs before parsing inWarcUri. - Add JUnit tests covering malformed scheme slashes and extra path slashes.
- Ignore IntelliJ project metadata via
.gitignore.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
src/main/java/org/commoncrawl/net/WarcUri.java |
Normalizes malformed http(s) URLs before parsing; adds warning log on parse failure. |
src/test/java/org/commoncrawl/net/WarcUriTest.java |
Adds regression tests for hostname extraction from malformed/oddly-slashed URLs. |
.gitignore |
Adds IntelliJ IDEA project directory ignore entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano. See inline comments.
|
@sebastian-nagel I've tighten the application of the normalization only if the hostname is blank. There is a comment in test |
I meant: If there's an assertion for a string to by "xyz" it cannot be empty. One assertion is enough, which means less code but not less tests. |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks. Looks good.
This PR addresses the problem found in the April Crawl for processing
robots.txtURLs that resulted in an empty/null hostname.Background information: The target URLs of robots.txt redirects are not passed to URL filters before the redirects are followed. RFC 9309 just specifies to follow five levels of redirect and does not require or even recommend normalization of redirect targets. As a consequence, the robots.txt URLs stemming from redirects may have an unexpected form.