Fix host decoding with punycodes in URLs#59
Conversation
|
@sebastian-nagel after a second though we could make those methods (that I duplicated) public and use them directly from the crawler-commons. |
There was a problem hiding this comment.
Pull request overview
This PR improves HostName handling for hostnames containing non-ASCII characters (including percent-encoded Unicode), adding a normalization fallback to avoid hostName == null when IDN.toASCII fails, and introduces targeted regression/unit tests for the observed crawl failure.
Changes:
- Add per-label normalization helpers (
isAscii,asciiConvert,normalizeName) and use them as a fallback when decoding/IDN conversion fails. - Update URL decoding to use the
URLDecoder.decode(String, Charset)overload (removing the checkedUnsupportedEncodingExceptionpath). - Add new JUnit tests covering ASCII checks, punycode normalization, and end-to-end
setHostNamebehavior for emoji/unicode and percent-encoded input.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/main/java/org/commoncrawl/net/HostName.java | Adds normalization helpers and fallback logic for URL decode / IDN conversion failures. |
| src/test/java/org/commoncrawl/net/HostNameTest.java | Adds unit/regression coverage for ASCII detection, normalization, and end-to-end hostname processing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks @lfoppiano, the code could be simplified a lot, see the inline comments.
|
@sebastian-nagel I've revised the code according to your comments. I also added a test that will generate the hostname to be empty: |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano! Looks good.
Could simplify handling the exceptions, see inline comment.
Merge conflicts need to be resolved.
This PR addresses the issue emerged during the April Crawl where Unicode host names, e.g.
🧠 .s.country(IDN2008) failed to decode when creating the index table.