8000 Fix host decoding with punycodes in URLs by lfoppiano · Pull Request #59 · commoncrawl/cc-index-table · GitHub
Skip to content

Fix host decoding with punycodes in URLs#59

Merged
lfoppiano merged 12 commits into
mainfrom
bugfix/punycodes-url
May 5, 2026
Merged

Fix host decoding with punycodes in URLs#59
lfoppiano merged 12 commits into
mainfrom
bugfix/punycodes-url

Conversation

@lfoppiano

@lfoppiano lfoppiano commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

This PR addresses the issue emerged during the April Crawl where Unicode host names, e.g. 🧠 .s.country (IDN2008) failed to decode when creating the index table.

@lfoppiano

Copy link
Copy Markdown
Contributor Author

@sebastian-nagel after a second though we could make those methods (that I duplicated) public and use them directly from the crawler-commons.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves HostName handling for hostnames containing non-ASCII characters (including percent-encoded Unicode), adding a normalization fallback to avoid hostName == null when IDN.toASCII fails, and introduces targeted regression/unit tests for the observed crawl failure.

Changes:

  • Add per-label normalization helpers (isAscii, asciiConvert, normalizeName) and use them as a fallback when decoding/IDN conversion fails.
  • Update URL decoding to use the URLDecoder.decode(String, Charset) overload (removing the checked UnsupportedEncodingException path).
  • Add new JUnit tests covering ASCII checks, punycode normalization, and end-to-end setHostName behavior for emoji/unicode and percent-encoded input.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/main/java/org/commoncrawl/net/HostName.java Adds normalization helpers and fallback logic for URL decode / IDN conversion failures.
src/test/java/org/commoncrawl/net/HostNameTest.java Adds unit/regression coverage for ASCII detection, normalization, and end-to-end hostname processing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/test/java/org/commoncrawl/net/HostNameTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/HostNameTest.java

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lfoppiano, the code could be simplified a lot, see the inline comments.

Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/main/java/org/commoncrawl/net/HostName.java
Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
Comment thread src/test/java/org/commoncrawl/net/HostNameTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/HostNameTest.java Outdated
Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
@lfoppiano lfoppiano requested a review from sebastian-nagel May 2, 2026 20:59
@lfoppiano

Copy link
Copy Markdown
Contributor Author

@sebastian-nagel I've revised the code according to your comments. I also added a test that will generate the hostname to be empty: normalizeNameNoAsciiInHostname in HostnameTest. Please double check that this is the expected behaviour.

Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
@lfoppiano lfoppiano requested a review from sebastian-nagel May 5, 2026 06:48

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano! Looks good.

Could simplify handling the exceptions, see inline comment.

Merge conflicts need to be resolved.

Comment thread src/main/java/org/commoncrawl/net/HostName.java Outdated
@lfoppiano lfoppiano merged commit 0782fa6 into main May 5, 2026
7 checks passed
@lfoppiano lfoppiano deleted the bugfix/punycodes-url branch May 5, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants