Adjust exception chain from invalid URLs in URLCleaner / URLNormalizers#57
Adjust exception chain from invalid URLs in URLCleaner / URLNormalizers#57lfoppiano wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens Nutch’s IDNA2008 hostname conversion so that unchecked ICU exceptions (e.g., from UTS46/Punycode) are converted into MalformedURLException, allowing callers like BasicURLNormalizer/URLCleaner to reject bad URLs instead of crashing a task.
Changes:
- Wrap ICU
IDNA.nameToASCII/nameToUnicodeinURLUtil.convertIDNA2008with atry/catchto translate unchecked ICU/UTS46/Punycode exceptions intoMalformedURLException(with cause attached). - Add a regression test intended to cover an invalid host case derived from a recent crawl sample.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/java/org/apache/nutch/util/URLUtil.java |
Converts unchecked ICU exceptions during IDNA2008 conversion into MalformedURLException to prevent task crashes. |
src/test/org/apache/nutch/util/TestURLUtil.java |
Adds a regression test for invalid-host handling in convertIDNA2008. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano! Good catch!
Would you mind to push this fix upstream? As usual, needs a Jira issue and PR. For the upstream test a "neutral" dummy URL would be preferable. Or even better: multiple URLs to test for the various runtime exceptions thrown by ICU. Thanks!
While processing the WAT file, the URLCleaner crawshed while encountering an invalid URL:
Exception:
The exception
com.ibm.icu.util.ICUInputTooLongException: input too long: 1255 UTF-16 code unitswas thrown by ICUPunycode.encode←URLUtil.convertIDNA2008←BasicURLNormalizer.normalizeHostName←UrlCleaner.map.Reason, is that
ICUInputTooLongExceptionis unchecked. SoconvertIDNA2008only handled the softidnaInfo.hasErrors()path, and the mapper caught onlyMalformedURLException, so the exception escaped → task died → 4 retries → job FAILED. (The neighboringconvertIDNA2003already guardedIllegalArgumentException/IndexOutOfBoundsException; the 2008 variant did not.)Fix: We wrapped the
idna.nameTo*calls inconvertIDNA2008and convert ICU's unchecked exceptions toMalformedURLException, withinitCauseandLOG.debug. The UTS46 + Punycode path throws three unrelated unchecked types, so catch all of them:catch (ICUException | IllegalArgumentException | IllegalStateException e)(ICUExceptioncoversICUInputTooLongException;IllegalArgumentExceptioncomes fromUTS46,IllegalStateExceptionfromPunycode; the referencedStringPrepParseExceptionis checked and cannot escape). The mapper then rejects the URL (increments the rejected counter) and the job survives.