Skip to content

Integrate Apache Nutch upstream improvements#55

Merged
sebastian-nagel merged 8 commits into
ccfrom
cc-integrate-upstream-improvements
May 18, 2026
Merged

Integrate Apache Nutch upstream improvements#55
sebastian-nagel merged 8 commits into
ccfrom
cc-integrate-upstream-improvements

Conversation

@sebastian-nagel

@sebastian-nagel sebastian-nagel commented May 15, 2026

Copy link
Copy Markdown

Integrate the following improvements from upstream Nutch:

Most other upstream improvements would require one of

  • Hadoop 3.4.x or 3.5.0
  • Java 17
  • JUnit 6

For now the fork needs to stay on Hadoop 3.3.6 and Java 11.

lewismc and others added 8 commits May 15, 2026 18:37
…apache#910)

Javadoc on the FAILED ParseStatus constant in src/java/org/apache/nutch/parse/ParseStatus.java read 'Parsing failed. An Exception occured'. Doc-only change.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
Co-authored-by: SAY-5 <SAY-5@users.noreply.github.com>
- URLUtil:
  - make IDNA2008 the default for the methods toASCII and toUNICODE
  - provide methods to convert host names both for IDNA2003 and IDNA2008
  - also convert host to lowercase it (if not already lowercased)
- urlnormalizer-basic:
  - convert host names using IDNA2008 if the property
    urlnormalizer.basic.host.idna2008 is true
- refactor to share methods between URLUtil and urlnormalizer-basic
- refactor calls of URLDecoder and pass Charset instead of String
  (since Java 10)
- reset reprUrl in FetcherThread after fetch is finished
- report idle threads properly
@sebastian-nagel sebastian-nagel merged commit 262516f into cc May 18, 2026
1 check passed
@sebastian-nagel sebastian-nagel deleted the cc-integrate-upstream-improvements branch May 18, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants