Skip to content

Merge cherry-picked commits from upstream 1.22#46

Merged
sebastian-nagel merged 27 commits into
ccfrom
merge-upstream-master-1.22
Feb 27, 2026
Merged

Merge cherry-picked commits from upstream 1.22#46
sebastian-nagel merged 27 commits into
ccfrom
merge-upstream-master-1.22

Conversation

@sebastian-nagel

Copy link
Copy Markdown

This PR merges cherry-picked commits from the upstream Apache Nutch 1.22 release (apache@release-1.22) with the exclusion of:

  • the upgrade to Hadoop 3.4.2
  • the upgrade to Tika 3.2.3

For now, the crawls are run on Hadoop 3.3.6 which requires commons-io 2.8.0.

Notable inclusions are:

  • NUTCH-3139 - protocol-okhttp: add support for zstd content-encoding
  • NUTCH-1564 - AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
  • a large overhaul about Nutch counters and metrics
  • multiple improvements and fixes to the build system and CI builds

lewismc and others added 24 commits February 25, 2026 17:12
Apply metrics naming conventions to CCF-specific classes and extensions.
Robots.txt parser: use URL objects in newly introduced
methods to avoid the unnecessary parsing of URLs.
Update URLUtil test to adapt to a change in the public suffix list
- upgrade to OkHttp 5.3.2
- enable support for zstd content-encoding
- adapt unit tests to changes introduced in
  crawler-commons/crawler-commons#478
- test for example given in Javadoc of getDomainSuffix
In setFetchSchedule, make sure 'refTime' is not in the past.

Add unit test to reproduce the situation described in Jira.

Unrelated fix in FetcherThread
Convert the fraction of the delta to a ratio of max interval, to avoid
next fetchTime in the past.

Add unit tests for different scenarios.
Add TestCrawlDbStatesExtended (was TODOTestCrawlDbStates)
Integrate Ivy cache in Common Crawl specific workflow.
@sebastian-nagel

Copy link
Copy Markdown
Author

As a side-effect #43 is fixed. All unit tests are now run, see workflow log.

I'm currently testing this PR on a single-node Hadoop cluster.

Apply metrics naming conventions to CCF-specific classes and extensions:
lower-case counter names of sitemap types in SitemapInjector.
Apply metrics naming conventions to WARC writer counters.
@sebastian-nagel

Copy link
Copy Markdown
Author

Testing finished.

The new metrics naming convention still required some adjustments, esp. to the WARC writer.

@lfoppiano lfoppiano linked an issue Feb 26, 2026 that may be closed by this pull request
@sebastian-nagel sebastian-nagel merged commit 2a9a6ab into cc Feb 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some tests are ignored when running through ant

2 participants