Merge cherry-picked commits from upstream 1.22#46
Merged
Conversation
Apply metrics naming conventions to CCF-specific classes and extensions.
…rser, and Indexer (apache#876)
Robots.txt parser: use URL objects in newly introduced methods to avoid the unnecessary parsing of URLs.
Update URLUtil test to adapt to a change in the public suffix list
- upgrade to OkHttp 5.3.2 - enable support for zstd content-encoding
- adapt unit tests to changes introduced in crawler-commons/crawler-commons#478 - test for example given in Javadoc of getDomainSuffix
In setFetchSchedule, make sure 'refTime' is not in the past. Add unit test to reproduce the situation described in Jira. Unrelated fix in FetcherThread
Convert the fraction of the delta to a ratio of max interval, to avoid next fetchTime in the past. Add unit tests for different scenarios.
Add TestCrawlDbStatesExtended (was TODOTestCrawlDbStates)
Integrate Ivy cache in Common Crawl specific workflow.
Author
|
As a side-effect #43 is fixed. All unit tests are now run, see workflow log. I'm currently testing this PR on a single-node Hadoop cluster. |
Apply metrics naming conventions to CCF-specific classes and extensions: lower-case counter names of sitemap types in SitemapInjector.
Apply metrics naming conventions to WARC writer counters.
Author
|
Testing finished. The new metrics naming convention still required some adjustments, esp. to the WARC writer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR merges cherry-picked commits from the upstream Apache Nutch 1.22 release (apache@release-1.22) with the exclusion of:
For now, the crawls are run on Hadoop 3.3.6 which requires commons-io 2.8.0.
Notable inclusions are: