Detect canonical links in Fetcher#36
Merged
Merged
Conversation
- Add lazy extractor for canonical links in HTTP header and HTML - Stubb call in Fetcher
- put canonical link into CrawlDatum metadata - put null value if no canonical link was found to allow that updates can overwrite existing values
- document property `fetcher.detect.canonical.link`
3e783fd to
6a23e6e
Compare
sebastian-nagel
added a commit
that referenced
this pull request
Jan 8, 2026
- cf. #36 for lazy canonical link detection - delay revisits on pages with a canonical link pointing to a different URL - the delay is configurable per property scoring.adaptive.penalty.non_canonical as a penalty on the generator sort value - fix typos in documentation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Canonical links (1, 2, 3) can be used by webmasters to signalize that a page is a duplicate of another one.
Evaluation on a random set of WARC files showed that about 60% of web pages include a canonical link. This observation is similar to those reported in 2 and 4.
Requirement is a lazy extraction of canonical links in a Nutch workflow where content is not parsed but only written into WARC files. The links are stored in the CrawlDb and are planned to be used later to reduce the amount of duplicated content.
Design decisions:
The decisions are rooted in an experimental setup were the canonical link detection was tested on WARC input. Below the evaluation metrics for
CC-MAIN-20240907095856-20240907125856-00058.warc.gz, with the following parameters:Using the first 64 kiB gives 96% of the canonical links found in the first 512 kiB, but requiring only 57% of the computation time. Usage of the ByteArrayCharSequence instead of a Java String is a significant performance improvement. The full evaluation metrics: canonical-link-evaluation-CC-MAIN-20240907095856-20240907125856-0005.tsv
The CrawlDb is expected to grow by 40–60% if it includes canonical links. The 60% growth is inline with the 60% adaption rate of canonical links. In practice, the number is lower because items in the CrawlDb which failed to fetch naturally do not have a canonical link.
Example of a CrawlDb record with canonical link: