Detect canonical links in Fetcher by sebastian-nagel · Pull Request #36 · commoncrawl/nutch

sebastian-nagel · 2025-12-03T17:38:39Z

Canonical links (1, 2, 3) can be used by webmasters to signalize that a page is a duplicate of another one.

Evaluation on a random set of WARC files showed that about 60% of web pages include a canonical link. This observation is similar to those reported in 2 and 4.

Requirement is a lazy extraction of canonical links in a Nutch workflow where content is not parsed but only written into WARC files. The links are stored in the CrawlDb and are planned to be used later to reduce the amount of duplicated content.

Design decisions:

Use a HTTP parser to extract canonical links from HTTP headers
Scan only the first 64 kiB of HTML content for canonical links. This is not sufficient to extract all canonical links, it's compromise between speed and recall.
Avoid converting binary HTML content to a Java String, directly scan the content byte array, assuming ASCII / ISO-8859-1 encoding.
Pass found links to CrawlDb by adding them to the metadata of the fetch CrawlDatum item.

The decisions are rooted in an experimental setup were the canonical link detection was tested on WARC input. Below the evaluation metrics for CC-MAIN-20240907095856-20240907125856-00058.warc.gz, with the following parameters:

       links           ms      
       found  String BACS 
kiB                           
  1     7468    273   214
  2    11967    407   372
  4    15034    657   546
  8    16617    980   899
 16    18047   1528  1371
 32    19761   2618  2093
 64    20566   3923  2823
128    21009   5481  3722
256    21255   6691  4887
512    21381   7502  4996

                   N = 31498  (documents processed
max. canonical links =     1  (stop after first link found)

Using the first 64 kiB gives 96% of the canonical links found in the first 512 kiB, but requiring only 57% of the computation time. Usage of the ByteArrayCharSequence instead of a Java String is a significant performance improvement. The full evaluation metrics: canonical-link-evaluation-CC-MAIN-20240907095856-20240907125856-0005.tsv

The CrawlDb is expected to grow by 40–60% if it includes canonical links. The 60% growth is inline with the 60% adaption rate of canonical links. In practice, the number is lower because items in the CrawlDb which failed to fetch naturally do not have a canonical link.

Example of a CrawlDb record with canonical link:

https://blog.commoncrawl.org/ccbot      Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Dec 10 10:31:58 CET 2025
Modified time: Wed Dec 03 10:31:58 CET 2025
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: 1097648fedeebc1db8dbe62354b698e8
Metadata: 
        _lst_=29412568
        _pst_=success(1), lastModified=1764367820000
        _rs_=28
        Content-Type=text/html
        nutch.protocol.code=200
        canonical.link=https://commoncrawl.org/ccbot

- Add lazy extractor for canonical links in HTTP header and HTML - Stubb call in Fetcher

- put canonical link into CrawlDatum metadata - put null value if no canonical link was found to allow that updates can overwrite existing values

- document property `fetcher.detect.canonical.link`

- cf. #36 for lazy canonical link detection - delay revisits on pages with a canonical link pointing to a different URL - the delay is configurable per property scoring.adaptive.penalty.non_canonical as a penalty on the generator sort value - fix typos in documentation

sebastian-nagel added 5 commits December 4, 2025 13:14

Upgrade crawler-commons to 1.6 / 1.7-SNAPSHOT

40c7dca

Extract canonical links in Fetcher

5dd2215

- Add lazy extractor for canonical links in HTTP header and HTML - Stubb call in Fetcher

CanonicalLinkDetector: catch HTTP header parse exception

b37880b

Extract canonical links in Fetcher

eaf94c6

- put canonical link into CrawlDatum metadata - put null value if no canonical link was found to allow that updates can overwrite existing values

Extract canonical links in Fetcher

6a23e6e

- document property `fetcher.detect.canonical.link`

sebastian-nagel force-pushed the canonical-links branch from 3e783fd to 6a23e6e Compare December 4, 2025 12:15

sebastian-nagel merged commit 994316e into cc Dec 19, 2025
1 check passed

sebastian-nagel deleted the canonical-links branch December 19, 2025 12:36

sebastian-nagel mentioned this pull request Jan 8, 2026

AdaptiveScoringFilter: Delay revisits of non-canonical pages #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect canonical links in Fetcher#36

Detect canonical links in Fetcher#36
sebastian-nagel merged 5 commits into
ccfrom
canonical-links

sebastian-nagel commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sebastian-nagel commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant