Skip to content

Detect canonical links in Fetcher#36

Merged
sebastian-nagel merged 5 commits into
ccfrom
canonical-links
Dec 19, 2025
Merged

Detect canonical links in Fetcher#36
sebastian-nagel merged 5 commits into
ccfrom
canonical-links

Conversation

@sebastian-nagel

Copy link
Copy Markdown

Canonical links (1, 2, 3) can be used by webmasters to signalize that a page is a duplicate of another one.

Evaluation on a random set of WARC files showed that about 60% of web pages include a canonical link. This observation is similar to those reported in 2 and 4.

Requirement is a lazy extraction of canonical links in a Nutch workflow where content is not parsed but only written into WARC files. The links are stored in the CrawlDb and are planned to be used later to reduce the amount of duplicated content.

Design decisions:

  • Use a HTTP parser to extract canonical links from HTTP headers
  • Scan only the first 64 kiB of HTML content for canonical links. This is not sufficient to extract all canonical links, it's compromise between speed and recall.
  • Avoid converting binary HTML content to a Java String, directly scan the content byte array, assuming ASCII / ISO-8859-1 encoding.
  • Pass found links to CrawlDb by adding them to the metadata of the fetch CrawlDatum item.

The decisions are rooted in an experimental setup were the canonical link detection was tested on WARC input. Below the evaluation metrics for CC-MAIN-20240907095856-20240907125856-00058.warc.gz, with the following parameters:

       links           ms      
       found  String BACS 
kiB                           
  1     7468    273   214
  2    11967    407   372
  4    15034    657   546
  8    16617    980   899
 16    18047   1528  1371
 32    19761   2618  2093
 64    20566   3923  2823
128    21009   5481  3722
256    21255   6691  4887
512    21381   7502  4996

                   N = 31498  (documents processed
max. canonical links =     1  (stop after first link found)

Using the first 64 kiB gives 96% of the canonical links found in the first 512 kiB, but requiring only 57% of the computation time. Usage of the ByteArrayCharSequence instead of a Java String is a significant performance improvement. The full evaluation metrics: canonical-link-evaluation-CC-MAIN-20240907095856-20240907125856-0005.tsv

The CrawlDb is expected to grow by 40–60% if it includes canonical links. The 60% growth is inline with the 60% adaption rate of canonical links. In practice, the number is lower because items in the CrawlDb which failed to fetch naturally do not have a canonical link.

Example of a CrawlDb record with canonical link:

https://blog.commoncrawl.org/ccbot      Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Dec 10 10:31:58 CET 2025
Modified time: Wed Dec 03 10:31:58 CET 2025
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: 1097648fedeebc1db8dbe62354b698e8
Metadata: 
        _lst_=29412568
        _pst_=success(1), lastModified=1764367820000
        _rs_=28
        Content-Type=text/html
        nutch.protocol.code=200
        canonical.link=https://commoncrawl.org/ccbot

- Add lazy extractor for canonical links
  in HTTP header and HTML
- Stubb call in Fetcher
- put canonical link into CrawlDatum metadata
- put null value if no canonical link was found
  to allow that updates can overwrite existing
  values
- document property `fetcher.detect.canonical.link`
@sebastian-nagel sebastian-nagel merged commit 994316e into cc Dec 19, 2025
1 check passed
@sebastian-nagel sebastian-nagel deleted the canonical-links branch December 19, 2025 12:36
sebastian-nagel added a commit that referenced this pull request Jan 8, 2026
- cf. #36 for lazy canonical link detection
- delay revisits on pages with a canonical link pointing to
  a different URL
- the delay is configurable per property
    scoring.adaptive.penalty.non_canonical
  as a penalty on the generator sort value
- fix typos in documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant