Skip to content

Host-link extraction: preserve www. prefix #56

Description

@sebastian-nagel

The CCF host-level web graphs (since 2017) were created with the leading www. stripped from the host name. Unlike the SURT-normalization, the prefix is only stripped

  • if at least two dot-separated segments are preserved (www.com is kept intact)
  • no www1. prefixes are stripped.

The reason for the stripping is the reduced size of the host-level web graphs:

  • Back in 2017 this made the graphs approx. 10% smaller.
  • This number went down over the years and the storage saving are now only about 5%.

While storage benefit became smaller, the stripping has a couple of disadvantages:

  1. Joining web graph data with other host-level data is more difficult.
  2. Fetching the homepage of a stripped host name requires to follow a redirect, or may even fail or return a different result.
  3. Extra documentation is required, in addition to the reverse domain name notation.

Starting with the first web graph in 2026 (cc-main-2025-26-nov-dec-jan-host), the leading www. in a host name will be preserved.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions