feat: Add per-segment drift notebooks (classifier score + MIME type) by malteos · Pull Request #21 · commoncrawl/crawl-openathena

malteos · 2026-05-28T09:12:44Z

Adds two sibling notebooks under notebooks/ that surface per-segment drift in the focus crawl CC-SUPPLEMENTAL-2026-22. Both auto-discover the 14-digit YYYYMMDDHHMMSS segment token from filenames / directory names, sort oldest-left, and drop probe / aborted segments via a MIN_RECORDS floor.

`compare_classifier_scores_segment_drift.ipynb`

Loads the per-segment classify-warc output CSVs and plots how the classifier score evolves over the lifetime of the focus crawl:

mean ± SEM (sized so real mean shifts are distinguishable from sampling noise; raw stdev is uninformative for the bimodal score distribution),
median with IQR (p25-p75) and outer (p10-p90) percentile bands,
high-confidence rate (P(score >= 0.5) and P(score >= 0.9)),
mean and median on shared axes as a consolidated view.

MIN_RECORDS = 1000 (drops the early probe segment with n < 1000 from plots while keeping it in the per-segment table for auditability).

`compare_mime_type_drift.ipynb`

Parses the CDX index files (*.cdx.gz) for each segment under cc-focus-tools/data/CC-SUPPLEMENTAL-2026-22/segments/*/cdx/warc/. To avoid re-parsing ~44M JSON records on every notebook run, the first execution materialises a per-CDX cache of (warc_filename, offset, mime_detected) triples under data/cache/CC-SUPPLEMENTAL-2026-22/segments/... (mirrors the source layout, gitignored via the existing data/ rule). Subsequent runs hit the cache in seconds.

MIME source is the CDX mime-detected field (content-sniffed, more comparable across origins than the raw Content-Type header), lower-cased and stripped of parameters. The long tail (~360 minor MIME types) is bucketed into "other" via a global top-N ranking so the legend stays stable across segments.

Plots:

Plot 1 — stacked bar of absolute record counts per MIME (composition + segment size).
Plot 1b — same, but with text/html excluded so the long tail (application/pdf, application/xhtml+xml, text/plain, ...) becomes readable on a linear y-axis.
Plot 1c — total record count per segment (fetch volume only).
Plot 2 — stacked bar of normalized share per segment (the headline composition-drift plot).

MIN_RECORDS = 100_000 (CDX records vastly outnumber score records, so the floor is higher than in the sibling notebook).

Adds `notebooks/compare_classifier_scores_segment_drift.ipynb`, a sibling to the existing baseline-vs-focus comparison notebook. Loads the per-segment `classify-warc` output CSVs for `CC-SUPPLEMENTAL-2026-22`, parses the `YYYYMMDDHHMMSS` token from each filename, and plots how the classifier score evolves over the lifetime of the focus crawl: - mean ± SEM (sized so real mean shifts are distinguishable from sampling noise; raw stdev is uninformative for the bimodal score distribution), - median with IQR (p25-p75) and outer (p10-p90) percentile bands, - high-confidence rate (P(score >= 0.5) and P(score >= 0.9)), - mean and median on shared axes as a consolidated view. Auto-discovers segment files and drops under-sized segments (n < 1000) from the plots while keeping them in the per-segment table for auditability.

Adds notebooks/compare_mime_type_drift.ipynb, a sibling to the classifier score drift notebook. Parses CDX index files (*.cdx.gz) for the focus crawl segments, caches (warc_filename, offset, mime_detected) per CDX file under data/cache/ (mirroring the source layout, gitignored), and visualises composition drift with four bar plots: absolute counts, absolute counts excluding text/html, total records per segment, and normalized share. MIME values come from CDX mime-detected (content sniffed), with the long tail bucketed into "other" via a global top-N ranking so the legend stays stable across segments.

lfoppiano · 2026-05-28T13:53:28Z

@malteos This notebook is run based on the WARC classifier output, right? could you provide the input data somewhere so that I can try to run it locally for testing?

malteos · 2026-06-01T13:22:06Z

The files are in our internal S3 bucket. I will update the notebooks as soon as the file are moved to the public bucket.

malteos requested a review from lfoppiano May 28, 2026 09:12

malteos changed the title ~~feat: Add notebook for per-segment classifier score drift~~ feat: Add per-segment drift notebooks (classifier score + MIME type) May 28, 2026

malteos merged commit 42d1af2 into main Jun 1, 2026
1 check passed

malteos deleted the feat/classifier-score-drift-notebook branch June 1, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add per-segment drift notebooks (classifier score + MIME type)#21

feat: Add per-segment drift notebooks (classifier score + MIME type)#21
malteos merged 2 commits into
mainfrom
feat/classifier-score-drift-notebook

malteos commented May 28, 2026 •

edited

Loading

Uh oh!

lfoppiano commented May 28, 2026

Uh oh!

malteos commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

malteos commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

compare_classifier_scores_segment_drift.ipynb

compare_mime_type_drift.ipynb

Uh oh!

lfoppiano commented May 28, 2026

Uh oh!

malteos commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malteos commented May 28, 2026 •

edited

Loading

`compare_classifier_scores_segment_drift.ipynb`

`compare_mime_type_drift.ipynb`