Skip to content

feat: Add per-segment drift notebooks (classifier score + MIME type)#21

Merged
malteos merged 2 commits into
mainfrom
feat/classifier-score-drift-notebook
Jun 1, 2026
Merged

feat: Add per-segment drift notebooks (classifier score + MIME type)#21
malteos merged 2 commits into
mainfrom
feat/classifier-score-drift-notebook

Conversation

@malteos

@malteos malteos commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Adds two sibling notebooks under notebooks/ that surface per-segment drift in the focus crawl CC-SUPPLEMENTAL-2026-22. Both auto-discover the 14-digit YYYYMMDDHHMMSS segment token from filenames / directory names, sort oldest-left, and drop probe / aborted segments via a MIN_RECORDS floor.

compare_classifier_scores_segment_drift.ipynb

Loads the per-segment classify-warc output CSVs and plots how the classifier score evolves over the lifetime of the focus crawl:

  • mean ± SEM (sized so real mean shifts are distinguishable from sampling noise; raw stdev is uninformative for the bimodal score distribution),
  • median with IQR (p25-p75) and outer (p10-p90) percentile bands,
  • high-confidence rate (P(score >= 0.5) and P(score >= 0.9)),
  • mean and median on shared axes as a consolidated view.

MIN_RECORDS = 1000 (drops the early probe segment with n < 1000 from plots while keeping it in the per-segment table for auditability).

compare_mime_type_drift.ipynb

Parses the CDX index files (*.cdx.gz) for each segment under cc-focus-tools/data/CC-SUPPLEMENTAL-2026-22/segments/*/cdx/warc/. To avoid re-parsing ~44M JSON records on every notebook run, the first execution materialises a per-CDX cache of (warc_filename, offset, mime_detected) triples under data/cache/CC-SUPPLEMENTAL-2026-22/segments/... (mirrors the source layout, gitignored via the existing data/ rule). Subsequent runs hit the cache in seconds.

MIME source is the CDX mime-detected field (content-sniffed, more comparable across origins than the raw Content-Type header), lower-cased and stripped of parameters. The long tail (~360 minor MIME types) is bucketed into "other" via a global top-N ranking so the legend stays stable across segments.

Plots:

  • Plot 1 — stacked bar of absolute record counts per MIME (composition + segment size).
  • Plot 1b — same, but with text/html excluded so the long tail (application/pdf, application/xhtml+xml, text/plain, ...) becomes readable on a linear y-axis.
  • Plot 1c — total record count per segment (fetch volume only).
  • Plot 2 — stacked bar of normalized share per segment (the headline composition-drift plot).

MIN_RECORDS = 100_000 (CDX records vastly outnumber score records, so the floor is higher than in the sibling notebook).

Adds `notebooks/compare_classifier_scores_segment_drift.ipynb`, a sibling
to the existing baseline-vs-focus comparison notebook. Loads the
per-segment `classify-warc` output CSVs for `CC-SUPPLEMENTAL-2026-22`,
parses the `YYYYMMDDHHMMSS` token from each filename, and plots how the
classifier score evolves over the lifetime of the focus crawl:

- mean ± SEM (sized so real mean shifts are distinguishable from
  sampling noise; raw stdev is uninformative for the bimodal score
  distribution),
- median with IQR (p25-p75) and outer (p10-p90) percentile bands,
- high-confidence rate (P(score >= 0.5) and P(score >= 0.9)),
- mean and median on shared axes as a consolidated view.

Auto-discovers segment files and drops under-sized segments
(n < 1000) from the plots while keeping them in the per-segment table
for auditability.
@malteos malteos requested a review from lfoppiano May 28, 2026 09:12
Adds notebooks/compare_mime_type_drift.ipynb, a sibling to the classifier
score drift notebook. Parses CDX index files (*.cdx.gz) for the focus
crawl segments, caches (warc_filename, offset, mime_detected) per CDX
file under data/cache/ (mirroring the source layout, gitignored), and
visualises composition drift with four bar plots: absolute counts,
absolute counts excluding text/html, total records per segment, and
normalized share. MIME values come from CDX mime-detected (content
sniffed), with the long tail bucketed into "other" via a global top-N
ranking so the legend stays stable across segments.
@malteos malteos changed the title feat: Add notebook for per-segment classifier score drift feat: Add per-segment drift notebooks (classifier score + MIME type) May 28, 2026
@lfoppiano

Copy link
Copy Markdown
Collaborator

@malteos This notebook is run based on the WARC classifier output, right? could you provide the input data somewhere so that I can try to run it locally for testing?

@malteos

malteos commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

The files are in our internal S3 bucket. I will update the notebooks as soon as the file are moved to the public bucket.

@malteos malteos merged commit 42d1af2 into main Jun 1, 2026
1 check passed
@malteos malteos deleted the feat/classifier-score-drift-notebook branch June 1, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants