feat: multi-model and multi-label classify-warc output by malteos · Pull Request #22 · commoncrawl/crawl-openathena

malteos · 2026-05-29T08:08:26Z

Summary

classify-warc now scores each record against N fasttext models × M labels per model in one pass, via parallel-list --model-repo / --model-file / --labels flags (with * = all labels of that model, the default).
Output column naming: score_<label> for a single model, score_m<idx>_<label> for multiple — so two models can share a label name (e.g. both GneissWeb classifiers emit __label__cc) without collision.
--overwrite to replace an existing output instead of failing fast.
--resume-from-output validates that the prior CSV's header byte-matches the new run's schema, with a structured added/removed-columns diff on mismatch.
Per-label score stats land in the sidecar as score.<column>.<stat>; the sidecar's arg.score_columns row records the resolved label vocabulary so a run is reproducible from the summary alone.
Notebook parameterized on SCORE_COL (auto-detected from score_* columns, with a prediction_score fallback for legacy CSVs).

Test plan

make check — ruff lint, ruff format, pytest (63 pass, 1 skipped real-model)
Real-model end-to-end (pytest --run-real) — language-id classifier round-trip with new column naming
Manual: science model alone, default labels → header has score___label__science and score___label__cc, per-row sum ≈ 1.0, quantum-mechanics doc → 1.0 science
Manual: sci + quality together with both defaults — produces score_m0___label__cc, score_m0___label__science, score_m1___label__hq, score_m1___label__cc (auto-namespaced, no collision)
Manual: --resume-from-output against a CSV with different columns errors with a structured added/removed diff
Manual: --overwrite warns + rewrites; absence still errors with a message pointing at the new flag
Notebook executes end-to-end against existing legacy CSVs via the prediction_score fallback path

`classify-warc` now scores each record against any number of fasttext models in one pass, and any number of labels per model, instead of one label from one model. `--model-repo`/`--model-file`/`--labels` are parallel-list flags; `*` (the default for `--labels`) expands to every label of that model via `model.get_labels()`. Output columns are `score_<label>` for a single model and `score_m<idx>_<label>` for multiple, so two models can share a label name (e.g. both GneissWeb classifiers emit `__label__cc`) without colliding. Also: `--overwrite` to replace an existing output instead of failing; `--resume-from-output` validates the prior CSV's header is byte-equal to the new run's schema (with a structured added/removed diff on mismatch); per-label stats land in the sidecar as `score.<column>.*`.

malteos merged commit ae2b007 into main May 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-model and multi-label classify-warc output#22

feat: multi-model and multi-label classify-warc output#22
malteos merged 1 commit into
mainfrom
feat/multi-labels

malteos commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

malteos commented May 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant