Skip to content

feat: multi-model and multi-label classify-warc output#22

Merged
malteos merged 1 commit into
mainfrom
feat/multi-labels
May 29, 2026
Merged

feat: multi-model and multi-label classify-warc output#22
malteos merged 1 commit into
mainfrom
feat/multi-labels

Conversation

@malteos

@malteos malteos commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • classify-warc now scores each record against N fasttext models × M labels per model in one pass, via parallel-list --model-repo / --model-file / --labels flags (with * = all labels of that model, the default).
  • Output column naming: score_<label> for a single model, score_m<idx>_<label> for multiple — so two models can share a label name (e.g. both GneissWeb classifiers emit __label__cc) without collision.
  • --overwrite to replace an existing output instead of failing fast.
  • --resume-from-output validates that the prior CSV's header byte-matches the new run's schema, with a structured added/removed-columns diff on mismatch.
  • Per-label score stats land in the sidecar as score.<column>.<stat>; the sidecar's arg.score_columns row records the resolved label vocabulary so a run is reproducible from the summary alone.
  • Notebook parameterized on SCORE_COL (auto-detected from score_* columns, with a prediction_score fallback for legacy CSVs).

Test plan

  • make check — ruff lint, ruff format, pytest (63 pass, 1 skipped real-model)
  • Real-model end-to-end (pytest --run-real) — language-id classifier round-trip with new column naming
  • Manual: science model alone, default labels → header has score___label__science and score___label__cc, per-row sum ≈ 1.0, quantum-mechanics doc → 1.0 science
  • Manual: sci + quality together with both defaults — produces score_m0___label__cc, score_m0___label__science, score_m1___label__hq, score_m1___label__cc (auto-namespaced, no collision)
  • Manual: --resume-from-output against a CSV with different columns errors with a structured added/removed diff
  • Manual: --overwrite warns + rewrites; absence still errors with a message pointing at the new flag
  • Notebook executes end-to-end against existing legacy CSVs via the prediction_score fallback path

`classify-warc` now scores each record against any number of fasttext
models in one pass, and any number of labels per model, instead of one
label from one model. `--model-repo`/`--model-file`/`--labels` are
parallel-list flags; `*` (the default for `--labels`) expands to every
label of that model via `model.get_labels()`. Output columns are
`score_<label>` for a single model and `score_m<idx>_<label>` for
multiple, so two models can share a label name (e.g. both GneissWeb
classifiers emit `__label__cc`) without colliding.

Also: `--overwrite` to replace an existing output instead of failing;
`--resume-from-output` validates the prior CSV's header is byte-equal
to the new run's schema (with a structured added/removed diff on
mismatch); per-label stats land in the sidecar as `score.<column>.*`.
@malteos malteos merged commit ae2b007 into main May 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant