feat: multi-model multi-label classify-warc output by malteos · Pull Request #31 · commoncrawl/crawl-openathena

malteos · 2026-06-01T13:23:12Z

classify-warc now scores each record against any number of fasttext models in one pass, and any number of labels per model, instead of one label from one model. --model-repo/--model-file/--labels are parallel-list flags; * (the default for --labels) expands to every label of that model via model.get_labels(). Output columns are score_<label> for a single model and score_m<idx>_<label> for multiple, so two models can share a label name (e.g. both GneissWeb classifiers emit __label__cc) without colliding.

Also: --overwrite to replace an existing output instead of failing; --resume-from-output validates the prior CSV's header is byte-equal to the new run's schema (with a structured added/removed diff on mismatch); per-label stats land in the sidecar as score.<column>.*.

`classify-warc` now scores each record against any number of fasttext models in one pass, and any number of labels per model, instead of one label from one model. `--model-repo`/`--model-file`/`--labels` are parallel-list flags; `*` (the default for `--labels`) expands to every label of that model via `model.get_labels()`. Output columns are `score_<label>` for a single model and `score_m<idx>_<label>` for multiple, so two models can share a label name (e.g. both GneissWeb classifiers emit `__label__cc`) without colliding. Also: `--overwrite` to replace an existing output instead of failing; `--resume-from-output` validates the prior CSV's header is byte-equal to the new run's schema (with a structured added/removed diff on mismatch); per-label stats land in the sidecar as `score.<column>.*`.

malteos added 3 commits May 29, 2026 10:08

added notebook for multi label eval

9d3cf9c

fixed notebook

6a67460

malteos merged commit c15b2c5 into main Jun 1, 2026
1 check passed

malteos deleted the feat/multi-labels branch June 1, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-model multi-label classify-warc output#31

feat: multi-model multi-label classify-warc output#31
malteos merged 3 commits into
mainfrom
feat/multi-labels

malteos commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

malteos commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant