Skip to content

feat: multi-model multi-label classify-warc output#31

Merged
malteos merged 3 commits into
mainfrom
feat/multi-labels
Jun 1, 2026
Merged

feat: multi-model multi-label classify-warc output#31
malteos merged 3 commits into
mainfrom
feat/multi-labels

Conversation

@malteos

@malteos malteos commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

classify-warc now scores each record against any number of fasttext models in one pass, and any number of labels per model, instead of one label from one model. --model-repo/--model-file/--labels are parallel-list flags; * (the default for --labels) expands to every label of that model via model.get_labels(). Output columns are score_<label> for a single model and score_m<idx>_<label> for multiple, so two models can share a label name (e.g. both GneissWeb classifiers emit __label__cc) without colliding.

Also: --overwrite to replace an existing output instead of failing; --resume-from-output validates the prior CSV's header is byte-equal to the new run's schema (with a structured added/removed diff on mismatch); per-label stats land in the sidecar as score.<column>.*.

malteos added 3 commits May 29, 2026 10:08
`classify-warc` now scores each record against any number of fasttext
models in one pass, and any number of labels per model, instead of one
label from one model. `--model-repo`/`--model-file`/`--labels` are
parallel-list flags; `*` (the default for `--labels`) expands to every
label of that model via `model.get_labels()`. Output columns are
`score_<label>` for a single model and `score_m<idx>_<label>` for
multiple, so two models can share a label name (e.g. both GneissWeb
classifiers emit `__label__cc`) without colliding.

Also: `--overwrite` to replace an existing output instead of failing;
`--resume-from-output` validates the prior CSV's header is byte-equal
to the new run's schema (with a structured added/removed diff on
mismatch); per-label stats land in the sidecar as `score.<column>.*`.
@malteos malteos merged commit c15b2c5 into main Jun 1, 2026
1 check passed
@malteos malteos deleted the feat/multi-labels branch June 1, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant