commoncrawl · malteos · May 29, 2026 · May 29, 2026
diff --git a/README.md b/README.md
@@ -31,9 +31,17 @@ uv run ccoa --help
 
 `ccoa classify-warc` streams WARC files from S3 (or any fsspec URL),
 extracts plain text from each response record with trafilatura, and
-applies a HuggingFace-hosted fasttext classifier. Per-record output is a
-CSV `URL,prediction_score,warc_filename,warc_record_index`; a one-shot score-distribution summary is
-logged at the end and written to a `<output>.summary.csv` file.
+applies one or more HuggingFace-hosted fasttext classifiers in a single
+pass. Per-record output is a CSV with one `score_<label>` column per
+requested label, between `URL` and the `warc_filename`/`warc_record_index`
+tail:
+
+```
+URL,score_<label_1>,...,score_<label_N>,warc_filename,warc_record_index
+```
+
+A per-column score-distribution summary is logged at the end and written
+to a `<output>.summary.csv` file.
 
 ```bash
 uv run ccoa classify-warc \
@@ -86,10 +94,37 @@ HTTPS gateway URL
 (`https://data.commoncrawl.org/...`) with no credentials.
 
 The default classifier is
-[`ibm-granite/GneissWeb.Sci_classifier`](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier)
-(`__label__science`). The first run downloads the ~4 GB model into the
-HuggingFace cache. Override with `--model-repo`, `--model-file`, and
-`--target-label`.
+[`ibm-granite/GneissWeb.Sci_classifier`](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier).
+Without `--labels` it emits both of the model's labels —
+`score___label__science` and `score___label__cc`, which sum to 1.0 per
+record. The first run downloads the ~4 GB model into the HuggingFace
+cache. Override with `--model-repo`, `--model-file`, and `--labels`.
+
+`--model-repo` and `--model-file` are list-valued and zipped positionally,
+so you can score against multiple classifiers in one pass:
+
+```bash
+uv run ccoa classify-warc \
+  --warc-paths 's3://commoncrawl/.../*.warc.gz' \
+  --model-repo ibm-granite/GneissWeb.Sci_classifier ibm-granite/GneissWeb.Quality_annotator \
+  --model-file fasttext_science.bin <quality_model_filename.bin> \
+  --output data/classified.csv
+```
+
+`--labels` is also list-valued (one entry per model). Each entry is a
+comma-separated list of labels (`"__label__science,__label__cc"`) or the
+literal `*` to use all of that model's labels (the default when `--labels`
+is omitted). Output columns are emitted in the order: models in CLI order,
+labels in the order given (or model-internal order for `*`).
+
+Column naming depends on whether the run has one model or many:
+
+- **Single model**: `score_<label>` (e.g. `score___label__science`).
+- **Multiple models**: `score_m<idx>_<label>`, where `<idx>` is the 0-based
+  CLI position of the model (e.g. `score_m0___label__science`,
+  `score_m1___label__hq`). This namespacing means two models can share a
+  label name — Sci_classifier and Quality_annotator both emit
+  `__label__cc` — without colliding.
 
 `--output` accepts `-` for stdout, any local path, or any fsspec URL —
 including `s3://bucket/key.csv`. S3 outputs use the same `--anonymous-s3`
@@ -133,11 +168,14 @@ uv run ccoa classify-warc \
   --output data/classified__resume-2.csv
 ```
 
-The resume CSV must include the `warc_filename` and `warc_record_index`
-columns (the current output schema always does). Records matching that
-`(warc_filename, record_index)` pair are skipped on the new run; the
-new `--output` contains only the missing rows. Concatenate the two
-CSVs (drop the second header) to get a complete result.
+The resume CSV's header must match the new run's output schema
+**exactly** — same `score_<label>` columns in the same order, between
+the leading `URL` and the trailing `warc_filename`/`warc_record_index`.
+Any drift (reorder, missing, extra) is rejected fast with a structured
+diff so a concatenation (drop the second header) yields a well-formed
+CSV. Records matching that `(warc_filename, record_index)` pair are
+skipped on the new run; the new `--output` contains only the missing
+rows.
 
 With `--records-per-file-limit N` the limit is interpreted as the
 **target total** per file (resumed + new). Files already at the target

diff --git a/notebooks/compare_classifier_scores.ipynb b/notebooks/compare_classifier_scores.ipynb
@@ -35,10 +35,10 @@
    "id": "cell-01-imports",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:21.594052Z",
-     "iopub.status.busy": "2026-05-26T10:15:21.593943Z",
-     "iopub.status.idle": "2026-05-26T10:15:41.227039Z",
-     "shell.execute_reply": "2026-05-26T10:15:41.226661Z"
+     "iopub.execute_input": "2026-05-28T19:40:08.502488Z",
+     "iopub.status.busy": "2026-05-28T19:40:08.502328Z",
+     "iopub.status.idle": "2026-05-28T19:40:19.488256Z",
+     "shell.execute_reply": "2026-05-28T19:40:19.487890Z"
     }
    },
    "outputs": [
@@ -98,7 +98,7 @@
    "source": [
     "## 1. Load the score data\n",
     "\n",
-    "The output CSV schema is `URL,prediction_score,warc_filename,warc_record_index`; we only need `prediction_score` for distribution analysis.\n",
+    "The output CSV schema is `URL,score_<label_1>,...,score_<label_N>,warc_filename,warc_record_index` — one score column per model label. The cell below auto-detects available `score_*` columns and picks the first as `SCORE_COL`; change it (or extend `load_scores`) to compare other labels. Older single-column outputs (`prediction_score`) are accepted as a fallback so existing CSVs in `data/` still work.\n",
     "\n",
     "The focus crawl can be split across **multiple parts** when `--resume-from-output` was used to recover from a crash (e.g. `…_1m_scores.csv` for the first run plus `…_1m_scores__resume-2.csv` for the continuation). We auto-discover all parts matching `{focus name}_1m_scores*.csv` (filtering out the `.summary.csv` sidecars) and concatenate them into a single distribution.\n",
     "\n",
@@ -113,13 +113,21 @@
    "id": "cell-03-load",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:41.228341Z",
-     "iopub.status.busy": "2026-05-26T10:15:41.228232Z",
-     "iopub.status.idle": "2026-05-26T10:15:41.973842Z",
-     "shell.execute_reply": "2026-05-26T10:15:41.973414Z"
+     "iopub.execute_input": "2026-05-28T19:40:19.489927Z",
+     "iopub.status.busy": "2026-05-28T19:40:19.489801Z",
+     "iopub.status.idle": "2026-05-28T19:40:20.224124Z",
+     "shell.execute_reply": "2026-05-28T19:40:20.223694Z"
     }
    },
    "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Score columns in baseline: ['prediction_score (legacy)']\n",
+      "Analysing: prediction_score\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -136,9 +144,19 @@
     }
    ],
    "source": [
+    "# Auto-detect the score column to analyse. classify-warc now emits one\n",
+    "# `score_<label>` column per model label; older outputs had a single\n",
+    "# `prediction_score`. Edit SCORE_COL to compare other labels.\n",
+    "_baseline_columns = pd.read_csv(BASELINE_CSV, nrows=0).columns.tolist()\n",
+    "SCORE_COLUMNS = [c for c in _baseline_columns if c.startswith(\"score_\")]\n",
+    "SCORE_COL = SCORE_COLUMNS[0] if SCORE_COLUMNS else \"prediction_score\"\n",
+    "print(f\"Score columns in baseline: {SCORE_COLUMNS or [SCORE_COL + ' (legacy)']}\")\n",
+    "print(f\"Analysing: {SCORE_COL}\")\n",
+    "\n",
+    "\n",
     "def load_scores(csv_path: Path) -> pd.Series:\n",
-    "    \"\"\"Read the prediction_score column from a classify-warc output CSV.\"\"\"\n",
-    "    series = pd.read_csv(csv_path, usecols=[\"prediction_score\"])[\"prediction_score\"]\n",
+    "    \"\"\"Read the `SCORE_COL` column from a classify-warc output CSV.\"\"\"\n",
+    "    series = pd.read_csv(csv_path, usecols=[SCORE_COL])[SCORE_COL]\n",
     "    return series.astype(float)\n",
     "\n",
     "\n",
@@ -151,7 +169,9 @@
     "    is_toy = False\n",
     "    ratio = len(focus_scores) / max(len(baseline_scores), 1)\n",
     "    is_partial = ratio < 0.95\n",
-    "    per_part = \", \".join(f\"{p.name}: {len(s):,}\" for p, s in zip(FOCUS_CSVS, parts))\n",
+    "    per_part = \", \".join(\n",
+    "        f\"{p.name}: {len(s):,}\" for p, s in zip(FOCUS_CSVS, parts, strict=True)\n",
+    "    )\n",
     "    print(\n",
     "        f\"{FOCUS_NAME}: {len(focus_scores):,} scores loaded from {len(FOCUS_CSVS)} \"\n",
     "        f\"file(s) ({ratio:.0%} of baseline size). Parts → {per_part}\"\n",
@@ -164,7 +184,7 @@
     "else:\n",
     "    rng = np.random.default_rng(42)\n",
     "    noisy = baseline_scores.to_numpy() + 0.05 + rng.normal(0.0, 0.10, size=len(baseline_scores))\n",
-    "    focus_scores = pd.Series(np.clip(noisy, 0.0, 1.0), name=\"prediction_score\")\n",
+    "    focus_scores = pd.Series(np.clip(noisy, 0.0, 1.0), name=SCORE_COL)\n",
     "    is_toy = True\n",
     "    is_partial = False\n",
     "    print(\n",
@@ -195,10 +215,10 @@
    "id": "cell-05-stats",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:41.974946Z",
-     "iopub.status.busy": "2026-05-26T10:15:41.974894Z",
-     "iopub.status.idle": "2026-05-26T10:15:42.081238Z",
-     "shell.execute_reply": "2026-05-26T10:15:42.080957Z"
+     "iopub.execute_input": "2026-05-28T19:40:20.225163Z",
+     "iopub.status.busy": "2026-05-28T19:40:20.225105Z",
+     "iopub.status.idle": "2026-05-28T19:40:20.333180Z",
+     "shell.execute_reply": "2026-05-28T19:40:20.332861Z"
     }
    },
    "outputs": [
@@ -381,10 +401,10 @@
    "id": "cell-07-hist",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:42.082323Z",
-     "iopub.status.busy": "2026-05-26T10:15:42.082276Z",
-     "iopub.status.idle": "2026-05-26T10:15:42.642052Z",
-     "shell.execute_reply": "2026-05-26T10:15:42.641716Z"
+     "iopub.execute_input": "2026-05-28T19:40:20.334135Z",
+     "iopub.status.busy": "2026-05-28T19:40:20.334086Z",
+     "iopub.status.idle": "2026-05-28T19:40:20.602002Z",
+     "shell.execute_reply": "2026-05-28T19:40:20.601654Z"
     }
    },
    "outputs": [
@@ -403,15 +423,15 @@
     "fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)\n",
     "bins = np.linspace(0.0, 1.0, 81)\n",
     "\n",
-    "for ax, log_scale in zip(axes, [False, True]):\n",
+    "for ax, log_scale in zip(axes, [False, True], strict=True):\n",
     "    ax.hist(baseline_scores, bins=bins, alpha=0.5, label=BASELINE_NAME, color=\"C0\", density=True)\n",
     "    ax.hist(focus_scores, bins=bins, alpha=0.5, label=FOCUS_NAME, color=\"C1\", density=True)\n",
     "    if log_scale:\n",
     "        ax.set_yscale(\"log\")\n",
     "        ax.set_title(\"Histogram (log y, density-normalised)\")\n",
     "    else:\n",
     "        ax.set_title(\"Histogram (linear y, density-normalised)\")\n",
-    "    ax.set_xlabel(\"prediction_score\")\n",
+    "    ax.set_xlabel(SCORE_COL)\n",
     "    ax.set_ylabel(\"density\")\n",
     "    ax.legend()\n",
     "    ax.grid(True, alpha=0.3)\n",
@@ -436,10 +456,10 @@
    "id": "cell-09-ecdf",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:42.643283Z",
-     "iopub.status.busy": "2026-05-26T10:15:42.643203Z",
-     "iopub.status.idle": "2026-05-26T10:15:43.228237Z",
-     "shell.execute_reply": "2026-05-26T10:15:43.227857Z"
+     "iopub.execute_input": "2026-05-28T19:40:20.602921Z",
+     "iopub.status.busy": "2026-05-28T19:40:20.602874Z",
+     "iopub.status.idle": "2026-05-28T19:40:21.206119Z",
+     "shell.execute_reply": "2026-05-28T19:40:21.205709Z"
     }
    },
    "outputs": [
@@ -469,7 +489,7 @@
     "]:\n",
     "    x, y = ecdf(series)\n",
     "    plt.plot(x, y, label=name, color=color)\n",
-    "plt.xlabel(\"prediction_score\")\n",
+    "plt.xlabel(SCORE_COL)\n",
     "plt.ylabel(\"ECDF P(score ≤ x)\")\n",
     "plt.title(\"Empirical CDF of classifier scores\")\n",
     "plt.legend()\n",
@@ -494,10 +514,10 @@
    "id": "cell-11-box",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:43.229226Z",
-     "iopub.status.busy": "2026-05-26T10:15:43.229167Z",
-     "iopub.status.idle": "2026-05-26T10:15:43.929555Z",
-     "shell.execute_reply": "2026-05-26T10:15:43.929175Z"
+     "iopub.execute_input": "2026-05-28T19:40:21.207200Z",
+     "iopub.status.busy": "2026-05-28T19:40:21.207147Z",
+     "iopub.status.idle": "2026-05-28T19:40:21.891974Z",
+     "shell.execute_reply": "2026-05-28T19:40:21.891665Z"
     }
    },
    "outputs": [
@@ -519,14 +539,14 @@
     "labels = [BASELINE_NAME, FOCUS_NAME]\n",
     "\n",
     "axes[0].boxplot(data, tick_labels=labels, showfliers=False)\n",
-    "axes[0].set_ylabel(\"prediction_score\")\n",
+    "axes[0].set_ylabel(SCORE_COL)\n",
     "axes[0].set_title(\"Box plot (outliers hidden)\")\n",
     "axes[0].grid(True, alpha=0.3)\n",
     "\n",
     "axes[1].violinplot(data, showmedians=True)\n",
     "axes[1].set_xticks([1, 2])\n",
     "axes[1].set_xticklabels(labels)\n",
-    "axes[1].set_ylabel(\"prediction_score\")\n",
+    "axes[1].set_ylabel(SCORE_COL)\n",
     "axes[1].set_title(\"Violin plot\")\n",
     "axes[1].grid(True, alpha=0.3)\n",
     "\n",
@@ -552,10 +572,10 @@
    "id": "cell-13-qq",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:43.930547Z",
-     "iopub.status.busy": "2026-05-26T10:15:43.930496Z",
-     "iopub.status.idle": "2026-05-26T10:15:44.006062Z",
-     "shell.execute_reply": "2026-05-26T10:15:44.005773Z"
+     "iopub.execute_input": "2026-05-28T19:40:21.893014Z",
+     "iopub.status.busy": "2026-05-28T19:40:21.892965Z",
+     "iopub.status.idle": "2026-05-28T19:40:21.973772Z",
+     "shell.execute_reply": "2026-05-28T19:40:21.973431Z"
     }
    },
    "outputs": [
@@ -611,10 +631,10 @@
    "id": "cell-15-tests",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:44.007069Z",
-     "iopub.status.busy": "2026-05-26T10:15:44.007022Z",
-     "iopub.status.idle": "2026-05-26T10:15:44.240328Z",
-     "shell.execute_reply": "2026-05-26T10:15:44.239994Z"
+     "iopub.execute_input": "2026-05-28T19:40:21.974787Z",
+     "iopub.status.busy": "2026-05-28T19:40:21.974737Z",
+     "iopub.status.idle": "2026-05-28T19:40:22.210867Z",
+     "shell.execute_reply": "2026-05-28T19:40:22.210554Z"
     }
    },
    "outputs": [
@@ -714,10 +734,10 @@
    "id": "cell-17-effect",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:44.241419Z",
-     "iopub.status.busy": "2026-05-26T10:15:44.241371Z",
-     "iopub.status.idle": "2026-05-26T10:15:44.262566Z",
-     "shell.execute_reply": "2026-05-26T10:15:44.262191Z"
+     "iopub.execute_input": "2026-05-28T19:40:22.211857Z",
+     "iopub.status.busy": "2026-05-28T19:40:22.211810Z",
+     "iopub.status.idle": "2026-05-28T19:40:22.233262Z",
+     "shell.execute_reply": "2026-05-28T19:40:22.232953Z"
     }
    },
    "outputs": [
@@ -831,10 +851,10 @@
    "id": "cell-19-verdict",
    "metadata": {
     "execution": {
-     "iopub.execute_input": "2026-05-26T10:15:44.263522Z",
-     "iopub.status.busy": "2026-05-26T10:15:44.263466Z",
-     "iopub.status.idle": "2026-05-26T10:15:44.283675Z",
-     "shell.execute_reply": "2026-05-26T10:15:44.283399Z"
+     "iopub.execute_input": "2026-05-28T19:40:22.234289Z",
+     "iopub.status.busy": "2026-05-28T19:40:22.234236Z",
+     "iopub.status.idle": "2026-05-28T19:40:22.254388Z",
+     "shell.execute_reply": "2026-05-28T19:40:22.254084Z"
     }
    },
    "outputs": [

diff --git a/src/ccoa/classifier/fasttext.py b/src/ccoa/classifier/fasttext.py
@@ -2,12 +2,13 @@
 
 from __future__ import annotations
 
+from collections.abc import Sequence
+
 import fasttext as _fasttext
 from huggingface_hub import hf_hub_download
 
 DEFAULT_MODEL_REPO = "ibm-granite/GneissWeb.Sci_classifier"
 DEFAULT_MODEL_FILE = "fasttext_science.bin"
-DEFAULT_TARGET_LABEL = "__label__science"
 
 FASTTEXT_MAX_INPUT_CHARS = 100_000
 
@@ -36,17 +37,19 @@ def clean_for_fasttext(text: str, max_len: int = FASTTEXT_MAX_INPUT_CHARS) -> st
     return cleaned
 
 
-def predict_target(model, text: str, target_label: str) -> float:
-    """Return the probability the model assigns to `target_label`.
+def predict_targets(model, text: str, target_labels: Sequence[str]) -> list[float]:
+    """Return probabilities for `target_labels` in the order given.
 
-    Cleans the input via `clean_for_fasttext` (strip newlines + NULs, clamp
-    length), then asks the model for probabilities over all labels and
-    returns the probability of `target_label` (0.0 if the model emits no
-    such label).
+    `model.predict(..., k=-1)` returns labels in descending probability order
+    (data-dependent), so we resolve scores by label name. Missing labels
+    yield 0.0.
     """
     cleaned = clean_for_fasttext(text)
     labels, probs = model.predict(cleaned, k=-1)
-    for lbl, prob in zip(labels, probs, strict=False):
-        if lbl == target_label:
-            return float(prob)
-    return 0.0
+    label_to_prob = dict(zip(labels, probs, strict=False))
+    return [float(label_to_prob.get(lbl, 0.0)) for lbl in target_labels]
+
+
+def get_model_labels(model) -> tuple[str, ...]:
+    """Return all labels the model can emit, in model-internal order."""
+    return tuple(model.get_labels())