Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 50 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,17 @@ uv run ccoa --help

`ccoa classify-warc` streams WARC files from S3 (or any fsspec URL),
extracts plain text from each response record with trafilatura, and
applies a HuggingFace-hosted fasttext classifier. Per-record output is a
CSV `URL,prediction_score,warc_filename,warc_record_index`; a one-shot score-distribution summary is
logged at the end and written to a `<output>.summary.csv` file.
applies one or more HuggingFace-hosted fasttext classifiers in a single
pass. Per-record output is a CSV with one `score_<label>` column per
requested label, between `URL` and the `warc_filename`/`warc_record_index`
tail:

```
URL,score_<label_1>,...,score_<label_N>,warc_filename,warc_record_index
```

A per-column score-distribution summary is logged at the end and written
to a `<output>.summary.csv` file.

```bash
uv run ccoa classify-warc \
Expand Down Expand Up @@ -86,10 +94,37 @@ HTTPS gateway URL
(`https://data.commoncrawl.org/...`) with no credentials.

The default classifier is
[`ibm-granite/GneissWeb.Sci_classifier`](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier)
(`__label__science`). The first run downloads the ~4 GB model into the
HuggingFace cache. Override with `--model-repo`, `--model-file`, and
`--target-label`.
[`ibm-granite/GneissWeb.Sci_classifier`](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier).
Without `--labels` it emits both of the model's labels —
`score___label__science` and `score___label__cc`, which sum to 1.0 per
record. The first run downloads the ~4 GB model into the HuggingFace
cache. Override with `--model-repo`, `--model-file`, and `--labels`.

`--model-repo` and `--model-file` are list-valued and zipped positionally,
so you can score against multiple classifiers in one pass:

```bash
uv run ccoa classify-warc \
--warc-paths 's3://commoncrawl/.../*.warc.gz' \
--model-repo ibm-granite/GneissWeb.Sci_classifier ibm-granite/GneissWeb.Quality_annotator \
--model-file fasttext_science.bin <quality_model_filename.bin> \
--output data/classified.csv
```

`--labels` is also list-valued (one entry per model). Each entry is a
comma-separated list of labels (`"__label__science,__label__cc"`) or the
literal `*` to use all of that model's labels (the default when `--labels`
is omitted). Output columns are emitted in the order: models in CLI order,
labels in the order given (or model-internal order for `*`).

Column naming depends on whether the run has one model or many:

- **Single model**: `score_<label>` (e.g. `score___label__science`).
- **Multiple models**: `score_m<idx>_<label>`, where `<idx>` is the 0-based
CLI position of the model (e.g. `score_m0___label__science`,
`score_m1___label__hq`). This namespacing means two models can share a
label name — Sci_classifier and Quality_annotator both emit
`__label__cc` — without colliding.

`--output` accepts `-` for stdout, any local path, or any fsspec URL —
including `s3://bucket/key.csv`. S3 outputs use the same `--anonymous-s3`
Expand Down Expand Up @@ -133,11 +168,14 @@ uv run ccoa classify-warc \
--output data/classified__resume-2.csv
```

The resume CSV must include the `warc_filename` and `warc_record_index`
columns (the current output schema always does). Records matching that
`(warc_filename, record_index)` pair are skipped on the new run; the
new `--output` contains only the missing rows. Concatenate the two
CSVs (drop the second header) to get a complete result.
The resume CSV's header must match the new run's output schema
**exactly** — same `score_<label>` columns in the same order, between
the leading `URL` and the trailing `warc_filename`/`warc_record_index`.
Any drift (reorder, missing, extra) is rejected fast with a structured
diff so a concatenation (drop the second header) yields a well-formed
CSV. Records matching that `(warc_filename, record_index)` pair are
skipped on the new run; the new `--output` contains only the missing
rows.

With `--records-per-file-limit N` the limit is interpreted as the
**target total** per file (resumed + new). Files already at the target
Expand Down
120 changes: 70 additions & 50 deletions notebooks/compare_classifier_scores.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@
"id": "cell-01-imports",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:21.594052Z",
"iopub.status.busy": "2026-05-26T10:15:21.593943Z",
"iopub.status.idle": "2026-05-26T10:15:41.227039Z",
"shell.execute_reply": "2026-05-26T10:15:41.226661Z"
"iopub.execute_input": "2026-05-28T19:40:08.502488Z",
"iopub.status.busy": "2026-05-28T19:40:08.502328Z",
"iopub.status.idle": "2026-05-28T19:40:19.488256Z",
"shell.execute_reply": "2026-05-28T19:40:19.487890Z"
}
},
"outputs": [
Expand Down Expand Up @@ -98,7 +98,7 @@
"source": [
"## 1. Load the score data\n",
"\n",
"The output CSV schema is `URL,prediction_score,warc_filename,warc_record_index`; we only need `prediction_score` for distribution analysis.\n",
"The output CSV schema is `URL,score_<label_1>,...,score_<label_N>,warc_filename,warc_record_index` — one score column per model label. The cell below auto-detects available `score_*` columns and picks the first as `SCORE_COL`; change it (or extend `load_scores`) to compare other labels. Older single-column outputs (`prediction_score`) are accepted as a fallback so existing CSVs in `data/` still work.\n",
"\n",
"The focus crawl can be split across **multiple parts** when `--resume-from-output` was used to recover from a crash (e.g. `…_1m_scores.csv` for the first run plus `…_1m_scores__resume-2.csv` for the continuation). We auto-discover all parts matching `{focus name}_1m_scores*.csv` (filtering out the `.summary.csv` sidecars) and concatenate them into a single distribution.\n",
"\n",
Expand All @@ -113,13 +113,21 @@
"id": "cell-03-load",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:41.228341Z",
"iopub.status.busy": "2026-05-26T10:15:41.228232Z",
"iopub.status.idle": "2026-05-26T10:15:41.973842Z",
"shell.execute_reply": "2026-05-26T10:15:41.973414Z"
"iopub.execute_input": "2026-05-28T19:40:19.489927Z",
"iopub.status.busy": "2026-05-28T19:40:19.489801Z",
"iopub.status.idle": "2026-05-28T19:40:20.224124Z",
"shell.execute_reply": "2026-05-28T19:40:20.223694Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score columns in baseline: ['prediction_score (legacy)']\n",
"Analysing: prediction_score\n"
]
},
{
"name": "stdout",
"output_type": "stream",
Expand All @@ -136,9 +144,19 @@
}
],
"source": [
"# Auto-detect the score column to analyse. classify-warc now emits one\n",
"# `score_<label>` column per model label; older outputs had a single\n",
"# `prediction_score`. Edit SCORE_COL to compare other labels.\n",
"_baseline_columns = pd.read_csv(BASELINE_CSV, nrows=0).columns.tolist()\n",
"SCORE_COLUMNS = [c for c in _baseline_columns if c.startswith(\"score_\")]\n",
"SCORE_COL = SCORE_COLUMNS[0] if SCORE_COLUMNS else \"prediction_score\"\n",
"print(f\"Score columns in baseline: {SCORE_COLUMNS or [SCORE_COL + ' (legacy)']}\")\n",
"print(f\"Analysing: {SCORE_COL}\")\n",
"\n",
"\n",
"def load_scores(csv_path: Path) -> pd.Series:\n",
" \"\"\"Read the prediction_score column from a classify-warc output CSV.\"\"\"\n",
" series = pd.read_csv(csv_path, usecols=[\"prediction_score\"])[\"prediction_score\"]\n",
" \"\"\"Read the `SCORE_COL` column from a classify-warc output CSV.\"\"\"\n",
" series = pd.read_csv(csv_path, usecols=[SCORE_COL])[SCORE_COL]\n",
" return series.astype(float)\n",
"\n",
"\n",
Expand All @@ -151,7 +169,9 @@
" is_toy = False\n",
" ratio = len(focus_scores) / max(len(baseline_scores), 1)\n",
" is_partial = ratio < 0.95\n",
" per_part = \", \".join(f\"{p.name}: {len(s):,}\" for p, s in zip(FOCUS_CSVS, parts))\n",
" per_part = \", \".join(\n",
" f\"{p.name}: {len(s):,}\" for p, s in zip(FOCUS_CSVS, parts, strict=True)\n",
" )\n",
" print(\n",
" f\"{FOCUS_NAME}: {len(focus_scores):,} scores loaded from {len(FOCUS_CSVS)} \"\n",
" f\"file(s) ({ratio:.0%} of baseline size). Parts → {per_part}\"\n",
Expand All @@ -164,7 +184,7 @@
"else:\n",
" rng = np.random.default_rng(42)\n",
" noisy = baseline_scores.to_numpy() + 0.05 + rng.normal(0.0, 0.10, size=len(baseline_scores))\n",
" focus_scores = pd.Series(np.clip(noisy, 0.0, 1.0), name=\"prediction_score\")\n",
" focus_scores = pd.Series(np.clip(noisy, 0.0, 1.0), name=SCORE_COL)\n",
" is_toy = True\n",
" is_partial = False\n",
" print(\n",
Expand Down Expand Up @@ -195,10 +215,10 @@
"id": "cell-05-stats",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:41.974946Z",
"iopub.status.busy": "2026-05-26T10:15:41.974894Z",
"iopub.status.idle": "2026-05-26T10:15:42.081238Z",
"shell.execute_reply": "2026-05-26T10:15:42.080957Z"
"iopub.execute_input": "2026-05-28T19:40:20.225163Z",
"iopub.status.busy": "2026-05-28T19:40:20.225105Z",
"iopub.status.idle": "2026-05-28T19:40:20.333180Z",
"shell.execute_reply": "2026-05-28T19:40:20.332861Z"
}
},
"outputs": [
Expand Down Expand Up @@ -381,10 +401,10 @@
"id": "cell-07-hist",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:42.082323Z",
"iopub.status.busy": "2026-05-26T10:15:42.082276Z",
"iopub.status.idle": "2026-05-26T10:15:42.642052Z",
"shell.execute_reply": "2026-05-26T10:15:42.641716Z"
"iopub.execute_input": "2026-05-28T19:40:20.334135Z",
"iopub.status.busy": "2026-05-28T19:40:20.334086Z",
"iopub.status.idle": "2026-05-28T19:40:20.602002Z",
"shell.execute_reply": "2026-05-28T19:40:20.601654Z"
}
},
"outputs": [
Expand All @@ -403,15 +423,15 @@
"fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)\n",
"bins = np.linspace(0.0, 1.0, 81)\n",
"\n",
"for ax, log_scale in zip(axes, [False, True]):\n",
"for ax, log_scale in zip(axes, [False, True], strict=True):\n",
" ax.hist(baseline_scores, bins=bins, alpha=0.5, label=BASELINE_NAME, color=\"C0\", density=True)\n",
" ax.hist(focus_scores, bins=bins, alpha=0.5, label=FOCUS_NAME, color=\"C1\", density=True)\n",
" if log_scale:\n",
" ax.set_yscale(\"log\")\n",
" ax.set_title(\"Histogram (log y, density-normalised)\")\n",
" else:\n",
" ax.set_title(\"Histogram (linear y, density-normalised)\")\n",
" ax.set_xlabel(\"prediction_score\")\n",
" ax.set_xlabel(SCORE_COL)\n",
" ax.set_ylabel(\"density\")\n",
" ax.legend()\n",
" ax.grid(True, alpha=0.3)\n",
Expand All @@ -436,10 +456,10 @@
"id": "cell-09-ecdf",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:42.643283Z",
"iopub.status.busy": "2026-05-26T10:15:42.643203Z",
"iopub.status.idle": "2026-05-26T10:15:43.228237Z",
"shell.execute_reply": "2026-05-26T10:15:43.227857Z"
"iopub.execute_input": "2026-05-28T19:40:20.602921Z",
"iopub.status.busy": "2026-05-28T19:40:20.602874Z",
"iopub.status.idle": "2026-05-28T19:40:21.206119Z",
"shell.execute_reply": "2026-05-28T19:40:21.205709Z"
}
},
"outputs": [
Expand Down Expand Up @@ -469,7 +489,7 @@
"]:\n",
" x, y = ecdf(series)\n",
" plt.plot(x, y, label=name, color=color)\n",
"plt.xlabel(\"prediction_score\")\n",
"plt.xlabel(SCORE_COL)\n",
"plt.ylabel(\"ECDF P(score ≤ x)\")\n",
"plt.title(\"Empirical CDF of classifier scores\")\n",
"plt.legend()\n",
Expand All @@ -494,10 +514,10 @@
"id": "cell-11-box",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:43.229226Z",
"iopub.status.busy": "2026-05-26T10:15:43.229167Z",
"iopub.status.idle": "2026-05-26T10:15:43.929555Z",
"shell.execute_reply": "2026-05-26T10:15:43.929175Z"
"iopub.execute_input": "2026-05-28T19:40:21.207200Z",
"iopub.status.busy": "2026-05-28T19:40:21.207147Z",
"iopub.status.idle": "2026-05-28T19:40:21.891974Z",
"shell.execute_reply": "2026-05-28T19:40:21.891665Z"
}
},
"outputs": [
Expand All @@ -519,14 +539,14 @@
"labels = [BASELINE_NAME, FOCUS_NAME]\n",
"\n",
"axes[0].boxplot(data, tick_labels=labels, showfliers=False)\n",
"axes[0].set_ylabel(\"prediction_score\")\n",
"axes[0].set_ylabel(SCORE_COL)\n",
"axes[0].set_title(\"Box plot (outliers hidden)\")\n",
"axes[0].grid(True, alpha=0.3)\n",
"\n",
"axes[1].violinplot(data, showmedians=True)\n",
"axes[1].set_xticks([1, 2])\n",
"axes[1].set_xticklabels(labels)\n",
"axes[1].set_ylabel(\"prediction_score\")\n",
"axes[1].set_ylabel(SCORE_COL)\n",
"axes[1].set_title(\"Violin plot\")\n",
"axes[1].grid(True, alpha=0.3)\n",
"\n",
Expand All @@ -552,10 +572,10 @@
"id": "cell-13-qq",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:43.930547Z",
"iopub.status.busy": "2026-05-26T10:15:43.930496Z",
"iopub.status.idle": "2026-05-26T10:15:44.006062Z",
"shell.execute_reply": "2026-05-26T10:15:44.005773Z"
"iopub.execute_input": "2026-05-28T19:40:21.893014Z",
"iopub.status.busy": "2026-05-28T19:40:21.892965Z",
"iopub.status.idle": "2026-05-28T19:40:21.973772Z",
"shell.execute_reply": "2026-05-28T19:40:21.973431Z"
}
},
"outputs": [
Expand Down Expand Up @@ -611,10 +631,10 @@
"id": "cell-15-tests",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:44.007069Z",
"iopub.status.busy": "2026-05-26T10:15:44.007022Z",
"iopub.status.idle": "2026-05-26T10:15:44.240328Z",
"shell.execute_reply": "2026-05-26T10:15:44.239994Z"
"iopub.execute_input": "2026-05-28T19:40:21.974787Z",
"iopub.status.busy": "2026-05-28T19:40:21.974737Z",
"iopub.status.idle": "2026-05-28T19:40:22.210867Z",
"shell.execute_reply": "2026-05-28T19:40:22.210554Z"
}
},
"outputs": [
Expand Down Expand Up @@ -714,10 +734,10 @@
"id": "cell-17-effect",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:44.241419Z",
"iopub.status.busy": "2026-05-26T10:15:44.241371Z",
"iopub.status.idle": "2026-05-26T10:15:44.262566Z",
"shell.execute_reply": "2026-05-26T10:15:44.262191Z"
"iopub.execute_input": "2026-05-28T19:40:22.211857Z",
"iopub.status.busy": "2026-05-28T19:40:22.211810Z",
"iopub.status.idle": "2026-05-28T19:40:22.233262Z",
"shell.execute_reply": "2026-05-28T19:40:22.232953Z"
}
},
"outputs": [
Expand Down Expand Up @@ -831,10 +851,10 @@
"id": "cell-19-verdict",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T10:15:44.263522Z",
"iopub.status.busy": "2026-05-26T10:15:44.263466Z",
"iopub.status.idle": "2026-05-26T10:15:44.283675Z",
"shell.execute_reply": "2026-05-26T10:15:44.283399Z"
"iopub.execute_input": "2026-05-28T19:40:22.234289Z",
"iopub.status.busy": "2026-05-28T19:40:22.234236Z",
"iopub.status.idle": "2026-05-28T19:40:22.254388Z",
"shell.execute_reply": "2026-05-28T19:40:22.254084Z"
}
},
"outputs": [
Expand Down
25 changes: 14 additions & 11 deletions src/ccoa/classifier/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@

from __future__ import annotations

from collections.abc import Sequence

import fasttext as _fasttext
from huggingface_hub import hf_hub_download

DEFAULT_MODEL_REPO = "ibm-granite/GneissWeb.Sci_classifier"
DEFAULT_MODEL_FILE = "fasttext_science.bin"
DEFAULT_TARGET_LABEL = "__label__science"

FASTTEXT_MAX_INPUT_CHARS = 100_000

Expand Down Expand Up @@ -36,17 +37,19 @@ def clean_for_fasttext(text: str, max_len: int = FASTTEXT_MAX_INPUT_CHARS) -> st
return cleaned


def predict_target(model, text: str, target_label: str) -> float:
"""Return the probability the model assigns to `target_label`.
def predict_targets(model, text: str, target_labels: Sequence[str]) -> list[float]:
"""Return probabilities for `target_labels` in the order given.

Cleans the input via `clean_for_fasttext` (strip newlines + NULs, clamp
length), then asks the model for probabilities over all labels and
returns the probability of `target_label` (0.0 if the model emits no
such label).
`model.predict(..., k=-1)` returns labels in descending probability order
(data-dependent), so we resolve scores by label name. Missing labels
yield 0.0.
"""
cleaned = clean_for_fasttext(text)
labels, probs = model.predict(cleaned, k=-1)
for lbl, prob in zip(labels, probs, strict=False):
if lbl == target_label:
return float(prob)
return 0.0
label_to_prob = dict(zip(labels, probs, strict=False))
return [float(label_to_prob.get(lbl, 0.0)) for lbl in target_labels]


def get_model_labels(model) -> tuple[str, ...]:
"""Return all labels the model can emit, in model-internal order."""
return tuple(model.get_labels())
Loading
Loading