GneissWeb Annotation Testing

GneissWeb Annotations, powered by IBM Research's GneissWeb methodology, is a dataset of quality and category annotations applied to the Common Crawl corpus.

This dataset enables precise filtering of web content across medical, educational, technology, and scientific domains, making it easier to build high-quality corpora for research projects, language models, and specialized applications.

Learn more about the annotation process and methodology in our official blog post.

What's Inside

GneissWeb Annotations uses the https://huggingface.co/ibm-granite/GneissWeb.bloom made publicly available by IBM, along with IBM’s Data Prep Kit (now a Linux Foundation AI & Data project) and the GneissWeb groups’ category classifiers.

Medical - Health information, medical research, and clinical content

Education - Learning materials, academic resources, and educational platforms

Technology - Software documentation, technical guides, and tech industry content

Science - Research publications, scientific articles, and academic work

You can access annotations at two levels of granularity:

Host-level - Aggregate statistics for entire domains, perfect for broad filtering

URL-level - Individual URL classifications for precise content selection

Getting the Data

Access the dataset through:

Hugging Face: https://huggingface.co/datasets/commoncrawl/gneissweb-annotation-testing-v1

AWS S3: s3://commoncrawl/projects/gneissweb-annotation-testing-v1

Example Usage

Check out the gneissweb examples in the cc-index-annotations github repository.

Schema

Host Index Files

Aggregated domain-level annotations for efficient filtering by source.

Column	Description
`crawl`	Common Crawl archive ID (e.g., CC-MAIN-2024-10)
`in_gneissweb`	Boolean flag for GneissWeb inclusion
`surt_host_name`	SURT-formatted hostname
`gneissweb_medical`	Medical content quality score
`gneissweb_technology`	Technology content quality score
`gneissweb_education`	Educational content quality score
`gneissweb_science`	Scientific content quality score

URL Index Files

Granular annotations for individual URLs.

Column	Description
`crawl`	Common Crawl archive ID (e.g., CC-MAIN-2024-10)
`in_gneissweb`	Boolean flag for GneissWeb inclusion
`url_surkey`	SURT-formatted URL key
`surt_host_name`	SURT-formatted hostname
`fetch_time`	TIMESTAMP of the page fetch
`gneissweb_medical`	Medical content quality score
`gneissweb_technology`	Technology content quality score
`gneissweb_education`	Educational content quality score
`gneissweb_science`	Scientific content quality score

Applications

This dataset opens up numerous possibilities:

Train domain-specific language models with curated web data

Conduct research on content quality distribution across the web

Create filtered datasets for specific industries or use cases

Combine with other Common Crawl signals (language, TLD, etc.) for multi-dimensional filtering

Attribution

When using our data in your work, please cite the https://commoncrawl.org and let us know, we'd love to hear from you!

Licensing

Common Crawl Foundation's standard terms and condition apply, see https://commoncrawl.org/terms-of-use for more details.