GneissWeb Annotations, powered by IBM Research's GneissWeb methodology, is a dataset of quality and category annotations applied to the Common Crawl corpus.
This dataset enables precise filtering of web content across medical, educational, technology, and scientific domains, making it easier to build high-quality corpora for research projects, language models, and specialized applications.
Learn more about the annotation process and methodology in our official blog post.
GneissWeb Annotations uses the https://huggingface.co/ibm-granite/GneissWeb.bloom made publicly available by IBM, along with IBM’s Data Prep Kit (now a Linux Foundation AI & Data project) and the GneissWeb groups’ category classifiers.
You can access annotations at two levels of granularity:
Access the dataset through:
s3://commoncrawl/projects/gneissweb-annotation-testing-v1Check out the gneissweb examples in the cc-index-annotations github repository.
Aggregated domain-level annotations for efficient filtering by source.
| Column | Description |
|---|---|
crawl |
Common Crawl archive ID (e.g., CC-MAIN-2024-10) |
in_gneissweb |
Boolean flag for GneissWeb inclusion |
surt_host_name |
SURT-formatted hostname |
gneissweb_medical |
Medical content quality score |
gneissweb_technology |
Technology content quality score |
gneissweb_education |
Educational content quality score |
gneissweb_science |
Scientific content quality score |
Granular annotations for individual URLs.
| Column | Description |
|---|---|
crawl |
Common Crawl archive ID (e.g., CC-MAIN-2024-10) |
in_gneissweb |
Boolean flag for GneissWeb inclusion |
url_surkey |
SURT-formatted URL key |
surt_host_name |
SURT-formatted hostname |
fetch_time |
TIMESTAMP of the page fetch |
gneissweb_medical |
Medical content quality score |
gneissweb_technology |
Technology content quality score |
gneissweb_education |
Educational content quality score |
gneissweb_science |
Scientific content quality score |
This dataset opens up numerous possibilities:
When using our data in your work, please cite the https://commoncrawl.org and let us know, we'd love to hear from you!
Common Crawl Foundation's standard terms and condition apply, see https://commoncrawl.org/terms-of-use for more details.