Contributor Content

Projects and datasets created by the community using Common Crawl data. Below you can find contributed corpora and research releases hosted under the Common Crawl S3 bucket.

Corpus

Nemotron-CC

A large-scale curated corpus built from Common Crawl, spanning CC-MAIN-2013-20 through CC-MAIN-2024-30 (99 crawls).

Corpus

DataComp-LM

A contributed language-model training corpus derived from Common Crawl (pre-2023).

Corpus

C4Corpus

A filtered Common Crawl corpus released for research use (2016, one crawl).

Corpus

DepCC

A dependency-parsed Common Crawl corpus for NLP research (2016, one crawl).

Datathon

Web Archives for Social Sciences

Curated Common Crawl datasets prepared for a social science datathon at the BDFI, University of Bristol, November 2025.

About Contributed Content