Projects and datasets created by the community using Common Crawl data. Below you can find contributed corpora and research releases hosted under the Common Crawl S3 bucket.
A large-scale curated corpus built from Common Crawl, spanning CC-MAIN-2013-20 through CC-MAIN-2024-30 (99 crawls).
A contributed language-model training corpus derived from Common Crawl (pre-2023).
A filtered Common Crawl corpus released for research use (2016, one crawl).
A dependency-parsed Common Crawl corpus for NLP research (2016, one crawl).
Curated Common Crawl datasets prepared for a social science datathon at the BDFI, University of Bristol, November 2025.