The news dataset includes articles from news sites all over the world. WARC files are released on a daily basis. The news crawl was started in 2016, please see the news dataset announcement for further information.
The source code of the news crawler is available on our GitHub account. Please, report issues there and share your suggestions for improvements with us.
crawl-data/CC-NEWS/yyyy/mm/CC-NEWS-yyyymmddHHMMSS-nnnnn.warc.gzwith
yyyymmddHHMMSSnnnnnThe timestamp (yyyymmddHHMMSS) indicates the time the
first record in the WARC file was created.
s3://commoncrawl/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gzresp.
https://data.commoncrawl.org/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gzFor accessing the data please see our Get Started page.
For every year (linked) we provide an overview by month including links to the WARC file listings.
| Year | Num. WARC files | Total WARC Size Compressed (TiB) |
|---|---|---|
| 2026 | n/a | n/a |
| 2025 | 5988 | 5.839 |
| 2024 | 6224 | 6.072 |
| 2023 | 8318 | 8.102 |
| 2022 | 7956 | 7.754 |
| 2021 | 6605 | 6.435 |
| 2020 | 5395 | 5.263 |
| 2019 | 3536 | 3.449 |
| 2018 | 2613 | 2.548 |
| 2017 | 1583 | 1.504 |
| 2016 | 207 | 0.151 |