DCLM-pool

Back to overview

Description

[...] To enable DCLM, we contribute a comprehensive experimental testbed. A key component is DCLM-POOL, a corpus of 240 trillion tokens derived from Common Crawl [42]. DCLMPOOL is the largest public corpus for language model training and forms the cornerstone of the DCLM filtering track, where participants aim to curate the best possible training set out of DCLM-POOL. [...]

Read the full paper on arXiv.

Contents

The data herein is stored as gzip-compressed JSONL files, organised by crawl-id. The crawls used range from CC-MAIN-2013-20 to CC-MAIN-2022-49.

    $ aws s3 ls s3://commoncrawl/contrib/datacomp/DCLM-pool/jsonl/
       PRE crawl=CC-MAIN-2013-20/
       PRE crawl=CC-MAIN-2013-48/
       PRE crawl=CC-MAIN-2014-10/
       PRE crawl=CC-MAIN-2014-15/
       PRE crawl=CC-MAIN-2014-23/
       PRE crawl=CC-MAIN-2014-35/
       PRE crawl=CC-MAIN-2014-41/
       PRE crawl=CC-MAIN-2014-42/
       PRE crawl=CC-MAIN-2014-49/
       PRE crawl=CC-MAIN-2014-52/
       PRE crawl=CC-MAIN-2015-06/
       PRE crawl=CC-MAIN-2015-11/
       PRE crawl=CC-MAIN-2015-14/
       PRE crawl=CC-MAIN-2015-18/
       PRE crawl=CC-MAIN-2015-22/
       PRE crawl=CC-MAIN-2015-27/
       PRE crawl=CC-MAIN-2015-32/
       PRE crawl=CC-MAIN-2015-35/
       PRE crawl=CC-MAIN-2015-40/
       PRE crawl=CC-MAIN-2015-48/
       PRE crawl=CC-MAIN-2016-07/
       PRE crawl=CC-MAIN-2016-18/
       PRE crawl=CC-MAIN-2016-22/
       PRE crawl=CC-MAIN-2016-26/
       PRE crawl=CC-MAIN-2016-30/
       PRE crawl=CC-MAIN-2016-36/
       PRE crawl=CC-MAIN-2016-40/
       PRE crawl=CC-MAIN-2016-44/
       PRE crawl=CC-MAIN-2016-50/
       PRE crawl=CC-MAIN-2017-04/
       PRE crawl=CC-MAIN-2017-09/
       PRE crawl=CC-MAIN-2017-13/
       PRE crawl=CC-MAIN-2017-17/
       PRE crawl=CC-MAIN-2017-22/
       PRE crawl=CC-MAIN-2017-26/
       PRE crawl=CC-MAIN-2017-30/
       PRE crawl=CC-MAIN-2017-34/
       PRE crawl=CC-MAIN-2017-39/
       PRE crawl=CC-MAIN-2017-43/
       PRE crawl=CC-MAIN-2017-47/
       PRE crawl=CC-MAIN-2017-51/
       PRE crawl=CC-MAIN-2018-05/
       PRE crawl=CC-MAIN-2018-09/
       PRE crawl=CC-MAIN-2018-13/
       PRE crawl=CC-MAIN-2018-17/
       PRE crawl=CC-MAIN-2018-22/
       PRE crawl=CC-MAIN-2018-26/
       PRE crawl=CC-MAIN-2018-30/
       PRE crawl=CC-MAIN-2018-34/
       PRE crawl=CC-MAIN-2018-39/
       PRE crawl=CC-MAIN-2018-43/
       PRE crawl=CC-MAIN-2018-47/
       PRE crawl=CC-MAIN-2018-51/
       PRE crawl=CC-MAIN-2019-04/
       PRE crawl=CC-MAIN-2019-09/
       PRE crawl=CC-MAIN-2019-13/
       PRE crawl=CC-MAIN-2019-18/
       PRE crawl=CC-MAIN-2019-22/
       PRE crawl=CC-MAIN-2019-26/
       PRE crawl=CC-MAIN-2019-30/
       PRE crawl=CC-MAIN-2019-35/
       PRE crawl=CC-MAIN-2019-39/
       PRE crawl=CC-MAIN-2019-43/
       PRE crawl=CC-MAIN-2019-47/
       PRE crawl=CC-MAIN-2019-51/
       PRE crawl=CC-MAIN-2020-05/
       PRE crawl=CC-MAIN-2020-10/
       PRE crawl=CC-MAIN-2020-16/
       PRE crawl=CC-MAIN-2020-24/
       PRE crawl=CC-MAIN-2020-29/
       PRE crawl=CC-MAIN-2020-34/
       PRE crawl=CC-MAIN-2020-40/
       PRE crawl=CC-MAIN-2020-45/
       PRE crawl=CC-MAIN-2020-50/
       PRE crawl=CC-MAIN-2021-04/
       PRE crawl=CC-MAIN-2021-10/
       PRE crawl=CC-MAIN-2021-17/
       PRE crawl=CC-MAIN-2021-21/
       PRE crawl=CC-MAIN-2021-25/
       PRE crawl=CC-MAIN-2021-31/
       PRE crawl=CC-MAIN-2021-39/
       PRE crawl=CC-MAIN-2021-43/
       PRE crawl=CC-MAIN-2021-49/
       PRE crawl=CC-MAIN-2022-05/
       PRE crawl=CC-MAIN-2022-21/
       PRE crawl=CC-MAIN-2022-27/
       PRE crawl=CC-MAIN-2022-33/
       PRE crawl=CC-MAIN-2022-40/
       PRE crawl=CC-MAIN-2022-49/
    $ aws s3 ls s3://commoncrawl/contrib/datacomp/DCLM-pool/jsonl.paths.gz
2025-07-12 21:09:38   15123532 jsonl.paths.gz

The paths.gz file contains the prefixes/paths to the files.