[...] As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLMBASELINE, enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-BASELINE represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6× less compute than Llama 3 8B. [...]
Read the full paper on arXiv.
The data herein is stored as global/local shards represented as ZSTD-compressed JSONL files.
$ aws s3 ls s3://commoncrawl/contrib/datacomp/DCLM-baseline/
PRE global-shard_01_of_10/
PRE global-shard_02_of_10/
PRE global-shard_03_of_10/
PRE global-shard_04_of_10/
PRE global-shard_05_of_10/
PRE global-shard_06_of_10/
PRE global-shard_07_of_10/
PRE global-shard_08_of_10/
PRE global-shard_09_of_10/
PRE global-shard_10_of_10/
2025-06-22 15:40:45 85087 DCLM-baseline.paths.gz
The paths.gz file contains the prefixes/paths to the files.