Data for LLMs

Training an LLM needs a large amount of high qualitity data. Even though many giant teches open up their high performance LLMs (e.g., LLaMA, Mistral), high qualitity data still remain private.

Chinese Dataset

English Dataset

RefinedWeb: 600 B toknes

Dolma: open-sourced by Allenai, contains 3T tokens and a toolkit with some key features: high performance, portability, built-in tagger, fast decuplication, extensibility and cloud support.

fineweb: 15 trillion tokens of high quality web data. Thanks to the team from huggingface. They filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama.

fineweb-edu: curating through many methods from fineweb . Details you must want to know about are here: fineweb data processing technique report.

DataComp: has 240T tokens from Common Crawl, and provides a collection of tool sets to undertake experiments in some ways (e.g., different mixing strategies, various data filters).

TxT360: TRillion eXtracted Text, A top-quality LLM Pre-training Dataset Requires the perfect blend. This is the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 commonly used non-web data sources (e.g. FreeLaw, PG-19, etc.) providing teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset.

TxT360 compared to other common pretraining datasets:

Data Source	TxT360	FineWeb	RefinedWeb	PedPajamaV2	C4	Dolma	RedPajamaV1	The Pile
CommonCrawl Snapshots	99	96	90	84	1	24	5	0.6% of 74
Papers**	5 Sources	-	-	-	-	1 Source	1 Source	4 Sources
Wikipedia	310+ Languages	-	-	-	-	Included	Included	English Only
FreeLaw	Included	-	-	-	-	-	-	Included
DM Math	Included	-	-	-	-	-	-	Included
USPTO	Included	-	-	-	-	-	-	Included
PG-19	Included	-	-	-	-	Included	Included	Included
HackerNews	Included	-	-	-	-	-	-	Included
Ubuntu IRC	Included	-	-	-	-	-	-	Included
EuroParl	Included	-	-	-	-	-	-	Included
StackExchange**	Included	-	-	-	-	-	-	Included
Code	*	-	-	-	-	Included	Included	Included

Note: TxT360 doesn’t include code.

clean dataset

Huggingface introduces a python library datatrove to process, filter and deduplicate text data at a very large scale.

Synthetic Data Generation

Cosmopedia: the largest synthetic dataset for pre-training.

instruction data generation

Although some LLMs, like LLaMA, Gemma, have opened their weightings, their alignment data remain private, which obstructs the democratization of AI. The researchers find a way, named MAGPIE¹, to generating large-scale alignment data. It is the key point that aligned LLMs can generate user query when we input only the left-side templates up to the right position reserved for user messages.

Figure 1. The process of self-synthesizing instruction data from aligned LLMs (e.g., LLaMA-8B-Instruct) to create a high quality instruction dataset. (Image source: MAGPIE: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing)

Concretely, the pipeline of MAGPIE is below:

Step 1: Instruction Gneration: Magpie crafts an input query in the format of the predefined instruction template of the LLM. This query defines only the role of instruction provider (e.g., user), and does not provide any instruction. The auto-regressive LLM has been fine-tuned using instruction data in the format of the predefined instruction template. Thus, the LLM autonomously generates an instruction when the query crafted by Magpie is given as an input.
Step 2: Response Generation: Magpie sends the instruction to the LLM to generate the responses. From the results of experiments, after fine-tuning LLaMA-8B-Base, the instruction dataset from MAGPIE outperforms other public datasets.

ocr-free data generation

MLLMs have been struggling in the task of text extraction from images. It is common to use OCR to extract text and then decode the key information in traditional ways, which requires much extra computation. It is now taken for granted that OCR-free methods will handle these tasks after MLLM (e.g., GPT4V) are developed. SynthDoG² proposes a way for generating synthetic datasets at a large scale.

Dataset Evaluation Benchmark

It is significantly crucial to find an approach to understand which data curation strategies work best and ultimately building better language model. DataComp-LM³(project) introduces a first benchmark for language model training data curation.

Figure 2. The model get different results according to various training sets. A better training set can bring a better performance. (Image source: DataComp-LM: In search of the next generation of training sets for language models)

In this paper, the authors conduct 416 experiments with different training setsand compute scales, and identify that model-based filter play a key role in an effective data curation pipeline. Unbelievably, a simple bigram classifer, combined with a carefully selected set of positive and negative examples, performs best.

Figure 3. The workflow of DCLM. (Image source: DataComp-LM: In search of the next generation of training sets for language models)

DCLM, with a 200T token pool and 7B models, is the first large-scale data-centric benchmark for language models. Some interesting conclusions are blow:

among C4, Dolam-V1, RedPajama, and RefinedWeb, RefinedWeb scores best.
for text extraction (e.g., resiliparse, trafilatura(used by RefinedWeb), Common Crawl method), resiliparse and trafilatura have similar downstream perfomances, and resiliparse is even up to itimes faster.
while decuplicating data, Bloom filter performs better than MinHash filter.
Reference to model-based quality filtering, comparing to PageRank score filtering, Semantic Decuplication, linear classifiers fit on pre-trained BGE text embedding, AskLLM that prompts an LM to decide if a document is useful, Perplexity filtering, and Top-k average logits, fastText works best.

Xu et al., MAGPIE: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, 2024 ↩︎
Kim et al., OCR-free Document Understanding Transformer, ECCV 2022 ↩︎
Li et al., DataComp-LM: In search of the next generation of training sets for language models, 2024. ↩︎

Chinese Dataset#

English Dataset#

clean dataset#

Synthetic Data Generation#

instruction data generation#

ocr-free data generation#

Dataset Evaluation Benchmark#