Pre-training is where a language model acquires the vast majority of its knowledge and capabilities, yet the data engineering behind pre-training remains one of the least transparent aspects of modern AI. While frontier labs invest hundreds of millions of dollars in training compute, the quality, composition, and curation of training data often matters as much as scale. This article examines the full pre-training pipeline β from web scraping and deduplication through quality filtering and data mixing β alongside the training objectives and curriculum strategies that determine how models learn from their data.
A modern pre-training data pipeline processes petabytes of raw web crawl data into a curated, deduplicated, filtered corpus. The pipeline typically has four stages: acquisition, deduplication, quality filtering, and mixing.
The foundation of most pre-training datasets is Common Crawl, a nonprofit organization that performs regular crawls of the public web and makes the data freely available. Each monthly crawl contains 3-5 billion web pages, totaling 200-300 TB of compressed WARC files.
Raw Common Crawl data is unsuitable for training directly:
The first step is text extraction β converting raw HTML into clean text. Trafilatura (Barbaresi, 2021) and jusText are commonly used tools that identify the main content area of a web page and strip boilerplate.
import trafilatura
def extract_text_from_html(html_content):
"""Extract main content from HTML, removing boilerplate."""
text = trafilatura.extract(
html_content,
include_comments=False,
include_tables=True,
no_fallback=False,
favor_precision=True
)
return text
After text extraction, each document is tagged with its language using a classifier such as fastText's language identification model (Joulin et al., 2017), which achieves 95%+ accuracy across 176 languages. This enables language-specific filtering and controlled multilingual mixing.
Deduplication is arguably the single most impactful data processing step. Lee et al. (2022) showed that training on deduplicated data produces better models at the same compute budget β duplicate data wastes training FLOPs and can cause memorization of specific passages.
The simplest form removes documents with identical content. This is typically done by computing a hash (SHA-256 or similar) of each document's text and removing duplicates:
import hashlib
def exact_dedup(documents):
"""Remove exactly duplicated documents."""
seen_hashes = set()
unique_docs = []
for doc in documents:
h = hashlib.sha256(doc.encode('utf-8')).hexdigest()
if h not in seen_hashes:
seen_hashes.add(h)
unique_docs.append(doc)
return unique_docs
Exact deduplication is fast and removes a surprising amount of data β typically 10-30% of a raw web crawl. However, it misses near-duplicates: documents that differ by a few words, a date, or formatting.
MinHash locality-sensitive hashing (Broder, 1997) is the standard approach for near-deduplication at scale. The algorithm estimates the Jaccard similarity between document shingle sets without comparing all pairs:
from datasketch import MinHash, MinHashLSH
def create_minhash(text, num_perm=128, ngram_size=5):
"""Create a MinHash signature for a document."""
m = MinHash(num_perm=num_perm)
# Create character n-gram shingles
shingles = set()
for i in range(len(text) - ngram_size + 1):
shingles.add(text[i:i+ngram_size])
for shingle in shingles:
m.update(shingle.encode('utf-8'))
return m
def near_dedup(documents, threshold=0.8):
"""Remove near-duplicate documents using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique_indices = []
for i, doc in enumerate(documents):
mh = create_minhash(doc)
# Check if similar document already exists
result = lsh.query(mh)
if not result:
lsh.insert(str(i), mh)
unique_indices.append(i)
return [documents[i] for i in unique_indices]
Penedo et al. (2023) in the RefinedWeb paper reported that MinHash deduplication removed an additional 30-40% of data beyond exact deduplication, and the resulting models showed measurable quality improvements, particularly on memorization-sensitive evaluations.
Lee et al. (2022) demonstrated that substring-level deduplication using suffix arrays can further improve data quality. Rather than removing entire near-duplicate documents, this approach identifies and removes duplicated spans of text (e.g., repeated boilerplate paragraphs, copied passages) that occur across many documents. This is more fine-grained than document-level deduplication and catches a class of redundancy that MinHash misses.
The computational cost is significant β building suffix arrays over trillions of tokens requires specialized distributed algorithms β but the quality improvement justifies the cost for major training runs.
After deduplication, the corpus still contains substantial low-quality text. Quality filtering aims to retain text that will be most useful for training.
Simple heuristic rules remove clearly low-quality documents:
def heuristic_filter(text):
"""Apply basic quality heuristics."""
# Too short or too long
if len(text) < 200 or len(text) > 500000:
return False
# Low alphabetic ratio (likely code/tables/binary)
alpha_ratio = sum(c.isalpha() for c in text) / len(text)
if alpha_ratio < 0.4:
return False
# High ratio of repeated lines
lines = text.split('\n')
unique_lines = set(lines)
if len(unique_lines) / max(len(lines), 1) < 0.3:
return False
# Check for known low-quality patterns
if any(spam in text.lower() for spam in [
'buy now', 'click here', 'subscribe to our newsletter',
'cookie policy', 'terms and conditions'
]):
return False # Oversimplified; real filters are more nuanced
return True
Rae et al. (2021) in the Gopher paper detailed an extensive heuristic filtering pipeline including:
Brown et al. (2020) trained a binary classifier to distinguish "high-quality" text (using Wikipedia and curated book corpora as positive examples) from random web text. Documents scoring above a threshold were retained. This approach, while effective, introduces a quality bias: the model learns to prefer text that "looks like" Wikipedia, which may not be the best proxy for training data quality.
Penedo et al. (2023) in RefinedWeb argued that heavy classifier-based filtering can be counterproductive, as it reduces dataset diversity and introduces systematic biases. They showed that with sufficient deduplication, lighter heuristic filtering produces comparable or better models.
An alternative approach uses a pre-trained language model (typically a small one, like a 5-gram KenLM model) to score documents by perplexity. Documents with very high perplexity (incomprehensible text, non-natural language) or very low perplexity (highly repetitive, template text) are removed:
import kenlm
model = kenlm.Model('web_lm.arpa')
def perplexity_filter(text, min_pp=10, max_pp=1000):
"""Filter documents by KenLM perplexity."""
pp = model.perplexity(text)
return min_pp <= pp <= max_pp
CCNet (Wenzek et al., 2020), used to create the CC-100 dataset and subsequently adopted by many projects, uses this approach to partition Common Crawl into quality buckets.
The training objective determines what the model learns from its data. While next-token prediction dominates, several alternatives and augmentations exist.
The standard objective for decoder-only models: predict the next token given all previous tokens.
$$\mathcal{L}{CLM} = -\sum{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1})$$
Every token in the training sequence provides a training signal, making CLM extremely data-efficient in terms of utilization β each sequence of $T$ tokens yields $T$ prediction targets.
CLM is used by GPT, Llama, Mistral, Claude, and most modern LLMs.
The objective used by BERT (Devlin et al., 2019): randomly mask 15% of tokens and predict them from the bidirectional context. MLM produces strong bidirectional representations but is less natural for generation tasks.
$$\mathcal{L}{MLM} = -\sum{t \in \mathcal{M}} \log P(x_t \mid x_{\setminus \mathcal{M}})$$
where $\mathcal{M}$ is the set of masked positions.
MLM models (BERT, RoBERTa, DeBERTa) excel at classification and extraction tasks but require separate decoder architectures for generation. The 15% masking rate means only 15% of tokens provide a training signal per sequence, making MLM less data-efficient than CLM.
A hybrid approach where the first portion of the sequence (the "prefix") is processed with bidirectional attention, and the remainder is predicted autoregressively:
$$\mathcal{L}{prefix} = -\sum{t=p+1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1})$$
where position $1$ through $p$ are the prefix tokens seen bidirectionally. T5 uses a variant of this, and Tay et al. (2022) in UL2 showed that mixing multiple objective types (prefix LM, CLM, and span corruption) can produce models that are strong at both understanding and generation.
Used by T5 (Raffel et al., 2020), span corruption replaces random contiguous spans of tokens with sentinel tokens, and the model must generate the missing spans:
Input: "The <X> sat on the <Y>."
Target: "<X> cat <Y> mat </s>"
Span corruption is more data-efficient than single-token MLM because the model must predict multiple consecutive tokens, learning local coherence. The span length distribution affects what the model learns: short spans emphasize local syntax, while long spans require more semantic understanding.
The composition of the training data β the relative proportions of different data sources and domains β significantly affects model capabilities.
Modern training corpora typically include:
| Source | Proportion | Purpose |
|---|---|---|
| Web text | 60-80% | General knowledge, language understanding |
| Code | 5-15% | Programming, structured reasoning |
| Books | 3-10% | Long-form reasoning, narrative, specialized knowledge |
| Academic papers | 2-5% | Scientific knowledge, technical writing |
| Wikipedia | 2-5% | Factual knowledge, structured information |
| Conversational data | 1-5% | Dialogue, informal language |
| Math | 1-5% | Mathematical reasoning |
| Multilingual | Variable | Non-English language capability |
Ye et al. (2024) studied how data mixing ratios affect model capabilities and found:
Xie et al. (2024) with DoReMi introduced an algorithm that learns optimal data mixing ratios using a small proxy model:
def doremi_mixing(proxy_model, reference_model, domains, val_data):
"""
DoReMi: learn domain weights that minimize worst-case
excess loss across domains.
"""
domain_weights = torch.ones(len(domains)) / len(domains)
for step in range(num_steps):
# Sample batch according to current weights
batch = sample_weighted(domains, domain_weights)
# Compute excess loss per domain
proxy_loss = proxy_model.loss(batch)
ref_loss = reference_model.loss(batch)
excess_loss = proxy_loss - ref_loss
# Update weights: upweight domains with high excess loss
domain_weights *= torch.exp(eta * excess_loss)
domain_weights /= domain_weights.sum()
return domain_weights
Gao et al. (2020) created The Pile, an 825 GB English text dataset assembled from 22 diverse sources. The Pile was significant because it:
The Pile was used to train GPT-Neo, GPT-J, and Pythia models, making it one of the most studied training datasets.
Together (2023) created RedPajama as an open reproduction of the Llama training data. RedPajama v1 replicated the seven-source composition of Llama (Common Crawl, C4, GitHub, Wikipedia, books, ArXiv, StackExchange) totaling 1.2T tokens.
RedPajama v2 expanded to 30T tokens of raw web data with quality signals, enabling researchers to apply their own filtering criteria.
Soldaini et al. (2024) at the Allen Institute for AI created Dolma, a 3T token dataset used to train the OLMo model family. Dolma's contribution is not just its size but its transparency:
# Dolma processing pipeline (conceptual)
# 1. Download Common Crawl WARC files
# 2. Extract text with trafilatura
# 3. Language identification with fastText
# 4. Exact dedup with SHA-256
# 5. Near dedup with MinHash (Jaccard threshold 0.8)
# 6. Quality filtering with heuristics + perplexity
# 7. PII removal (email, phone, SSN patterns)
# 8. Tokenization and mixing
Penedo et al. (2024) at Hugging Face released FineWeb, a 15T token English web dataset derived from 96 Common Crawl snapshots. FineWeb demonstrated that careful deduplication and filtering of web data alone, without curated sources like Wikipedia or books, can produce models competitive with those trained on mixed-source datasets.
The companion FineWeb-Edu subset used a classifier trained on educational content to extract high-quality educational text, achieving strong results on knowledge benchmarks.
Li et al. (2024) introduced DCLM, applying the DataComp competition framework β originally developed for vision-language datasets β to language model pre-training. Rather than fixing the data and varying the model, DCLM fixes the model architecture and training recipe, then benchmarks different data curation strategies against each other on a standardized evaluation suite. The resulting DCLM-Baseline dataset, curated from Common Crawl using a fastText-based quality filter trained on OpenHermes 2.5 examples, produced a 7B-parameter model that matched or exceeded models trained on significantly more tokens. DCLM demonstrated that the data curation algorithm matters more than the raw data volume β a theme that connects directly to the scaling law insights discussed in Scaling Laws. The full competition framework and datasets are open-source, enabling the community to systematically compare filtering, deduplication, and mixing strategies on equal footing.
Penedo et al. (2024) extended FineWeb into the multilingual domain with FineWeb2, covering over 1000 languages extracted from Common Crawl. FineWeb2 applies the same rigorous deduplication and quality filtering pipeline as its English predecessor, adapted to handle the additional challenges of multilingual data: language identification accuracy degrades for low-resource languages, quality classifiers trained on English transfer poorly, and tokenization efficiency varies dramatically across scripts (see Tokenization for how vocabulary design interacts with multilingual data). FineWeb2 provides per-language quality scores and deduplication metadata, making it practical to construct custom multilingual training mixtures. For many low-resource languages, FineWeb2 represents the largest curated text corpus available.
The use of synthetically generated text as pre-training data has moved from experimental curiosity to mainstream practice, driven by the recognition that carefully crafted synthetic data can fill gaps that web-crawled data cannot.
The Phi model family (Gunasekar et al., 2023; Li et al., 2023) demonstrated that small models trained on high-quality synthetic data can dramatically outperform larger models trained on web crawls. The Phi-1 model (1.3B parameters) matched the code generation performance of models 10x its size by training on "textbook-quality" synthetic data: code exercises and explanations generated by GPT-3.5 and GPT-4 that followed a pedagogical structure β introducing concepts, providing worked examples, and building in complexity. Phi-1.5 and Phi-2 extended this approach to general reasoning, generating synthetic textbook chapters, exercises, and explanations across science, mathematics, and common-sense reasoning. The key insight is that synthetic data works not because it contains novel facts but because it presents information in a form optimized for learning β structured, progressive, and redundancy-free.
Ben Allal et al. (2024) at Hugging Face released Cosmopedia, a 25B-token synthetic dataset generated by Mixtral-8x7B-Instruct. Cosmopedia contains synthetic textbooks, blog posts, stories, and WikiHow-style articles covering topics derived from web content seeds. The generation pipeline uses curated prompts that specify audience level (e.g., "explain to a college student"), format, and topic, then post-filters the output for quality and coherence. The resulting dataset was used to train the SmolLM model family, demonstrating competitive performance at small model sizes.
Synthetic data is not universally beneficial. Research has identified several failure modes:
The emerging consensus is that synthetic data works best as a supplement to web-crawled data, not a replacement. It is most effective for domains where web data is scarce (mathematical reasoning, structured problem-solving) or where the pedagogical structure of the data matters more than the factual content.
A growing concern in the LLM community is the data wall β the impending exhaustion of unique, high-quality text available on the public internet.
Villalobos et al. (2024) estimated that the stock of high-quality English text on the internet is approximately 9 trillion tokens, with total web text (including low-quality content) at roughly 50-90 trillion tokens. Frontier models trained in 2024 already consumed 15T+ tokens, and Llama 3 405B was trained on 15.6T tokens β meaning that the highest-quality web text has already been used at least once by leading labs. Extrapolating the historical growth rate of model training data suggests that unique high-quality text could be effectively exhausted by 2028, a timeline that connects directly to the compute-data tradeoffs examined in Scaling Laws.
When unique data is limited, an obvious strategy is to train on the same data multiple times. Muennighoff et al. (2023) systematically studied multi-epoch training and found that repeating data up to 4 epochs causes minimal degradation, but beyond that, models begin to memorize rather than generalize. The effective value of repeated data follows a roughly logarithmic decay β the second pass through the data is worth approximately 60-70% of the first, the third pass 40-50%, and so on. Taylor et al. (2022) in the Galactica paper trained for multiple epochs on curated scientific text and showed that when data quality is very high, more epochs can be tolerated before diminishing returns set in.
Several strategies extend the effective size of a fixed corpus:
One path around the text data wall is to tap into the vast quantities of human knowledge encoded in video and audio. YouTube alone hosts over 800 million videos, with an estimated 10+ billion hours of spoken content β potentially trillions of tokens if transcribed. Whisper (Radford et al., 2023) demonstrated that high-quality speech-to-text models can transcribe audio at scale with low error rates. Several projects have explored using Whisper-transcribed YouTube, podcast, and lecture content as training data. The quality characteristics of transcribed speech differ from written text β it is more conversational, contains disfluencies, and represents a different distribution of topics β but it provides genuine human-generated content that is not yet exhausted.
As models expand beyond text to handle images, video, and audio natively, the curation of multimodal training data has become its own discipline.
The earliest large-scale image-text datasets were constructed by scraping alt-text from web images. LAION-5B (Schuhmann et al., 2022) collected 5.85 billion image-text pairs from Common Crawl, filtered using CLIP similarity scores to retain pairs where the image and text were semantically related. LAION-5B enabled the training of open models like Stable Diffusion, but also raised concerns: subsequent audits found CSAM, copyrighted material, and biased content in the dataset, leading to its temporary takedown and a re-filtered release.
DataComp (Gadre et al., 2024) applied the same competition-based framework as DCLM to image-text data, providing standardized evaluation for different filtering strategies. DataComp demonstrated that aggressive CLIP-score filtering improves model quality but reduces dataset diversity, echoing the tension between quality and diversity seen in text-only data curation.
Training models that can naturally mix text and images β like GPT-4V or Gemini β requires interleaved multimodal data where images appear inline within text passages, replicating how images function in web pages, textbooks, and articles. OBELICS (LaurenΓ§on et al., 2023) extracted 141 million interleaved image-text documents from Common Crawl, preserving the spatial relationship between text and images on the original web pages. This is fundamentally harder than collecting separate image-caption pairs: the filtering pipeline must assess whether images are content-relevant (not advertisements or icons), whether the surrounding text actually references the image, and whether the document as a whole is coherent.
Video pre-training data adds temporal complexity. Datasets like WebVid (Bain et al., 2021) collected millions of short video clips with text descriptions, while InternVid (Wang et al., 2024) scaled to over 200 million video clips. The challenge with video data is both the storage requirements (orders of magnitude larger than text or images) and the annotation quality β automatically generated descriptions of video content are often shallow ("a person walks across a room") and miss the temporal dynamics that make video informative. Current multimodal models typically sample sparse frames from videos rather than processing full video streams, a compromise driven by computational cost. The relationship between how multimodal data is tokenized and how models process it connects to the broader tokenization challenges discussed in Tokenization.
Curriculum learning β the idea that the order in which data is presented during training matters β has shown promise in LLM pre-training.
Bengio et al. (2009) originally proposed curriculum learning for neural networks: start with "easy" examples and gradually increase difficulty. For LLMs, "difficulty" can be operationalized as:
A more practical form of curriculum learning is phase-based training, where the data mixture changes over the course of training:
Llama 3 (Meta, 2024) used this approach, with an explicit annealing phase where the learning rate was decayed while training on a high-quality data subset. This improved benchmark performance significantly compared to continuing with the standard mixture.
The annealing phase has become a standard technique. Blakeney et al. (2024) showed that training on high-quality data during the final phase of pre-training, when the learning rate is being decayed, has an outsized effect on model quality. The intuition is that at low learning rates, the model makes small, precise adjustments rather than large updates, and high-quality data at this stage "polishes" the model's knowledge.
A persistent challenge in pre-training data curation is benchmark contamination β the accidental inclusion of evaluation benchmark data in the training corpus. Since training data is sourced from the web, and benchmarks are often published online, contamination is difficult to avoid entirely.
Dodge et al. (2021) and Jacovi et al. (2023) documented widespread contamination across major models and datasets. Mitigation strategies include:
def check_contamination(train_docs, benchmark_examples, n=13):
"""Check for n-gram overlap between training data and benchmarks."""
# Build n-gram index from benchmark examples
benchmark_ngrams = set()
for example in benchmark_examples:
tokens = example.split()
for i in range(len(tokens) - n + 1):
benchmark_ngrams.add(tuple(tokens[i:i+n]))
contaminated = []
for doc in train_docs:
tokens = doc.split()
for i in range(len(tokens) - n + 1):
if tuple(tokens[i:i+n]) in benchmark_ngrams:
contaminated.append(doc)
break
return contaminated
Pre-training data curation intersects with significant ethical and legal questions:
Longpre et al. (2023) in the Data Provenance Initiative documented the provenance and licensing status of 1800+ text datasets, finding significant ambiguity in licensing terms and frequent chain-of-custody issues where datasets are derived from other datasets without preserving license constraints.