How AI Learns to Speak: A Journey into LLM Pretraining
- Tushar Prasad

- Jun 6
- 13 min read
You’ve seen them, you’ve probably used them. ChatGPT, Claude, Llama – these Large Language Models (LLMs) have exploded onto the scene, writing essays, generating code, and even having surprisingly coherent conversations. But have you ever stopped to wonder what it actually takes to build one of these digital brains? It's not magic (though sometimes it feels like it!), but a fascinating mix of clever ideas, massive amounts of data, and some serious computing horsepower, let's take a journey into the world of LLM creation. We'll try to keep it understandable, even if you're not an AI PhD (yet!).

The Two Big Acts: Pretraining and Post-training
Think of building a cutting-edge LLM as a two-act play
Act I: Pretraining (Building the Raw Genius - Think GPT-3): This is where the model learns the fundamental rules of language, grammar, facts about the world, and even some reasoning capabilities. It’s like sending a kid to school for many years to absorb a vast general education.
Act II: Post-training (Teaching it Good Manners & Specific Skills - Think ChatGPT): Raw genius is great, but we also want the LLM to be helpful, follow instructions, and not say weird or harmful things. This is the "finishing school" phase.
The Core Mission: Teaching an AI to "Get" Language by Predicting What's Next
At its heart, pretraining an LLM is often about a surprisingly simple task: predicting the next word (or "token") in a sequence.
It sounds almost deceptively simple, doesn't it? But at the very heart of pretraining, many of today's most powerful LLMs lies a core task: language modeling. More specifically, it's often Autoregressive (AR) language modeling.
Imagine you hand the AI a piece of text: "The cat sat on the..."The AI's primary job during pretraining is to figure out what word (or, more accurately, "token") is most likely to come next. In this case, "mat," "floor," or "couch" would all be pretty good guesses. It does this over and over, for trillions of words, across a vast ocean of text scraped from the internet, books, and more.
Why this particular game of "predict the next word"? Because by constantly striving to make accurate predictions, the model is forced to learn the incredibly complex, often subtle, underlying patterns of human language. It's not just memorizing sequences; it's building an internal representation that captures:
Grammar and Syntax: The rules of how words combine to form meaningful sentences. It learns that "the cat sat" is valid, but "sat cat the" isn't.
Semantics: The actual meaning of words and how they relate to each other. It learns that "king" and "queen" are related to royalty, while "apple" and "banana" are fruits.
Contextual Nuance: How the meaning of a word can shift dramatically based on the words around it. "Bank" means one thing in "river bank" and something entirely different in "savings bank."
Factual Knowledge: As it reads through encyclopedias, news articles, and scientific papers, it absorbs facts about the world – who was the first person on the moon, what's the capital of France, etc.
Common Sense Reasoning (to a degree): It learns that if you drop a glass, it will likely break, or that birds can fly but fish typically don't.
Discourse Structure: How ideas flow in longer texts, how paragraphs connect, and how stories unfold.
When it's time for the model to actually generate text (what AI folks call "inference"), the process leverages this learned predictive ability:
Tokenize: Your input (the "prompt") is broken down into tokens.
Forward Pass: These tokens are fed through the LLM's neural network.
Predict Probabilities: The model churns out a list of probabilities for every single token in its vocabulary, indicating how likely each one is to be the next token in the sequence.
Sample: To avoid making the output too predictable and boring, the model doesn't always pick the token with the absolute highest probability. Instead, it often "samples" from the top few most likely tokens, introducing an element of randomness and creativity.
Detokenize: The chosen token is converted from its numerical ID back into readable text and added to the ongoing sentence. This loop (predict, sample, append) continues, with each newly generated token becoming part of the context for predicting the next one, until the model decides the thought is complete (often by generating a special "end of sequence" token) or it reaches a predefined length.
The "Teacher's Red Pen": How LLMs Learn from Their Mistakes via the Loss Function
So, the model makes a guess for the next word. How does it know if it was a good guess or a terrible one? This is where the loss function acts like a strict but fair teacher. For language modeling, the most common choice is cross-entropy loss.
Imagine the model is trying to complete "I saw a cat on a ___." Let's say the actual next word in the training data is "mat."
The model outputs its probability distribution: maybe it thought "mat" was 60% likely, "rug" 30% likely, and "table" 10% likely.
The cross-entropy loss function compares this prediction to the "ground truth" (which is 100% "mat").
If the model assigned high probability to "mat" (the correct word), the loss (think of it as a penalty or an "error score") is low. The model gets a conceptual pat on the back.
If it assigned very low probability to "mat" (meaning it was very "surprised" by the correct answer), the loss is high. The model gets a conceptual "needs improvement."
The entire, massive pretraining process is an optimization problem: the model's internal parameters (millions or billions of them, called "weights") are continuously adjusted, tiny bit by tiny bit, using algorithms like stochastic gradient descent, to try and minimize this total loss across all the trillions of examples in the training dataset.
Breaking Down Language: The Unsung Hero Called the Tokenizer
Computers, for all their power, don't inherently "understand" words like "serendipity" or even simple ones like "the." They speak the language of numbers. This is where the tokenizer plays a crucial, if often overlooked, role. It’s the translator between human language and the numerical world of the LLM.
Why not just assign a unique number to every single word?
Vocabulary Explosion: The number of unique words in any living language is vast and constantly growing (think of slang, new technical terms, names). Trying to create a fixed dictionary for all of them would be a nightmare and would quickly become outdated.
Out-of-Vocabulary (OOV) Words: What happens when the model encounters a word it's never seen during training? If it only knew whole words, it would be completely stumped.
Typos and Variations: Simple misspellings ("teh" instead of "the") or regional spelling differences ("color" vs. "colour") would be treated as entirely new, unrelated words by a whole-word system.
Morphological Richness: Words like "happy," "unhappy," "happiness," "happily" all share a common root and meaning. A good tokenization strategy should be able to capture these relationships.
Instead of whole words, LLMs typically operate on tokens. These are often sub-word units – common sequences of characters. So, a single word might be represented by one token (e.g., "cat" -> [cat]) or multiple tokens (e.g., "unbelievably" -> [un, believe, ably]). This approach offers several advantages:
Handles Novel Words: The model can represent new or rare words by piecing together known sub-word tokens.
Manages Vocabulary Size: The total number of unique tokens can be kept to a manageable size (e.g., 32,000 to 256,000 tokens), even while being able to represent a virtually infinite vocabulary of words.
Efficiency: Processing sequences of tokens is generally more efficient than character-by-character processing and more robust than whole-word processing.
Byte Pair Encoding (BPE) is a very common algorithm used to create these token vocabularies. The process, in a nutshell:
Start with a huge corpus of text.
Initially, consider every individual character in the text as a token.
Iteratively find the pair of adjacent tokens that occurs most frequently in the corpus and merge them into a new, single token. Add this new token to the vocabulary.
Repeat step 3 until the vocabulary reaches a predefined target size (e.g., 50,000 tokens) or no more frequent pairs can be beneficially merged.
The result is a vocabulary of tokens that are optimized to represent the training data compactly and efficiently. When you type a prompt into ChatGPT, the very first thing that happens is that your text is run through a tokenizer just like this!
"How Smart Is It?" – The Art and Science of Evaluating Pretrained LLMs
After weeks or months of intensive training, spending millions on compute, how do engineers and researchers figure out if their shiny new LLM is actually any good? Evaluating these behemoths is a complex field in itself.
Perplexity (PPL): The Old Faithful
As we discussed with the loss function, perplexity is a direct measure of how well the model predicts unseen text. It's derived from the average loss on a "validation set" – a chunk of text the model didn't see during training.
The Intuition: A lower perplexity score means the model is less "surprised" by new text, indicating it has learned the underlying language patterns better. If a model has a perplexity of 10, it's roughly as uncertain about the next token as if it were choosing uniformly from 10 possibilities.
The Trend: The progress here has been astounding. Over the last decade, perplexity scores on standard datasets have plummeted from the 70s or higher down to single digits for state-of-the-art models. This directly translates to models that generate much more coherent and human-like text.
Practical Use: While it doesn't tell the whole story about a model's abilities on specific tasks (like question answering or coding), perplexity is an invaluable tool during the pretraining phase. If perplexity isn't going down, something is likely wrong with your training setup!
Comprehensive Benchmarking: The LLM Olympics
To get a broader understanding of an LLM's capabilities beyond just next-token prediction, researchers rely on large, diverse sets of standardized evaluation tasks, often called benchmarks. The idea behind these aggregated benchmarks is to test the LLM's ability to generalize its learned linguistic knowledge to perform many different kinds of "easily" evaluable tasks, often by seeing how well it can predict a "gold" or correct answer compared to other options.
HELM (Holistic Evaluation of Language Models): This is an ambitious project aiming to provide a truly comprehensive assessment. It covers a wide range of scenarios (e.g., question answering, summarization, sentiment analysis, reasoning) and metrics (accuracy, robustness, fairness, bias, toxicity).
Hugging Face Open LLM Leaderboard: This has become a popular public platform where different LLMs (especially open-source ones) are continuously evaluated on a suite of common benchmarks. It fosters transparency and allows for quick comparisons.
3. MMLU (Massive Multitask Language Understanding): The Knowledge Test
MMLU has emerged as a particularly trusted benchmark for gauging the breadth and depth of an LLM's knowledge and reasoning abilities across many academic and professional domains. It consists of multiple-choice questions covering subjects like history, law, mathematics, computer science, medicine, philosophy, and more. A model's performance on MMLU is often seen as a good proxy for its general world knowledge and its capacity for basic problem-solving in those areas. Success here indicates the model has truly "learned" a lot from its pretraining data, not just surface-level statistical patterns.
The Lifeblood of LLMs: The Monumental Task of Data Curation
We've talked about models and evaluation, but let's be crystal clear: the data is king. You can have the most advanced neural network architecture, but if you feed it low-quality, biased, or insufficient data, your LLM will reflect those shortcomings. Pretraining an LLM is fundamentally about distilling the information from a massive dataset into the model's parameters.
The general philosophy is often to "use all of the clean internet." However, "clean" is the operative word, and the internet is anything but! This leads to an incredibly complex and labor-intensive data processing and curation pipeline:
Acquisition – Drinking from the Firehose: The journey often starts with gargantuan web crawls. Common Crawl is a publicly available dataset that regularly scrapes billions of web pages, amounting to petabytes (millions of gigabytes) of raw data. This is supplemented with other large text corpora like books (e.g., Project Gutenberg, Books3), scientific papers (e.g., arXiv), code repositories (e.g., GitHub), conversational text (e.g., Reddit, StackExchange), and news articles.
Extraction – Finding the Signal in the Noise: Raw HTML from web pages is messy. It contains actual content, but also a lot of boilerplate (navigation menus, ads, disclaimers, JavaScript code). Sophisticated text extraction tools are needed to carefully pull out the clean, human-readable text while trying to preserve important formatting like paragraphs, lists, or code blocks.
Filtering – Cleaning Up the Mess: This is a multi-stage process:
Removing Undesirable Content: Automated filters (and sometimes human review) are used to remove or flag content that is pornographic, overtly hateful, excessively toxic, or violates privacy by containing Personally Identifiable Information (PII).
Deduplication: The internet (and other datasets) are rife with repetition. Exact duplicates of documents, paragraphs, or even common lines (like "Click here to subscribe!") are removed. This is crucial for training efficiency and to prevent the model from over-learning and regurgitating common but uninformative phrases. Deduplication can be quite computationally intensive for very large datasets.
Quality Filtering (Heuristics & Models): Not all text is created equal. Various heuristics are applied to weed out low-quality documents. This might include:
Minimum/maximum document length.
Average word length (too short might be spam, too long might be gibberish).
Percentage of alphabetic characters (to filter out mostly code or symbol-heavy pages if not desired).
Repetitiveness within a document.
Presence of "dirty" tokens or too many outlier words. Sometimes, machine learning models are even trained specifically to classify documents as "high-quality" or "low-quality" (e.g., by predicting if a webpage is likely to be cited as a reference on Wikipedia).
Data Mixing & Domain Weighting – The Chef's Special Blend:
An LLM's capabilities are heavily influenced by the mix of data it sees. If it only sees news articles, it won't be good at writing code. If it only sees old books, it won't know about current events.
Therefore, careful thought goes into creating a "data mixture" – blending text from various domains like web pages, books, code, academic papers, dialogues, etc.
The proportions of these domains are critical. Researchers often experiment with different "weights" for each domain, sometimes guided by scaling law experiments, to achieve the desired balance of capabilities in the final pretrained model. For example, oversampling high-quality domains like books or well-curated academic text is common.
The precise recipes for data collection, filtering, and mixing are often closely guarded secrets by the organizations developing frontier LLMs, as this is a major source of competitive advantage. There's also a tremendous amount of ongoing research into new data sources, more effective filtering techniques, the role of synthetic (AI-generated) data in pretraining, and the legal and ethical implications of using vast web-scraped datasets (e.g., copyright concerns).
The Mind-Boggling Price Tag: What it Actually Costs to Build a Top-Tier LLM
We've talked about the complex processes, the vast data, and the smart algorithms. But to really grasp the scale of pretraining a state-of-the-art (SOTA) LLM, let's talk numbers. And be warned, they're astronomical! This isn't something you whip up in your garage over a weekend.
Let's take a hypothetical (but realistic) example of training a very large, cutting-edge model, say something like a LLaMA 3 400B (which means it has around 400 billion internal knobs or "parameters" it learns to tune).
Here's a breakdown of what goes into it:
The Sheer Amount of Text (Data):
15.6 Trillion Tokens: Imagine every word in every book in a giant library, then multiply that by a few thousand. A "token," as we discussed, is a piece of a word. This model needs to "read" and learn from a quantity of text that's almost impossible for a human to comprehend. This isn't just downloading; it's all the cleaning and processing we talked about earlier.
The "Brain Size" (Parameters):
405 Billion Parameters: Think of these as the individual connections or "neurons" in the AI's brain. Each one is a tiny number that gets adjusted during training. Having this many allows the model to store and process an incredible amount of nuanced information from that vast dataset.
The Computational Olympics (FLOPs - Floating Point Operations):
3.8 e25 FLOPs (that's 38 followed by 24 zeros!): This is the total number of basic math calculations the computers need to perform to train the model. Every time the model sees a piece of data and adjusts its parameters, it's doing billions of these calculations. The total is just staggering. To put it in perspective, a modern home computer might do a few trillion FLOPs per second. This training run needs many orders of magnitude more than that in total.
The Army of Supercomputers (Compute Hardware):
16,000 NVIDIA H100 GPUs: These aren't your average gaming graphics cards. H100s are some of the most powerful AI accelerators on the planet, each costing tens of thousands of dollars. You'd need a data center packed with 16,000 of them, all working in concert.
Average Throughput of 400 TFLOPS per GPU: This means each of those 16,000 GPUs is performing 400 trillion calculations every second, on average, for the duration of the training.
The Marathon Training Run (Time):
70 Days (around the clock): Even with that massive army of GPUs crunching numbers non-stop, it would take roughly 70 days – over two months – of continuous operation to complete the pretraining. If anything goes wrong (a power outage, a major software bug), you could lose valuable time and money.
The Jaw-Dropping Bill (Cost):
Rented Compute: ~$52 Million: Just renting the cloud computing time for those 16,000 GPUs for 70 days could cost over 50 million dollars. Cloud providers charge by the hour for these powerful machines.
Salaries: ~$25 Million: You need a large, highly skilled team of AI researchers, data engineers, systems engineers, and software developers to design the model, prepare the data, write the training code, manage the infrastructure, and troubleshoot problems. If you have a team of, say, 50 elite specialists, their combined annual salaries and overheads could easily run into the tens of millions.
Total Estimated Cost: ~$75 Million (could range from $65M to
85)
The Environmental Footprint (Carbon Emissions):
~4,400 metric tons of CO2 equivalent: All that electricity consumption for the GPUs has a significant environmental impact. This amount of CO2 is roughly equivalent to the emissions from about 2,000 round-trip passenger flights between New York (JFK) and London (LHR). This is a serious consideration that the AI community is increasingly trying to address through more efficient models and greener energy sources.
And it doesn't stop there...
"Next model? ~10x more FLOPs": The trend in AI has been "bigger is better." The next generation of SOTA models is likely to require even more data, more parameters, and significantly more computation, potentially an order of magnitude more.
This breakdown really underscores why only a handful of the world's largest tech companies and well-funded AI labs can currently afford to develop these "frontier" LLMs from scratch. It's an incredible feat of engineering and a massive investment, all to create that initial, raw linguistic intelligence we call a pretrained LLM. So, yeah, the next time you see an LLM churn out a perfect poem or fix that impossible bug in your code like it's no big deal, just take a second to appreciate what went into that "effortless" bit of brilliance.
But this incredible "linguistic beast" is ready for its next big adventure: finishing school. You definitely don't want to miss Part 2 of this series! We’re going to spill the beans on how we actually try to teach this digital brain some manners, make it genuinely helpful, and stop it from, you know, deciding to answer every question like a grumpy pirate (unless, of course, that's what you're going for!).



Comments