era · present · artificial-intelligence

How LLMs Work — The Mathematics That Made Machines Speak

Probability and transformers turned text into synthetic thought

By Esoteric.Love

Updated 22nd June 2026

era · present · artificial-intelligence

The Presentartificial intelligenceScience~16 min · 3,158 words

EPISTEMOLOGY SCORE

85/100

1 = fake news · 20 = fringe · 50 = debated · 80 = suppressed · 100 = grounded

GROUNDED

Imagine a machine that has never seen a single word of English, yet can write poetry, summarize legal documents, and hold a conversation that feels almost human. It does not understand meaning, intention, or truth — it only knows patterns of probability, sculpted by mathematics into something that looks, for all the world, like thought. This is the quiet miracle of large language models, and it began not with a breakthrough in artificial intelligence, but with a simple, ancient question: given what you have seen, what comes next?

TL;DRWhy This Matters

We are living through a quiet revolution in how language is processed, generated, and understood by machines. Less than a decade ago, the idea that a computer could produce coherent paragraphs on demand, translate between dozens of languages, or write code from a description seemed like science fiction. Today, these capabilities are embedded in products used by hundreds of millions of people. Yet the fundamental mechanism behind this shift remains obscure to most — a black box that we trust but rarely examine.

Understanding how large language models work is not merely an academic exercise. It shapes how we evaluate their outputs, how we anticipate their failures, and how we decide where to draw the line between useful tool and deceptive mimic. When a model generates a convincing but false statement, it is not lying — it is doing exactly what it was trained to do: produce the most probable sequence of tokens. Recognizing this distinction is essential for anyone who relies on these systems for research, journalism, education, or decision-making.

The mathematics that made machines speak is also a window into a deeper question: what does it mean to "know" a language? If a statistical model can produce grammatically perfect sentences without any understanding of grammar, then perhaps our own linguistic competence is more pattern-driven than we like to admit. The technology does not just challenge our tools — it challenges our assumptions about mind, meaning, and the nature of communication itself.

Looking forward, the trajectory is clear: models will grow larger, more efficient, and more integrated into daily life. But the core principles — probability, attention, and scale — are unlikely to change. The future of human-machine interaction will be built on the foundations laid in the last decade, and understanding those foundations is the first step toward using them wisely.

The Unreasonable Power of Prediction

At its heart, a language model is a probability distribution over sequences of words. Given a sequence of tokens — words, subwords, or characters — it assigns a probability to every possible next token. The simplest version of this idea is the n-gram model, which looks at the last n tokens and uses their frequency in a training corpus to predict what follows. For example, a trigram model trained on a large corpus might learn that "I am" is often followed by "going" or "happy," and assign higher probabilities accordingly.

N-gram models have a fatal flaw: they cannot capture long-range dependencies. The word that matters for predicting the next token might be twenty positions back, beyond the model's fixed window. They also suffer from data sparsity — many plausible sequences never appear in training data, so their probability is estimated as zero, which is both mathematically inconvenient and linguistically absurd.

Modern language models solve these problems by replacing fixed windows with learned representations and flexible attention mechanisms. Instead of counting frequencies, they learn to map each token into a high-dimensional vector — an embedding — that captures its meaning in context. The probability of the next token is then computed not from raw counts, but from the similarity and interaction of these vectors. This shift from counting to learning is what made the current revolution possible.

The training objective remains deceptively simple: predict the next token as accurately as possible. For a model with billions of parameters, trained on trillions of tokens from the internet, this task forces the model to internalize an enormous amount of linguistic structure — grammar, syntax, world knowledge, reasoning patterns — all encoded in the weights of a neural network. The model never explicitly learns that "the cat sat on the mat" is grammatical; it simply learns that this sequence is more probable than "the cat sat on the the," because the latter almost never appears in its training data.

The Transformer Architecture

The breakthrough that enabled modern language models was the introduction of the Transformer architecture in a 2017 paper titled "Attention Is All You Need." Before the Transformer, most sequence models relied on recurrent neural networks (RNNs) or long short-term memory (LSTM) networks, which processed tokens one at a time, maintaining a hidden state that was updated at each step. This sequential processing made training slow and made it difficult to capture dependencies between distant tokens — the hidden state had to carry information across many steps, and gradients would often vanish or explode.

The Transformer dispensed with recurrence entirely. Instead, it processes all tokens in parallel, using a mechanism called self-attention to compute relationships between every pair of tokens in the sequence. For each token, the model computes a weighted sum of the representations of all other tokens, where the weights are determined by how "relevant" each token is to the current one. This allows the model to directly attend to any token, regardless of distance, in a single computational step.

The architecture consists of an encoder and a decoder, each built from stacked layers of self-attention and feed-forward neural networks. The encoder processes the input sequence and produces a set of representations; the decoder uses these representations, along with its own self-attention, to generate the output sequence one token at a time. In practice, many modern language models — including the GPT family — use only the decoder, treating language generation as a purely autoregressive task: predict the next token given all previous tokens.

The Transformer's parallel processing makes it highly efficient on modern hardware like GPUs and TPUs, which are designed for matrix operations. Training a Transformer on a large corpus is still expensive — costing millions of dollars in compute — but it is feasible, whereas training an equivalently sized RNN would be prohibitively slow. This efficiency is what made scaling possible.

Attention: The Core Mechanism

The term attention in machine learning refers to a mechanism that allows a model to focus on relevant parts of the input when producing an output. In the Transformer, attention is computed using three matrices: queries, keys, and values. For each token, the model computes a query vector, a key vector, and a value vector. The attention weight between two tokens is the dot product of the query of the first and the key of the second, scaled by the square root of the dimension of the key vectors, and then passed through a softmax function to produce a probability distribution.

The intuition is straightforward: the query asks "what am I looking for?" and the key answers "what do I contain?" The dot product measures compatibility. If a token's query aligns well with another token's key, the attention weight is high, and the value of that token contributes more to the output. This allows the model to dynamically select which tokens are most relevant for predicting the next word.

Multi-head attention extends this idea by computing multiple sets of queries, keys, and values in parallel, each learning to attend to different types of relationships. One head might focus on syntactic dependencies — attending to the subject of a verb — while another attends to semantic similarity, and a third tracks positional information. The outputs of all heads are concatenated and projected back to the model's dimension, allowing the model to capture multiple aspects of context simultaneously.

Attention is not just a computational trick; it is a conceptual shift. Instead of compressing the entire input into a fixed-size vector (as RNNs do), the Transformer retains access to all tokens at all times, and the attention mechanism determines which information is used at each step. This makes the model more flexible and more interpretable — attention weights can be visualized to see which parts of the input the model is "looking at" when making a prediction.

Positional Encoding and the Problem of Order

Because the Transformer processes all tokens in parallel, it has no inherent sense of order. Unlike an RNN, which processes tokens sequentially and thus knows that "the cat" comes before "sat," the Transformer sees all tokens simultaneously and must be told their positions explicitly. This is done through positional encoding — adding a vector to each token's embedding that encodes its position in the sequence.

The original Transformer used sinusoidal functions of different frequencies to generate positional encodings. For each position pos and each dimension i, the encoding is:

- PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) - PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This choice has several desirable properties. The encodings are deterministic and do not require learning. They allow the model to easily attend to relative positions, because the encoding at position pos + k can be represented as a linear function of the encoding at position pos. And they generalize to sequences longer than those seen during training, since the sinusoidal functions extend indefinitely.

Later models, such as those in the GPT family, replaced sinusoidal encodings with learned positional embeddings, which are trained alongside the rest of the model. Both approaches serve the same purpose: giving the model a sense of where each token sits in the sequence, so that it can learn position-dependent patterns like "the word after 'the' is often a noun."

Scaling Laws and Emergent Abilities

One of the most surprising findings in recent AI research is that model performance improves predictably with scale. In a 2020 paper, researchers at OpenAI showed that the test loss of a language model follows a power-law relationship with the number of parameters, the amount of training data, and the compute budget. Doubling the model size, doubling the data, or doubling the compute all produce roughly the same improvement in loss — provided the other resources are scaled accordingly.

This observation, known as the scaling law, has driven the race toward ever-larger models. GPT-3, released in 2020, had 175 billion parameters — more than ten times the size of any previous dense model. It was trained on hundreds of billions of tokens from the internet, books, and other sources. The result was not just a better language model, but a qualitatively different one: GPT-3 could perform tasks it had never been explicitly trained on, simply by being given a few examples in the prompt.

This phenomenon, called few-shot learning, was a surprise. Previous models required fine-tuning on task-specific data to achieve good performance. GPT-3 could translate English to French, answer questions about a passage, or write a poem in the style of Shakespeare, all without any gradient updates — just by conditioning on a few examples provided in the input. The model had internalized enough linguistic and world knowledge during pre-training that it could generalize to novel tasks on the fly.

The mechanisms behind emergent abilities are still debated. Some researchers argue that they are a natural consequence of scale — that as models become larger and more capable, they implicitly learn to perform a wider range of tasks. Others suggest that emergent abilities are artifacts of the evaluation metrics, and that performance improves smoothly rather than suddenly. Regardless, the practical implications are clear: scaling up models unlocks capabilities that are not present in smaller ones, and we do not yet know where the ceiling lies.

Tokenization and the Vocabulary Problem

Before a language model can process text, it must convert raw strings into a sequence of discrete tokens. This process, called tokenization, is surprisingly consequential. The choice of tokenizer determines the model's vocabulary, its ability to handle rare or novel words, and its computational efficiency.

The most common approach is byte-pair encoding (BPE), which starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of tokens. For example, if "th" appears frequently in the training corpus, it becomes a single token. Over many iterations, common subwords like "ing," "tion," and "the" are added to the vocabulary, while rare words are split into smaller pieces. The final vocabulary typically contains tens of thousands of tokens, balancing the need to represent common words efficiently while still being able to handle novel inputs.

Tokenization introduces several quirks. The model never sees raw characters — it only sees token IDs. This means that spelling errors, unusual punctuation, or words from low-resource languages may be tokenized into many small pieces, making them harder for the model to process. The tokenizer also determines the model's maximum context length in tokens, not in words or characters, which affects how much text the model can "see" at once.

Some recent models have experimented with character-level or byte-level tokenization, which avoids the need for a fixed vocabulary and can handle any input. However, these approaches increase the sequence length — a sentence that might be 20 tokens with BPE could be 100 characters — which increases computational cost. The trade-off between vocabulary size and sequence length is an active area of research.

Training: From Raw Text to Language Model

Training a large language model is a monumental engineering effort. The process begins with data collection: scraping the web, digitizing books, and gathering other text sources to create a corpus of trillions of tokens. This raw data must be cleaned, deduplicated, and filtered to remove low-quality or harmful content. The resulting dataset is then tokenized and stored in a format that can be efficiently loaded during training.

The training itself uses stochastic gradient descent (SGD) or one of its variants, such as Adam, to minimize the cross-entropy loss — the negative log probability of the correct next token, averaged over all positions in the training data. The model's parameters are updated iteratively, with each update computed on a small batch of sequences. For a model with hundreds of billions of parameters, training requires thousands of GPUs running in parallel for weeks or months.

The scale of training introduces numerous challenges. The model is too large to fit on a single GPU, so it must be split across many devices using model parallelism — different layers or different parts of the same layer are placed on different GPUs. The data must also be split across devices using data parallelism, where each GPU processes a different batch of data and gradients are averaged. Communication between GPUs becomes a bottleneck, and specialized hardware interconnects are needed to keep training efficient.

Despite these challenges, training has become routine for large companies and research labs. The cost, however, remains prohibitive for most organizations — training a single large model can cost tens of millions of dollars in compute. This has led to concerns about centralization of power and access, as only a handful of entities can afford to train state-of-the-art models.

Post-Training: Instruction Tuning and Alignment

A raw language model trained on internet text is not particularly useful. It can generate coherent text, but it has no concept of following instructions, being helpful, or avoiding harmful outputs. It might complete a sentence with offensive content, or generate a plausible-sounding but false statement. To make models safe and useful, a post-training phase is required.

Instruction tuning involves fine-tuning the model on a dataset of (instruction, response) pairs. The model learns to follow instructions by seeing examples like "Translate this sentence to French: 'Hello, how are you?'" with the expected response. This dramatically improves the model's ability to perform tasks on command, even tasks it was not explicitly trained on.

Reinforcement learning from human feedback (RLHF) goes a step further. Human evaluators rank the model's outputs for quality, and a reward model is trained to predict these rankings. The language model is then fine-tuned to maximize the reward, effectively learning to produce outputs that humans prefer. This process aligns the model with human values — making it more helpful, more honest, and less likely to generate harmful content.

Recent work has introduced direct preference optimization (DPO), which simplifies RLHF by directly optimizing the model on preference data without training a separate reward model. DPO has become popular because it is simpler and more stable than RLHF, while achieving comparable alignment.

Post-training is where the art of language model development meets the science. The choice of training data, the design of the reward model, and the tuning of hyperparameters all affect the final behavior of the model. It is also where many of the ethical challenges arise — whose preferences are used for alignment? How do we ensure the model is fair across different demographics? These questions remain open.

The Questions That Remain

Despite the remarkable progress, fundamental questions about language models remain unanswered. The first is about understanding: do these models actually "know" anything, or are they just sophisticated pattern matchers? When a model correctly answers a question about history, is it reasoning, or is it reproducing a pattern it saw in training data? The distinction matters for trust and reliability, but we lack a clear way to test it.

The second question concerns the limits of scaling. Will continued increases in model size and data lead to further emergent abilities, or are we approaching a plateau? Some researchers argue that current models are already close to the limits of what can be learned from text alone, and that new architectures or training paradigms will be needed for further progress. Others believe that scaling will continue to yield improvements for the foreseeable future.

The third question is about efficiency. Current models require enormous amounts of energy and specialized hardware to train and run. Can we build models that are as capable but orders of magnitude smaller? Techniques like quantization, pruning, and knowledge distillation offer partial solutions, but the gap between the largest models and the smallest usable ones remains vast.

The fourth question is about safety. As models become more capable, the potential for misuse grows — from generating disinformation to automating cyberattacks. How do we ensure that powerful language models are used for beneficial purposes? Technical solutions like alignment and watermarking are promising, but they are not foolproof, and the social and regulatory dimensions of this question are only beginning to be explored.

The final question is perhaps the most profound: what does it mean that a statistical model can produce language that is indistinguishable from human writing? If the boundary between human and machine text is blurring, then our traditional notions of authorship, creativity, and authenticity are called into question. Language models do not have intentions, beliefs, or desires — but their outputs increasingly look as though they do. We are left with the unsettling possibility that fluent language does not require a mind, only a sufficiently large probability distribution.