AI & ML March 2, 2026

How Large Language Models Work

A 6-minute read

ChatGPT doesn't 'know' anything. It's a very sophisticated next-word predictor, and understanding that changes how you should use it.

In 2020, researchers at OpenAI fed GPT-3 a prompt asking it to write a legal brief. It produced something that looked professional, cited reasonable-sounding case law, and argued coherently. It also invented several of the cases. The model wasn’t lying — it can’t lie. It was doing what it always does: predicting what text should come next. That same quality that makes these models astonishing is also what makes them dangerous to trust blindly.

The short answer

A large language model (LLM) is a system trained to predict the most likely next word (or fragment of a word) given everything that came before it. It learned this by processing hundreds of billions of words from the internet, books, and code. The result isn’t a database of facts. It’s a system that has internalized the statistical patterns of human language so deeply that it can produce text that sounds, and often is, correct.

The full picture

What a language model actually does

Every time you type a message into ChatGPT, the model is doing one thing: predicting what comes next.

It works in units called tokens, which are roughly word fragments. “Unbelievable” might be two tokens: “un” and “believable.” A typical conversation uses a few hundred to a few thousand tokens.

Given a sequence of tokens, the model assigns a probability to every possible next token in its vocabulary (typically around 50,000 options). It picks one, appends it, then repeats. That loop, run hundreds of times, is how you get a paragraph.

This is not a search engine. It’s not looking anything up. It’s completing a pattern.

Training: what the model actually learns

Before an LLM can predict anything, it needs to train on data. For GPT-4 or Claude, that means roughly a trillion words, scraped from websites, digitized books, academic papers, Reddit, GitHub, and more.

The training process is simple in concept: show the model a sentence with the last word hidden, ask it to predict the hidden word, compare its guess to the real answer, and adjust the model slightly to do better next time. Do this billions of times.

What does the model learn? Not facts in a filing cabinet. It learns relationships. It learns that “the capital of France is” is almost always followed by “Paris.” It learns what legal documents sound like, how Python code is structured, how a doctor explains a diagnosis. It learns the texture of human thought across every domain it saw.

Transformers and attention: why context matters

The architecture that makes modern LLMs work is called the transformer, built on the same neural network principles first established in decades of deep learning research, and introduced specifically by Google researchers in a 2017 paper titled “Attention Is All You Need”.

The key idea is attention: the ability for the model to weigh how relevant each previous word is to predicting the next one.

Imagine reading the sentence: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy. You know this because you paid attention to the size relationship earlier in the sentence. A transformer does the same thing mathematically. For every token, it calculates attention scores across every other token in the context window, determining which earlier words should influence the current prediction most.

This is why LLMs can hold context across a long conversation. It’s also why they have limits. Most models can only “see” a certain number of tokens at once (their context window).

Parameters: the knobs that hold everything

A language model is, at its core, a very large mathematical function. The function is defined by billions of numerical values called parameters, or weights.

GPT-3 had 175 billion parameters — a figure that seemed staggering in 2020 but was quickly eclipsed. The field moves fast. By 2024, frontier models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro had become standard; by early 2026, the leading generation includes GPT-5, Claude Opus 4.5, and Gemini 3 Pro, with capabilities that dwarf GPT-3 by every benchmark. Parameter counts for current frontier models are generally not published, but estimates run into the hundreds of billions to trillions. These numbers represent the “knowledge” of the model: each one a tiny dial, set during training to nudge predictions in the right direction.

Scale matters enormously, and not in an obvious way. When researchers scaled models past certain sizes, new abilities emerged that weren’t present in smaller models: multi-step reasoning, code generation, translation of languages the model had barely seen. Nobody fully understands why scale unlocks these capabilities. It just does.

RLHF: teaching the model to be helpful

RLHF (Reinforcement Learning from Human Feedback) is how models learn to be helpful rather than just plausible — it’s the reason ChatGPT answers questions usefully, declines harmful requests, and says “I’m not sure” instead of confidently making things up. The helpfulness is trained in, not inherent.

A model trained purely to predict the next word is not automatically useful. It might complete your prompt in the style of a Reddit argument, or just give you plausible-sounding nonsense.

To make models useful and safe, researchers use a technique called RLHF: Reinforcement Learning from Human Feedback.

The process: human raters compare different model responses and rank them. A separate model (a “reward model”) is trained on those rankings, learning to score how good a response is. The main model is then updated to produce responses that score highly.

This is how ChatGPT learned to answer questions helpfully, decline harmful requests, and say “I’m not sure” instead of confidently making things up. The helpfulness is trained in, not inherent.

Context windows and why they matter

Every LLM has a context window: the maximum amount of text it can “see” at once, measured in tokens. Early GPT models had context windows of about 4,000 tokens (roughly 3,000 words). Modern models have expanded dramatically. Some now handle 200,000 tokens or more (about 150,000 words, effectively an entire novel).

The context window determines what the model can consider when generating a response. Everything outside the window is invisible. If you’re having a long conversation and the conversation exceeds the context limit, earlier messages effectively disappear. The model isn’t forgetting — it literally cannot see text beyond its window.

This creates real practical implications. Summarizing a very long document? You might need to chunk it into sections the model can process. Asking a model to recall something you said ten conversations ago? Impossible, unless the system explicitly stored and retrieved that memory.

Context windows also explain why longer prompts generally produce better outputs. The more relevant context you provide, the more the model has to work with. A bare “write an email” produces something generic. “Write an email to a client who missed a meeting, tone should be firm but professional, they’ve done this twice before, we need to reschedule by Friday” gives the model a rich context to work with — and the output is correspondingly better.

One subtle consequence: because models weight recent tokens more heavily in their attention calculations, the order of information in your prompt matters. Putting the most important context at the beginning and end of your prompt generally produces better results than burying it in the middle. This isn’t obvious, but it’s real.

Why it matters

Understanding that LLMs are pattern-completion engines (not knowledge retrieval systems) changes how you use them. They’re extraordinarily good at tasks where the right answer is well-represented in the training data: summarizing, explaining, translating, drafting, coding. They’re unreliable for tasks that require precise, up-to-date facts, mathematical guarantees, or information that wasn’t in their training data.

The hallucination problem flows directly from this architecture. When a model doesn’t “know” the answer, it doesn’t say nothing. It predicts the most plausible-sounding continuation of the conversation, which might be a confidently stated falsehood. This isn’t a bug that will be patched away. It’s baked into how these systems work.

The practical implication: use LLMs as a drafting and reasoning tool, not a reference. Verify claims that matter. Treat the output as a first draft from a very well-read colleague who sometimes confabulates.

Common misconceptions

“LLMs understand language the way humans do.” They don’t. They model statistical relationships between tokens. This produces outputs that look like understanding, but the mechanism is fundamentally different from human cognition.

“Bigger always means better.” Scaling helps up to a point, but training data quality, fine-tuning, and architecture choices matter just as much. A smaller model trained on better data often outperforms a larger one trained on noisy data.

“LLMs are retrieving stored facts.” They’re not. There’s no database of facts inside the model. Knowledge is distributed across billions of parameters in a way that can’t be directly inspected or updated.