How Does Transformer Architecture Work?
A 8-minute read
Every major AI system you've used, GPT, Claude, Gemini, Stable Diffusion, is built on transformer architecture. Understanding the attention mechanism that makes it work changes how you think about what these systems can and can't do.
In 2017, a team at Google published a paper called “Attention Is All You Need.” The title was a provocation, the claim that a single mechanism, attention, was sufficient to build a powerful sequence model without any of the recurrent loops that defined the previous decade of neural network research.
They were right. The architecture they introduced, the transformer, became the foundation for virtually every major AI system built since: GPT, Claude, Gemini, Stable Diffusion, DALL-E, Whisper, AlphaFold. Understanding how transformers work is understanding what modern AI actually is.
The problem transformers solved
To understand why transformers were revolutionary, you need to understand what came before.
Language is a sequential thing, words come one after another, and the meaning of each word depends on what came before it. For years, the dominant approach to modeling language was recurrent neural networks (RNNs) and their more sophisticated variant, LSTMs (Long Short-Term Memory networks). These models processed text the obvious way: one token at a time, left to right, maintaining a “hidden state” that summarized everything seen so far.
This approach had two fundamental problems.
Sequential processing is slow to train. You can’t start processing token 50 until you’ve finished token 49. This meant RNNs couldn’t take advantage of modern parallel hardware (GPUs excel at doing many things simultaneously, not one thing after another). Training was a bottleneck.
Long-range dependencies were hard to preserve. Trying to carry information from the beginning of a long sentence to the end through a chain of operations tends to degrade. By the time an RNN got to token 500, the influence of token 5 had been diluted through hundreds of intermediate steps. The model forgot things.
Transformers solved both problems at once.
The core idea: attention
The transformer’s central innovation is the attention mechanism. Instead of processing tokens sequentially, a transformer processes all tokens simultaneously and computes direct relationships between every pair of tokens in the input.
Here’s the intuition. Suppose the model is trying to understand the sentence: “The trophy didn’t fit in the suitcase because it was too big.”
What does “it” refer to? The trophy or the suitcase? A human reader knows immediately, the trophy was too big to fit. But to figure this out, you need to connect “it” to “trophy” across eight words.
In an RNN, that connection has to survive eight sequential updates to the hidden state. In a transformer, the model directly computes: how relevant is each other word to understanding the word “it”? It assigns weights to every other token, high weight to “trophy” and “big,” lower weight to “the” and “because”, and uses those weighted relationships to build a rich representation of “it.”
This is attention: learning which other tokens to pay attention to when processing each token.
How attention is computed
The technical implementation involves three vectors per token: a Query, a Key, and a Value.
Think of it like a search engine. The Query is what you’re looking for. The Keys are like the indexed terms in a database. The Values are the actual content you retrieve.
For each token, the model:
- Computes a Query vector (what this token is “asking about”)
- Computes Key vectors for every other token (what each token “offers”)
- Multiplies the Query against all Keys to get attention scores (how relevant is each token?)
- Normalizes the scores with a softmax function (so they sum to 1)
- Uses the scores to create a weighted sum of the Value vectors
The result is a new representation of each token that incorporates information from the other tokens most relevant to it.
Crucially, this entire computation happens in parallel across all tokens simultaneously, no sequential dependency, so it can be massively parallelized on GPU hardware.
Multi-head attention
One attention computation captures one type of relationship. But language has many types of relationships simultaneously, syntactic (subject-verb agreement), semantic (word meaning), coreference (“it” → “trophy”), positional (what’s nearby), and more.
Transformers use multi-head attention: running the attention mechanism multiple times in parallel, each “head” using different learned Query/Key/Value matrices. Each head learns to attend to different patterns. One head might focus on syntax, another on coreference, another on sentence structure.
The outputs of all heads are concatenated and projected down to produce the final representation. A transformer might have 8, 16, 32, or more attention heads running simultaneously, each capturing different aspects of the relationships between tokens.
The full transformer block
Attention is powerful but not the entire story. A transformer processes input through a stack of identical “blocks,” each containing:
Multi-head self-attention, the mechanism described above, where tokens attend to each other within the sequence.
Feed-forward network, after attention has gathered information across the sequence, a simple two-layer neural network is applied to each token position independently. This is where the model applies stored knowledge, learned facts, patterns, transformations. It’s larger than it looks: the feed-forward layer is typically 4x wider than the attention layer, and much of the model’s “memory” lives here.
Layer normalization, applied before each sub-component to stabilize training, keeping the values in a numerically tractable range.
Residual connections, the input to each sub-component is added back to its output, creating a “shortcut” path. This lets gradients flow through deep networks during training without vanishing.
A large language model stacks dozens to hundreds of these blocks. GPT-3 had 96 layers. Each layer adds another round of attention (which tokens matter to which?) and transformation (what does the model know about those relationships?).
Positional encoding
There’s a problem: attention has no built-in notion of order. Computing “how relevant is each other token?” doesn’t care whether “dog” comes before or after “bit” in “the dog bit the man.” But word order matters enormously.
Transformers solve this with positional encoding, adding a representation of each token’s position to its embedding before processing begins. Original transformers used fixed sine and cosine functions of different frequencies to create a unique positional signature for each position. Modern models often use learned positional embeddings or more sophisticated schemes like rotary position embedding (RoPE) that encode relative positions rather than absolute ones.
The model learns to use this positional information in conjunction with the token content, understanding not just what tokens are present, but in what order.
Encoder, decoder, and encoder-decoder
The original transformer paper introduced an encoder-decoder architecture for translation: the encoder reads the input sentence in full (attending to the whole input simultaneously) and the decoder generates the output word by word, attending both to what’s been generated so far and to the encoder’s representation of the input.
Modern language models have simplified this in two directions:
Encoder-only models (like BERT) process text bidirectionally, each token can attend to all other tokens in both directions. These are excellent for understanding tasks like classification, search, and extracting meaning from text, but they can’t generate new text.
Decoder-only models (like GPT, Claude, Gemini) process text unidirectionally, each token can only attend to previous tokens, not future ones. This is the autoregressive structure that enables text generation: the model predicts one token at a time, and each prediction becomes input for the next. When you send a message to an LLM and it streams a response, you’re watching a decoder-only transformer predict tokens sequentially.
The vast majority of LLMs you interact with today are decoder-only transformers.
Why scale works
One of the most surprising findings of the transformer era is that scaling, training larger models on more data with more compute, produces not just incremental improvements but qualitative capability jumps.
Researchers at OpenAI and other labs found that model performance on diverse tasks followed a predictable “scaling law”: loss improved as a power function of model size, dataset size, and compute. This meant that if you wanted a more capable model, the recipe was straightforward in principle: make it bigger and train it longer.
What nobody fully predicted was that at certain scales, models would develop capabilities that weren’t present at smaller scales, abilities that seemed to “emerge” discontinuously. Arithmetic, chain-of-thought reasoning, few-shot learning, code generation: these capabilities appeared at scale, apparently absent below certain parameter thresholds.
The reason isn’t fully understood, but the transformer architecture seems uniquely positioned to benefit from scale. The attention mechanism lets every parameter interact with every other through the attention matrix. The feed-forward layers provide vast associative memory. Stack 96 layers of this, train on a trillion tokens, and something qualitatively different emerges.
The limits of transformers
Understanding how transformers work also clarifies what they can’t do well.
Context window constraints are fundamental. Standard attention scales quadratically with sequence length, doubling the context requires four times the compute. While techniques like sparse attention and efficient attention variants have extended context windows dramatically, processing very long documents remains computationally expensive and may involve approximations.
No persistent memory. A transformer processes each input from scratch. It doesn’t have a “memory” that persists between conversations, just its weights (learned during training) and whatever is in the current context window. Everything a model “knows” either lives in its weights from training or was provided in the current prompt.
Hallucination is structural. A decoder model predicts the next token based on statistical patterns in its training data. It doesn’t retrieve facts from a database, it generates plausible continuations. This is why models confidently produce false information: they’re not lying, they’re extrapolating patterns. Retrieval-augmented generation (RAG) is one way to address this, but it’s a workaround for a fundamental architectural property.
Training data is a frozen snapshot. The model’s knowledge is fixed at training cutoff. Anything that happened after the training data was collected is unknown unless provided in context.
Key terms
Token The basic unit a transformer processes. Tokens are chunks of text, roughly 3-4 characters on average in English, not words. “Unbelievable” might be tokenized as “un,” “believ,” “able.” The model never sees raw characters or whole words.
Embedding A high-dimensional vector representing a token’s meaning. Similar meanings cluster together in embedding space. The transformer converts tokens to embeddings at the input and embeddings to tokens at the output.
Attention head One instance of the attention computation, learning to track one type of relationship in the text. Transformers run many attention heads in parallel (multi-head attention) to capture multiple relationship types simultaneously.
Parameters The learned numerical weights inside a model. A model with 70 billion parameters has 70 billion numbers that were tuned during training. More parameters = more capacity to store patterns, at the cost of more compute to run.
Autoregressive generation The process of generating text one token at a time, where each generated token is fed back as input to generate the next. This is why LLMs appear to “think” sequentially as they write.
Common misconceptions
“Transformers understand language.” Transformers are extraordinarily good at predicting statistical patterns in text, which produces behavior that looks like understanding. But they have no world model, no grounded sensory experience, no causal reasoning, just extremely sophisticated pattern completion. Whether this constitutes “understanding” is a genuine philosophical debate; what’s certain is that the mechanism is entirely statistical.
“Larger context window means the model reads everything equally.” In practice, transformer attention is not uniform over the context window. Models tend to pay disproportionate attention to content at the very beginning and very end of the context, with reduced attention to content in the middle, a phenomenon called “lost in the middle.” Simply stuffing a 100,000-token context window doesn’t guarantee all of it influences the output.
“Training is the same as running the model.” They’re fundamentally different operations. Training requires storing activations for every layer and backpropagating gradients, memory requirements scale with model size and batch size, requiring thousands of specialized chips. Inference (running a trained model) is much cheaper and can run on a single high-end GPU for many model sizes.
“ChatGPT and Claude are GPT models.” GPT (Generative Pre-trained Transformer) is OpenAI’s architecture. Claude is Anthropic’s, built on transformer principles but with different training approaches, architectures, and safety techniques. “GPT” has become colloquial shorthand for “large language model,” which confuses the specific (an OpenAI product) with the general (a class of systems).