What makes transformers different from earlier neural networks?

Earlier neural networks for language (RNNs and LSTMs) processed text sequentially, one word at a time, left to right, which made them slow to train and poor at connecting ideas across long distances in a sentence. Transformers process all words simultaneously and use an 'attention mechanism' to let every word directly consider every other word, regardless of position. This made them dramatically faster to train on modern parallel hardware and much better at understanding context.

What is attention in the context of AI?

Attention is the mechanism that lets a transformer model decide which words (or tokens) are most relevant to understanding each other. When processing the word 'it' in a sentence, the attention mechanism assigns weights to every other word in the sentence to figure out what 'it' refers to. High-weight words are 'attended to' more; low-weight words are largely ignored. The model learns these weights during training.

Why do transformers need so much compute to train?

The attention mechanism requires every token in a sequence to compare itself against every other token, an operation that scales with the square of the sequence length. For a 1,000-token context, that's a million comparisons per attention head, across dozens of heads, across dozens of layers, across billions of training examples. Modern large language models require thousands of high-end GPUs running for weeks or months to train from scratch.

What is a transformer's 'context window'?

The context window is the maximum number of tokens a transformer can process at once, the amount of text it can 'hold in mind' when generating a response. Early GPT models had context windows of 2,048 tokens (about 1,500 words). Modern models like Claude and Gemini support hundreds of thousands of tokens. The limitation is computational: processing longer contexts requires more memory and compute, roughly quadratically with standard attention.

Do all modern AI systems use transformers?

Most do, but the field is evolving. Transformers dominate language models, image generation (via diffusion models built on transformer components), and increasingly vision and audio. Some researchers are exploring alternatives like state-space models (Mamba) and other architectures that scale more efficiently with context length. But as of 2025, the vast majority of frontier AI systems are transformer-based.

How Transformer Architecture Works

In 2017, a team at Google published a paper called “Attention Is All You Need.” The title was a provocation, the claim that a single mechanism, attention, was sufficient to build a powerful sequence model without any of the recurrent loops that defined the previous decade of neural network research.

They were right. The architecture they introduced, the transformer, became the foundation for virtually every major AI system built since: GPT, Claude, Gemini, Stable Diffusion, DALL-E, Whisper, AlphaFold. Understanding how transformers work is understanding what modern AI actually is.

The short answer

A transformer processes all tokens in a sequence simultaneously and uses an attention mechanism to let each token directly consider every other token, regardless of position. This replaces the sequential processing of earlier neural networks and enables massive parallelization on GPU hardware. The architecture consists of stacked attention blocks, where each block contains multi-head self-attention (allowing tokens to attend to each other) followed by a feed-forward network that applies learned knowledge. Decoder-only transformers like GPT and Claude generate text autoregressively, predicting one token at a time based on all previous tokens, which is why these models appear to “think” as they write.

The full picture

The problem transformers solved

To understand why transformers were revolutionary, you need to understand what came before.

Language is a sequential thing, words come one after another, and the meaning of each word depends on what came before it. For years, the dominant approach to modeling language was recurrent neural networks (RNNs) and their more sophisticated variant, LSTMs (Long Short-Term Memory networks). These models processed text the obvious way: one token at a time, left to right, maintaining a “hidden state” that summarized everything seen so far.

This approach had two fundamental problems.

Sequential processing is slow to train. You can’t start processing token 50 until you’ve finished token 49. This meant RNNs couldn’t take advantage of modern parallel hardware (GPUs excel at doing many things simultaneously, not one thing after another). Training was a bottleneck.

Long-range dependencies were hard to preserve. Trying to carry information from the beginning of a long sentence to the end through a chain of operations tends to degrade. By the time an RNN got to token 500, the influence of token 5 had been diluted through hundreds of intermediate steps. The model forgot things.

Transformers solved both problems at once.

The core idea: attention

The transformer’s central innovation is the attention mechanism. Instead of processing tokens sequentially, a transformer processes all tokens simultaneously and computes direct relationships between every pair of tokens in the input.

Here’s the intuition. Suppose the model is trying to understand the sentence: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy or the suitcase? A human reader knows immediately, the trophy was too big to fit. But to figure this out, you need to connect “it” to “trophy” across eight words.

In an RNN, that connection has to survive eight sequential updates to the hidden state. In a transformer, the model directly computes: how relevant is each other word to understanding the word “it”? It assigns weights to every other token, high weight to “trophy” and “big,” lower weight to “the” and “because”, and uses those weighted relationships to build a rich representation of “it.”

This is attention: learning which other tokens to pay attention to when processing each token.

How attention is computed

The technical implementation involves three vectors per token: a Query, a Key, and a Value.

Think of it like a search engine. The Query is what you’re looking for. The Keys are like the indexed terms in a database. The Values are the actual content you retrieve.

For each token, the model:

Computes a Query vector (what this token is “asking about”)
Computes Key vectors for every other token (what each token “offers”)
Multiplies the Query against all Keys to get attention scores (how relevant is each token?)
Normalizes the scores with a softmax function (so they sum to 1)
Uses the scores to create a weighted sum of the Value vectors

The result is a new representation of each token that incorporates information from the other tokens most relevant to it.

Crucially, this entire computation happens in parallel across all tokens simultaneously, no sequential dependency, so it can be massively parallelized on GPU hardware.

Multi-head attention

One attention computation captures one type of relationship. But language has many types of relationships simultaneously, syntactic (subject-verb agreement), semantic (word meaning), coreference (“it” → “trophy”), positional (what’s nearby), and more.

Transformers use multi-head attention: running the attention mechanism multiple times in parallel, each “head” using different learned Query/Key/Value matrices. Each head learns to attend to different patterns. One head might focus on syntax, another on coreference, another on sentence structure.

The outputs of all heads are concatenated and projected down to produce the final representation. A transformer might have 8, 16, 32, or more attention heads running simultaneously, each capturing different aspects of the relationships between tokens.

The full transformer block

Attention is powerful but not the entire story. A transformer processes input through a stack of identical “blocks,” each containing:

Multi-head self-attention, the mechanism described above, where tokens attend to each other within the sequence.

Feed-forward network, after attention has gathered information across the sequence, a simple two-layer neural network is applied to each token position independently. This is where the model applies stored knowledge, learned facts, patterns, transformations. It’s larger than it looks: the feed-forward layer is typically 4x wider than the attention layer, and much of the model’s “memory” lives here.

Layer normalization, applied before each sub-component to stabilize training, keeping the values in a numerically tractable range.

Residual connections, the input to each sub-component is added back to its output, creating a “shortcut” path. This lets gradients flow through deep networks during training without vanishing.

A large language model stacks dozens to hundreds of these blocks. GPT-3 had 96 layers. Each layer adds another round of attention (which tokens matter to which?) and transformation (what does the model know about those relationships?).

Positional encoding

There’s a problem: attention has no built-in notion of order. Computing “how relevant is each other token?” doesn’t care whether “dog” comes before or after “bit” in “the dog bit the man.” But word order matters enormously.

Transformers solve this with positional encoding, adding a representation of each token’s position to its embedding before processing begins. Original transformers used fixed sine and cosine functions of different frequencies to create a unique positional signature for each position. Modern models often use learned positional embeddings or more sophisticated schemes like rotary position embedding (RoPE) that encode relative positions rather than absolute ones.

The model learns to use this positional information in conjunction with the token content, understanding not just what tokens are present, but in what order.

Encoder, decoder, and encoder-decoder

The original transformer paper introduced an encoder-decoder architecture for translation: the encoder reads the input sentence in full (attending to the whole input simultaneously) and the decoder generates the output word by word, attending both to what’s been generated so far and to the encoder’s representation of the input.

Modern language models have simplified this in two directions:

Encoder-only models (like BERT) process text bidirectionally, each token can attend to all other tokens in both directions. These are excellent for understanding tasks like classification, search, and extracting meaning from text, but they can’t generate new text.

Decoder-only models (like GPT, Claude, Gemini) process text unidirectionally, each token can only attend to previous tokens, not future ones. This is the autoregressive structure that enables text generation: the model predicts one token at a time, and each prediction becomes input for the next. When you send a message to an LLM and it streams a response, you’re watching a decoder-only transformer predict tokens sequentially.

The vast majority of LLMs you interact with today are decoder-only transformers.

Why scale works

One of the most surprising findings of the transformer era is that scaling, training larger models on more data with more compute, produces not just incremental improvements but qualitative capability jumps.

Researchers at OpenAI and other labs found that model performance on diverse tasks followed a predictable “scaling law”: loss improved as a power function of model size, dataset size, and compute. This meant that if you wanted a more capable model, the recipe was straightforward in principle: make it bigger and train it longer.

What nobody fully predicted was that at certain scales, models would develop capabilities that weren’t present at smaller scales, abilities that seemed to “emerge” discontinuously. Arithmetic, chain-of-thought reasoning, few-shot learning, code generation: these capabilities appeared at scale, apparently absent below certain parameter thresholds.

The reason isn’t fully understood, but the transformer architecture seems uniquely positioned to benefit from scale. The attention mechanism lets every parameter interact with every other through the attention matrix. The feed-forward layers provide vast associative memory. Stack 96 layers of this, train on a trillion tokens, and something qualitatively different emerges.

The limits of transformers

Understanding how transformers work also clarifies what they can’t do well.

Context window constraints are fundamental. Standard attention scales quadratically with sequence length, doubling the context requires four times the compute. While techniques like sparse attention and efficient attention variants have extended context windows dramatically, processing very long documents remains computationally expensive and may involve approximations.

No persistent memory. A transformer processes each input from scratch. It doesn’t have a “memory” that persists between conversations, just its weights (learned during training) and whatever is in the current context window. Everything a model “knows” either lives in its weights from training or was provided in the current prompt.

Hallucination is structural. A decoder model predicts the next token based on statistical patterns in its training data. It doesn’t retrieve facts from a database, it generates plausible continuations. This is why models confidently produce false information: they’re not lying, they’re extrapolating patterns. Retrieval-augmented generation (RAG) is one way to address this, but it’s a workaround for a fundamental architectural property.

Training data is a frozen snapshot. The model’s knowledge is fixed at training cutoff. Anything that happened after the training data was collected is unknown unless provided in context.

Why it matters

Understanding transformers matters because they are the engine behind every major AI tool you interact with today. When you use ChatGPT, Claude, or Gemini, you’re interacting with a decoder-only transformer. When you generate an image with Stable Diffusion, you’re using a diffusion model whose core architecture is built on transformer components. The attention mechanism is so effective that it has spread beyond language into vision, audio, robotics, and scientific discovery.

This understanding also clarifies what AI can and cannot do. The architecture’s fundamental limits, like the lack of persistent memory between conversations and the tendency to hallucinate facts, are not bugs to be fixed but properties of how the system works. Knowing this helps set realistic expectations. A transformer is not a knowledge retrieval system; it’s a pattern-completion engine that generates statistically plausible text. It doesn’t “know” facts in the way humans do, it reproduces patterns it observed during training.

For anyone building with AI, understanding the attention mechanism helps with prompt engineering, debugging outputs, and choosing the right architecture for a task. The difference between encoder-only and decoder-only models, for instance, determines whether you’re building a system that understands text or generates it. The scaling behavior explains why bigger models suddenly develop new capabilities, and why the compute requirements for training are so enormous.

Common misconceptions

“Transformers understand language.” Transformers are extraordinarily good at predicting statistical patterns in text, which produces behavior that looks like understanding. But they have no world model, no grounded sensory experience, no causal reasoning, just extremely sophisticated pattern completion. Whether this constitutes “understanding” is a genuine philosophical debate; what’s certain is that the mechanism is entirely statistical.

“Larger context window means the model reads everything equally.” In practice, transformer attention is not uniform over the context window. Models tend to pay disproportionate attention to content at the very beginning and very end of the context, with reduced attention to content in the middle, a phenomenon called “lost in the middle.” Simply stuffing a 100,000-token context window doesn’t guarantee all of it influences the output.

“Training is the same as running the model.” They’re fundamentally different operations. Training requires storing activations for every layer and backpropagating gradients, memory requirements scale with model size and batch size, requiring thousands of specialized chips. Inference (running a trained model) is much cheaper and can run on a single high-end GPU for many model sizes.

“ChatGPT and Claude are GPT models.” GPT (Generative Pre-trained Transformer) is OpenAI’s architecture. Claude is Anthropic’s, built on transformer principles but with different training approaches, architectures, and safety techniques. “GPT” has become colloquial shorthand for “large language model,” which confuses the specific (an OpenAI product) with the general (a class of systems).

Key terms

Token The basic unit a transformer processes. Tokens are chunks of text, roughly 3-4 characters on average in English, not words. “Unbelievable” might be tokenized as “un,” “believ,” “able.” The model never sees raw characters or whole words.

Embedding A high-dimensional vector representing a token’s meaning. Similar meanings cluster together in embedding space. The transformer converts tokens to embeddings at the input and embeddings to tokens at the output.

Attention head One instance of the attention computation, learning to track one type of relationship in the text. Transformers run many attention heads in parallel (multi-head attention) to capture multiple relationship types simultaneously.

Parameters The learned numerical weights inside a model. A model with 70 billion parameters has 70 billion numbers that were tuned during training. More parameters equal more capacity to store patterns, at the cost of more compute to run.

Autoregressive generation The process of generating text one token at a time, where each generated token is fed back as input to generate the next. This is why LLMs appear to “think” sequentially as they write.

Context window The maximum number of tokens a transformer can process at once, the amount of text it can hold in mind when generating a response.

This explanation draws on the foundational Attention Is All You Need paper from Google and the Wikipedia entry on Transformer architecture for additional reference.