How Does Tokenization Work?
A 6-minute read
Before an AI model can read your text, it breaks it into tiny pieces called tokens. This process, called tokenization, is how language models turn words into numbers they can understand.
When you type “Hello, world!” into ChatGPT, the model doesn’t see letters or words directly. Behind the scenes, the text gets chopped into roughly four pieces: “Hello”, ”,”, ” world”, and ”!”. Each piece becomes a number called a token, and these numbers are what the AI model actually processes. This conversion from text to numbers is called tokenization, and it is the first step in how every language model works. Learn more about how language models process text.
The short answer
Tokenization is the process of breaking text into discrete units called tokens that a language model can process. Rather than working with entire words, models split text into subword pieces based on statistical patterns learned during training. This approach lets models handle any text, from common words to invented terms, by representing everything as a sequence of numbers. The specific way text gets split varies between models, and different tokenizers produce different token counts for the same text.
The full picture
Why tokens instead of words
Early language models tried to work with complete words as their basic units, but this approach ran into problems. A vocabulary of all possible English words would need hundreds of thousands of entries, and it would completely fail when encountering new words, misspellings, or words from other languages. Tokenization solves this by breaking text into subword pieces that capture meaningful fragments.
Common words like “the” or “and” might become single tokens, while rarer words get split into smaller pieces. The word “unhappiness” might tokenize as three pieces: “un”, “happi”, “ness”. Each piece carries meaning, and the model learns to combine them. This flexibility means the same tokenizer can handle any text, whether it is a Shakespeare play, a text message, or a invented startup name.
How tokenizers work
Tokenizers use algorithms learned from large amounts of text data. The most common approach builds a vocabulary of common subword sequences, then uses a greedy algorithm to split any new text into the longest possible tokens from this vocabulary.
During training, the tokenizer scans billions of words and identifies which subword sequences appear most frequently together. It prioritizes keeping common words intact while breaking rare words into meaningful fragments. The result is a vocabulary of roughly 30,000 to 200,000 tokens depending on the model, with each token mapped to a specific number.
When new text arrives, the tokenizer repeatedly looks for the longest token in its vocabulary that matches the current position in the text, then moves forward and repeats until all text is consumed. This is why the same sentence can produce different token counts with different tokenizers.
Tokenization across different models
Each major AI company trains its own tokenizer on its own data, which creates differences in how text gets split. According to research from Google, the same sentence can have 30-50% different token counts when compared across models from different companies. This matters for practical reasons like API pricing, since most providers charge per token. Google Research studied tokenization differences across models.
OpenAI’s models use a tokenizer called tiktoken, while Anthropic uses its own BPE-based tokenizer, and Google uses SentencePiece. The underlying technique is similar, but the specific vocabulary and rules differ. This is also why prompting in one model does not transfer perfectly to another, since the same words get chunked differently.
Token limits and why they matter
Every language model has a maximum context window measured in tokens, representing how much text it can process at once. This includes both your input and its output. Modern models range from around 4,000 tokens up to potentially millions in specialized versions.
When you approach the token limit, the model must either truncate your text or use techniques like summarization to fit. Understanding tokenization helps you estimate whether your input will fit, since token counts do not map neatly to word counts. A 500-word email might be 650 tokens in one model and 750 in another.
Why it matters
Tokenization is the bridge between human language and machine numbers. Without it, models could not process text at all. But beyond the technical necessity, understanding tokenization helps you work more effectively with AI.
Knowing that text gets split into subword pieces explains why models sometimes generate nonsense when given unusual inputs. It also helps you estimate costs and length constraints. If you are building applications around LLMs, understanding tokens helps you design prompts that stay within limits and avoid pricing surprises.
For developers, tokenization matters even more. Fine-tuning a model requires careful consideration of tokenization, since your training data needs to be tokenized the same way the base model expects. Mismatched tokenization is a common source of bugs in AI applications.
Common misconceptions
“Tokens are just words.”
Not true. Tokens are subword units that can be smaller than words. The word “intelligence” might tokenize as “intel”, “li”, “gence”. This is why counting words does not give you an accurate token count.
“All models use the same tokenization.”
Different models use completely different tokenization schemes. A prompt that costs $0.01 with ChatGPT might cost $0.015 with Claude for the same text, purely due to different token counts.
“Longer words always equal more tokens.”
Not necessarily. A short common word like “the” is one token, while a longer rare word like “antidisestablishmentarianism” might split into multiple tokens. It depends on how common each piece is in the tokenizer’s vocabulary.
Key terms
Tokenizer: Software that converts raw text into token sequences. Every model has its own tokenizer trained specifically for it.
Vocabulary: The fixed set of tokens a model can recognize. Typically 30,000 to 200,000 tokens, covering common words, subword fragments, and special characters.
BPE (Byte Pair Encoding): The most common tokenization algorithm. It builds a vocabulary by iteratively merging the most frequent adjacent character pairs in training data.
Subword tokenization: Breaking words into smaller meaningful fragments rather than using whole words or individual characters.
Context window: The maximum number of tokens a model can process in a single request, including both input and output.
Token count: The number of tokens in a given text, which determines API costs and whether text fits within model limits.