What is the difference between model distillation and model compression?

Model compression is an umbrella term for any technique that makes a model smaller, including pruning, quantization, and distillation. Model distillation is a specific type of compression where a smaller student model is trained to mimic the outputs of a larger teacher model. Compression strips away parts of an existing model; distillation trains an entirely new, smaller model to replicate the teacher's behavior.

Does the student model perform as well as the teacher?

Usually not quite, but often close enough to be practically useful. DistilBERT, for example, retains about 97% of BERT's performance on language understanding benchmarks while being 40% smaller and 60% faster. The performance gap tends to widen on complex reasoning tasks where the teacher has a real advantage.

What are soft targets in distillation?

When a teacher model classifies an input, it outputs a probability distribution across all possible classes, not just the top answer. These probabilities are called soft targets. A photo might be 80% cat, 15% lynx, 4% fox. These soft targets carry more information than a simple 'cat' label because they encode the teacher's uncertainty and relationships between categories. The student learns by matching this distribution, not just the final answer.

Can distillation be used for large language models?

Yes, and it is regularly. Models like DistilGPT-2 and Mistral 7B (which incorporates distillation techniques) are distilled versions of much larger models. The challenge at LLM scale is that teacher inference is expensive and the output space is vast, but the core method works the same way.

What is the difference between distillation and fine-tuning?

Fine-tuning starts from a pre-trained model and continues training it on new data to specialize its behavior. Distillation trains a new, smaller model from scratch (or from a smaller checkpoint) using the larger model's outputs as training signal. Fine-tuning changes an existing model; distillation creates a new one shaped by the teacher.

How Model Distillation Works

In 2015, Geoffrey Hinton, then at Google, published a paper with a deceptively simple insight: when a large neural network makes a prediction, the distribution of probabilities it outputs contains far more information than just the final answer. A model that says a photo is 90% cat, 8% lynx, and 2% fox is teaching you something about how cats and lynxes are related. That structured knowledge, encoded in the full probability output, can be used to train a much smaller model. The paper was titled Distilling the Knowledge in a Neural Network, and the technique has become central to how modern AI is deployed at scale.

The short answer

Model distillation trains a small “student” model to mimic the outputs of a large “teacher” model. Instead of learning from raw labeled data alone, the student learns from the teacher’s full probability distributions, called soft targets, which carry richer information than simple correct-or-wrong labels. The result is a compact model that approximates the teacher’s behavior at a fraction of the compute cost.

The full picture

Why size is a problem

Large models are capable partly because of their size. A model with 70 billion parameters can represent complex relationships that a 7 billion parameter model struggles with. But size has costs. Running inference on a 70B model requires expensive hardware, generates significant latency, and consumes substantial energy. Deploying such models in consumer apps, on phones, or at high request volumes becomes prohibitively expensive.

The goal of distillation is to close this gap: produce a smaller model that behaves much like the larger one, at a fraction of the cost.

How the teaching works

Standard machine learning trains a model against hard labels: the correct answer for each input. An image classification model learns that a particular photo is labeled “cat.” The model’s loss is measured against this single ground truth.

The problem with hard labels is that they throw away information. The photo labeled “cat” might be 90% cat and 8% lynx in the teacher’s estimation. That 8% carries signal: it means cats and lynxes share visual features the model noticed. Hard labels discard all of this.

Distillation uses the teacher’s full output distribution as training signal. This distribution, the soft targets, tells the student not just what the answer is, but how confident the teacher is, and what alternatives it considered. The student learns by minimizing the difference between its own output distribution and the teacher’s, rather than just trying to match the correct label.

The technical mechanism involves a “temperature” parameter that controls how spread out the soft targets are. Higher temperatures make the distribution softer and more informative, spreading probability more evenly across near-miss categories. The student trains against these softened outputs and learns a more nuanced representation than hard-label training would produce.

What gets transferred

The insight is that a trained model’s outputs contain compressed knowledge about the structure of the problem. A model that has seen millions of images learns that cats and lynxes are visually similar, that certain dog breeds resemble wolves, and that toy cars and real cars share features. This relational structure is encoded in the output probabilities and gets passed to the student.

When the student trains against soft targets rather than hard labels, it learns not just to classify correctly but to produce a similar internal representation. The student doesn’t just learn “this is a cat.” It learns the shape of the teacher’s uncertainty, which is a proxy for the teacher’s understanding.

Distillation in practice: DistilBERT

The most widely cited example is DistilBERT, published by Hugging Face in 2019. BERT had 110 million parameters. DistilBERT was trained to mimic BERT using distillation, resulting in a model with 66 million parameters, 40% smaller, 60% faster to run, and retaining around 97% of BERT’s performance on language understanding benchmarks.

This is the practical payoff of distillation: not perfect replication, but close enough to be useful while being far cheaper to deploy. DistilBERT powers applications that couldn’t afford full BERT at scale.

The same principle applies to much larger models. Several widely used LLMs incorporate distillation techniques, either using a larger proprietary model as teacher during training or by training smaller variants alongside the full model.

Variations on the technique

The standard form of distillation uses only the final output layer as teacher signal. But researchers have explored transferring knowledge from intermediate layers too, a method called feature-based distillation. Rather than matching only the final probability, the student is trained to match the internal activations of the teacher at multiple points in the network. This can produce better results but requires more careful engineering.

Another variant, relation-based distillation, trains the student to match the relationships between different inputs according to the teacher, rather than matching individual outputs. The student learns the teacher’s sense of similarity rather than its absolute predictions.

Why it matters

The gap between what large models can do and what can be deployed at scale is significant. A frontier research model might require a cluster of high-end GPUs just to run a single query. A distilled version might run on a laptop or a phone. That difference determines whether a capability becomes accessible or stays locked behind expensive infrastructure.

For product developers, distillation means being able to use the reasoning patterns of expensive teacher models without paying for them at inference time. Train once against the teacher; deploy cheaply with the student.

For end users, distillation is why capable AI runs on mobile devices. On-device models that handle voice recognition, image classification, and text generation in real time are often distilled from much larger systems that ran in a data center during the training phase.

Common misconceptions

“Distillation is just making a model smaller.” Distillation specifically refers to training a student model using teacher outputs as supervision. Simply removing layers or parameters from a model is pruning, a different technique. Quantization, another approach, reduces numerical precision of existing weights. These all reduce model size, but through different mechanisms and with different tradeoffs.

“The student model is just a compressed copy.” The student doesn’t replicate the teacher’s architecture or weights. It’s a new model, typically with a different structure, trained to produce similar outputs. Two models can behave similarly while being completely different internally.

“You need the teacher model to run inference once distillation is complete.” The teacher is only needed during training. Once the student is trained, it runs independently. You might distill a student from GPT-4 and then deploy that student without any further connection to GPT-4.

Key terms

Teacher model: The large, well-trained model whose behavior the student is trained to replicate.

Student model: The smaller model trained through distillation; it learns from the teacher’s outputs rather than (or in addition to) raw labeled data.

Soft targets: The full probability distribution output by the teacher for a given input, as opposed to a single hard label. Soft targets carry relational information between categories that hard labels discard.

Temperature (in distillation): A parameter that controls how spread out the teacher’s output distribution is. Higher temperature produces softer, more uniform distributions that are richer in signal for the student to learn from.

Pruning: A related but distinct technique that reduces model size by removing individual weights or neurons from an existing model, rather than training a new one.