Knowledge Distillation Explained: Teacher-Student Training for LLMs

Knowledge distillation is the process of transferring the capabilities of a large, powerful model (the teacher) into a smaller, efficient model (the student). The student learns to replicate the teacher’s behaviour on a specific task — producing a compact model that’s faster, cheaper, and often just as accurate.

It’s the reason you can replace a 70-billion-parameter API call with a 1-billion-parameter model running on a single GPU.

The Core Idea

Large language models are general-purpose. They know a lot about everything, but you only need them to do one thing well. Knowledge distillation exploits this gap: instead of deploying the full model, you train a small model to mimic the large one on your specific task.

The process works in three stages:

Teacher generates — The large model produces outputs (predictions, labels, reasoning) on a set of inputs relevant to your task
Student learns — The small model is fine-tuned on the teacher’s outputs, learning to replicate its behaviour
Student deploys — The trained student model goes into production, running at a fraction of the cost

The key insight is that the teacher’s outputs contain more information than raw labels alone. A teacher model doesn’t just say “positive” — it demonstrates how to reason about the input, what formatting to use, and how to handle edge cases. The student absorbs all of this.

Why Not Just Use a Smaller Model Directly?

You can fine-tune a small model on labelled data without a teacher. But there are practical reasons distillation works better:

You rarely have enough labelled data. Most teams have a handful of examples, not thousands. A teacher model can generate the training data you’re missing.
Teacher outputs are richer. A human label might say “urgent.” A teacher’s output demonstrates the reasoning, formatting, and confidence you want the student to replicate.
It’s faster to iterate. Changing your task description and regenerating synthetic data is faster than relabelling a dataset by hand.

A Concrete Example

Say you want a model that classifies customer support tickets into categories: billing, technical, account, and other.

Without distillation:

Collect and label 2,000+ support tickets manually
Fine-tune a small model on the labelled data
Discover the model struggles with ambiguous tickets
Label more data, retrain, repeat

With distillation:

Write a task description and provide 10–20 example tickets with labels
A teacher model (e.g., Llama 3.3 70B) generates 1,000 diverse synthetic examples
Fine-tune a student model (e.g., Qwen3 1.7B) on the synthetic data
The student learns the teacher’s nuanced decision-making, not just surface patterns

The second approach gets you to production faster, with less manual effort, and often with better accuracy on edge cases.

What Makes Distillation Work Well

Not all distillation is equal. The quality of the result depends on:

Teacher quality

The teacher needs to be genuinely good at your task. A model that’s only 80% accurate will pass its mistakes to the student. Always evaluate the teacher on your test set before generating training data.

Data diversity

If the teacher generates 1,000 examples that all look the same, the student learns a narrow pattern. Effective distillation pipelines use mutation strategies — varying input complexity, length, topic, and phrasing — to ensure the training data covers the full distribution of real-world inputs.

Validation and filtering

Not every example the teacher generates is good. Automated validation catches formatting errors, duplicates, off-topic outputs, and low-quality responses before they pollute the training set.

Task specificity

Distillation works best on well-defined tasks. Classification, information extraction, question answering, and tool calling are ideal. Open-ended creative writing is harder because “correct” is subjective.

Distillation in the LLM Era

The concept of knowledge distillation predates large language models — Hinton et al. introduced it in 2015 for image classifiers. But LLMs have made it dramatically more practical:

Teachers are available off the shelf. You don’t need to train a teacher — frontier models like GPT-4, Llama 3.3 70B, or Qwen3 235B are ready to use.
Synthetic data generation scales. A teacher can produce thousands of labelled examples in minutes, solving the data scarcity problem.
Small models are surprisingly capable. Modern SLMs in the 1B–8B parameter range have enough capacity to absorb task-specific knowledge from much larger teachers.

When to Use Knowledge Distillation

Distillation makes sense when:

You’re calling a large model API and want to reduce cost or latency
You need to run inference on-premises or at the edge
You have a well-defined task but limited labelled data
You want deterministic, consistent outputs instead of prompt-dependent behaviour

It’s less suited for:

Tasks where you need the full breadth of a general-purpose model
Rapidly changing requirements where retraining is impractical
Domains where no existing teacher model performs well

The Bottom Line

Knowledge distillation is how you get from “this works in a demo with GPT-4” to “this runs in production at scale.” It’s the bridge between powerful-but-expensive large models and fast-but-specialised small ones.

The teacher does the thinking. The student does the work.

Knowledge Distillation Explained: Teacher-Student Training for LLMs

Knowledge Distillation Explained: Teacher-Student Training for LLMs

The Core Idea

Why Not Just Use a Smaller Model Directly?

A Concrete Example

What Makes Distillation Work Well

Teacher quality

Data diversity

Validation and filtering

Task specificity

Distillation in the LLM Era

When to Use Knowledge Distillation

The Bottom Line

Cookie preferences