← All learn articles

Teacher-Student Distillation: How It Works and When to Use It

Teacher-Student Distillation: How It Works and When to Use It

Teacher-student distillation is the core mechanism behind knowledge distillation. A large, capable model (the teacher) generates training signal that a smaller model (the student) learns from — producing a compact model that captures the teacher’s expertise on a specific task.

It’s how you go from a 70B-parameter model running on a GPU cluster to a 1B-parameter model running on a single CPU — without giving up the accuracy that matters.

The Core Idea

The insight behind teacher-student distillation is simple: a large model already knows how to solve your task. Instead of training a small model from scratch on hand-labeled data, you let the large model demonstrate the correct behaviour and train the small model to replicate it.

This is fundamentally different from traditional supervised learning. In supervised learning, you need humans to label every example. In distillation, the teacher model does the labeling — and it can generate far more training data, far more cheaply, than any human annotation team.

How the Process Works

Teacher-student distillation follows a straightforward pipeline:

1. Define the Task

Start with a clear task description and a small set of seed examples (as few as 10). This tells the teacher what kind of outputs you expect.

2. Teacher Generates Training Data

The teacher model — something like Llama 3.3 70B or Qwen3 235B — takes your task description and seed examples, then generates hundreds or thousands of new input-output pairs. These synthetic examples form the training dataset.

3. Validate and Filter

Not every teacher output is usable. Automated validation removes duplicates, off-topic examples, malformed outputs, and low-quality generations. This step is critical — noisy training data leads to noisy student models.

4. Student Learns from Teacher

The filtered dataset is used to fine-tune a small student model (e.g., Llama 3.2 1B, Qwen3 0.6B, or SmolLM2 135M). The student learns to map inputs to outputs the same way the teacher does — but in a fraction of the parameters.

5. Evaluate

The trained student is tested against a held-out test set to measure accuracy, consistency, and any regressions compared to the teacher.

Why Not Just Use the Teacher?

If the teacher already solves the task, why bother training a student? Because in production, the teacher’s strengths become liabilities:

DimensionTeacher (70B+)Student (1B–3B)
Latency500ms–2s per request20–100ms per request
Cost per 1M tokens$1–10$0.01–0.10
InfrastructureMulti-GPU clusterSingle GPU or CPU
PrivacyOften requires API callsRuns fully on-prem
ReliabilityVariable (prompt-dependent)Deterministic on narrow tasks

The student isn’t better than the teacher at everything — it’s better at the one thing you need, while being dramatically cheaper and faster to run.

When Teacher-Student Distillation Works Best

Distillation excels in specific conditions:

  • Well-defined tasks — Classification, extraction, QA, tool calling. The clearer the expected output format, the better the student learns.
  • Narrow domains — You’re not trying to build a general-purpose assistant. You need a model that does one thing reliably.
  • Scale matters — You’re running thousands or millions of inferences per day, and cost or latency is a constraint.
  • Limited labeled data — You have a handful of examples, not thousands. The teacher bridges the data gap.

When It Doesn’t

Distillation is less effective when:

  • The task is vaguely defined — If you can’t clearly describe what a good output looks like, the teacher will generate noisy data and the student will learn noise.
  • You need open-ended creativity — Distillation compresses knowledge, which means the student trades breadth for depth.
  • The teacher can’t do the task — The student can’t exceed the teacher’s capability on the training distribution. If the teacher gets it wrong, the student will too.

Choosing the Right Teacher and Student

Teacher Selection

Pick the most capable model that reliably handles your task. Bigger isn’t always better — what matters is that the teacher produces high-quality outputs for your specific domain.

Common teacher choices:

  • Llama 3.3 70B Instruct — strong general-purpose teacher
  • Qwen3 235B — excellent for multilingual and reasoning tasks
  • DeepSeek R1 — strong on tasks requiring chain-of-thought reasoning

Student Selection

Choose based on your deployment constraints:

ModelParametersUse case
SmolLM2 135M135MUltra-low latency, edge devices
Qwen3 0.6B600MBalance of size and quality
Llama 3.2 1B1BSolid general-purpose student
Llama 3.2 3B3BComplex tasks needing more capacity

Real-World Performance

In our benchmarks across 10 task-specific datasets, distilled students matched or exceeded the teacher model on 8 out of 10. A 1B-parameter student running on a single GPU consistently hit 90%+ of the teacher’s accuracy — at less than 1% of the inference cost.

The performance gap is smallest on well-defined tasks like classification and extraction, and largest on open-ended generation tasks. For most production use cases, the student is more than good enough.

Getting Started

With distil labs, teacher-student distillation takes minutes, not weeks:

  1. Describe your task and provide 10–50 seed examples
  2. Select a teacher and student model
  3. The platform generates synthetic data, validates it, trains the student, and evaluates the result
  4. Deploy your student model to a cloud endpoint or download it for on-prem use

No ML infrastructure. No training scripts. No data labeling pipeline.


Teacher-student distillation is the most practical path from “I have a prompt and an API key” to “I have a production model that runs anywhere.” It takes the intelligence of frontier models and packages it into something you can actually deploy.