How to Distill a Large Language Model into a Small One
Large language models like GPT-4, Llama 3.3 70B, and Qwen3 235B are impressively capable — but they’re expensive, slow, and impossible to run on your own infrastructure without serious hardware. The good news: you don’t have to use them directly in production.
Model distillation lets you transfer the knowledge of a large “teacher” model into a small “student” model that’s 10–100x cheaper to run — while retaining most (or all) of the teacher’s accuracy on your specific task.
This guide walks you through the entire process.
What Is Model Distillation?
Distillation is the process of training a small model to replicate the behaviour of a large one. Instead of training from scratch on raw data, the student learns from the teacher’s outputs — its predictions, reasoning patterns, and decision boundaries.
The key insight is that a large model’s outputs contain far more information than raw labels. When a teacher classifies a support ticket as “billing issue” with high confidence, but also assigns some probability to “account access,” the student learns that nuance too.
The Distillation Pipeline
Distilling an LLM follows a clear sequence of steps:
1. Define Your Task
Distillation works best on well-scoped tasks. Start by writing a clear task description and identifying the input-output format:
- Classification — input text → category label
- Question answering — question (+ optional context) → answer
- Information extraction — document → structured fields
- Tool calling — user request → function call with arguments
The narrower your task, the smaller your student model can be while maintaining accuracy.
2. Gather Seed Examples
You need a small set of examples that demonstrate the task. These seed examples serve two purposes:
- They show the teacher model what you’re looking for
- They anchor the synthetic data generation process
As few as 10–20 high-quality examples is enough to get started. Each example should have an input (the “question”) and the expected output (the “answer”).
3. Choose a Teacher Model
Your teacher should be a large, capable model that performs well on your task. Common choices:
| Teacher Model | Parameters | Strengths |
|---|---|---|
| Llama 3.3 70B Instruct | 70B | Strong general-purpose, good at following instructions |
| Qwen3 235B-A22 | 235B (22B active) | Excellent reasoning, multilingual |
| DeepSeek R1 | 671B | Deep reasoning, strong on complex tasks |
Test your teacher on a handful of examples before committing. If the teacher can’t do the task well, the student won’t either.
4. Generate Synthetic Training Data
This is where the teacher earns its keep. Using your seed examples and task description, the teacher generates hundreds or thousands of new training examples.
A good synthetic data pipeline:
- Varies complexity — generates both easy and hard examples
- Covers the input space — uses mutation strategies to avoid repetitive patterns
- Validates outputs — automatically filters malformed, duplicate, or off-topic examples
The result is a rich, diverse training dataset that would have taken weeks to create manually.
5. Fine-Tune the Student
With your synthetic dataset ready, train a small student model. Key decisions:
- Model size — 0.6B to 8B parameters depending on task complexity and deployment constraints
- Training method — LoRA adapters for efficiency, full fine-tuning for maximum accuracy
- Hyperparameters — 3–5 epochs, learning rate around 2e-4 for LoRA
The student doesn’t need to understand everything — just your task. That’s why a 1B-parameter model can outperform a 70B general-purpose model on a specific domain.
6. Evaluate
Compare the student against the teacher on a held-out test set. Track:
- Accuracy — does the student match the teacher’s correctness?
- Consistency — does it produce stable outputs across similar inputs?
- Latency and cost — how much faster and cheaper is the student?
If the student falls short, you can improve results by generating more training data, increasing model size, or refining your task description.
7. Deploy
Fine-tuned SLMs are small enough to deploy almost anywhere:
- Serverless endpoints for easy API integration
- On-premises servers for data privacy requirements
- Edge devices for models under 3B parameters
How Small Can You Go?
The answer depends on your task:
| Task Complexity | Recommended Student Size | Example |
|---|---|---|
| Simple classification (< 10 classes) | 0.6B–1B | Sentiment analysis, intent routing |
| Moderate extraction or QA | 1B–3B | Named entity extraction, FAQ answering |
| Complex reasoning or multi-step | 3B–8B | Tool calling, multi-hop QA |
In our benchmarks, a distilled 1.7B model matches a 70B teacher on 8 out of 10 classification datasets — at roughly 1/40th the inference cost.
Common Mistakes to Avoid
Starting too big. Try a 1B student first. You can always scale up if needed, but you might be surprised how capable small models are on focused tasks.
Skipping evaluation. Always measure against a held-out test set. Synthetic data quality varies, and you need to know if the student actually learned the right patterns.
Using a weak teacher. The student can only be as good as its training signal. If your teacher gets 70% accuracy on the task, don’t expect the student to do better.
Over-generating data. More data isn’t always better. 1,000 high-quality, diverse examples often outperform 10,000 repetitive ones.
Getting Started with distil labs
The distil labs platform handles the entire distillation pipeline:
- Describe your task and provide seed examples
- The platform selects a teacher and generates synthetic training data
- A student model is fine-tuned and evaluated automatically
- Deploy your distilled model to a serverless endpoint or download it
No GPU setup, no ML infrastructure, no data labeling — just a clear task description and a few examples.
Distillation is the most practical path from “we’re using GPT-4 for everything” to “we have fast, private, cost-effective models in production.” The teacher does the hard work once so the student can do it efficiently forever.