Model Distillation Tutorial: From LLM to Deployable SLM
Model distillation is the process of transferring knowledge from a large, expensive language model into a small, efficient one. The result is a model that’s 10–100x smaller, runs on commodity hardware, and matches the original on your specific task.
This tutorial walks you through every step — from choosing a teacher to deploying a production-ready small language model.
What You’ll Build
By the end of this tutorial, you’ll have:
- A teacher model (e.g., Llama 3.3 70B) generating high-quality training data
- A validated synthetic dataset tailored to your task
- A fine-tuned student model (e.g., Qwen3 1.7B) that runs anywhere
- A clear understanding of how each step works
Prerequisites
You don’t need ML expertise. You do need:
- A clear idea of what task you want the model to perform
- 10–50 seed examples showing input-output pairs
- A test set of 20–50 examples to evaluate the result
Step 1: Define Your Task
Every distillation project starts with a task definition. Write a plain-language description of what your model should do:
“Classify incoming customer support tickets into one of five categories: billing, technical, account, shipping, or general.”
Then gather your seed examples. Each example needs a question (input) and an answer (expected output):
{"question": "I was charged twice for my subscription", "answer": "billing"}
{"question": "The app crashes when I try to upload a file", "answer": "technical"}
{"question": "How do I change my email address?", "answer": "account"}
Ten to fifty examples is enough to get started. Focus on covering the range of inputs your model will see in production.
Step 2: Choose Your Teacher Model
The teacher model generates the synthetic training data your student will learn from. Pick the most capable model you can afford for this step — it only runs during training, not in production.
| Teacher Model | Parameters | Strengths |
|---|---|---|
| Llama 3.3 70B Instruct | 70B | Strong general-purpose, good instruction following |
| Qwen3 235B | 235B | Excellent reasoning, multilingual |
| DeepSeek R1 | 671B (MoE) | Deep reasoning, chain-of-thought |
The teacher doesn’t need to be perfect. It just needs to be better than random on your task — the validation step will catch mistakes.
Step 3: Evaluate the Teacher
Before generating training data, confirm the teacher can actually do your task. Run it against your test set and measure accuracy.
This step catches problems early. If the teacher struggles with your task, you need to either:
- Improve your task description
- Provide better seed examples
- Choose a more capable teacher
A teacher accuracy of 80%+ is a good starting point. Student models routinely match or exceed the teacher after distillation because they benefit from the concentrated, validated training set.
Step 4: Generate Synthetic Data
This is the core of the distillation pipeline. The teacher model generates hundreds or thousands of new examples based on your task description and seed data.
A good generation pipeline uses mutation strategies to ensure diversity:
- Topic mutation — vary the subject matter across examples
- Complexity mutation — mix simple and difficult cases
- Length mutation — vary input and output length
Each generated example is validated automatically. Invalid, duplicate, or low-quality examples are filtered out. A typical pipeline generates 500–2,000 usable examples from just 10 seeds.
Step 5: Choose Your Student Model
The student model is what you’ll deploy to production. Choose based on your constraints:
| Student Model | Parameters | Best For |
|---|---|---|
| SmolLM2 135M | 135M | Edge devices, ultra-low latency |
| Qwen3 0.6B | 600M | Balance of speed and accuracy |
| Llama 3.2 1B | 1B | General-purpose baseline |
| Llama 3.2 3B | 3B | Complex tasks needing more capacity |
| Llama 3.1 8B | 8B | Maximum accuracy, still far smaller than the teacher |
Smaller models are faster and cheaper to run. Start small and only scale up if accuracy isn’t sufficient.
Step 6: Fine-Tune the Student
Train the student model on your validated synthetic dataset. Key configuration:
base:
task: classification
student_model_name: Qwen3-1.7B
teacher_model_name: Llama-3.3-70B-Instruct
tuning:
num_train_epochs: 4
use_lora: true
learning_rate: 0.0002
synthgen:
generation_target: 1000
LoRA (Low-Rank Adaptation) is the default training method. It’s faster, uses less memory, and produces results comparable to full fine-tuning for most tasks.
Training typically takes 30 minutes to a few hours depending on the dataset size and student model.
Step 7: Evaluate the Student
Compare your fine-tuned student against the teacher on your held-out test set. Metrics to track:
- Accuracy — does the student produce correct outputs?
- Consistency — how stable are outputs across similar inputs?
- Latency — how fast is inference compared to the teacher?
- Cost — what’s the per-request cost reduction?
In our benchmarks, distilled students match or exceed the teacher on 8 out of 10 datasets — while running orders of magnitude faster.
Step 8: Deploy
Fine-tuned SLMs are small enough to run almost anywhere:
- Serverless API — deploy behind an endpoint for easy integration
- On-premises — run on your own infrastructure for data privacy
- Edge devices — models under 3B parameters run on mobile hardware and laptops
Your deployed model processes requests in milliseconds, costs a fraction of API calls to frontier models, and keeps all data under your control.
Common Pitfalls
Starting with too little evaluation data. Your test set is your compass. If it’s too small or unrepresentative, you won’t know whether distillation worked.
Skipping teacher evaluation. If the teacher can’t do the task, the student won’t learn it. Always validate the teacher first.
Over-generating without validation. More data isn’t always better. A thousand validated examples outperform ten thousand noisy ones.
Choosing a student that’s too small. Start with a 1B–3B model. You can always compress further once you’ve validated the approach.
Putting It All Together
The full distillation pipeline looks like this:
- Define task + gather seed examples
- Select and evaluate a teacher model
- Generate and validate synthetic training data
- Fine-tune a small student model
- Evaluate against your test set
- Deploy to production
With distil labs, this entire pipeline runs from a single configuration. Describe your task, provide your examples, and the platform handles teacher evaluation, data generation, training, and evaluation automatically.
Model distillation isn’t a research technique — it’s a production workflow. Every team running LLM inference at scale should be asking: can I distill this into something smaller, faster, and cheaper? The answer is almost always yes.