← All learn articles

Model Distillation Tutorial: From LLM to Deployable SLM

Model Distillation Tutorial: From LLM to Deployable SLM

Model distillation is the process of transferring knowledge from a large, expensive language model into a small, efficient one. The result is a model that’s 10–100x smaller, runs on commodity hardware, and matches the original on your specific task.

This tutorial walks you through every step — from choosing a teacher to deploying a production-ready small language model.

What You’ll Build

By the end of this tutorial, you’ll have:

  • A teacher model (e.g., Llama 3.3 70B) generating high-quality training data
  • A validated synthetic dataset tailored to your task
  • A fine-tuned student model (e.g., Qwen3 1.7B) that runs anywhere
  • A clear understanding of how each step works

Prerequisites

You don’t need ML expertise. You do need:

  • A clear idea of what task you want the model to perform
  • 10–50 seed examples showing input-output pairs
  • A test set of 20–50 examples to evaluate the result

Step 1: Define Your Task

Every distillation project starts with a task definition. Write a plain-language description of what your model should do:

“Classify incoming customer support tickets into one of five categories: billing, technical, account, shipping, or general.”

Then gather your seed examples. Each example needs a question (input) and an answer (expected output):

{"question": "I was charged twice for my subscription", "answer": "billing"}
{"question": "The app crashes when I try to upload a file", "answer": "technical"}
{"question": "How do I change my email address?", "answer": "account"}

Ten to fifty examples is enough to get started. Focus on covering the range of inputs your model will see in production.

Step 2: Choose Your Teacher Model

The teacher model generates the synthetic training data your student will learn from. Pick the most capable model you can afford for this step — it only runs during training, not in production.

Teacher ModelParametersStrengths
Llama 3.3 70B Instruct70BStrong general-purpose, good instruction following
Qwen3 235B235BExcellent reasoning, multilingual
DeepSeek R1671B (MoE)Deep reasoning, chain-of-thought

The teacher doesn’t need to be perfect. It just needs to be better than random on your task — the validation step will catch mistakes.

Step 3: Evaluate the Teacher

Before generating training data, confirm the teacher can actually do your task. Run it against your test set and measure accuracy.

This step catches problems early. If the teacher struggles with your task, you need to either:

  • Improve your task description
  • Provide better seed examples
  • Choose a more capable teacher

A teacher accuracy of 80%+ is a good starting point. Student models routinely match or exceed the teacher after distillation because they benefit from the concentrated, validated training set.

Step 4: Generate Synthetic Data

This is the core of the distillation pipeline. The teacher model generates hundreds or thousands of new examples based on your task description and seed data.

A good generation pipeline uses mutation strategies to ensure diversity:

  • Topic mutation — vary the subject matter across examples
  • Complexity mutation — mix simple and difficult cases
  • Length mutation — vary input and output length

Each generated example is validated automatically. Invalid, duplicate, or low-quality examples are filtered out. A typical pipeline generates 500–2,000 usable examples from just 10 seeds.

Step 5: Choose Your Student Model

The student model is what you’ll deploy to production. Choose based on your constraints:

Student ModelParametersBest For
SmolLM2 135M135MEdge devices, ultra-low latency
Qwen3 0.6B600MBalance of speed and accuracy
Llama 3.2 1B1BGeneral-purpose baseline
Llama 3.2 3B3BComplex tasks needing more capacity
Llama 3.1 8B8BMaximum accuracy, still far smaller than the teacher

Smaller models are faster and cheaper to run. Start small and only scale up if accuracy isn’t sufficient.

Step 6: Fine-Tune the Student

Train the student model on your validated synthetic dataset. Key configuration:

base:
  task: classification
  student_model_name: Qwen3-1.7B
  teacher_model_name: Llama-3.3-70B-Instruct
tuning:
  num_train_epochs: 4
  use_lora: true
  learning_rate: 0.0002
synthgen:
  generation_target: 1000

LoRA (Low-Rank Adaptation) is the default training method. It’s faster, uses less memory, and produces results comparable to full fine-tuning for most tasks.

Training typically takes 30 minutes to a few hours depending on the dataset size and student model.

Step 7: Evaluate the Student

Compare your fine-tuned student against the teacher on your held-out test set. Metrics to track:

  • Accuracy — does the student produce correct outputs?
  • Consistency — how stable are outputs across similar inputs?
  • Latency — how fast is inference compared to the teacher?
  • Cost — what’s the per-request cost reduction?

In our benchmarks, distilled students match or exceed the teacher on 8 out of 10 datasets — while running orders of magnitude faster.

Step 8: Deploy

Fine-tuned SLMs are small enough to run almost anywhere:

  • Serverless API — deploy behind an endpoint for easy integration
  • On-premises — run on your own infrastructure for data privacy
  • Edge devices — models under 3B parameters run on mobile hardware and laptops

Your deployed model processes requests in milliseconds, costs a fraction of API calls to frontier models, and keeps all data under your control.

Common Pitfalls

Starting with too little evaluation data. Your test set is your compass. If it’s too small or unrepresentative, you won’t know whether distillation worked.

Skipping teacher evaluation. If the teacher can’t do the task, the student won’t learn it. Always validate the teacher first.

Over-generating without validation. More data isn’t always better. A thousand validated examples outperform ten thousand noisy ones.

Choosing a student that’s too small. Start with a 1B–3B model. You can always compress further once you’ve validated the approach.

Putting It All Together

The full distillation pipeline looks like this:

  1. Define task + gather seed examples
  2. Select and evaluate a teacher model
  3. Generate and validate synthetic training data
  4. Fine-tune a small student model
  5. Evaluate against your test set
  6. Deploy to production

With distil labs, this entire pipeline runs from a single configuration. Describe your task, provide your examples, and the platform handles teacher evaluation, data generation, training, and evaluation automatically.


Model distillation isn’t a research technique — it’s a production workflow. Every team running LLM inference at scale should be asking: can I distill this into something smaller, faster, and cheaper? The answer is almost always yes.