Distillation vs Quantization: Which Shrinks Your Model Better?
You want a smaller, faster language model. Two techniques keep coming up: knowledge distillation and quantization. Both shrink models, but they do it in completely different ways — and choosing the wrong one can leave performance on the table.
Here’s how they compare and when each one makes sense.
What Is Knowledge Distillation?
Knowledge distillation trains a smaller model (the student) to reproduce the behaviour of a larger model (the teacher). The student learns from the teacher’s outputs — not from the original training data directly.
The result is a genuinely smaller architecture. A 70B-parameter teacher can distil its task-specific knowledge into a 1B-parameter student that runs on a single CPU.
What changes: the model architecture itself — fewer layers, fewer parameters, different weights.
What Is Quantization?
Quantization takes an existing model and reduces the numerical precision of its weights. Instead of storing each weight as a 16-bit or 32-bit floating-point number, you compress them to 8-bit, 4-bit, or even lower.
The architecture stays the same. You still have the same number of parameters — they’re just stored more compactly.
What changes: the precision of the numbers, not the structure of the model.
Side-by-Side Comparison
| Dimension | Distillation | Quantization |
|---|---|---|
| What shrinks | Architecture (fewer parameters) | Weight precision (same parameters, smaller numbers) |
| Size reduction | 10–100x | 2–4x |
| Speed improvement | Large (smaller model = faster) | Moderate (depends on hardware support) |
| Accuracy impact | Can match teacher on narrow tasks | Small degradation, increases at lower bit widths |
| Training required | Yes — full fine-tuning run | No training (post-training quantization) or minimal calibration |
| Task specificity | Produces a task-specific model | Preserves general-purpose capability |
| Hardware requirements | GPU for training, CPU/GPU for inference | Minimal — often done on CPU |
When to Use Distillation
Distillation is the right choice when:
- You have a well-defined, narrow task (classification, extraction, QA, tool calling)
- You need maximum size reduction — going from 70B to 1B parameters
- Inference cost is a primary concern and you’re running at scale
- You want a model that can run on edge devices or on-prem hardware
- You’re willing to invest time in training for a significantly better production model
The trade-off is that distilled models are specialists. They excel at the task they were trained for but lose the generalist capability of the teacher.
When to Use Quantization
Quantization is the right choice when:
- You need a general-purpose model that’s simply smaller
- You want a quick win without any training infrastructure
- You’re deploying a model that needs to handle diverse, unpredictable tasks
- You’re working with an open-weights model and want to run it locally
- A 2–4x size reduction is sufficient for your deployment constraints
The trade-off is that quantization has diminishing returns. Below 4-bit precision, accuracy degrades noticeably — especially on reasoning-heavy tasks.
Can You Combine Them?
Yes — and this is often the best approach for production deployments.
A common pipeline looks like:
- Distil a large teacher into a small student trained on your specific task
- Quantize the distilled student to squeeze out additional size and speed gains
For example, you might distil a 70B teacher into a 3B student, then quantize the student to 4-bit precision. The result is a model that’s hundreds of times smaller than the original, runs on commodity hardware, and still performs well on your target task.
The Bottom Line
Distillation and quantization solve different problems:
- Distillation changes what the model is — smaller architecture, task-specific knowledge
- Quantization changes how the model is stored — same architecture, lower precision
If you’re building a production system for a specific task, distillation gives you the biggest gains. If you need a quick reduction in model size without retraining, quantization is the pragmatic choice. For maximum compression, use both.