Distillation vs Quantization: Which Shrinks Your Model Better?

You want a smaller, faster language model. Two techniques keep coming up: knowledge distillation and quantization. Both shrink models, but they do it in completely different ways — and choosing the wrong one can leave performance on the table.

Here’s how they compare and when each one makes sense.

What Is Knowledge Distillation?

Knowledge distillation trains a smaller model (the student) to reproduce the behaviour of a larger model (the teacher). The student learns from the teacher’s outputs — not from the original training data directly.

The result is a genuinely smaller architecture. A 70B-parameter teacher can distil its task-specific knowledge into a 1B-parameter student that runs on a single CPU.

What changes: the model architecture itself — fewer layers, fewer parameters, different weights.

What Is Quantization?

Quantization takes an existing model and reduces the numerical precision of its weights. Instead of storing each weight as a 16-bit or 32-bit floating-point number, you compress them to 8-bit, 4-bit, or even lower.

The architecture stays the same. You still have the same number of parameters — they’re just stored more compactly.

What changes: the precision of the numbers, not the structure of the model.

Side-by-Side Comparison

Dimension	Distillation	Quantization
What shrinks	Architecture (fewer parameters)	Weight precision (same parameters, smaller numbers)
Size reduction	10–100x	2–4x
Speed improvement	Large (smaller model = faster)	Moderate (depends on hardware support)
Accuracy impact	Can match teacher on narrow tasks	Small degradation, increases at lower bit widths
Training required	Yes — full fine-tuning run	No training (post-training quantization) or minimal calibration
Task specificity	Produces a task-specific model	Preserves general-purpose capability
Hardware requirements	GPU for training, CPU/GPU for inference	Minimal — often done on CPU

When to Use Distillation

Distillation is the right choice when:

You have a well-defined, narrow task (classification, extraction, QA, tool calling)
You need maximum size reduction — going from 70B to 1B parameters
Inference cost is a primary concern and you’re running at scale
You want a model that can run on edge devices or on-prem hardware
You’re willing to invest time in training for a significantly better production model

The trade-off is that distilled models are specialists. They excel at the task they were trained for but lose the generalist capability of the teacher.

When to Use Quantization

Quantization is the right choice when:

You need a general-purpose model that’s simply smaller
You want a quick win without any training infrastructure
You’re deploying a model that needs to handle diverse, unpredictable tasks
You’re working with an open-weights model and want to run it locally
A 2–4x size reduction is sufficient for your deployment constraints

The trade-off is that quantization has diminishing returns. Below 4-bit precision, accuracy degrades noticeably — especially on reasoning-heavy tasks.

Can You Combine Them?

Yes — and this is often the best approach for production deployments.

A common pipeline looks like:

Distil a large teacher into a small student trained on your specific task
Quantize the distilled student to squeeze out additional size and speed gains

For example, you might distil a 70B teacher into a 3B student, then quantize the student to 4-bit precision. The result is a model that’s hundreds of times smaller than the original, runs on commodity hardware, and still performs well on your target task.

The Bottom Line

Distillation and quantization solve different problems:

Distillation changes what the model is — smaller architecture, task-specific knowledge
Quantization changes how the model is stored — same architecture, lower precision

If you’re building a production system for a specific task, distillation gives you the biggest gains. If you need a quick reduction in model size without retraining, quantization is the pragmatic choice. For maximum compression, use both.

Distillation vs Quantization: Which Shrinks Your Model Better?

Distillation vs Quantization: Which Shrinks Your Model Better?

What Is Knowledge Distillation?

What Is Quantization?

Side-by-Side Comparison

When to Use Distillation

When to Use Quantization

Can You Combine Them?

The Bottom Line

Cookie preferences