Best Small Language Model for Fine-Tuning in 2025

Choosing the right base model is the single highest-leverage decision you make before fine-tuning. Pick wrong and no amount of data or hyperparameter tuning will close the gap. Pick right and you can ship a model that matches GPT-4 on your task — at a fraction of the cost.

In this guide we compare the three leading small language model families available in 2025 — Qwen 3, Llama 3.2, and Gemma 3 — across real fine-tuning workloads. Every number comes from our 12-model benchmark.

Why the base model matters

Fine-tuning adapts a pre-trained model to your task, but it cannot teach a model capabilities it never learned during pre-training. A base model that already understands structured output, multilingual text, or function-calling schemas will converge faster and generalise better after fine-tuning.

The three factors that matter most:

Pre-training data mix — models trained on more code and structured data handle tool-calling and information-extraction tasks better out of the box.
Architecture efficiency — models that use grouped-query attention or mixture-of-experts can deliver higher quality at the same parameter count.
Instruct-tuning quality — the instruct variant you start from determines how well the model follows task-specific instructions before you even fine-tune.

The contenders

Qwen 3 (0.6 B – 8 B)

Qwen 3 ships in five sizes that matter for fine-tuning: 0.6 B, 1.7 B, 4 B, and 8 B. The family uses grouped-query attention and was pre-trained on a broad multilingual corpus with a heavy code component.

Strengths:

Consistently ranks first or second across classification, QA, and tool-calling benchmarks after fine-tuning
The 4 B model hits a sweet spot between quality and inference cost
Native support for structured output and function-calling schemas

Watch out for:

Tokeniser is not perfectly aligned with some European languages at the 0.6 B scale
Requires careful LoRA rank selection — under-parameterised adapters degrade faster than with Llama

Llama 3.2 (1 B – 3 B) and Llama 3.1 (8 B)

Meta’s Llama family remains the most widely deployed open-weight model line. Llama 3.2 introduced the 1 B and 3 B parameter tiers, while the 8 B Instruct from the 3.1 generation is still the go-to for tasks that need more capacity.

Strengths:

Enormous community ecosystem — every framework, quantisation method, and deployment tool supports Llama on day one
Rock-solid tool-calling performance, especially at 3 B+
Llama 3.1 8 B Instruct is the most battle-tested 8 B model available

Watch out for:

The 1 B model underperforms Qwen 3 0.6 B on classification despite being larger
Inference throughput on CPU is lower per parameter than Gemma 3 at the same size

Gemma 3 (270 M – 4 B)

Google’s Gemma 3 line starts at just 270 M parameters — the smallest model in our comparison. It is designed for on-device and edge use cases.

Strengths:

The 270 M model is the best option when you need a model that runs on microcontrollers or phones
Surprisingly competitive on classification tasks after fine-tuning at the 1 B tier
Fast inference with optimised attention implementations

Watch out for:

Drops off on open-ended generation and QA at the sub-1 B scale
Smaller community around fine-tuning recipes compared to Llama

Head-to-head: fine-tuning benchmarks

We fine-tuned every model with LoRA using identical hyperparameters and training data drawn from eight tasks: sentiment classification, intent detection, named-entity recognition, open-book QA, closed-book QA, multi-class classification, single-turn tool calling, and multi-turn tool calling.

Model	Params	Classification	QA (open)	QA (closed)	Tool Calling	NER
Qwen 3 0.6 B	0.6 B	89.2	71.4	64.8	62.1	78.3
Gemma 3 270 M	270 M	82.1	58.3	51.2	44.7	69.4
Gemma 3 1 B	1 B	88.4	69.1	62.3	59.8	76.9
Llama 3.2 1 B	1 B	86.7	70.2	63.1	61.4	75.2
Qwen 3 1.7 B	1.7 B	91.3	75.8	69.4	68.2	82.1
Llama 3.2 3 B	3 B	92.1	78.4	72.6	74.3	84.7
Qwen 3 4 B	4 B	93.7	80.2	74.1	76.8	86.4
Gemma 3 4 B	4 B	92.8	78.9	73.2	73.1	85.1
Qwen 3 8 B	8 B	95.1	84.6	78.9	81.2	89.7
Llama 3.1 8 B	8 B	94.8	83.9	77.4	80.7	88.9

Scores are task-level accuracy (%) averaged across datasets within each category. Full per-dataset breakdowns are in the benchmark article.

How to choose

If you need the absolute best quality at 8 B: Qwen 3 8 B and Llama 3.1 8 B are within one point of each other on most tasks. Pick Qwen if multilingual support or structured output matters; pick Llama if ecosystem compatibility is your priority.

If you need 3–4 B for edge/on-prem: Qwen 3 4 B edges out both Gemma 3 4 B and Llama 3.2 3 B across the board. It is the best overall model at this tier.

If you need sub-1 B: Qwen 3 0.6 B is the clear winner over Gemma 3 270 M and Llama 3.2 1 B. If you need something truly tiny for an embedded device, Gemma 3 270 M is your only realistic option.

If tool calling is your primary task: Llama and Qwen are both strong. Gemma lags behind at every parameter count.

Fine-tune any of them in minutes

With distil labs you don’t need to write training scripts, manage GPUs, or generate synthetic data manually. Upload your seed examples, pick your base model, and distil labs handles the rest — from synthetic data generation to LoRA fine-tuning to deployment.

Get started with the CLI →

Best Small Language Model for Fine-Tuning in 2025: Qwen vs Llama vs Gemma

Best Small Language Model for Fine-Tuning in 2025

Why the base model matters

The contenders

Qwen 3 (0.6 B – 8 B)

Llama 3.2 (1 B – 3 B) and Llama 3.1 (8 B)

Gemma 3 (270 M – 4 B)

Head-to-head: fine-tuning benchmarks

How to choose

Fine-tune any of them in minutes

Cookie preferences