Best Small Language Model for Fine-Tuning in 2025
Choosing the right base model is the single highest-leverage decision you make before fine-tuning. Pick wrong and no amount of data or hyperparameter tuning will close the gap. Pick right and you can ship a model that matches GPT-4 on your task — at a fraction of the cost.
In this guide we compare the three leading small language model families available in 2025 — Qwen 3, Llama 3.2, and Gemma 3 — across real fine-tuning workloads. Every number comes from our 12-model benchmark.
Why the base model matters
Fine-tuning adapts a pre-trained model to your task, but it cannot teach a model capabilities it never learned during pre-training. A base model that already understands structured output, multilingual text, or function-calling schemas will converge faster and generalise better after fine-tuning.
The three factors that matter most:
- Pre-training data mix — models trained on more code and structured data handle tool-calling and information-extraction tasks better out of the box.
- Architecture efficiency — models that use grouped-query attention or mixture-of-experts can deliver higher quality at the same parameter count.
- Instruct-tuning quality — the instruct variant you start from determines how well the model follows task-specific instructions before you even fine-tune.
The contenders
Qwen 3 (0.6 B – 8 B)
Qwen 3 ships in five sizes that matter for fine-tuning: 0.6 B, 1.7 B, 4 B, and 8 B. The family uses grouped-query attention and was pre-trained on a broad multilingual corpus with a heavy code component.
Strengths:
- Consistently ranks first or second across classification, QA, and tool-calling benchmarks after fine-tuning
- The 4 B model hits a sweet spot between quality and inference cost
- Native support for structured output and function-calling schemas
Watch out for:
- Tokeniser is not perfectly aligned with some European languages at the 0.6 B scale
- Requires careful LoRA rank selection — under-parameterised adapters degrade faster than with Llama
Llama 3.2 (1 B – 3 B) and Llama 3.1 (8 B)
Meta’s Llama family remains the most widely deployed open-weight model line. Llama 3.2 introduced the 1 B and 3 B parameter tiers, while the 8 B Instruct from the 3.1 generation is still the go-to for tasks that need more capacity.
Strengths:
- Enormous community ecosystem — every framework, quantisation method, and deployment tool supports Llama on day one
- Rock-solid tool-calling performance, especially at 3 B+
- Llama 3.1 8 B Instruct is the most battle-tested 8 B model available
Watch out for:
- The 1 B model underperforms Qwen 3 0.6 B on classification despite being larger
- Inference throughput on CPU is lower per parameter than Gemma 3 at the same size
Gemma 3 (270 M – 4 B)
Google’s Gemma 3 line starts at just 270 M parameters — the smallest model in our comparison. It is designed for on-device and edge use cases.
Strengths:
- The 270 M model is the best option when you need a model that runs on microcontrollers or phones
- Surprisingly competitive on classification tasks after fine-tuning at the 1 B tier
- Fast inference with optimised attention implementations
Watch out for:
- Drops off on open-ended generation and QA at the sub-1 B scale
- Smaller community around fine-tuning recipes compared to Llama
Head-to-head: fine-tuning benchmarks
We fine-tuned every model with LoRA using identical hyperparameters and training data drawn from eight tasks: sentiment classification, intent detection, named-entity recognition, open-book QA, closed-book QA, multi-class classification, single-turn tool calling, and multi-turn tool calling.
| Model | Params | Classification | QA (open) | QA (closed) | Tool Calling | NER |
|---|---|---|---|---|---|---|
| Qwen 3 0.6 B | 0.6 B | 89.2 | 71.4 | 64.8 | 62.1 | 78.3 |
| Gemma 3 270 M | 270 M | 82.1 | 58.3 | 51.2 | 44.7 | 69.4 |
| Gemma 3 1 B | 1 B | 88.4 | 69.1 | 62.3 | 59.8 | 76.9 |
| Llama 3.2 1 B | 1 B | 86.7 | 70.2 | 63.1 | 61.4 | 75.2 |
| Qwen 3 1.7 B | 1.7 B | 91.3 | 75.8 | 69.4 | 68.2 | 82.1 |
| Llama 3.2 3 B | 3 B | 92.1 | 78.4 | 72.6 | 74.3 | 84.7 |
| Qwen 3 4 B | 4 B | 93.7 | 80.2 | 74.1 | 76.8 | 86.4 |
| Gemma 3 4 B | 4 B | 92.8 | 78.9 | 73.2 | 73.1 | 85.1 |
| Qwen 3 8 B | 8 B | 95.1 | 84.6 | 78.9 | 81.2 | 89.7 |
| Llama 3.1 8 B | 8 B | 94.8 | 83.9 | 77.4 | 80.7 | 88.9 |
Scores are task-level accuracy (%) averaged across datasets within each category. Full per-dataset breakdowns are in the benchmark article.
How to choose
If you need the absolute best quality at 8 B: Qwen 3 8 B and Llama 3.1 8 B are within one point of each other on most tasks. Pick Qwen if multilingual support or structured output matters; pick Llama if ecosystem compatibility is your priority.
If you need 3–4 B for edge/on-prem: Qwen 3 4 B edges out both Gemma 3 4 B and Llama 3.2 3 B across the board. It is the best overall model at this tier.
If you need sub-1 B: Qwen 3 0.6 B is the clear winner over Gemma 3 270 M and Llama 3.2 1 B. If you need something truly tiny for an embedded device, Gemma 3 270 M is your only realistic option.
If tool calling is your primary task: Llama and Qwen are both strong. Gemma lags behind at every parameter count.
Fine-tune any of them in minutes
With distil labs you don’t need to write training scripts, manage GPUs, or generate synthetic data manually. Upload your seed examples, pick your base model, and distil labs handles the rest — from synthetic data generation to LoRA fine-tuning to deployment.