LoRA vs Full Fine-Tuning: When to Use What
Fine-tuning a language model means updating its weights so it performs better on your specific task. But you don’t always need to update every weight. That’s the core idea behind LoRA (Low-Rank Adaptation) — and understanding when it’s enough (and when it isn’t) can save you significant time and compute.
What Is Full Fine-Tuning?
Full fine-tuning updates all of the model’s parameters during training. For a 1B-parameter model, that means all one billion weights are adjusted to fit your dataset.
Pros:
- Maximum flexibility — the model can change its behaviour significantly
- Often delivers the best possible accuracy on a given task
- Well-understood, standard approach
Cons:
- High GPU memory requirements — you need to store gradients and optimizer states for every parameter
- Slower training iterations
- Risk of catastrophic forgetting on general capabilities
- Produces a full-size model checkpoint for every experiment
What Is LoRA?
LoRA freezes the original model weights and injects small trainable matrices into each transformer layer. Instead of updating a large weight matrix W directly, LoRA learns two small matrices A and B such that the update is ΔW = A × B. The rank of these matrices (the “r” in LoRA) controls how expressive the adaptation is.
Pros:
- Dramatically lower memory usage — often 5–10× less than full fine-tuning
- Faster training iterations
- Tiny adapter files (often < 100 MB) instead of full model checkpoints
- Easy to swap adapters for different tasks on the same base model
Cons:
- Slightly lower ceiling on task accuracy for complex tasks
- Choosing the right rank and alpha requires some experimentation
- Not every architecture benefits equally
When to Choose LoRA
LoRA is the right default for most practical scenarios:
- Limited GPU budget — you can fine-tune a 3B model on a single consumer GPU with LoRA
- Multiple tasks on one base model — swap lightweight adapters instead of managing multiple full checkpoints
- Rapid experimentation — shorter training cycles mean faster iteration on data and hyperparameters
- The task is well-scoped — classification, NER, and structured extraction tasks rarely need full fine-tuning to reach production quality
When to Choose Full Fine-Tuning
Full fine-tuning still makes sense in specific situations:
- The task requires deep behavioural change — for example, teaching a model a new output format or reasoning style that differs substantially from its pre-training
- You have ample compute and data — if cost isn’t a constraint and you want to squeeze out every last point of accuracy
- Very small models — for models under 500M parameters, the memory savings of LoRA are less meaningful, and full fine-tuning is straightforward
Practical Recommendations
| Factor | LoRA | Full Fine-Tuning |
|---|---|---|
| GPU memory (1B model) | ~6–8 GB | ~24+ GB |
| Training speed | Faster per step | Slower per step |
| Adapter size | 10–100 MB | Full model (2–16 GB) |
| Accuracy ceiling | Very close to full FT | Highest possible |
| Best for | Scoped tasks, fast iteration | Deep adaptation, small models |
For most teams fine-tuning small language models for production tasks like classification, question answering, or tool calling, LoRA is the recommended starting point. You can always fall back to full fine-tuning if you hit an accuracy wall — but in our benchmarks across dozens of tasks, that rarely happens.
How distil labs Handles This
When you fine-tune a model with distil labs, LoRA is enabled by default. The platform automatically selects a sensible rank and alpha based on the model size and task type, so you don’t need to tune these hyperparameters yourself. If you need full fine-tuning, you can disable LoRA in your configuration — but we recommend trying LoRA first.