The 10x inference tax you don't have to pay
Frontier LLMs keep getting better and cheaper: GPT-5 nano costs $0.05 per million input tokens and Gemini 2.5 Flash Lite is $0.10. At these prices, is there still a case for running your own small models?
If you grab an off-the-shelf small model and point it at a production task, the answer is no. Base small models are simply not good enough. But there's a third option that most teams overlook: fine-tuning changes everything. A small model that has been fine-tuned on your specific task doesn't just close the gap with frontier LLMs, it matches or beats them, while running 10x cheaper on your own hardware.
We tested this across 8 datasets, comparing fine-tuned small models (0.6B to 8B parameters) against 10 frontier LLMs from OpenAI, Anthropic, Google, and xAI. The fine-tuned models ranked first overall on half the tasks and placed 3.2 on average, just behind Claude Opus 4.6 (2.5), while beating out Gemini 2.5 Flash (3.5) at 100x lower cost.
The fine-tuned models rank 3.2 on average, just behind Opus 4.6 (2.5). The cost difference: $3 for fine-tuned SLMs per million requests vs. $6,241 for Opus.
If you care about efficiency and LLM inference meaningfully shows up on your bill, let us walk you through the details. All the code, models, and data for this post are available in this repository, and you can reproduce everything using our platform.
What is Distil Labs?
Distil Labs is a platform for training task-specific small language models. You provide a task description and a handful of examples; we handle synthetic data generation, validation, fine-tuning, and evaluation. The result: models 50 to 400x smaller than frontier LLMs that maintain comparable accuracy and runs for 10% of the price. The hard part of fine-tuning, collecting data, picking the right base model, tuning hyperparameters, validating quality, is exactly what we automate. Check out our docs if you want to dive deeper.
Results: Same accuracy, 100x cheaper
We evaluated fine-tuned models against up to 11 frontier LLMs per dataset, including both mid-tier (GPT-5 nano, Gemini 2.5 Flash Lite, Grok 4.1 Fast, GPT-5 mini, Gemini 2.5 Flash, Claude Haiku 4.5) and premium models (GPT-5.2, Sonnet 4.6, Grok 4, Opus 4.6). For each dataset, we ranked all models by accuracy. The "Avg Rank" column below is each model's mean position across all datasets it was evaluated on: 1.0 means it came first on every task, higher is worse.
Frontier model costs are averages across datasets, computed from measured API token usage over 3 runs each. Fine-tuned model costs are computed from sustained vLLM throughput on a single H100 GPU at $2.40/hr.
The fine-tuned models rank 3.2 on average, just behind Opus 4.6 (2.5). The cost difference: $3 for fine-tuned SLMs per million requests vs. $6,241 for Opus. The closest competitor on rank is Gemini 2.5 Flash (3.5), which costs $313 per million requests, over 100x more expensive for essentially the same quality.
If you're processing a million requests per day on well-structured problems, a dedicated fine-tuned model will almost certainly save you money, even accounting for training and deployment overhead.
Results: For well-structured tasks, fine-tuned small models match or beat even the most expensive frontier models
The per-dataset breakdown shows where fine-tuned models win outright and where frontier models hold an edge. The fine-tuned model ranks first on 4 out of 8 tasks, beating every frontier model tested, including the most expensive ones. The gains are largest where the task is narrow and well-defined (function calling, classification, entity extraction), and smaller where broad reasoning or free-form generation is involved.
★ Smart Home and Banking77 use Qwen3-0.6B; Docstring uses Qwen3-8B; all others use Qwen3-4B.
Setup
We compared fine-tuned models (0.6B to 8B parameters) against 10 frontier LLMs across 8 datasets. All task-specific models were fine-tuned using Distil Labs, mostly with the Qwen3 family as the base. All were served on a single GPU via vLLM using the Chat Completions API with thinking disabled. We used the following datasets:
Every model was evaluated on the same test set with the same prompts, same evaluation criteria, and minimal reasoning/effort. For all LLM-as-a-judge runs, we used Claude Sonnet 4.6 with default effort. Frontier models were each run 3 times to measure variance; we report means with standard deviations. The fine-tuned models default to temperature 0, so we report a single run. For more benchmarking results focusing on trainabikity, see our previous blog post.
Pricing of SLMS
While Frontier LLM APIs charge per token, Distil Labs models run on dedicated GPUs, charged by uptime. This means exact pricing depends on utilization: the more requests you push through a GPU, the cheaper it gets. We report our numbers assuming full utilization since many real workloads get close to it (you only pay when the GPU is processing requests). The conclusions still hold up even assuming a pessimistic 10% utilization.
Specifically, we report sustained throughput on a single H100 GPU node (~$2.40/hr). Since these models are small, they fit on much smaller GPUs as well, however, most of our tasks are prefill-heavy and decode-light. In such scenarios, H100's FLOP advantage really shines and outweighs the fact that 80GB of memory is an overkill.
Deep dive: Text2SQL
Text2SQL is a good test of the accuracy-cost tradeoff because it requires genuine reasoning — translating natural language questions into SQL queries across custom schemas spanning e-commerce, HR, healthcare, finance, education, and social domains. For example, given this input:
Schema:
CREATE TABLE clinics (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
address TEXT,
phone TEXT
);
CREATE TABLE visits (
id INTEGER PRIMARY KEY,
clinic_id INTEGER REFERENCES clinics(id),
patient_name TEXT,
visit_date DATE,
diagnosis TEXT
);
Question:
How many patient visits per clinic this year?We'd like to get something like
SELECT c.name, COUNT(*) FROM clinics c JOIN visits v ON c.id = v.clinic_id WHERE v.visit_date >= '2026-01-01' GROUP BY c.id, c.name;This task is straightforward, and it's clear that the distilled 4B model matches models with orders of magnitude more parameters and cost (notice the x-axis is log-scale!) Our distilled 4B model hits 98% — matching Claude Sonnet 4.6, GPT-5 mini and Gemini 2.5 Flash, and only 2 points behind Claude Opus 4.6's perfect score. Even the 1.7B model at 94% matches GPT-5 nano and Flash Lite (96.0% mean across 3 runs, with individual runs ranging 94–98%).
Grok 4 does not support setting reasoning effort, so the token counts and corresponding costs are inflated compared to other models.

Same accuracy as Sonnet 4.6 and GPT-5 mini. $3 per million requests vs. $24 for GPT-5 nano. That's an 8x cost reduction over the cheapest frontier option, at higher accuracy.
We report sustained throughput on a single H100 GPU node (~$2.40/hr). At the measured ceiling of 222 RPS for the Text2SQL 4B model, a single GPU handles over 19 million requests per day. Since these models are small, they fit on much smaller GPUs as well, however, most of our tasks are prefill-heavy and decode-light. In such scenarios, H100's FLOP advantage really shines and outweighs the fact that 80GB of memory is an overkill.
We're keeping all models on BF16 and not exploring quantization here, though depending on the scenario it can be useful. In brief experiments, FP8 quantization gave us an additional 15% throughput boost with 44% less memory and no measurable accuracy loss. Expect to read more about quantization on our blog soon!
Practical recommendations
In short: you should distill specialist models to handle all your structured tasks and route the open-ended problems to larger, generalist models. Not every task is a good candidate for distillation (this might change in the future!), and that's OK. The best production setups smartly combine both. In other words, use distillation when:
- The task has a well-defined structure (function calling, classification, SQL generation).
- Frontier models haven't seen your specific schema or domain.
- Cost at scale matters — you're making millions of requests.
- Data can't leave your infrastructure — a self-hosted model means no patient records, financial data, or PII ever hits a third-party API (our PII Redaction Healthcare dataset scored 94.0% with everything running on-premise).
Route to a frontier API when:
- The task requires broad world knowledge, like e.g. coding or general conversations.
- Freeform generation quality matters.
- The task is low-volume enough to not show up on your balance sheet.
Most production LLM spend goes to structured, high-volume tasks, exactly where distillation delivers the biggest wins. Route your open-ended or low-volume tasks to a frontier API, and distill everything else.
There's also a system maturity angle here: if a task doesn't distill well, it may be too broad. Breaking it into narrower subtasks allows selectively picking off better distillation candidates. For example, instead of "answer any question about this domain," split the task into entity extraction, classification, and targeted generation.
More datasets
GPT-5 nano is the most efficient frontier option on every dataset, ranging from ~$17/M requests (E-commerce) to ~$69/M requests (Git Assistant); though the quality of its answers mostly lags behind alternatives. On the other end of the spectrum is Opus 4.6 which consistently provides best answers in our benchmarks, while being the most expensive model we’ve tested.
Tool calling
Tool-calling datasets get 2–4x fewer requests per dollar than classification or QA because tool schemas inflate the prompt token count. This task most clearly shows the strength of distillation, where a structured, well-defined problem can be solved by a specialized tiny model (for Smart Home we used Qwen3 0.6B!), outperforming generic alternatives on both quality and efficiency.

The git-assistant dataset is also a function-calling problem, however, here the image is a little different. Distillation still clearly wins on efficiency, but the quality question is more nuanced, since frontier models are very good at using git (it’s well-represented in their training data).

Information extraction & question answering
Similarly, non-function-calling problems with structured outputs lend themselves well to distillation and the PII redaction dataset demonstrates this well. Both efficiency and quality of responses outperform the larger, more generic models.
There’s a similar story on the TextSQL problem, however, just like with git-assistant, existing models are great at writing SQL so there isn’t much space to outperform them.
Finally, the Docstring results illustrate something interesting too: while we do expect structured output in this problem, a part of it is a free-form, plain-language function description. In the general case, understanding and describing a function requires general, rather than specialized, reasoning capability and that’s a difficult area for a small model to compete in.


Open-book question answering problems ask the model to formulate an answer to a question given raw information in the form of “chunks” (e.g. from a RAG system).

Classification
Classification is a very well-defined problem and the distilled models are competitive with (and much cheaper than!) the frontier labs’ cloud models.



Methodology notes
- We used the same test set for distilled and frontier models on every dataset.
- Same evaluation criteria: exact-match accuracy for classification,
tool_call_equivalencefor function calling (i.e. JSON comparison after default parameter normalization), LLM-as-a-judge (Claude Sonnet 4.6) for generation tasks. - Distilled model training: 50 training examples per dataset (fewer for some).
- Teacher models: a mixture of large open-weight models (not a frontier APIs — distil labs doesn't train on outputs from closed models like GPT-5 or Claude).
- Student models: Qwen3-4B-Instruct for most datasets; Qwen3-0.6B through 8B for the Text2SQL deep-dive; Qwen3-0.6B through 1.7B for Smart Home.
- Variance: Frontier models were run 3 times per dataset; we report mean ± std. Distilled models use temperature 0 by default, so we skip multiple runs (the results would be the same).
- Cost calculation: Frontier costs computed from measured API token usage per dataset. Distilled costs computed from H100 GPU time at $2.40/hr divided by measured sustained RPS.
- Pricing snapshot from February 2026.
Start saving
Most of your inference spend is going to structured tasks that a fine-tuned small model can handle just as well, or better, than a frontier LLM. The hard part has always been the fine-tuning itself: collecting training data, choosing the right base model, running experiments, validating quality. That's what Distil Labs automates.
Give us a task description and 50 examples. We'll generate synthetic training data, fine-tune a model, and deliver a production-ready small expert in under 12 hours. No ML team required.
Sign up at distillabs.ai and stop paying the inference tax.

.png)