The 10x inference tax you don't have to pay

Frontier LLMs keep getting better and cheaper: GPT-5 nano costs $0.05 per million input tokens and Gemini 2.5 Flash Lite is $0.10. At these prices, is there still a case for running your own small models?

We ran comprehensive benchmarks across 9 datasets spanning multiple task categories to find out. The answer: yes, decisively! Small specialized models are faster, cheaper and the same quality as the models hosted by frontier LLM labs for many real-world tasks! Importantly, you can reach these results while having as few as 50 training data examples and self-host these small models on your own infrastructure!

If you care about efficiency and LLM inference meaningfully shows up on your bill, let us walk you through the details. All the code, models, and data for this post are available in this repository, and you can reproduce it using our platform.

The take-away

The distilled models consistently beat or match mid-sized frontier models, even beating the biggest models in 4/9 cases. At the same time, they can be served at a fraction of the cost, dropping inference costs and latency by a factor of 10.

If you're processing a million requests per day for well-structured problems with mid-tier frontier models, a dedicated distilled model will almost certainly save you money, even accounting for training and deployment overhead.

Setup

We compared distilled models (0.6B–8B parameters) against 10 frontier LLMs from OpenAI, Anthropic, Google, and xAI across 9 datasets spanning classification, question answering, and function calling. All task-specific models were distilled using distil labs mostly with the Qwen3 family as the base. All were then set up to be served on a single GPU via vLLM using the Chat Completions API and set to no-thinking mode.

Dataset Category Test Size Eval Metric
Smart Home Function Calling 50 Tool call equivalence
Git Assistant Function Calling 116 Tool call equivalence
PII Redaction Healthcare Question Answering 133 LLM-as-a-judge
Text2SQL Question Answering 50 LLM-as-a-judge
Docstring Generation Question Answering 253 LLM-as-a-judge
HotpotQA Open-Book QA 200 LLM-as-a-judge
Banking77 Classification 200 Accuracy
E-commerce Classification 200 Accuracy
TREC Classification 200 Accuracy

Every model was evaluated on the same test set with the same prompts and evaluation criteria and with minimal reasoning/effort. For all LLM-as-a-judge runs, we used Claude Sonnet 4.6 with default effort. Frontier models were each run 3 times to measure variance; we report means with standard deviations. The distilled models default to temperature 0, so we don't report multiple runs (the results would be the same). Prices collected in February 2026.

The most apples-to-apples comparison is between distilled models and the six mid-tier frontier models priced at or below $1/MTok input and $5/MTok output: GPT-5 nano, Gemini 2.5 Flash Lite, Grok 4.1 Fast, GPT-5-mini, Gemini 2.5 Flash, Claude Haiku 4.5, since these are the realistic choices for high-volume production workloads. The four premium models, GPT-5.2, Sonnet 4.6, Grok 4, Opus 4.6, are also included for reference in the per-dataset breakdowns below, but the cost differences here are massive, making them less practical at scale. We ran into rate limits with large Gemini models and will update the numbers once this is resolved - but looking at intermediate results, the overall picture is unlikely to change.

For more benchmarking results focusing on trainabikity, see our previous blog post.

Results

Across all 9 datasets, a distilled model matches or beats the best mid-tier frontier model on 6 out of 9 tasks, effectively ties on the seventh, and comes within a few points on the others.

Dataset Distilled Best Alternative Score Best Alternative
Smart Home ★ 98.7% † 92.0% ± 0.0pp Gemini 2.5 Flash
TREC 92.5% 88.0% ± 0.5pp Gemini 2.5 Flash
PII Redaction 94.0% 91.0% ± 2.3pp Gemini 2.5 Flash
E-commerce 89.0% 88.7% ± 0.8pp Gemini 2.5 Flash
Docstring ★ 90.1% 90.2% ± 0.5pp Claude Haiku 4.5
Text2SQL 98.0% 98.7% ± 0.9pp Claude Haiku 4.5
Banking77 ★ 88.0% 89.5% ± 0.5pp Gemini 2.5 Flash
Git Assistant 92.2% 95.7% ± 0.0pp Gemini 2.5 Flash
HotpotQA 92.0% 98.0% ± 0.4pp Claude Haiku 4.5

★ Smart Home and Banking77 use Qwen3-0.6B; Docstring is on Qwen3-8B; all others use Qwen3-4B.
† The raw Smart Home gap is +6.7pp, but much of it comes from the strict evaluation penalizing frontier models for reasonable alternative interpretations (e.g. choosing "unsupported_device" vs "off_topic" for an alarm request). We consider this a strong match rather than a clear win.

Deep dive: Text2SQL

Text2SQL is a good test of the accuracy-cost tradeoff because it requires genuine reasoning — translating natural language questions into SQL queries across custom schemas spanning e-commerce, HR, healthcare, finance, education, and social domains. For example, given this input:

Schema:
CREATE TABLE clinics (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  address TEXT,
  phone TEXT
);
CREATE TABLE visits (
  id INTEGER PRIMARY KEY,
  clinic_id INTEGER REFERENCES clinics(id),
  patient_name TEXT,
  visit_date DATE,
  diagnosis TEXT
);

Question:
How many patient visits per clinic this year?

We'd like to get something like

SELECT c.name, COUNT(*) FROM clinics c JOIN visits v ON c.id = v.clinic_id WHERE v.visit_date >= '2026-01-01' GROUP BY c.id, c.name;

This task is straightforward, and it's clear that the distilled 4B model matches models with orders of magnitude more parameters and cost (notice the x-axis is log-scale!) Our distilled 4B model hits 98% — matching Claude Sonnet 4.6, GPT-5 mini and Gemini 2.5 Flash, and only 2 points behind Claude Opus 4.6's perfect score. Even the 1.7B model at 94% matches GPT-5 nano and Flash Lite (96.0% mean across 3 runs, with individual runs ranging 94–98%).

Model Mean Score ± Std $ / M requests
Claude Opus 4.6 100.0% ± 0.0pp $1,623
GPT-5.2 98.7% ± 1.2pp $582
Claude Haiku 4.5 98.7% ± 1.2pp $378
grok-4-0709 98.7% ± 2.3pp $2,890
Qwen3-4B (distilled) 98.0% $3.00
Claude Sonnet 4.6 98.0% ± 0.0pp $1,042
Grok 4.1 Fast 98.0% ± 0.0pp $78
GPT-5 mini 97.3% ± 1.2pp $122
Gemini 2.5 Flash 97.3% ± 1.2pp $130
GPT-5 nano 96.0% ± 2.0pp $24
Gemini 2.5 Flash Lite 96.0% ± 2.0pp $30

Grok 4 does not support setting reasoning effort, so the token counts and corresponding costs are inflated compared to other models.

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/question-answering/text2sql/data

Note: while frontier LLM APIs charge per token, distil labs models run on dedicated GPUs, charged by uptime. This means exact pricing depends on utilization: the more requests you push through a GPU, the cheaper it gets. We report our numbers assuming full utilization since many real workloads get close to it (you only pay when the GPU is processing requests). The conclusions still hold up even assuming a pessimistic 10% utilization.

Specifically, we report sustained throughput on a single H100 GPU node (~$2.40/hr). At the measured ceiling of 222 RPS for the Text2SQL 4B model, a single GPU handles over 19 million requests per day. Since these models are small, they fit on much smaller GPUs as well, however, most of our tasks are prefill-heavy and decode-light. In such scenarios, H100's FLOP advantage really shines and outweighs the fact that 80GB of memory is an overkill.

Metric Value
Max sustained RPS 222
p50 latency 390ms
p95 latency 640ms
p99 latency 870ms
GPU memory 7.6 GiB

We're keeping all models on BF16 and not exploring quantization here, though depending on the scenario it can be useful. In brief experiments, FP8 quantization gave us an additional 15% throughput boost with 44% less memory and no measurable accuracy loss. Expect to read more about quantization on our blog soon!

Practical recommendations

In short: you should distill specialist models to handle all your structured tasks and route the open-ended problems to larger, generalist models. Not every task is a good candidate for distillation (this might change in the future!), and that's OK. The best production setups smartly combine both. In other words, use distillation when:

  • The task has a well-defined structure (function calling, classification, SQL generation).
  • Frontier models haven't seen your specific schema or domain.
  • Cost at scale matters — you're making millions of requests.
  • Data can't leave your infrastructure — a self-hosted model means no patient records, financial data, or PII ever hits a third-party API (our PII Redaction Healthcare dataset scored 94.0% with everything running on-premise).

Route to a frontier API when:

  • The task requires broad world knowledge, like e.g. coding or general conversations.
  • Freeform generation quality matters.
  • The task is low-volume enough to not show up on your balance sheet.

Most production LLM spend goes to structured, high-volume tasks, exactly where distillation delivers the biggest wins. Route your open-ended or low-volume tasks to a frontier API, and distill everything else.

There's also a system maturity angle here: if a task doesn't distill well, it may be too broad. Breaking it into narrower subtasks allows selectively picking off better distillation candidates. For example, instead of "answer any question about this domain," split the task into entity extraction, classification, and targeted generation.

More datasets

GPT-5 nano is the most efficient frontier option on every dataset, ranging from ~$17/M requests (E-commerce) to ~$69/M requests (Git Assistant); though the quality of its answers mostly lags behind alternatives. On the other end of the spectrum is Opus 4.6 which consistently provides best answers in our benchmarks, while being the most expensive model we’ve tested.

Tool calling

Tool-calling datasets get 2–4x fewer requests per dollar than classification or QA because tool schemas inflate the prompt token count. This task most clearly shows the strength of distillation, where a structured, well-defined problem can be solved by a specialized tiny model (for Smart Home we used Qwen3 0.6B!), outperforming generic alternatives on both quality and efficiency.

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/function-calling/smart-home/data

The git-assistant dataset is also a function-calling problem, however, here the image is a little different. Distillation still clearly wins on efficiency, but the quality question is more nuanced, since frontier models are very good at using git (it’s well-represented in their training data).

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/function-calling/git_assistant/data

Information extraction & question answering

Similarly, non-function-calling problems with structured outputs lend themselves well to distillation and the PII redaction dataset demonstrates this well. Both efficiency and quality of responses outperform the larger, more generic models.

There’s a similar story on the TextSQL problem, however, just like with git-assistant, existing models are great at writing SQL so there isn’t much space to outperform them.

Finally, the Docstring results illustrate something interesting too: while we do expect structured output in this problem, a part of it is a free-form, plain-language function description. In the general case, understanding and describing a function requires general, rather than specialized, reasoning capability and that’s a difficult area for a small model to compete in.

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/question-answering/pii-redaction-healthcare/data
https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/question-answering/docstring-generation/data

Open-book question answering problems ask the model to formulate an answer to a question given raw information in the form of “chunks” (e.g. from a RAG system).

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/open-book-qa/hotpot-qa/data

Classification

Classification is a very well-defined problem and the distilled models are competitive with (and much cheaper than!) the frontier labs’ cloud models.

https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/classification/banking77/data
https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/classification/ecommerce/data
https://github.com/distil-labs/inference-efficiency-benchmarks/tree/main/classification/TREC/data

Methodology notes

  • We used the same test set for distilled and frontier models on every dataset.
  • Same evaluation criteria: exact-match accuracy for classification, tool_call_equivalence for function calling (i.e. JSON comparison after default parameter normalization), LLM-as-a-judge (Claude Sonnet 4.6) for generation tasks.
  • Distilled model training: 50 training examples per dataset (fewer for some).
  • Teacher models: a mixture of large open-weight models (not a frontier APIs — distil labs doesn't train on outputs from closed models like GPT-5 or Claude).
  • Student models: Qwen3-4B-Instruct for most datasets; Qwen3-0.6B through 8B for the Text2SQL deep-dive; Qwen3-0.6B through 1.7B for Smart Home.
  • Variance: Frontier models were run 3 times per dataset; we report mean ± std. Distilled models use temperature 0 by default, so we skip multiple runs (the results would be the same).
  • Cost calculation: Frontier costs computed from measured API token usage per dataset. Distilled costs computed from H100 GPU time at $2.40/hr divided by measured sustained RPS.
  • Pricing snapshot from February 2026.