Using distil labs we cut our inference costs by roughly 50% for selected tasks without sacrificing quality. We were able to spin up custom small models without dedicating a full-time engineer to it.

Lucas Hild, CTO Knowunity

Using the distil labs training and inference platform allowed Knowunity to shave off ~50% of their inference costs without any change to accuracy, which is a meaningful win at the scale of hundreds of millions monthly requests.

The problem

Knowunity is a fast-growing edtech startup founded in 2020 in Berlin, who recently secured their €27M Series B. It serves tens of millions of students across several countries, processing hundreds of millions of AI requests each month.

One of the problems they are solving, is classifying incoming student requests based on the study subject they relate to. The twist is that each country has a different list of subjects available, so the list of available classes is dynamic.

At this scale, cloud LLM costs become a significant line item, but accuracy, throughput and latency constraints often make optimization challenging. While large models exposed by big cloud providers offer great accuracy, they are expensive, even the smallest variants. Knowunity evaluated open-weight alternatives, but found they could match proprietary models either on accuracy or efficiency, not both.

The solution

At distil labs, we offer a turnkey solution for fine-tuning specialized Small Language Models for specific tasks. We use large, open teacher models and distill their abilities into much smaller, sub-8-billion-parameter models, which perform just as well (sometimes better!) on these specific tasks. Because of distillation, our platform is remarkably data-efficient, requiring less than 100 example data points to get started. The process does not require specialized AI knowledge to operate it, while still being highly configurable and allowing ML experts to experiment.

Once training is complete, our customers can take advantage of our inference platform by simply updating their old API endpoint to point to the new, specialized model. Those who prefer to self-host the model on own infrastructure can download the trained model weights as well.

Because our small models are specialized, their performance on the task they were fine-tuned for is usually the same or better as the large teacher model (which in this case was Qwen3-235B). And because of their low parameter count, they can be hosted much more efficiently. As a result, Knowunity is able to run their workflows at a fraction of the cost of comparable big cloud provider.

Results

The table below shows classification accuracy on held-out test set as well as latency and cost between Gemini 2.5 Flash Lite (Google’s best small, proprietary model) and our custom distilled model.

TKTK TABLE

We ran the benchmarks using variable traffic with up to 10 requests per second bursts and consider any response above 8s as a failed request. Refer to this link for more details about our benchmarking setup.

Note about evaluation: the production traffic we serve for Knowunity reaches above 130 requests per second. However, to keep the comparison fair in presence of cloud provider quotas and rate limits, the benchmarks were ran on a smaller scale - the results should generalize linearly.

Note about pricing: we calculate prices based on instance uptime, while most big cloud providers charge per token. The prices above are taken from production usage on a specific use case over several days, not from synthetic benchmarks.

Impact

Knowunity were able to take advantage of the two new custom models by updating the API endpoint in their workflows without trading off performance or quality. The team is now empowered to train their own models independently using the distil labs platform.

Contact us if you want to train your own models and save a significant portion of your LLM budget.

Appendix

The benchmarking script simulates multiple classification requests being sent every second and varies the traffic intensity (RPS, requests per second) over time. We start with a baseline level of 1 RPS for 30s (the “warmup” period), then gradually increase to 7 RPS, hold it steady at that level for 60s, gradually ramp down to the baseline level and keep it there for another 30s. Each request timing also includes some random jitter.

We found that this scenario surfaces common issues with autoscaling setups effectively and provides a good overview of request latency in under different conditions. The benchmark script is implemented in Elixir to give us good control over request timing.

In production, each model we host, can serve up to 4 million requests per day, with variable traffic ranging up to 150 RPS, where rapidly auto-scaling to meet changing demand is crucial.

In the benchmark we also assume that request latency has a ceiling of 8s and any request taking longer than that counts as failed.