Data preparation
There are two ways to prepare data for training with distil labs:
- Trace processing — use this if you already have production logs from an LLM-powered application. Upload your traces and our pipeline handles the rest.
- Minimal dataset — use this if you don’t have production traces but can provide a small set of labeled examples for your task.
Trace processing
Section titled “Trace processing”If you have production traces (logs of real interactions with an LLM), you can upload them and our pipeline will automatically filter, relabel, and convert them into training and test data. This is the fastest way to get started if you already have an LLM-powered application in production.
Your traces directory needs three files:
| File | Format | Description |
|---|---|---|
traces.jsonl | JSONL | Production traces in the OpenAI messages format |
job_description.json | JSON | Task description defining what the model should do |
config.yaml | YAML | Training config with trace_processing parameters |
distil model upload-traces <model-id> --data ./traces
Learn more about trace processing →
Minimal dataset
Section titled “Minimal dataset”If you don’t have production traces, prepare a small structured dataset with labeled examples. You only need a few dozen high-quality examples that capture the essence of your task.
Your data directory needs the following files:
| File | Format | Required | Description |
|---|---|---|---|
job_description.json | JSON | Yes | Task description defining what the model should do |
train.csv | CSV or JSONL | Yes | 20+ labeled (question, answer) pairs |
test.csv | CSV or JSONL | Yes | Held-out evaluation set |
config.yaml | YAML | Yes | Training hyperparameters |
unstructured.csv | CSV or JSONL | No | Domain-relevant text for synthetic data generation |
distil model upload-data <model-id> --data ./data
For detailed formatting and structure requirements per task type, refer to: