Data preparation
Proper input preparation is critical to the success of your model training with distil labs. The input consists of the job description, training data, test data, configuration file and optional unstructured dataset.
Job description
Section titled “Job description”The job description explains what you want your model to do - it’s similar to a detailed prompt you would give to a large language model. It defines what task the model should perform, expected formats, and any specific requirements or constraints.
Training data
Section titled “Training data”Training data provides concrete examples of desired model behavior. The examples demonstrate input-output relationships and help calibrate the knowledge distillation process. With distil labs, you only need a few dozen high-quality examples so focus on representative examples that capture the essence of your task.
Test data
Section titled “Test data”Test data serves as your verification mechanism, providing an independent assessment of whether the model has learned the task. These examples remain separate from training and help measure how well the model generalizes to unseen inputs. Good test data identifies potential weaknesses and builds confidence before deployment.
Unstructured data (optional)
Section titled “Unstructured data (optional)”Unstructured data provides broader domain knowledge without requiring labeled examples. This supplementary information guides the distillation process towards the right domain without the labor-intensive process of creating additional labeled examples.
Configuration file
Section titled “Configuration file”The configuration file is your control panel for the training process. While the job description defines what the model should do, the configuration dictates how it learns to do it. It specifies task type, model size, and training parameters, bridging your requirements with the technical aspects of model training.
Task-specific guidance
Section titled “Task-specific guidance”For detailed formatting and structure requirements, refer to:
- Question Answering data preparation →
- Classification data preparation →
- Tool Calling data preparation →
- Multi-Turn Tool Calling data preparation →
- Open Book QA (RAG) data preparation →
- Closed-Book QA data preparation →
- Creating the configuration file →
Data upload
Section titled “Data upload”Once the inputs are prepared, you can upload them to the distil labs platform using the following snippet (Get your token):
import json
import requests
from pathlib import Path
# See Account and Authentication for distil_bearer_token() implementation
auth_header = {"Authorization": f"Bearer {distil_bearer_token()}"}
# Load data from files
data = {
"job_description": {"type": "json", "content": open("data/job_description.json").read()},
"train_data": {"type": "csv", "content": open("data/train.csv").read()},
"test_data": {"type": "csv", "content": open("data/test.csv").read()},
"unstructured_data": {"type": "csv", "content": open("data/unstructured.csv").read()},
"config": {"type": "yaml", "content": open("data/config.yaml").read()},
}
# Package and upload your data
response = requests.post(
f"https://api.distillabs.ai/models/{model_id}/uploads",
data=json.dumps(data),
headers={"Content-Type": "application/json", **auth_header},
)
upload_id = response.json()["id"]
print(f"Upload successful. ID: {upload_id}")