Data preparation

Proper input preparation is critical to the success of your model training with distil labs. The input consists of the job description, training data, test data, configuration file and optional unstructured dataset.

Job description

The job description explains what you want your model to do - it’s similar to a detailed prompt you would give to a large language model. It defines what task the model should perform, expected formats, and any specific requirements or constraints.

Training data

Training data provides concrete examples of desired model behavior. The examples demonstrate input-output relationships and help calibrate the knowledge distillation process. With distil labs, you only need a few dozen high-quality examples so focus on representative examples that capture the essence of your task.

Test data

Test data serves as your verification mechanism, providing an independent assessment of whether the model has learned the task. These examples remain separate from training and help measure how well the model generalizes to unseen inputs. Good test data identifies potential weaknesses and builds confidence before deployment.

Unstructured data (optional)

Unstructured data provides broader domain knowledge without requiring labeled examples. This supplementary information guides the distillation process towards the right domain without the labor-intensive process of creating additional labeled examples.

Configuration file

The configuration file is your control panel for the training process. While the job description defines what the model should do, the configuration dictates how it learns to do it. It specifies task type, model size, and training parameters, bridging your requirements with the technical aspects of model training.

Task-specific guidance

For detailed formatting and structure requirements, refer to:

Data upload

Once the inputs are prepared, you can upload them to the distil labs platform using the following snippet (Get your token):

import json
import requests
from pathlib import Path

# See Account and Authentication for distil_bearer_token() implementation
auth_header = {"Authorization": f"Bearer {distil_bearer_token()}"}

# Load data from files
data = {
    "job_description": {"type": "json", "content": open("data/job_description.json").read()},
    "train_data": {"type": "csv", "content": open("data/train.csv").read()},
    "test_data": {"type": "csv", "content": open("data/test.csv").read()},
    "unstructured_data": {"type": "csv", "content": open("data/unstructured.csv").read()},
    "config": {"type": "yaml", "content": open("data/config.yaml").read()},
}

# Package and upload your data
response = requests.post(
    f"https://api.distillabs.ai/models/{model_id}/uploads",
    data=json.dumps(data),
    headers={"Content-Type": "application/json", **auth_header},
)

upload_id = response.json()["id"]
print(f"Upload successful. ID: {upload_id}")