Classification
Model training with the distil labs platform
Section titled “Model training with the distil labs platform”The distil labs platform allows anyone to benefit from state-of-the-art methods for model fine-tuning. You don’t need to be a machine learning expert to get a highly performant model customized to your needs in a matter of a day.
Overview
Section titled “Overview”In this notebook, we will train a small language model (SLM) with the distil labs platform. We will follow a three-step process and, at the end, download our own SLM for local deployment.
In practice, you will transform a compact “student” model into a domain expert—without writing a single training loop yourself. Distil labs takes care of every heavy-lifting step:
| Stage | What happens under the hood | Why it matters |
|---|---|---|
| Data upload & validation | You submit a job description, tiny train / test CSVs, and (optionally) an unstructured corpus. The platform checks schema, finds label mistakes, and estimates achievable accuracy. | Catches data bugs before you waste compute. |
| LLM evaluation | A large foundation model (“teacher”) answers your test questions. distil labs measures accuracy and shows a pass/fail report. | If the teacher can’t solve the task, small models won’t either—stop here instead of two hours later. |
| SLM training (synthetic generation + distillation) | Automatically generates additional Q&A pairs from your corpus to fill knowledge gaps, then fine-tunes the 135 M student with LoRA/QLoRA adapters while distilling the teacher’s reasoning. Lightweight hyper-parameter search runs in the background. | Produces a model up to 70 × smaller than the teacher yet usually within a few percentage points of its accuracy—ready for CPU-only devices. |
| Benchmarking & packaging | Once training finishes, distil labs re-evaluates both teacher and student on your held-out test set, generates a side-by-side metrics report, and bundles the weights in a tarball. | You get hard numbers and a model you can run locally in one command. |
Registration
Section titled “Registration”The first step towards model distillation is creating an account at app.distillabs.ai. Once you sign up, you can use the CLI to log in with your credentials.
Notebook Setup
Section titled “Notebook Setup”Copy over necessary data
Section titled “Copy over necessary data”%%bash
# Check if the directory exists
if [ -d "data-mental-health" ]; then
echo "Data directory does exist, nothing to do"
else
echo "Data directory does not exist, cloning from a repository"
# Clone the repo to a temp location
git clone https://github.com/distil-labs/distil-labs-examples.git distil-labs-examples
# Copy the specific subdirectory to the data directory
cp -r distil-labs-examples/classification-tutorial/data-mental-health data-mental-health
cp -r distil-labs-examples/classification-tutorial/data-injury data-injury
cp -r distil-labs-examples/classification-tutorial/data-ecommerce data-ecommerce
cp -r distil-labs-examples/classification-tutorial/data-banking-routing data-banking-routing
# Delete the cloned repo
rm -rf distil-labs-examples
echo "Subdirectory copied and repo removed."
fi
! pip install pandas requests rich torch transformers
import pandas
pandas.set_option("display.max_rows", 10)
distil labs authentication
Section titled “distil labs authentication”To begin, we need to authenticate. Log in using the distil labs CLI with your email and password. The CLI will handle token management for you.
distil login import getpass
import json
import requests
def distil_bearer_token(DL_USERNAME: str, DL_PASSWORD: str) -> str:
response = requests.post(
"https://cognito-idp.eu-central-1.amazonaws.com",
headers={
"X-Amz-Target": "AWSCognitoIdentityProviderService.InitiateAuth",
"Content-Type": "application/x-amz-json-1.1",
},
data=json.dumps({
"AuthParameters": {
"USERNAME": DL_USERNAME,
"PASSWORD": DL_PASSWORD,
},
"AuthFlow": "USER_PASSWORD_AUTH",
"ClientId" : "4569nvlkn8dm0iedo54nbta6fd",
})
)
response.raise_for_status()
return response.json()["AuthenticationResult"]["AccessToken"]
DL_USERNAME = "YOUR_EMAIL"
DL_PASSWORD = getpass.getpass()
AUTH_HEADER = {"Authorization": distil_bearer_token(DL_USERNAME, DL_PASSWORD)}
print("Success") Register a new model
Section titled “Register a new model”The first component of the workflow is registering a new model - this helps us keep track of all our experiments down the line.
distil model create testmodel-1234 from pprint import pprint
# Register a model
data = {"name": "testmodel-1234"}
response = requests.post(
"https://api.distillabs.ai/models",
data=json.dumps(data),
headers={"content-type": "application/json", **AUTH_HEADER},
)
pprint(response.json())
model_id = response.json()["id"]
print(f"Registered a model with ID={model_id}") Inspect our models
Section titled “Inspect our models”Now that the model is registered, we can take a look at all the models in our repository.
distil model list from pprint import pprint
# Retrieve all models
response = requests.get(
"https://api.distillabs.ai/models",
headers=AUTH_HEADER
)
pprint(response.json()) Data Validation
Section titled “Data Validation”To get started with model training we need to upload the necessary data components. The details of formatting are discussed in Data Preparation Guidelines for Classification but if you don’t have a dataset ready, you can follow one of the data preparation notebooks to prepare an example dataset. Each distil labs training relies on:
- Job description that explains the classification task and describes all classes
- Train and test dataset (~10s examples) which demonstrates our expected inputs and outputs
- (optional) Unstructured dataset with unlabelled data points related to the problem
from pathlib import Path
data_location = Path("data-banking-routing")
assert data_location.exists()
The data for this example should be stored in the data_location directory. Lets first take a look at the current directory to make sure all files are available. Your current directory should look like:
├── README.md
├── classification-training.ipynb
└── <data_location>
├── job_description.json
├── test.csv
├── train.csv
└── unstructured.csv
Job Description
Section titled “Job Description”A job description explains the classification task in plain english and follows the general structure below:
{
"task_description": "<Enter job description here>",
"classes_description":
{
"class A": "<Enter class A description here>",
"class B": "<Enter class B description here>",
...
}
}
For this problem, we use the job description stored in data_location/, lets inspect the job_description prepared for our problem:
import json
import rich.json
with open(data_location.joinpath("job_description.json")) as fin:
rich.print(rich.json.JSON(fin.read()))
Train and test data
Section titled “Train and test data”We need a small train data to begin disti labs training and a testing dataset that we can use to evaluate the performance of the fine-tuned model. Here, we use the train and test datasets from the data_location directory where each is a JSON-lines file with below 100 (question, answer) pairs.
Let’s inspect the available datasets to see the format and a few examples.
from pathlib import Path
from IPython.display import display
import pandas
print("# --- Train set")
train = pandas.read_csv(data_location.joinpath("train.csv"))
display(train)
print("# --- Test set")
test = pandas.read_csv(data_location.joinpath("test.csv"))
display(test)
Unstructured dataset
Section titled “Unstructured dataset”The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. It can be documentation, unlabelled examples, or even industry literature that contains such information. Here, we use the unstructured datasets from the data_location/ directory where each is a JSON-lines with a single column (context).
Let’s inspect the available datasets to see the format and a few examples.
unstructured = pandas.read_csv(data_location.joinpath("unstructured.csv"))
display(unstructured)
Upload and Validate data
Section titled “Upload and Validate data”We upload all data elements to the distil labs platform and use data validation to check if everything is in order for our jobs.
First, create a config.yaml file in your data directory with the following content:
base:
task: classification
Then upload the data:
distil model upload-data <model-id> --data data-banking-routing import json
import requests
import yaml
from pathlib import Path
import pandas
# Specify the config
config = {
"base": {
"task": "classification",
}
}
# Package your data
data_dir = Path("data")
data = {
"job_description": {
"type": "json",
"content": open(data_location / "job_description.json", encoding="utf-8").read()
},
"train_data": {
"type": "csv",
"content": open(data_location / "train.csv", encoding="utf-8").read()
},
"test_data": {
"type": "csv",
"content": open(data_location / "test.csv", encoding="utf-8").read()
},
"unstructured_data": {
"type": "csv",
"content": open(data_location / "unstructured.csv", encoding="utf-8").read()
},
"config": {
"type": "yaml",
"content": yaml.dump(config)
},
}
# Upload data
response = requests.post(
f"https://api.distillabs.ai/models/{model_id}/uploads",
data=json.dumps(data),
headers={"content-type": "application/json", **AUTH_HEADER},
)
print(response.json())
upload_id = response.json()["id"] Teacher evaluation
Section titled “Teacher evaluation”In the teacher evaluation stage, we will use our test set to validate whether our chosen ‘teacher’ LLM can solve the task well enough.
If a large model can solve a problem, we can then distil the problem-solving ability of the larger model into a small model. The accuracy of the teacher LLM will give us an idea of the performance to expect from our SLM.
Start the teacher evaluation:
distil model run-teacher-evaluation <model-id> from pprint import pprint
# Start teacher evaluation
data = {"upload_id": upload_id}
response = requests.post(
f"https://api.distillabs.ai/models/{model_id}/teacher-evaluations",
data=json.dumps(data),
headers={"content-type": "application/json", **AUTH_HEADER},
)
pprint(response.json())
teacher_evaluation_id = response.json().get("id") Check status and results
Section titled “Check status and results”Run the command below to check the status and results of the LLM evaluation.
High accuracy on LLM evaluation indicates our task is well defined and we can move on to training. When training an SLM for this task, we can use the LLM evaluation as the quality benchmark for the trained model.
distil model teacher-evaluation <model-id> import json
from pprint import pprint
import pandas as pd
response = requests.get(
f"https://api.distillabs.ai/teacher-evaluations/{teacher_evaluation_id}/status",
headers=AUTH_HEADER,
)
pprint(response.json()["message"])
try:
display(pd.DataFrame(response.json().get("results")).transpose())
except:
pass SLM Training
Section titled “SLM Training”Now that we are satisfied with the LLM evaluation, we will start the distil labs training process where the SLM learns to mimic the LLM’s behavior on your specific task. Once the training is complete, we will review the SLM’s performance against the LLM’s benchmark and decide if the quality meets your requirements.
To kick off the training job, run the following command:
distil model run-training <model-id> import time
from pprint import pprint
# Start SLM training
data = {"upload_id": upload_id}
response = requests.post(
f"https://api.distillabs.ai/models/{model_id}/training",
data=json.dumps(data),
headers={"content-type": "application/json", **AUTH_HEADER},
)
pprint(response.json())
slm_training_job_id = response.json().get("id") Training status and evaluation results
Section titled “Training status and evaluation results”We can check the status of the training job using the CLI. The following command displays the current status of the job we started before.
distil model training <model-id> import json
response = requests.get(
f"https://api.distillabs.ai/trainings/{slm_training_job_id}/status",
headers=AUTH_HEADER,
)
response.json() When the job is finished (status=complete), the command above will also display the benchmarking results - the accuracy of the LLM and the accuracy of the fine-tuned SLM.
Interpreting results
Section titled “Interpreting results”Inspecting the classification results, we can compare the accuracy of the small model (1B parameters) to the teacher model with 70x the size. In most cases, the accuracy should be comparable, indicating successful training.
SLM Ready
Section titled “SLM Ready”Once the model is fully trained, we can download the model binaries so you can deploy it on your own infrastructure and have full control. A trained model can be later deployed for inference; this is explained in the next section.
distil model download <model-id> import json
from pprint import pprint
import requests
response = requests.get(
f"https://api.distillabs.ai/models",
headers=AUTH_HEADER,
)
pprint(response.json())from pprint import pprint
slm_training_job_id = "SELECTED-MODEL"
response = requests.get(
f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
headers=AUTH_HEADER,
)
pprint(response.json())import tarfile
import urllib.request
print("Downloading ...")
def status(count, block, total):
print("\r", f"Downloading: {count * block / total:.1%}", end="")
urllib.request.urlretrieve(
s3url,
"model.tar",
reporthook=status,
)
print("\nUnpacking ...")
with tarfile.open("model.tar", mode="r:*") as tar:
tar.extractall(path=".") Model deployment
Section titled “Model deployment”import torch
import pandas
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TextClassificationPipeline
model = AutoModelForSequenceClassification.from_pretrained("model")
tokenizer = AutoTokenizer.from_pretrained("model", padding_side="left")
llm = TextClassificationPipeline(model=model, tokenizer=tokenizer, top_k=None)
answer = llm("I have a charge for cash withdrawal that I want to learn about")
pandas.DataFrame(answer.pop()).sort_values(by="score", ascending=False)