Local deployment
Once your model is trained, you can download and deploy it locally using the inference framework of your choice. Alternatively, you can push it to Hugging Face Hub for easy sharing and deployment.
Deploying with distil CLI
Section titled “Deploying with distil CLI”The distil CLI provides a built-in deployment command that handles model download and serving for you.
Local deployment (experimental)
Section titled “Local deployment (experimental)”Deploy your trained model locally using llama-cpp as the inference backend:
distil model deploy local <model-id>
This downloads the model and starts a local llama-server on port 8000. You can customize the port and enable server logs:
distil model deploy local --port 9000 --logs <model-id>
Once running, query the model using the OpenAI-compatible API at http://localhost:<port>/v1.
To get the command to invoke your locally deployed model:
distil model invoke <model-id>
This outputs a ready-to-run command using uv. Copy and run it directly:
uv run $PATH_TO_CLIENT --question "Your question here"
For question answering models that require context, use the --context flag:
uv run $PATH_TO_CLIENT --question "Your question here" --context "Your context here"
| Option | Description |
|---|---|
--port <port> | Port number for local llama-server (default: 8000) |
--logs | Show llama-server logs during local deployment |
--output json | Output results in JSON format |
Downloading your model
Section titled “Downloading your model”Download your trained model using the CLI or API:
distil model download <model-id> import requests
# See Account and Authentication for distil_bearer_token() implementation
auth_header = {"Authorization": f"Bearer {distil_bearer_token()}"}
slm_training_job_id = "YOUR_TRAINING_ID"
response = requests.get(
f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
headers=auth_header,
)
# Get the download URL
print(response.json()) After downloading, extract the tarball. You will have a directory containing your trained SLM with all necessary files for deployment:
├── model/
├── model-adapters/
├── Modelfile
├── model_client.py
├── README.md
Deploying with vLLM
Section titled “Deploying with vLLM”vLLM is a high-performance inference engine for LLMs. To get started, set up a virtual environment and install dependencies:
python -m venv serve
source serve/bin/activate
pip install vllm openai
Start the vLLM server:
vllm serve model --api-key EMPTY
For tool calling models:
vllm serve model --enable-auto-tool-choice --tool-call-parser hermes --api-key EMPTY
Note that model refers to the directory containing your model weights. The server runs in the foreground, so make sure to run the server in a separate window or a background process.
Query the model using the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Your question here"},
],
)
print(response.choices[0].message.content)
Or use the provided client script:
python model_client.py --question "Your question here"
For question answering models that require context, use the --context flag:
python model_client.py --question "Your question here" --context "Your context here"
Pushing to Hugging Face Hub (API only)
Section titled “Pushing to Hugging Face Hub (API only)”You can upload your model directly to your private Hugging Face repository. Once pushed, you can deploy it directly from Hugging Face using various inference frameworks.
Requirements:
- Training ID of the model (
YOUR_TRAINING_ID) - Hugging Face user access token with write privileges (
YOUR_HF_TOKEN) - Repository name for your model (
YOUR_USERNAME/MODEL_NAME)
import json
import requests
# See Account and Authentication for distil_bearer_token() implementation
auth_header = {"Authorization": f"Bearer {distil_bearer_token()}"}
slm_training_job_id = "YOUR_TRAINING_ID"
hf_details = {
"hf_token": "YOUR_HF_TOKEN",
"repo_id": "YOUR_USERNAME/MODEL_NAME"
}
response = requests.post(
f"https://api.distillabs.ai/trainings/{slm_training_job_id}/huggingface_models",
data=json.dumps(hf_details),
headers={"Content-Type": "application/json", **auth_header},
)
print(response.json())
We push models to two repositories on Hugging Face, one in GGUF format and one in safetensors format. Once your model is on Hugging Face, you can deploy it directly using vLLM:
pip install vllm
vllm serve "YOUR_USERNAME/MODEL_NAME"
Production considerations
Section titled “Production considerations”When deploying your model to production, consider:
- Resource Requirements: Even small models benefit from GPU acceleration for high-throughput applications.
- Security: Apply appropriate access controls, especially if your model handles sensitive information.
- Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.