Gitara: How we trained a 3B Function-Calling Git Agent for Local Use

We fine-tuned a small, tool-calling language model to turn plain-English language questions into git commands with the accuracy of a cloud LLM. You can play with it by checking out this

GitHub repo or getting the model directly from Huggingface. Because it's small, you can run it locally on your own machine.

Introduction

As you might know, we think small agents are the future. If we, as a society, are supposed to deploy AI much more widely than it is today, we will need to start prioritizing efficiency. While most of the effort so far has historically been focused on capabilities, we're now seeing more of the industry shift to making already-smart-enough-for-many-tasks models economical to set up across different hardware platforms.

As companies figure out how to best make use of AI in their workflows, you hear the word "agentic" over and over again. While nobody can quite agree on what an agent even is, one component is definitely necessary: the ability to interact with the external world. This is usually done is via tool calling, i.e. giving the underlying model a set of tools to choose from, letting it reply with a "tool call", executing this call and then giving the result back to the model. OpenAI has a good overview of the usual flow.

When creating an agentic system to automate workflows, it's usually recommended to split the over-arching general workflow into several sub-agents, with each one providing a specific, narrow function. Whenever you see a large, general model performing a narrow function, you should think: this looks like a good place for distillation! So let's walk through how to build a small-model-based, tool-calling agent while using git assistant as a motivating example.

The task

If you still remember learning git, chances are it's not all happy recollections of you knowing right away which command and options to use when. Remembering the correct commands and syntax to achieve a given task takes time. Wouldn't it be great if you could just explain what you want to do and get the correct command to run to do it? Because we have fine-tuned a small model to produce valid git commands based on plain English queries, now you can! Let's look at some examples of how the end result gitara can be used (and don't worry, gitara will not execute any commands, it'll just print them out for you):

> gitara "what's in the latest stash, show diff"
git stash show --patch

> gitara "push feature-x to origin, override any changes there and track it"
git push origin feature-x --force --set-upstream

> gitara "show staged changes with diffs"
git status --verbose

Now, all of git is pretty large, so we'll make our lives a little bit easier and define a 20% subset of commands, which cover 95% of my daily usage. We'll support the following commands, each with a reasonable, but not exhaustive, subset of options: status, add, commit, push, pull, branch, switch, restore, merge, stash, rebase, reset, log. Sorry reflog & bisect fans!

Side note: if you're an old-schoold git-head (meaning you learned git before 2019) you might've noticed absence of git checkout in the above list. This is not a coincidence! While I still find it hard to let go of checkout, switch and restore are better, more modern alternatives, so we use them and skip checkout.

Overview of tool calling

The usual way to give a language model access to tools is to specify a JSON schema for each tool. We follow the OpenAI function calling format, which exposes the above set of git commands using schemas such as this one (this is just for git add, you can see the entire set of in the repo):

{
    "type": "function",  
    "function": {    
    	"name": "git_add",    
        "description": "Stage files for commit",    
        "parameters": {      
            "type": "object",      
            "properties": {        
            	"files": {          
            		"type": "array",          
                	"description": "List of file paths to stage (use ['.'] for all files)",          
                	"items": { "type": "string" },          
             		"minItems": 1       
          	}      
	    },      
       	"required": ["files"],      
       	"additionalProperties": false    
     } 
   }
}

‍

You give a list of those to the completion API endpoint, together with messages containing the query and expect something like the following in return:

{"name": "git_add", "parameters": {"files": ["README.md"]}}

‍

There are a bunch of different formats for tool calling, both for specifying the tool schemas (e.g. Anthropic expect a slightly different structure) and for the answer returned by the LLM (e.g. recent Llama models are trained to return Python-style tool calls like git_add(files=["README.md"])). We'll mostly stick to the OpenAI format here because it's the most popular one. We'll just make two modifications: while the OpenAI API returns arguments as a string-containing-JSON, we'll train our model to return nested JSON, where parameters (not arguments!) is just an object. There are a bunch of trade-offs here connected to string/JSON choice (streaming and validation can be easier when arguments is a string), but for our use-case keeping the response a single nested JSON object makes things simpler. The choice to use parameters instead of arguments is driven by Llama 3 family being trained with this key name. Either way, it's easy to add a wrapper around the model to convert the output to any format you might want to use in your application.

One thing that's important to consider is adding a do_nothing tool so the model can respond reasonably when faced with unreasonable queries. For instance, when our git assistant is asked to make a sandwich or hack the Pentagon it should ideally respond with "I'm afraid I can't do that", rather than an arbitrary git command. Giving it do_nothing as an option together with a handful of examples is a way to ensure that.

Implementation

Creating seed dataset

To start the process, we need to create some seed data: what we expect the model to accept and what it should respond with. Normally, fine-tuning a language model requires way more examples than it's practical to create manually, but we only need a small set to get started (our platform will take care of generating the rest). Let's start with something like the following:

Input	Output
apply stash@{5}	`{"name": "git_stash", "parameters": {"action": "apply", "stash_ref": "stash@{5}"}}`
merge vendor branch preferring ours	`{"name": "git_merge", "parameters": {"branch": "vendor", "strategy": "ours"}}`
show 8 commits for current branch with graph	`{"name": "git_log", "parameters": {"limit": 8, "graph": true}}`

‍

Ideally, we want to have a decent coverage of both possible git commands as well as realistic queries both for training and validation. In the case of our problem, having a decent coverage means something like ~100 examples. While it'd be painful to write this many by hand, it's fairly easy to generate them using your stochastic parrot of choice and then just validate that they're correct, throwing out anything you don't like. A set like this is available in the repo here.

Baseline: evaluating a large model

With data like this, we should check how well a standard large model does on this task to have a baseline. We'll use Llama 3.3 70B Instruct here because that's still a fairly popular and widely-deployed model. To check, we'll use a system prompt like the following:

You are a tool-calling model responding with the git operation tool call based on the desired action.

You will be given a single task and you should solve it by generating an appropriate tool call according to the provided tool schema. Stick to the format of the following examples:

{"name": "git_stash", "parameters": {"action": "apply", "stash_ref": "stash@{5}"}}
{"name": "git_merge", "parameters": {"branch": "vendor", "strategy": "ours"}}

To score predictions, we parse both the ground-truth and model output into Python dicts. Then, we normalize them to remove any default-value arguments and compare the normalized dicts for structural equality. This ignores irrelevant formatting differences (whitespace, key order) and whether a default value is included or not, but flags any outputs that do not match reference tool call.

With this setup, our benchmark "teacher" model achieves 0.94 score on the test dataset linked above.

Training the student

To actually start fine-tuning a smaller model, we need much more data than what we have above. One way to get it is to use our seed data together with the tool call schemas as guides for generating many more examples. Our platform has its own way of doing this, which you can find out more about here. We end up with a dataset of 10k input-output pairs, which we can use to train the much smaller student model (Llama 3.2 3B Instruct) to do just as well as the teacher on this task while having 25x fewer parameters, bringing massive cost and latency benefits.

The end result

With the above setup, our student model achieves the score of 0.94 on the hold-out test dataset - exactly the same as the much larger teacher model. At 3B the end result is straightforward to host on any modern dev machine (most queries take less than 2s to return a response on my M4 MacBook Pro once the model is loaded - and that's without quantization).

We also experimented with using a 1B variant, instead of 3B and that achieves the score of 0.88. This model is even less demanding and depending on your use-case and dataset difficulty might offer good-enough performance.

Future improvements

There are several improvements we could make from here. One obvious one is constrained decoding, where we set up the inference engine so it can only output valid JSON. While we the student model was producing syntactically-correct structure most of the time, this would likely bump the performance further with minimal latency cost.

Similarly, it would make sense to next train a model to solve more complex tasks using multi-turn workflows. The idea would be for the user to describe a larger task and execute a few back-and-forth rounds, each time either responding with the result of the last requested tool call or giving the model feedback about what to change.

Finally, quantization could make the model even smaller and easier to deploy without noticeably compromising performance.

Conclusions

Wrapping up, we took a tiny model and made it work locally just as well as one 25x bigger. The workflow shown above is generic across tool-calling, so while we demonstrated the git assistant example it will work just as well for other tool-calling scenarios. You now know a process in your toolbelt, which can be used for many other problems!

While doing all this manually is a bunch of work, the distil labs platform abstracts away most of the difficult parts so you can focus on defining the the task.

‍