A Founder’s Guide to AI Fine-Tuning

Introduction

About the Author: I’m Kyle Corbitt, the founder and CEO of OpenPipe. We make it easy to move from GPT-4 to a custom fine-tuned model that continuously improves based on your users’ feedback.

Who this is for: You’re a busy founder or engineer building a GenAI-powered feature. You’ve heard about fine-tuning but don’t have deep ML expertise. You might already have a problem you think fine-tuning can solve (lower costs? higher quality?) or might just be getting ready for the future.

What you’ll learn:

What is fine-tuning?
Why fine-tune? What problems does it solve?
When is the right time to think about fine-tuning?
What tools and expertise do I need to be successful?

What is AI Fine-Tuning?

Fine-tuning is the process of steering model behavior by updating their actual weights, as opposed to prompting, which involves only rewriting the instructions or adding additional samples to the context. Compared to prompting, fine-tuning gives you much deeper and more nuanced control of a model’s behavior. On the other hand, when done incorrectly it can lead to conditions like “catastrophic forgetting” where the model actually gets much worse instead of better. Fine-tuning is a power tool!

Why Fine-Tune?

Modern frontier AI models are incredible. They can “zero-shot” many tasks like document classification, information extraction, chatbots and role-playing. However, they still have weaknesses. These include:

Unreliable adherence to instructions. For example, if you ask an LLM to create a summary that is under 50 words, it might still produce a 60+ word summary 20% of the time. This becomes even more true with instructions that are nuanced and subjective.
High costs for operations at scale. Current frontier model costs are low enough that they usually can be justified for realtime user interactions. But for large operations like sifting through millions of documents or scraping large parts of the web, they can quickly become cost prohibitive.
Latency that is too high for certain applications. If your app requires multiple rounds of LLM calls in eg. an agentic workflow, or near-realtime responses for a text-to-voice stream, frontier models may not respond fast enough provide a good experience.

Fine-tuning can help on all three of these dimensions simultaneously.

A good dataset leads to far greater reliability because the model is strongly conditioned to create outputs that correspond to your needs.
Fine-tuned models are generally much smaller than generalist models. This leads to much lower inference costs (often a 10-100x improvement).
Since fine-tuned models are much smaller, they also can have much lower latency than large generalist models.

This effect is well illustrated on the graph below. The cost vs quality Pareto frontier for fine-tuned models is significantly more attractive than the one for prompted models alone.

When Should You Fine-Tune?

Cool, so is fine-tuning a panacea with no tradeoffs? Unfortunately not. There are some very good reasons why you should probably start with prompting.

Fine-tuning specializes a model on a specific shape of input. But if you’re just starting off, you probably don’t know exactly what your input shape will look like! You need actual users using your system to see what inputs they will bring, so you can specialize on that.
Fine-tuning does take some time and money (even if it’s minimal, as we’ll see below!). So it doesn’t make sense to reach for it until you’ve exhausted the limits of easier approaches like prompt engineering.

The typical flow we see come up frequently is to prompt engineer first, and then fine-tune only when you can’t wring more gains out of prompting.

Prototype Quickly with GPT-4: When exploring an idea or seeking product-market fit, using an off-the-shelf model like GPT-4 makes sense. You can iterate rapidly without worrying about custom infrastructure.
Scale with Fine-Tuning: Once you’ve identified your core use case and are ready to scale, fine-tuning becomes valuable. A fine-tuned model can be:

Cheaper: Optimized for your specific tasks, reducing overhead.

Faster: Streamlined responses improve speed.

More Consistent: Less variance in output, enhancing predictability.

More Controllable: You can define exactly how your AI responds, making it a better fit for your users.

How to Fine-Tune

Fine-tuning might seem daunting, but it can be broken down into four high-level steps:

1. Prepare Your Data: Start by saving interactions from your application—these could be customer support queries, user feedback, or prompts and responses from prototyping. This data forms the foundation of your fine-tuning dataset. Tools like OpenPipe or observability providers like Helicone and Portkey can assist with data collection.

2. Train Your Model: You can use open-source tools (e.g. Unsloth, Axolotl or LLaMA-Factory) for self-hosted setups, or opt for hosted platforms like OpenPipe, Together, or Fireworks for a more managed experience.

3. Evaluate Performance: Evaluation occurs in two loops:

Inner Loop: Rapid, iterative evaluations using a “golden dataset” or having an LLM act as a judge.
Outer Loop: Business metrics—how well is the fine-tuned model performing in terms of user satisfaction and task success?

4. Deploy: Depending on your technical capabilities, you can either self-host your fine-tuned model (using tools like vLLM or TensorRT-LLM) or use hosted options like OpenPipe, AWS Bedrock, Together or Fireworks.

All of these are achievable by a reasonably competent software engineer with no specific training in data science or machine learning. We’ve seen organizations go from a prompt to a fine-tuned model with less than an hour of engineering time, if they were already collecting their data!

Is Fine-Tuning Right for You?

If you’ve reached the limits of prompt engineering and your AI-powered feature could still be improved by lower costs, higher reliability or lower latency, fine-tuning may be able to help a lot!