Comparing GPT-4o mini vs Llama 3.1 for fine-tuning

Kyle Corbitt, the founder of OpenPipe, shared his experience with comparing GPT-4o mini and Llama 3.1 for fine tuning. The bottom line? AI models are getting good, really good.

Two weeks ago, two new major model families became available for fine-tuning: Llama 3.1, which comes in 8B, 70B and 405B(!) variants, and GPT-4o mini. We’ve added them to the OpenPipe platform and ran all of them (except Llama 3.1 405B) through our evaluation harness. Here’s what you need to know if you’re interested in using them as a base for fine-tuning.

High-level Stats

Model Quality

The good news is, all 3 of models are extremely high quality. The bad news is, they saturate most of the standard evals we ran, which makes comparing them difficult! In fact, both Llama 3.1 of the smaller variants saturate all 3 of the standard evals we ran, and GPT-4o mini also saturated 2/3 of them.

What do we mean by saturate? For any given input, you can imagine there is a potential “perfect” output (or set of outputs) that cannot be improved upon. The more complex the task, the more difficult it is for a model to generate a perfect output. However, once a model is strong enough to consistently generate a perfect output for that task, we consider the task saturated for that model. In our LLM-as-judge evals, this usually shows up as a cluster of models all doing about the same on the task without any model significantly outperforming.

And in fact, that’s what we see in the evaluations below:

In the chart above, all 3 fine-tuned models do about as well as each other (win rates within 6%) on both the Resume Summarization and Data Extraction tasks. On Chatbot Responses, however, both Llama 3.1 variants significantly outperform GPT-4o mini. So the Chatbot Responses task isn’t saturated for GPT-4o mini, but all other tasks and models are.

This is very significant—we chose these tasks explicitly because older models on our platform, like Mistral 7B and Llama 3 8B, did not saturate these tasks! There are two main reasons why we’re seeing this saturation now:

The new models we’re testing here are stronger than the previous generation of models available on-platform.
Our benchmark models are now all trained on datasets relabeled with Mixture of Agents, which substantially improves the quality of the dataset and thus the fine-tuned model.

We’re working on developing better benchmarks (please email me if you have a high-quality internal dataset you’d be interested in adapting into an internal benchmark!), and once we have some higher-difficulty ones we’ll analyze Llama 3.1 405B as well. Of course, happy to answer any questions about fine-tuning!

About the author

Kyle Corbitt is the founder of OpenPipe, the easiest way to train and deploy your own fine-tuned models. Formerly, Kyle was a director at Y Combinator, engineer at Google, and co-founder at Emberall.

This article was originally published on OpenPipe's blog.