Elos and Benchmarking LLMs

Every week, it seems like a new LLM model is released. AI experts and influencers  start talking nearly immediately about what this particular new model excels at and why you might want to consider dropping whatever you’re using now and getting in on the new hotness. It’s very hard to know when this analysis is based upon actual data and when it’s just grabbing on to something shiny.

Benchmarking is what we’re all relying on to make these judgements, except there’s a lot of different benchmarks and they don’t always agree. Many people’s benchmark of choice has been the (LMSYS) Chatbot Arena Elo ratings. As of September 24th, their current leaderboard looks like this:

Chatbot Arena Elo ratings

This clearly shows OpenAI’s “o1-preview” model at the top of the chart, with an Elo score of 1355. Then a bunch of ties as we go through the top 10.

This rating is based upon a human judging the winner of an anonymous head-to-head matchup between LLMs in the “Chatbot Arena”.

Unlike most other benchmarks, this rating is not based upon giving the model some kind of automated quiz, but rather an A vs. B with the winner chosen by a person.

Here’s an example, which I would have gotten wrong because I thought there was only one “R” in sarsaparilla. Gemini must have gotten it wrong assuming that all words now have three “R”s if asked 🙃.

A win for May ’24 GPT-4o

After you vote, you get to see which model you selected.

Like in chess (where Elo ratings have their roots), the Elo rating indicates the likelihood of winning a head-to-head matchup. It’s not any sort of rating of skill or intelligence beyond that. Besides incorrectly spelling it as “ELO” (it’s not an acronym, it’s named after a physicist named Arpad Elo), the most common misunderstanding of Elo is that it has some kind of inherent value across disciplines. Just because chess 🐐 Magnus Carlsen hit a chess Elo of 2882 doesn’t mean anything about how smart GPT-4o is now or may be in the future. It just means that if I, as a 1000 Elo player played him that I’d have something like a 0.00197% chance of winning. Which honestly seems way too high, though maybe Magnus might bonk his head on something or get caught up in a sharknado during the match.

O1-preview’s rating of 1355 means that it should theoretically beat Claude 3.5-sonnet (rating 1269) 62% of the time where there was a winner. As o1-preview is still pretty new there’s a big confidence interval there (+12/-11), the actual win rate of 57% didn’t quite match, but it’s close.

The fact that Claude 3.5-sonnet beat 01-preview in 43% seems to indicate that everything at the top of the list right now is pretty close to each other in terms of quality. Even the lowest scoring model in the top 25 right now (Meta-Llama-3.1-70b-Instruct, Elo rating 1248) beat o1-preview 24% of the time. Except that doesn’t really seem to line up with people’s real-world experience with the models, where the leaders seem to be further ahead than that.

So intuitively, it doesn’t seem like there’s a big difference in 100 Elo points, which is the spread of the top 25 leaderboard right now. After all, if I’m playing some chess.com and I’m rated 1000 and I come across someone that’s rated 1100 I’m probably going to think to myself that they aren’t really much better than me and this is a pretty even matchup.

The difference with LLM Elos is that they are the result of a lot more matchups from systems with a lot more consistency.

Some of these models in the area have over 4,000 decisive matchups against other models. Even if May ’24 GPT-4o has only on 53% win rate and 16 Elo points on Claude 3.5 sonnet, they have decisively battled 3,829 times in the arena… meaning GPT has a 114 win lead on Claude in the series. I’d posit that my perception of GPT-4o being slightly better is because I’m thinking about those few cases I saw where one was distinctly better than the other.

Yet in other benchmarks like the Scientific Reasoning & Knowledge (GPQA) benchmark Claude wins vs. the May ’24 version of ChatGPT4o. So which one is better?

Better at what?

This is where things get interesting. One of the reasons people have turned to the Chatbot Arena rankings is that it is based upon actual human evaluation over a much wider range of questions than other rankings. There are a lot of other test-based rankings: MMLU, GPQA, MATH, MGSM, and these can be very useful (check out Artificial Analysis for comparisons using those), but the consensus of overall “best” seems to have most closely followed the Chatbot Area rankings.

Actual humans judging on a wide range of questions has to be better than automated testing against a fixed set of questions right? I still think yes, but I’d like to highlight one significant bias in these judgements and present a concern about using this method long-term.

The biggest bias is the type of questions asked and who is doing the judging. Unsurprisingly, the Chatbot Arena is full of techies asking techie questions… which may or may not match your intended use case.

LMSYS released 33,000 actual questions & answers that were used last year for the chatbot area. I’ve attempted to categorize them (using GPT-4o) and of the 60%+ that I was able to categorize, over 80% were technical or mathematical in nature.

There are a huge amount questions about programming. If that’s what you use LLMs for, then great — these rankings are going to probably be pretty spot-on for you. If you’re using it for medieval English literature, then you may be out of luck, there’s zero questions that mention Chaucer. Yet there were dozens of questions about Star Trek. Maybe a model that does well on Star Trek also does well on Chaucer? It’s hard to say without anyone testing for it. Still there’s over 4,000 mentions of Python, including at least 10 in relation to this famous question:

Question: What is the average air speed velocity of an unladen swallow?
claude-instant-v1: African or European swallow?  — voted winner ✅ claude-v1: I apologize, but I do not actually know the air speed velocity of an unladen swallow. That was a reference to a line from Monty Python and the Holy Grail.

Human judgement, of the nerdy variety.

 

No comments yet.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.