Elos and Benchmarking LLMs

Jason Packer

Elos and Benchmarking LLMs

A new LLM model seems to be released every week. AI experts and influencers immediately start talking about where new model excels, and why you should drop whatever you’re using now and getting in on the new hotness. It’s very hard to know when this analysis is based upon actual data, and when it’s just glomming on to something shiny. It’s also important to remember that we should absolutely expect each new release to outdo the previous!

Benchmarking is what we’re all relying on to make these judgements, except there’ are a lot of different benchmarks and they don’t always agree. Many people’s benchmark of choice has been the (LMSYS) Chatbot Arena Elo ratings. As of September 24th, their current leaderboard looks like this:

Chatbot Arena Elo ratings

This clearly shows OpenAI’s “o1-preview” model at the top of the chart, with an Elo score of 1355. Then a bunch of ties as we go through the top 10.

This rating is based upon a human judging the winner of an anonymous head-to-head matchup between LLMs in the “Chatbot Arena”.

Unlike most other benchmarks, this rating is not based upon giving the model some kind of automated quiz, but rather an A vs. B with the winner chosen by a person.

In the example below, I do a twist on the usual “how many R’s in strawberry” question and ask how many R’s are in “sarsaparilla” instead. I threw in this twist since the “strawberry” question might no longer be as valid if it’s in the training data. I also find it an interesting example since in American English it’s pronounced with only one “R” sound: sas-pə-RIL-ə (or sæspəˈrɪlə for all you IPA fans out there).

A win for May ’24 GPT-4o

After you vote, you get to see which model you selected.

Like in chess (where Elo ratings have their roots), the Elo rating shows the likelihood of winning a head-to-head matchup. People have also extensively used Elo ratings in sports (for example FiveThirtyEight).

It’s not any sort of rating of skill or intelligence beyond that measurement of a win likelihood. Besides misspelling it as “ELO” (it’s not an acronym, it’s named after a physicist named Arpad Elo), the most common misunderstanding of Elo is that it has some kind of inherent value or has meaning across different disciplines. Just because chess 🐐 Magnus Carlsen hit a chess Elo of 2882 means nothing about how smart GPT-4o is now or may be in the future. It just means that if I, as a 1000 Elo player, played Magnus that I’d have something like a 0.00197% chance of winning. Which honestly seems way too high. Maybe he’d bonk his head on something or get caught up in a sharknado during the match.

O1-preview’s rating of 1355 means that it should theoretically beat Claude 3.5-sonnet (rating 1269) 62% of the time where there was a winner. As o1-preview is still pretty new, there’s a big confidence interval there (+12/-11), and the actual win rate of 57% doesn’t quite match the predicted rate, but it’s close.

That Claude 3.5-sonnet beat 01-preview 43% of the time seems to show that everything at the top of the list right now is pretty close to each other in terms of quality. Even the lowest scoring model in the top 25 right now (Meta-Llama-3.1-70b-Instruct, Elo rating 1248) beat o1-preview 24% of the time.

Except these win percentages don’t seem to line up with people’s perception of which models are best, where the leaders seem to be further ahead than that.

Intuitively, it doesn’t seem like 100 Elo points is a big difference, which is the spread of the top 25 leaderboard right now. If I’m playing on chess.com and I’m rated 1000, when I come across someone that’s rated 1100 I’m going to think this is a pretty even matchup and they aren’t really much better than me if they are better at all.

The difference with LLM Elos is that they are the result of a lot more matchups from systems with a lot more consistency.

Some of these models in the arena have over 4,000 decisive matchups against other models. Even if May ’24 GPT-4o has only on 53% win rate and 16 Elo points on Claude 3.5 sonnet, they have decisively battled 3,829 times in the arena… meaning GPT has a 114 win lead on Claude in the series.

I’d posit that my perception of GPT-4o being definitely better overall is a selection bias where I’m thinking about those few cases I saw where one was distinctly better than the other.

Yet in other benchmarks like the Scientific Reasoning & Knowledge (GPQA) benchmark, Claude wins vs. the May ’24 version of ChatGPT4o. So which one is better?

… better at what?

This is where things get interesting. One reason people have turned to the Chatbot Arena rankings is that it is based upon actual human evaluation over a much wider range of questions than other rankings. There are a lot of other test-based rankings: MMLU, GPQA, MATH, MGSM, and these can be very useful (check out Artificial Analysis for comparisons using those), but the consensus of overall “best” seems to have most closely followed the Chatbot Arena rankings.

Actual humans judging on a wide range of questions has to be better than automated testing against a fixed set of questions, right? I still think yes, but I’d like to highlight one significant bias in these judgements and present a concern about using this method long term. At some point, the models become so good that humans won’t be able to judge them well, but IMHO we are still some way off from that.

The biggest bias is the type of questions asked and who is doing the judging. Unsurprisingly, the Chatbot Arena is full of techies asking techie questions… which may or may not match your intended use case.

LMSYS released 33,000 actual questions & answers that were used last year for the chatbot arena. I’ve attempted to categorize them (using GPT-4o) and of the 60%+ that I could categorize, over 80% were technical or mathematical.

There are a huge amount of questions about programming. If that’s what you use LLMs for, then great — these rankings are will be pretty spot-on for you. If you’re using it for medieval English literature, then you may be out of luck, as there’s zero questions that mention Chaucer. Yet there were dozens of questions about Star Trek. Maybe a model that does well on Star Trek also does well on Chaucer? It’s hard to say with no one testing for it. Still, there’s over 4,000 mentions of Python, including at least 10 in relation to this famous question:

Question: What is the average air speed velocity of an unladen swallow?
claude-instant-v1: African or European swallow? — voted winner ✅	claude-v1: I apologize, but I do not actually know the air speed velocity of an unladen swallow. That was a reference to a line from Monty Python and the Holy Grail.

So yea, nerds are gonna ask nerdy questions. Except I’m not sure I agree with how the voting went here. Yes “African or European…” is the canonically correct response from the movie, but the second answer actually provides much more information, including identifying where the quote is from. Since the model makers watch these rankings closely, the models may well trend towards responses that the raters like. Is this Goodharting? I.e. is this a case where the Elos are not as good of a measure because the measure itself is a target? Perhaps, though I think the bigger issue is the bias in the type of people doing these rankings.

Quantable Analytics

Analytics & Optimization

Navigation

Elos and Benchmarking LLMs

No comments yet.

Leave a Reply Click here to cancel reply.