How Good is AI at Google Analytics 4?

Google has long offered a “Google Analytics Certification” quiz. For GA4, it’s 50 questions (from a pool of 150)  that cover a broad range of topics within GA. The quiz is timed, and in order to be certified, you have to score an 80% or better. Since the quiz is open book and most of the answers are available in the help section, it is not a terribly hard quiz. Nor is it meant to be.

After reading so many discussions about which LLM is better for what task, I got curious as to how good current models had gotten with GA4.

My first step was to run current models thorugh the GA4 Certification quiz, so I fed 9 different models 100 questions from the exam. Perhaps a couple of years ago this wouldn’t have been the case, but all of the current leading models passed it easily. In fact I was a little surprised to the degree to which they all aced it, with scores from 86% (gpt-4o-mini) to 98% (claude 3.7 sonnet).

I asked ChatGPT to create its own certificate of completion and it did not disappoint:

I couldn’t complete the quiz any better myself. I know because I tried. OpenAI’s newest model o3 was able to do 50 questions in 4 minutes 49 seconds, and scored a 98%. I tried speed-running the quiz — but it took me 9 minutes and I only got 88% correct. You are allotted 1 hour 15 minutes to do the quiz so I don’t feel too bad (but I still feel like I could have done better).

The GA Certification quiz is largely based upon a regurgitation of what’s in the GA4 help section. It seems that the current state of things is that as the LLM has ingested an up-to-date version of this help and the docs are accurate it should have no trouble passing. After all, if GPT-4 can pass the bar, we shouldn’t be shocked that today’s more advanced models can ace this simple quiz.

Spitting back out the documentation doesn’t make you an expert though, so we needed to go deeper. Next I found a significantly harder quiz, provided by MeasureSchool.

MeasureSchool’s quiz has significantly harder questions, requiring more synthesis and real knowledge of best practices. Also the answers are harder to infer just from the questions themselves, which is a weakness in the first test.

The tested models still did ok, though only o3 and Gemini 2.5-pro hit the 80% correct level. The range was Deepseek Chat v3 with 59% correct to Gemini 2.5 pro with 94%.

If in the past you wondered why Google’s own tools were so mediocre at GA knowledge, it seems like the newest Gemini model has turned that trend around.

We still needed to go further to really test the limits of these models’ knowledge and reasoning. In order to do so I created my own advanced GA4 quiz, containing questions that even real human experts should have to think about.

My quiz is only 15 questions, but it dives pretty deep in to GA4. I tried to use real examples that I’ve run across and that make you think. I’ve also tried to avoid purposely tricky questions, since I don’t see the point of purposely exploiting known LLM weaknesses just to show we can.

After making the quiz I enlisted Todd Bullivant, one of the most knowledgeable and prolific contributors on Measure Slack‘s #google-analytics channel, to validate the quiz. He got 100% right, though he said he did have to think about some of them. With ToddGPT’s validation it was time to see how the actual AIs did.

They had a tough time with the questions, and only o3 was able to get to the 80% correct score level. The scores ranged from 27% from Grok 3 to 80% from o3.

GA4 Quiz, Hard Mode

Model Correct Total Percentage
openai/o3 12 15 80%
google/gemini-2.5-pro-preview-03-25 9 15 60%
openai/gpt-4o-mini 9 15 60%
openai/gpt-4.1 8 15 53%
deepseek/deepseek-chat-v3-0324 7 15 47%
meta-llama/llama-4-maverick 7 15 47%
anthropic/claude-3.7-sonnet 7 15 47%
google/gemini-2.0-flash-001 5 15 33%
x-ai/grok-3-beta 4 15 27%

I expect that quite a few of my readers are looking at that list and thinking, “but you didn’t test model X”. Sorry, I’ve got my preferences and I’m not trying to break the bank on openrouter.ai credits (the multi-LLM service I ran the tests through).

Here are the results overall:

The cases where the AIs had trouble are perhaps instructive:

  • Questions where something was true with UA but no longer true in GA4 proved difficult.
    • E.g. “500 hit per session limit”
  • Hallucination bait where I offer up a feature that probably should exist, but doesn’t
    • E.g. an “Email Alerts” section in the admin
  • The LLMs also seized upon statements that were true but not germane to the question
    • E.g. “Email is considered PII by GA4 and therefore disallowed by policy”

The AI world is changing so fast that the model on top today might well be surpassed next week. As an example, when I started writing this article two days ago, o3 had not yet been released. Without o3 the results of my “advanced” quiz would have made it seem as though LLMs had a long way to go when it came to examples requiring reasoning. The pace of change is absurdly fast.

I last wrote about Elos and benchmarking LLMs only about 6 months ago. The benchmarking landscape has changed quite a lot since then.

  1. DeepSeek shook the AI world, despite very modest improvements (if any) in benchmarking but rather improvements in cost per performance.
  2. Llama was been accused of trying to game Elos.
  3. Humanity’s Last Exam benchmark was released to try and add headroom into older benchmarks that have trended towards being solved.

While it does seem as though these AIs are on the brink of being able to effectively answer most any factual question about GA4, they still seem to have a long way to go when it comes to doing actual analysis. This is certainly much harder to validate, but I have yet to see an “AI analysis” platform that does much more than highlight anomalies regardless of context. They can be of course be incredible tools to help you do your analysis, but I would argue we’re still pretty far from being able to point an AI at a GA4 account and tell it to “go find insights”.

Curious to take the “advanced” GA4 quiz yourself? I’ve embedded it here, let me know if you think I’ve gotten anything wrong.

No comments yet.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.