How Accurate is HyperLogLog++ in GA4?
Categories: analytics
There’s a lot of reasons why your GA4 data may not line up with expectations, but one of the most confusing is HyperLogLog++ (HLL++) cardinality estimation, which is used to estimate user counts. This estimation algorithm shows up in the GA4 UI for all counts whether they are flagged as “sampled” or not, but will not match what is seen in BigQuery. So, what is HLL++ and how much is it really messing with our numbers?
First let’s dispel a few myths:
- HLL++ is pretty accurate.
- Usually within 1.6% accuracy for users (95% confidence interval).
- It’s less accurate for sessions, ±3.3%.
- Yes, it can be frustrating to have numbers not exactly line up, but as far as issues that affect the accuracy of your analytics, it’s pretty low on the list compared to things like ad blockers, cookie limitations, bots, etc. Let’s face it — if you’re focusing on making that last 1-2% match up you are probably wasting your time.
- Usually within 1.6% accuracy for users (95% confidence interval).
- HLL++ is not something new to GA4, it was used in UA as well.
-
- It was used in UA for user counts dating back to 2017.
- GA4 also uses it for more, including counting sessions.
- HLL is also used in Adobe, though just for count distinct on dimension values and not for user counting like in GA.
-
- HLL++ is not AI, machine learning, sampling, or modeling.
- It’s a probabilistic counting method: an algorithm to quickly estimate cardinality using randomness.
So… what the heck is it?
Let’s start with that name. While “HyperLogLog++” does sound pretty intensely sci-fi, it’s actually kind of a silly name. Computer science has a long history of stupid names for cool things. “C++” was named as an improved version of C (“++” meaning incrementing by one). “C” was an improved version of the B programming language. Similarly “HyperLogLog++” is a double improved version of the LogLog algorithm. Or perhaps a triple improved version, since there was already a “super-LogLog”. The “LogLog” algorithm itself may be so named because the measure of how much memory the method uses is based upon a logarithm of a logarithm. If it was called “ReallyExcellent Multiplication” it wouldn’t seem as intimidating… though let’s be honest that’d be a pretty awful name.
What does it do?
It counts how many unique items there are in a list. For example, if we wanted to know how many different user we had this month with source=google, we couldn’t just sum all the visits per day, because some users had one visit and some had 10 over multiple days. We need to de-duplicate that list to count the number of distinct user ids. HLL++ provides us a much more efficient way to do that without actually going through each user id and storing it in memory to do a comparison.
If you’re familiar with SQL, what HLL++ does is provide a more efficient way to do a count of a “select distinct” without having to look at all the data. If you’ve ever done a select distinct on a big dataset you know it can be pretty slow. Within GA4’s this provide count estimates for sessions and users within the UI or data API. This means that counts of these things in GA4 are not “actuals” and won’t match up exactly with things that are not estimated, like a direct count in BigQuery.
How does it work?
We’re not going to get at all deep into the algorithm, but at a very basic level what it does is create hashes of the data and examine at how unlikely those hashes are. The more unlikely the hashes are, the more different items there probably are in that set.
Imagine someone that told you that they were flipping coins and got heads six times in a row. You can either assume that they are very lucky, or that they had been doing a whole bunch of coin flipping in order to get that result. If multiple people tell you they got six consecutive heads, it becomes nearly assured that there’s a whole lot coin flipping going on. In HLL, the coin flips would be the starting digits of the binary representation of the hash. So the assumption is if you see multiple hashes that start with 000000 that there must be quite a lot of unique items in your original data since that is an unlikely outcome.
An excellent more detailed explanation of the algorithm can be found here. That blog post predates HLL++, but the “++” doesn’t change the algorithm conceptually — it just adds some optimizations.
The output of this estimation is called a “sketch” and is kind of like compressed version of what’s in that set, suitable for comparison or merging with other sketches, though not actually containing any of the original data.
One of the interesting things about HLL is that the sketches are always the same size at whatever precision level is set. Whether you’re trying to get a count across a million rows or 10, that sketch size is the same. Kind of like how an MD5 hash is always 32 characters long no matter what the contents of the file you are hashing was.
How accurate is HLL++?
This is easy enough to plot accuracy vs. size based, and Google even provides this data.
We can see here that as the precision setting is increased the amount of error goes down and the amount of memory used goes up. It’s interesting to note that while the default setting in BigQuery is precision of 15, the settings in GA4 are lower: 12 for sessions and 14 for users. This makes sense from Google’s perspective as they have decided that an additional .5% accuracy isn’t worth doubling the storage size from 16K to 32K… but it feels a little stingy. The precision of sessions in particular is lower than would be ideal. For cases like A/B testing where this sort of error could change test outcomes it would be nice to have both higher precision and options for the actual counts via the data API.
This is all at a 95% confidence interval, so hopefully nearly always we’ll have less than that % of error. The error distribution itself is relatively normal, such that the number of extreme outliers should be very low.
But that’s all in theory right, does the performance actually match this? In my testing it does. It’d be challenging to test this comprehensively, but it’s pretty trivial to do some basic validation. For example:
SELECT HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS total_user_count FROM ga4-data.analytics_xxx.events_202308*
(which would simulate what happens inside GA4)
SELECT COUNT(DISTINCT user_pseudo_id) AS total_user_count FROM ga4-data.analytics_xxx.events_202308*
(which is an actual count).showed a .93% error for me in an example with millions of users. Running more spot tests never showed anything with error > 1.63%, though again I did not test this extensively.
Thanks Jim Kultgen for helping to clarify how HLL is used in Adobe.
Thanks for this, it really helped me when working with BQ. This is even more curital for GA core value which is connecting sessions to traffic sources. I got some strange results where not that large data sets (less than 5K MAUs) get WILDLY different numbers on users per source/medium between GA4 and BQ – and even wilder differences on large data sets (millions of users).
Maybe you can explain a bit more about this fenomena where I can see, for the same data set, when breaking users per source / medium “small” sources (i.e. few dozens) show smaller diff then “large” sources (few hundreds) – how is HLL++ causing this?
Hi Or, that sounds more like thresholding to me than HLL++. If there’s thresholding on your data in the standard interface due to Google Signals it wouldn’t happen in GBQ: https://support.google.com/analytics/answer/13644080
no Signals no thresholding
For example:
Looking at number of users per source/medium in GA I get for “small” source 13 users and BQ shows 24
on “large” source BQ shows 522 vs 403 on GA
This is single day data and I am aware of time diff between GA and BQ BUT when checking also for same sources +/- day the numbers don’t add up
I’ve also tested this on a client with millions of avg of 350K MUAs and getting same wired resolutes
The GBQ exports don’t include session-level attribution like the the regular GA interface, they include only event-level attribution and (delayed) returning user-level attribution. It’s very annoying. https://support.google.com/analytics/answer/9358801?hl=en
I know that BUT the last update they released supports the ability to create a session based attribution based on auto events session_start & page_view with source/medium been part of the event params
Hmmm… I’m not sure what else it might be then, sorry!
Unfortunately, the use HLL, HLL++ or any alternative method based on the same core principle to obtain a count destroys (not an overstatement) any possibility of using it in most statistical analyses: https://blog.analytics-toolkit.com/2020/the-effect-of-using-cardinality-estimates-like-hyperloglog-in-statistical-analyses/ In short, these “small” relative errors are to blame due to the fact that they are constant across most sample sizes, instead of shrinking the larger the sample is. It is 1.6% relative error for 50,000 unique IDs and 1.6% relative error for 50 mln. unique IDs which is contrary to the data models used in most statistical tools.
Hi Georgi, thanks for your comment and your excellent article. I did actually cite your article, but I think it’s worth re-stating.