How Accurate is HyperLogLog++ in GA4?

There’s a lot of reasons why your GA4 data may not line up with expectations, but one of the most confusing is HyperLogLog++ (HLL++) cardinality estimation, which is used to estimate user counts. This estimation algorithm shows up in the GA4 UI for all counts whether they are flagged as “sampled” or not, but will not match what is seen in BigQuery. So, what is HLL++ and how much is it really messing with our numbers?

First let’s dispel a few myths:

  • HLL++ is pretty accurate.
    • Usually within 1.6% accuracy for users (95% confidence interval).
      • It’s less accurate for sessions, ±3.3%.
    • Yes, it can be frustrating to have numbers not exactly line up, but as far as issues that affect the accuracy of your analytics, it’s pretty low on the list compared to things like ad blockers, cookie limitations, bots, etc. Let’s face it — if you’re focusing on making that last 1-2% match up you are probably wasting your time.
  • HLL++ is not something new to GA4, it was used in UA as well.
      • It was used in UA for user counts dating back to 2017.
      • GA4 also uses it for more, including counting sessions.
      • HLL is also used in Adobe, though just for count distinct on dimension values and not for user counting like in GA.
  • HLL++ is not AI, machine learning, sampling, or modeling.
    • It’s a probabilistic counting method: an algorithm to quickly estimate cardinality using randomness.

So… what the heck is it?

Let’s start with that name. While “HyperLogLog++” does sound pretty intensely sci-fi, it’s actually kind of a silly name. Computer science has a long history of stupid names for cool things. “C++” was named as an improved version of C (“++” meaning incrementing by one). “C” was an improved version of the B programming language. Similarly “HyperLogLog++” is a double improved version of the LogLog algorithm. Or perhaps a triple improved version, since there was already a “super-LogLog”. The “LogLog” algorithm itself may be so named because the measure of how much memory the method uses is based upon a logarithm of a logarithm. If it was called “ReallyExcellent Multiplication” it wouldn’t seem as intimidating… though let’s be honest that’d be a pretty awful name.

What does it do?

It counts how many unique items there are in a list. For example, if we wanted to know how many different user we had this month with source=google, we couldn’t just sum all the visits per day, because some users had one visit and some had 10 over multiple days. We need to de-duplicate that list to count the number of distinct user ids. HLL++ provides us a much more efficient way to do that without actually going through each user id and storing it in memory to do a comparison.

If you’re familiar with SQL, what HLL++ does is provide a more efficient way to do a count of a “select distinct” without having to look at all the data. If you’ve ever done a select distinct on a big dataset you know it can be pretty slow. Within GA4’s  this provide count estimates for sessions and users within the UI or data API. This means that counts of these things in GA4 are not “actuals” and won’t match up exactly with things that are not estimated, like a direct count in BigQuery.

How does it work?

We’re not going to get at all deep into the algorithm, but at a very basic level what it does is create hashes of the data and examine at how unlikely those hashes are. The more unlikely the hashes are, the more different items there probably are in that set.

Imagine someone that told you that they were flipping coins and got heads six times in a row. You can either assume that they are very lucky, or that they had been doing a whole bunch of coin flipping in order to get that result. If multiple people tell you they got six consecutive heads, it becomes nearly assured that there’s a whole lot coin flipping going on. In HLL, the coin flips would be the starting digits of the binary representation of the hash. So the assumption is if you see multiple hashes that start with 000000 that there must be quite a lot of unique items in your original data since that is an unlikely outcome.

An excellent more detailed explanation of the algorithm can be found here. That blog post predates HLL++, but the “++” doesn’t change the algorithm conceptually — it just adds some optimizations.

The output of this estimation is called a “sketch” and is kind of like compressed version of what’s in that set, suitable for comparison or merging with other sketches, though not actually containing any of the original data.

One of the interesting things about HLL is that the sketches are always the same size at whatever precision level is set. Whether you’re trying to get a count across a million rows or 10, that sketch size is the same. Kind of like how an MD5 hash is always 32 characters long no matter what the contents of the file you are hashing was.

How accurate is HLL++?

This is easy enough to plot accuracy vs. size based, and Google even provides this data.

We can see here that as the precision setting is increased the amount of error goes down and the amount of memory used goes up. It’s interesting to note that while the default setting in BigQuery is precision of 15, the settings in GA4 are lower: 12 for sessions and 14 for users. This makes sense from Google’s perspective as they have decided that an additional .5% accuracy isn’t worth doubling the storage size from 16K to 32K… but it feels a little stingy. The precision of sessions in particular is lower than would be ideal. For cases like A/B testing where this sort of error could change test outcomes it would be nice to have both higher precision and options for the actual counts via the data API.

This is all at a 95% confidence interval, so hopefully nearly always we’ll have less than that % of error. The error distribution itself is relatively normal, such that the number of extreme outliers should be very low.

But that’s all in theory right, does the performance actually match this? In my testing it does. It’d be challenging to test this comprehensively, but it’s pretty trivial to do some basic validation. For example:

SELECT HLL_COUNT.EXTRACT(HLL_COUNT.INIT(user_pseudo_id, 14)) AS total_user_count FROM ga4-data.analytics_xxx.events_202308*
(which would simulate what happens inside GA4)
SELECT COUNT(DISTINCT user_pseudo_id) AS total_user_count FROM ga4-data.analytics_xxx.events_202308*
(which is an actual count).showed a .93% error for me in an example with millions of users. Running more spot tests never showed anything with error > 1.63%, though again I did not test this extensively.
All in all, HLL++ is an ingenious method to do this sort of counting. While it may be one more example of how the “real” numbers are become more abstracted in our analytics tools, it’s at least a very predictable way in which error is introduced.

Thanks Jim Kultgen for helping to clarify how HLL is used in Adobe.

No comments yet.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.