Google Analytics Sampling

If you’re dealing with a high traffic site or doing reports in Google Analytics over long periods of time you may have run into report sampling. Y’know, the yellow box:


To me there’s 3 obvious questions when you see it:
1. Why is it happening?
2. What can I do about it?
3. Is it bad?

#1 Why??

The why is well defined if a little detailed. The google documentation is here, but basically what it’s saying is that any on-the-fly report can’t exceed 250k visits. So if it’s a standard report without a segment applied like Audience Overview you won’t run into sampling no matter how many visits are in it. Likewise you can slice & dice a custom report all day and not hit sampling if it’s < 250k visits. Usually why you’re seeing the report is that you’re running something custom on a large time period of data.

#2 How can I get rid of it??

a. First and easiest is to drag the precision slider (the square grid icon above the yellow box) all the way to “highest precision”. You probably already knew that. This makes the report take a little longer to run, but ups that 250k limit to 500k visits. So if the “report is based on x visits (y% of visits)” has a y% thats > 50% you’re good to “fix” it this way.

b. Use a standard report if you can. For example if you’re doing a segment like “visits with a goal completion” to get the number of goals when you could have used the standard Conversions > Goals > Overview report then you’re introducing sampling when you don’t need to. Basically, these boxex:


Just aren’t your friends when it comes to triggering sampling.

c. Create a new profile / view with your segment in it as a filter. Say you are always segmenting out Google Organic traffic to see stats on just that traffic, but you’re hitting sampling too often. Create a profile with a filter for just that traffic and then all your standard reports have that segment already set. This is a great solution but don’t use it too much or you’ll end up with 50 profiles and then managing them will be a royal pain.

d. Just look at a shorter time period. If you’re trying to run the report on 4 months but you’re hitting sampling then maybe run it one month at a time? Just like above you can tell roughly how many time periods you have to cut it up into by the % of visits in your current sample. So if it’s sampling at ~25% of visitors or more for your 4 month report then doing 4 reports one-per-month and adding up the stats you need yourself will work.

If none of those work there’s more involved solutions as well, this is a great article from Luna Metrics for more detail.

#3 So is sampling really messing up my report or not??

If you can’t fix it now’s the time to think about what the sampling really means. You know it’s taking time-adjusted random group of visitors based on all of your visitors. Other than the part where it has to figure out which random visitors to pull into that sample that seems pretty simple. But is it messing up the conclusions you make from that report? There’s two main things to consider here:

a. What % of visits is it getting?
This is the obvious one. If it’s getting 90% of your visits then you’re probably good unless you really need an exact number, and if it’s getting 1% then there’s some problems. But.. it’s important to also consider:

b. What the frequency and distribution of the event I’m trying to measure?
This is very key and the main reason I’m writing this article. First consider you’re trying to measure your percentage of mobile users. That’s probably something that’s very consistent across any random sample of users. Every user is going to either be mobile or not, so looking at 1,000 random users on one day (broken out across all hours) is going to give you very close to the same breakdown as looking at 100,000. This is where sampling works great. You don’t have to check every single visit because there’s just not that much variation and from a business perspective you’re probably not paying anyone anything differently if you actually had 22.4% instead of 22.5% like your sampled report says.

But what if you’re measuring something that doesn’t happen that often? Say in this case you’re measuring signups to your email newsletter, but you don’t put that everywhere on your site and maybe you get a ton of visitors but not too many actually want your emails.

If that signup rate is really low, say like .1% of your total visits, then even if you are looking at a pretty decent % of your overall visits in the sample  you could still be missing a lot of signups in your report, and then it looks way off because it doesn’t match the actual number which you have from your email list manager.

So with very infrequent events or highly bursty events you can end up with some pretty inaccurate results. What we’ve found is that this is especially problematic because many of those highly infrequent events are things that people care about most. In other words if you’re bothering to look at all at something that only happens 1/1000 times you probably care quite a bit about it! Those kind of events are usually the checkouts, signups & conversions that are the most important events on our sites!

Google says this, “In particular, GA’s approach to sampling works very well for fast, top N queries and other queries that have a relatively broad, uniform distribution across visits. Session sampling can be less accurate for ‘needle in a haystack’ problems, such as single keyword analysis and longtail analysis, or cases with narrow dimension filtering, such as heavily filtered views or conversion analysis where conversions constitute a small fraction of visits.” That’s totally accurate but for me really undersells the issues that can come up and understanding when it doesn’t work.

No comments yet.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.