If you’re running a GA account with anything much beyond the defaults, you probably have some kind of traffic exclusion filters on — I know I usually do! Do you have an unfiltered view that has no filters at all? If so, do you ever look in there? And when you do find some kind of automated traffic, do you update those exclusion filters and just ignore that traffic??
Filtering out the “junk” is GA is pretty standard practice, but what is that junk anyways? In general we can lump it all together as “bot traffic”, but as I’ve written before there’s a lot of different kinds of bots out there and as they have gotten smarter more have showed up in GA.
So there’s a lot of different kinds of “junk”, and I’m here today to try and convince you not to ignore potentially useful data by filtering too aggressively. Bot filtering inside of GA is just hiding the problem. The bots are still on your site and you’ve done nothing to actually block them, you’ve just set yourself up to forget about them since you don’t see them.
If you’ve ever dropped in a filter for Boardman, Oregon or Ashburn, Virginia (bot-town U.S.A.) you were thinking about cleaning up your account.. and that’s a good thought, but we’ve got to remember that just because we put in a GA filter that doesn’t mean the traffic has gone anywhere! I realize this is perhaps odd advice coming from someone that has run an automated GA filter service for 4 years, but the nature of bot traffic has shifted quite a bit since 2015. I haven’t installed the full set of spam filters from my service on any new view in over a year.
What we don’t want to filter out is traffic that calls for action to be taken! If there’s a content-stealing scraper hitting your site, and you see the traffic it creates in GA but your response is filter it out of GA and take no other action — you’re not listening to what your reports are telling you. But beyond scrapers and other sundry miscreants mostly what we should be concerned about is ad fraud.
To be clear, I’m not talking about spam filters here — GA spam exists largely to advertise these spam sites and just muck up your analytics in general. This is purely data pollution which has no real analytical value. If you have a pollution problem you should feel free to filter away — but keep in mind that your spam filter might be getting other kinds of bot traffic as well. So if you do add filters, make sure to do it smartly! I suggest reading Carlos Escalera Alonso excellent and consistently updated guide to the topic.
Ad fraud traffic is much different. The most important difference is that you could be paying for that traffic! In a recent eye-opening episode of the Digital Analytics Power Hour on ad fraud featuring Dr. Augustine Fou they talked about uncovering this fraud, and Dr. Fou brought up reading your own analytics reports with an eye towards fraudulent patterns as one of the best lines of defense.
If your site advertises, you have ad fraud. How much? If you run programmatic display ads it could be a very significant percentage (like at least 20%, we’re not talking single digits here), even if you are running anti-fraud services on your inventory. But I wouldn’t fully trust any specific numbers in reports on ad fraud, and even if someone could say a specific number for industry ad fraud that doesn’t mean it would be that for your campaigns. If some “trusted” source publishes a number, like for example I heard Quantable.com said “at least 20%” (citation: two sentences ago) — that is a guess no matter how slick the associated dataviz is. What are the error bars like on that number? We don’t know, and if there’s not even confidence to say what the uncertainty level is on our guess that’s a pretty big guess.
Numbers on fraud levels should be taken as an absolute minimum, since they largely represent only those that got caught. There are plenty of ad fraud detection systems, and you should probably use them, but all are limited and in general are fighting the last fight while the fraudsters are onto their next tactic. Am I saying that we can create reports in GA that can ID the fraud better than these detection systems? Definitely not, but since you know your site the best you’ll know what sticks out in reports. Here’s some tell-tale signs when looking at a traffic segment:
- Bounce rate 0% or 100%.
- Unnaturally coordinated traffic (e.g. all from the same place or at the same time in a way that you wouldn’t expect).
- Traffic that doesn’t ever submit any forms.
- Traffic that doesn’t ever convert.
- Traffic that *always* converts.
- Traffic with clearly forged internal referrals (hitting a bunch of different pages without the logical antecedent URLs, can be hard to detect in GA though obvious from raw logs).
There’s a lot more possibilities, but the main idea is that when you see something unnatural that you can’t explain, the goal should be to stop it and explain first, not just hide it from reporting.
So when should you filter? Well, it depends on how often you want to check in on your unfiltered GA data and what type of a bot problem you have. As a reminder, it really is important to have an “unfiltered” Google Analytics view. This is the recommendation that accompanies any good article or service that talks about filtering, and yet I rarely see this unfiltered view setup. And if the raw unfiltered view is setup, it’s never looked at. Well, at least I know I harder ever do, maybe you are more diligent.
An unfiltered view isn’t just a backup view in case you mess up your filters (though certainly that can be very helpful!), it’s also a way to monitor everything being excluded. Maybe if you want to get creative you could even setup a view that is ONLY those things that are filtered.
So if we’re not filtering, what should we do? I suggest using segments, and offer my “bot-ish” segment in the GA solutions gallery. I’ve been adding to this segment for a while now, but there’s always new bot patterns out there so if you see something I should add, let me know!
Bot-ish Segment Dimensions
My “bot-ish” segment is nothing fancy, just a series of the same kind of dimension checks you might normally see in bot filters.
|Service Provider||List of server farm ISP, e.g. “digitalocean llc” or “facebook ireland ltd”||Determined by a whois lookup of IP.|
|Network Domain||List of server domains, e.g. “amazonaws.com”||Determined by reverse DNS lookup of IP, might match up with the service provider, but might not.|
|Screen Resolution||either 0x0 or having some bogus non-numeric info.||Some poorly written bots don’t set.|
|Browser Size||either 0x0 or having some bogus non-numeric info.||Again, some poorly written bots don’t set.|
|Browser||(not set) or Mozilla||Things like “Mozilla Compatible Agent” tend to be bots, but not always.|
|Hostname||(not set)||Usually only (not set) on ghost spam, but can be helpful still.|
The “service provider” and “network domain” checks are doing most of the work here. We will catch some lazy bot writers with the other dimensions too, but the majority of matches will likely be based on the IP address. You can do the same kind of lookup that Google does to see info about the IP with the whois or host commands (or dig -x for reverse DNS rather than host, I am a fan of the dig command), this is not any sort of proprietary Google info, it’s just public internet registration info. As you can imagine for that reason it is not always super accurate.
Here’s where we need to talk about how weak this kind of bot detection inside of GA really is. We’re dealing with derived information (so dimensions like “browser” instead of the full HTTP user agent, or the service provider dimension instead of the user IP), and we can’t give this traffic a score based upon multiple factors. A more sophisticated bot detection system would work by considering multiple factors together and base things as well upon real usage. So we’re never going to get much beyond ID-ing the most obvious bots and not without a pretty high level of false positives.
If you’d like to see how many false positives your segment might have, one good way is to look at conversions. While certainly some fraud bots can convert, it is unusual, and especially unusual if a payment is required.
Examples of false positives:
- Users on VPNs routing through cloud servers.
- Corporate users, for example a real user at Microsoft that gets grouped in with the Microsoft Azure-based bots.
- Users with unusual user agents that GA doesn’t recognize so gets ID’d as a “Mozilla Compatible Agent”.
- Errors in whois or reverse DNS that place an IP into a block it shouldn’t be.
Examples of false negatives:
- Bots using some other cloud hosting provider I’ve not listed (I’ve only listed a few big cloud providers, not even a full list of the major ones)
- Bots running off of compromised devices from consumer IPs (your great Aunt’s compromised Windows XP machine that is part of a botnet)
- Any number of bots smart enough to just not be obviously bots.
There’s plenty of bots (especially ad fraud bots) that are so smart as to be effectively indistinguishable from real users.
So where’s the good news here? What do we do? Beyond being a good market for bot and ad fraud detection vendors, the good news is that the same kind of problem-solving approach we’ve always applied to “weird” traffic is still the right thing to do. Figuring out “why” has always been the analyst’s top skill, and while the technologies are ever changing and evolving, our analytical problem-solving skills remain our best tool.