Improving Analytics Integrity

The data quality in web analytics reports has long been a point of concern for analysts. These days we’re dealing with Google Analytics spammers, bots, and other noise in our reports, but the idea that web analytics reports are “just not quite right” has long been considered the norm.

One of the reasons for this lack of quality is that the methods we use to report data to analytics services are pretty simplistic and easily spoofed. I’m a big fan of keeping it simple, but if our data collection methods are so easily subverted, then it seems like time to take a look at how we might fix that. We’re talking Google Analytics here, but the methods are similar for other client-side measurement protocols.

The core simplicity of the Google Analytics Measurement Protocol (GAMP) is great, and makes it very flexible and powerful — but the lack of verification on the requests made to it leaves us open to abuse. The same functionality that makes it easy to plug GA into whatever platform we want to measure also makes it easy to spoof.

What we’re going to be talking about is a way to ensure that the actual GA Measurement Protocol part of the measurement transaction is actually from the page we placed the tracking snippet on (i.e. that it is authorized) and that the contents of the transaction is what we intended and hasn’t been tampered with (i.e. that it has integrity).

Legit vs. Spoofed Request

A basic GAMP request might look like this:

https://www.google-analytics.com/collect?v=1&_v=j41&a=702618035&t=pageview&_s=1&dl=https://www.quantable.com/&ul=en-us&de=UTF-8&dt=Quantable - Analytics & Optimization&sd=24-bit&sr=1680x1050&vp=1442x464&je=0&_u=SCCAAUAjK~&jid=&cid=157092037.1441829013&tid=UA-34128028-1&z=823826407

Which broken out into individual variables contains the following parameters:

parameters

Google Analytics Measurement Protocol (GAMP) parameters (via ObservePoint Tag Debugger)

This is where it becomes obvious why it’s so easy to muck up the data, just change some of those variables to whatever you want and the data in your GA reports will change. The request doesn’t even have to come from your site, it just needs to hit that collection URL.

There’s nothing really stopping you* from changing the parameters to whatever you want on the fly and putting whatever you want to in that account (* Google will block a lot of abuse, though really only the most blatant and generally not immediately). Thus the ease and prevalence of GA spam.

But this isn’t just about spam, it’s about the integrity of the reporting in general — spam is just the most glaring current issue. What about competitors polluting each others analytics with incorrect info, or scraped sites that include your original tracker code, or bots run amok?

The solution I want to explore in this experiment is relatively simple, but it does require changing how we view the data collection stage of our analytics protocols, and it is not applicable to all cases. In other words: it’s a proof-of-concept, not something I’m suggesting my clients start running.


If you’ve ever used a 3rd party API system, you know that they usually require the use of an API key in order to validate that the requests made of that API are indeed from an authorized source. Well, how come there’s no API keys for our analytics API (aka the measurement protocol)??

The answer is that the data collection happens fully on the client-side, meaning any API key that was used would have to be viewable by the client’s browser. This means anyone hitting the site could trivially pick it off and use it, making it pretty pointless as a key.

Think of it this way: an API key that was in your GA tracking code on your site would prevent only the most basic of abuse where the abuser never even looked at your site. This would solve so-called “ghost” visit spam (as described in Mike Sullivan’s guide to spam here), but that’s it.

So let’s say we wanted to use a custom dimension as an “API key”; we’d add it to our tracker code and then add an include filter in GA so as to ignore anything that doesn’t have our custom dimension. The tracker code is simple enough:

notsosecret

Client-side is, by its nature, not secret.

But as you can see, since that custom dimension has to be right on the page (or in the GTM-injected tag) and anyone can pick that off, so it’s not any sort of secret.


Ok — so it’s clear that a client-side only solution can’t provide us the level of verification we want.

My solution is to bring a server-side GA proxy into the mix that verifies a request to GA is both authorized and untampered with:

Auth’d Analytics Data Flow

The two main additions to the process are the HMAC signature (hash-based message authentication code, I’ll explain that a little more shortly) and the fact that we take the Google Analytics measurement protocol request and send it to a server-side proxy rather than directly to GA.

I’ve simplified the listing of all the steps included in my diagram of the “Legitimate GA Call” flow earlier to keep things from getting too crazy, so let’s get into the differences:

1st step: Server-side HMAC creation.

If you’re not familiar with HMAC, it is a hashing system designed to do exactly what we’re looking for: creation of a keyed hash for a request that can’t be tampered with. To quote the HMAC Wikipedia entry, “it may be used to simultaneously verify both the data integrity and the authentication of a message.” Sounds like exactly what we’re looking for, which is precisely why HMAC is widely used to sign requests to many REST-based APIs. We’re not breaking new ground in API authentication here, this kind of method has been used for years.

Not to get too much into cryptography (and lose the rest of the 10% of my audience that is still reading this far in), but we could achieve most of what we want with a simple salted hash — however HMAC is a more robust solution to the problem and the additional server-side overhead is pretty trivial.

I usually find a bit of code makes things clearer, so here’s what we’re doing on the server-side (with PHP):

We’re making a copy of the hit request that would normally get sent directly to GA and instead sending it to our local PHP proxy, including some additional data:

  1. the HMAC signature, including a hash based on: the browser user-agent, URL requested, and time of the request.
  2. un-hashed copies of the time of request and URI (so we can re-create the HMAC in the next step when we need to verify it).

2nd Step: Client sends GAMP request to proxy, which checks the HMAC.

As you can see in step 1 — we make an AJAX request to post our data to our local proxy, “signedproxy.php”, which:

  1. Checks that the request time (used in the signature) is not off more than 5 minutes of the server time on the proxy now. This breaks the ability to replay old requests over and over in the future.
  2. Checks that the HMAC passed to it is valid by generating its own HMAC and comparing. This is why we need to pass the un-hashed copies of time of request and URI in step 1 so we can have all the pieces we need to recreate the HMAC.
  3. Checks the request for any known spammer URLs
    (no need for GA filters to exclude spammers, it happens before it even gets to GA).
  4. Passes a custom dimension as an “API key” like we discussed earlier, but now it’s stored server-side so it’s actually secret.
  5. Returns the beacon GIF.

And the code for that:

Ok — so maybe I’ve under-sold it being “relatively simple”. Creating a proxy like this is not a lot of coding, but it is surely way more complicated than just dropping in the tracking code and does have some other drawbacks.

What this does provide is confidence that what you see in GA is definitely the result of a unique request from a browser hitting your site.

It doesn’t mean they can’t be bots, but the bots must have rendered the page to get the needed HMAC signature.

It also means the hits in GA are not the result of old requests. Not only does this break the malicious replaying of old requests, but it also means that the annoyance of seeing old client-side requests based on a page refresh outside the 5 minute window would not be counted. For example I frequently have seen sessions in GA that are simply the result of users that had tabs open on their iPhone and when they come back to Safari it refreshes (but not reloads) the page. This would stop that annoyance as well, although some may see that as undesirable data loss rather than annoyance.

Also one drawback with my code here is that it means your pages can’t be server-side cached. So if you used either something like a WordPress plugin that does caching (like W3 Total Cache) or a service that does proxy-caching (like CloudFlare) then this would not work as drawn up. It should not be that much more work to add another step where an AJAX request calls the server-side on each page load to get the HMAC for each page request, but I feel like this is complex enough already.


But does it work?? I’ve been running this in parallel with regular GA delivered via GTM on this very site for a few weeks and the numbers match up pretty well (over 90%), we can take a look at who didn’t match up exactly by exporting a day out of the User Explorer.

user-explorer

My blog doesn’t get big traffic so looking at 1 day we’re only talking about 36 client ids (regular) vs. 38 client ids (auth’d), the diff (excluding spam) was:

Only “regular” side Only auth’d side Discrepancy Reason?
1338227961.1468903925  Return visitor, client-side cached.
157092037.1441829013  That’s me. GA filtered on “regular” side but not auth’d side (because IPs are automatically anonymized in the auth’d side so the normal IP exclusion filters don’t work).
 170843534.1469514852  Reason unknown, partial GA blocker?

Here’s the outcomes:

  1. The authenticated system excludes all the spam without any additional filters in GA.
  2. The auth’d system misses a some sessions that are refreshes of previously open pages (mostly on mobile).
  3. Once you filter out the spam, there are a few more users in the authenticated system than in the standard system.
    There appears to be a couple of things going on here:
    a) bots that choose not to fire GA out of courtesy, but no longer recognize our proxy as GA (Googlebot, Bingbot, etc.) so they get recorded. If you turn on bot filtering in your GA view this will mostly go away.
    b) a few new mystery users – this may be users that did not block the loading of GA but do block the GAMP calls (which would break the regular client-side reporting but not our hybrid type here). This number was very low, only a couple of percentage points.

It is also important to note (since otherwise it might be a violation of the GA terms of service) that this method still respects users that are opting-out of GA by using the official opt-out browser add-on (or any blocker that blocks the loading of analytics.js for that matter).

So in the end what does this get us? It seems like a lot of work for something that mostly only excludes spam that we can easily exclude in other ways. That is certainly true, but I also think this type of hybrid method can eventually lead down the path to more confidence in the quality of our reporting.

 


Thanks to Tim Wilson for editing help.