Supporting Do Not Track in Google Analytics

I’ve long been a proponent of the “Do Not Track” (DNT) browser setting, including writing an article a few years ago on how to track how many users have DNT turned on. DNT is a simple signal sent in the header of all HTTP requests indicating that a user doesn’t want to be tracked. No browser plugins, no consent pop-ups, just a single flag turned off or on. The user indicates their preference, but it’s up to the site to determine what they should do to comply with that preference. This lack of clear standard of exactly what complying with a DNT request would mean alongside there not really being any downside for non-compliance has lead to very low levels of implementation. However especially considering how user-unfriendly many GDPR consent mechanism are I think considering DNT support is still worth our time.

Google Analytics doesn’t support DNT directly. There are plenty of instructions out there on how to respect the DNT flag in Google Analytics by simply not firing GA at all when DNT is on. This is a reasonable solution if you’re willing to simply not report anything for those users (roughly 10%, but your site may vary).

However for many sites completely flushing 10% of data from the primary system used to determine site activity on is a non-starter. For analytics services where dealing with only a sampling of data makes more sense I definitely agree with not firing the tracker at all (for example HotJar won’t capture sessions for DNT users), but GA is more problematic because it is generally used to measure the overall level of traffic in aggregate.

As a basic compromise you could toggle advertising features per request as well as turn on IP anonymization based on the DNT flag (or a GDPR consent status). This is a good idea, but beyond those two items there’s still a lot of potentially identifiable info that could show up in GA.

I wanted to come up with something that respects the DNT flag, but still allows site owners to know that a user is on their site and using it. This would give much more limited information about that DNT user than a regular user, but still give something. This is probably not full compliance with interpretations of DNT like the EFF DNT policy, but it’s about the closest you could come and still have session data show up in GA. Again, if you want 100% compliance then not firing the tag makes the most sense, this is a compromise solution.

I was working with Simo Ahava’s customTask method for removing PII from GA hits and realized this same method could be used to remove/redact parts of GA hits that might be used to identify a user.

What data we’re scrubbing:

Data Measurement Protocol Variable(s) Data Removal Effect(s)
user IP uip geo, network domain
user agent ua browser, OS, mobile
additional browser info:
screen resolution, viewport size, screen depth, java status, flash status
sr, vp, sd, je, fl more browser info
other Google property join ids jid, gjid, a, gclid no doubleclick or adsense cookie joining (thus no demographics data), no gclids (thus no adwords data linkage)
custom dimensions/metrics cd1-200, cm1-200 no storage of anything custom, including things that might be privacy-impacting
individual page referral data dr all incoming referrals will just be domain, not full page info (as if every external domain has the meta referrer origin tag on its site)

This removes much of what could possibly be used to get an idea who an individual might be, but it leaves a general idea of what marketing campaign they may have come in on and then what they did on the site.

You’d still know a user came from Google Organic and filled out a contact form, but you wouldn’t know where they were in the world or what kind of computer they had. Additionally Google would be prevented from tying their cookie to any advertising cookies for that user, which is also important to try and maintain that user’s anonymity.

Another example of the compromise here is replacing gclid with utm_source=google, utm_medium=cpc. This severs the linkage between GA and AdWords such that keyword or cost data could not be associated with that particular click (and whatever other data matching might happen on the AdWords side) but that the user still shows as coming from AdWords.

Here’s the code:

 

And it is implemented just like Simo’s example mentioned above:
1. Create a custom javascript variable with the code above, call it “DNT Custom Task” or whatever you’d like.

2. Add that variable to the customTask field in your GA variable configuration:

A couple notes:

  • This solution currently requires GTM, I haven’t experimented with what the on-page or plugin-based code would be yet.
  • It’s important that this task be on every Universal Analytics tag that you have. The easiest way to do this is to add it to the GA Settings variable as shown in my screenshot above. Otherwise the tags that are missing this will fire the regular data to GA — which may still be attached to the session from the hits you have masked, making this whole thing kind of pointless.
  • If you already have an existing customTask you will have to merge the customTasks, you can’t have two customTasks fields. Simo’s got an answer for this issue too!

Ok, so if we implement this, what actually happens? What does the data look like?

Let’s take a look at a typical request. The measurement protocol parameters plus the HTTP data available in the request itself (like the IP and User Agent) are all the data Google receives from GA.

The original request is on the left, and the heavily scrubbed version with DNT enabled is on the right. All the items in the little blue boxes have either been removed completely or had fake data stuffed into the field.  (Those particular concerned about Google seeing IP addresses should note that because the measurement protocol collection requests hit Google directly they still see the IPs (and user agents) in those requests, we are just telling them what to store with the override parameters (ua, uip). Note this is still much better than GA’s IP anonymization feature which simply removes the last part of the IP address, still allowing most geo and network-based identification to work.)

Obviously this method is not perfect. There’s many other possible ways someone could still be identified by other factors, including those where the user self-identifies. There could be also some less commonly-used measurement protocol parameters I’ve missed (please let me know if so).

I’ve only been using this method on my site for a few days, but so far I like the results and feel it’s a good compromise for people looking to respect DNT or have a clearly consent-less GA implementation.

I have seen about 15% of readers have DNT on and therefore be subject to this scrubbing. If you want to see those users use any of the fake parameters we set as a dimension, like browser=DNT or screen resolution=1×1. For example a simple segment with browser=DNT will let you see how many users you have with DNT on (although not that much more about them beyond that!).

No comments yet.

Leave a Reply