There are a few questions that frequently come up in Google Analytics with respect to data processing:
1. “When will my data show up in GA”?
2. “When is yesterday’s data complete in GA”?
3. “Can I edit or remove data in GA?”
And finally a slightly less commonly-asked question I am also looking to explore here:
4. “Can I back-date data in GA? If so, by how much?”
By back-dating I mean placing hits into GA with timestamps in the past, since normally all hits to GA are given a timestamp of the current time.
In order to get to an answer to #4 we have to first investigate questions #1-3 first as they are all related to GA data processing flow.
When Will My Data Show Up?
As far as when hits show up, data of course shows up first in the real-time reports, within seconds.
Then data shows up in standard reports — within minutes for smaller sites, within hours for bigger sites (we’re only talking about standard free GA here, not GA 360). Larger sites get fewer processing updates throughout the day, which can lead to some pretty big time lags waiting for data to show in standard reports. The data flow is the same no matter the size of the site, just the bigger sites require a lot more processing and therefore have to wait longer.
GA’s defaults reporting time period always ends on yesterday for good reasons; both because it’s potentially very confusing to be reporting on up-to-the-minute numbers (unless you are extremely careful about your time periods), and they would prefer you not pull reports on data that’s not fully done processing yet.
Here’s an example of reporting latency in a standard report for a moderately trafficked site (<10k sessions/day) vs. a more heavily trafficked site (50-100k sessions / day). Both have some processing latency, however processing latency on the bigger site is much more evident and problematic for current-day reporting.
This site is showing a very small amount of latency. The time of 2:30 means that the dot for the 2:00 hour should be about half-way up to where its final position, and it is just shy of that. Looking at other data points at the top of the hour generally showed very minimal latency, about 95% of the sessions were already counted for the previous hour when measured at the top of the hour. This is not an exact measurement though because sessions are not always additive; sometimes sessions are taken away in processing for various reasons.
The higher trafficked site here shows hours of lag on reporting, and only updates a few times a day.
It’s clearly stated in the Google Analytics documentation that there is a 24 to 48 hour processing latency on data. That number represents a maximum amount of reporting lag; for most small sites your data will be mostly complete in GA’s standard reports within minutes. Even larger sites will generally see standard reports only lagging by several hours at most. That same piece of Google documentation says sites with 200,000+ sessions/day may only be updated once per day, but sites with under 200k/day may also see a less frequent update schedule as is illustrated with our example above.
It’s also important to remember that the data processing output does not always have to be in chronological order; just because you are getting data for 2pm doesn’t mean the data from 8am is 100% complete (again this is especially true with larger sites).
Other data coming from outside GA like Google Search Console data may come in much later (2-3 days), but we’re only looking at GA-processed hit data here.
When is Yesterday’s Data Complete?
This puts us into the realm of our second initial question, when is yesterday’s data complete? When can I stop worrying if the numbers are going to still jump around a bit? Not just up-to-date with maybe a few straggling hits to still get counted or reprocessed, but totally done processing never to change again.
The GA API has flag for this in the batchGet method, “golden”. The name presumably comes from the idea of a golden master in software, which is the final release version of the program ready to ship out to the customer. Once that golden flag shows up on a query, it means the data in that date range will never change again.
So our flow is: 1) real-time reports, 2) standard reports (processing), and 3) standard reports (finalized/golden). We can of course start looking at our standard reports while they are processing, but to be safe we should not seriously use them until they are finalized.
So what time is that?? I’ve been running a test for the last few months to find out. The test is pretty simple — every 5 minutes I asked if a basic query was processing or golden. Once it’s golden it stays that way, so the chart below represents the first time a query came back marked golden.
On average, data was completed for our test site (very low traffic, Eastern timezone) at 2:13pm. There was quite a bit of variation (standard deviation of 1h:56m), on many days it completed processing much later into the day. There was even one day (April 5th) that didn’t actually finish processing the previous day by midnight (though in the interest of not totally confusing my graphs I have just marked that as midnight). Was April 5th still in the 24-48 hour window? Depends how you interpret that window and what it means to have completed processing… A hit at 12:01am on the reporting day that was not marked as golden until after 12:01am the day after the processing day is more than 48 hours elapsed from the original hit, but only from that hit, not from the end of the reporting day itself. In any case that was just one day, and overall the processing speed has improved somewhat since April.
This is what the processing completion time was for our test site only. My experiments looking at the processing completion time across multiple sites showed that as you’d expect they don’t all finish processing at the same time, though within the same timezone they did generally finish pretty close in time.
Unfortunately there’s no way to tell in GA if the data is fully processed just from the UI, but from our experiment here it’s a good bet than if you’re looking at the data at ~3pm or later the following day that the data is probably done processing. Once it’s done processing, that data is set in stone.
This brings up an important point about GA which most of us know from experience but probably don’t think about much, which is that GA is essentially a write once system. Once the hits are recorded in GA, you can’t delete or edit them (even if you’d really, really like to…). There are some exceptions to this like cost data (from Adwords or via Data Import), but the core hit data is only writeable once on the currently processing day.
To extend the “golden” metaphor (which supposedly comes from the gold CD-Rs that a final product was written to), we can think about this as if GA burns a CD at the close of each processing day for the previous day. After that close you can read the CD as many times as you’d like, but you can’t change it.
Oh wait.. that was our third question wasn’t it?
Can I edit or remove data?
Nope, next question.
(edit 2019-01-25) OK — this isn’t totally true anymore, as of 2018 you can remove users by client id (thanks GDPR?). You still can’t edit data though, maybe to unnecessarily extend the CD metaphor this is like putting a scratch in the right exact place for one cid.
Can I Back-date Data? How Far Back?
This brings us to our final question and the second part of the experiment, which is “can I back-date data?”. We know that we can’t change data on any day that is already closed, and we know can’t alter any data that we’ve already recorded, but a lesser known feature of GA is that we can actually back-date hits… at least by a few hours.
Specifically, I mean when we do a measurement protocol request we can use the “qt” (queue time) parameter to tell GA that a hit actually happened in the past. So if we want to replay offline events — like an offline sale, or a visits to an offline local version copy of a site, we can do so, but only while the day we want to add the hits to is still processing. So if we wanted to put in a hit from last week, we couldn’t.
For back-dating hits, the documentation says “Values greater than four hours may lead to hits not being processed”. What’s with the vagueness Google?? Well, it’s because four hours is the minimum amount of back-dating guaranteed to be available.
The previous day ends at midnight and stops accepting new hits at around 4am, so if you are trying to back-date a hit to 11:59pm the previous day at 3:59am that is the most amount of time you might be able to back-date in this case, 4 hours into the past. If you wanted to back-date a hit to have happened at 12:01am the previous day at 3:59am that’d be roughly 28 hours into the past, the biggest amount of time offset in a back-dated hit. This is all assuming that the window for accepting hits closes at 4am.
As long as the day is still open to accept hits you can place hits at whatever time during that day you’d like. That could mean just back-dating a hit a few minutes while still in the current day, or it could mean taking advantage of the open time window into the next morning to back-date as far back as the beginning of the previous day.
This is all relative to the time zone the view is set to! If your view is set to US Pacific time and your back-dating processes is running on US Eastern time, that means your window to place hits in the previous day would be open until 7am ET because midnight from your view’s perspective is at 3am ET. Timezones are confusing, ugh.. I’ll go back to just assuming you are in the same timezone as your view.
But does that mean that the previous day data goes “golden” at 4am? Unfortunately no… for that to be possible it’d mean Google would have take a backdated hit that just happened right before 4am and have it totally integrated and processing complete immediately after. There’s the time that the day closes for business (roughly 4am) and then the time the day is fully baked & golden brown.
So In addition to testing when the data becomes golden, I also tested when the day actually closed to new hits, to see how hard that 4am deadline for new data really is.
As an example, here’s one day’s data. On August 9th I was able to place hits back-dated into the previous day until 5:45am. Meaning that the 5:45am hit trying to put data into the previous day worked, but the 5:50am attempt did not work and that hit just disappeared. This is an important note when back-dating hits, if the day you are attempting to place your hit in has closed, the hit will silently fail. No error messages or validation available, it just ends up in the bit bucket (i.e. nowhere).
So in reality for that day we had over an extra 1 hour 45 minutes to put in back-dated hits. That turned out to be a pretty typical closing time, for the 168 days the experiment ran the average closing time was just shy of 5:45am. The consistency here was much more reliable than the golden time above, the day closed pretty much at the same time every day.
The earliest closing time was 4:45am and there was one outlier, April 4th stayed open until 3pm. As we know from the golden completion test above clearly April 5th was not a great day for GA data processing. This is interesting (well, to me at least) to see how processing works, but the delayed close times are not helpful in practicality because anytime after 4am is not guaranteed, and hits attempted after the day has closed simply disappear.
Overall (time zones aside) this flow of processing is pretty simple:
I’m sure in the future this will all change. Perhaps processing will not broken down into discrete per-day data “chunks”. Hopefully at some point there will be the ability to edit hits after the fact, or send hits further back in time. Until that time hopefully this post helps illustrate how and why the current system works as it does.