We all know our clients can be impatient sometimes… This was the case yesterday, when one of my clients decided that they needed to know what the traffic was on their site and absolutely could not wait until the next morning when the google analytics numbers were updated. Fortunately, I had enabled logging on the server for this site, and was able to pull down what are called "raw logs" for some almost* up-to-the-minute numbers [* - It's important to note the "almost", because the hosting provider currently doesn't offer real-time logging, but instead updates nightly and has the data available before Google does].
When I pulled down the numbers, they were about 1/3rd higher than what is being reported by Google Analytics. In this case, it was a difference of about 10,000 unique visitors!
So why the discrepancy and what should we do about it?
First, we need to know how it's even possible that these two tools show such vastly different numbers. To answer that, we just have to look at HOW each collects information.
In the case of the raw logs, Every time a user requests a page, a line of data is stored in a log file on the web server. Barring some catastrophic server error, this never fails. Ever. In fact, the server will store that line in the log file on the server even if you don't bother to let the whole page load. The only downside to this is that search spiders and robots request web pages in the same way as humans. These are programs run by the major search companies and by people like me who have to write scripts to scrape thousands of pages of data from government websites. They're supposed to pass along a little piece of data that says "I'm a robot", but some don't, so they can artificially inflate the number of unique visitors (Sorry about that script I wrote <website I won't mention here>!). In some situations, a user can visit a site they've already seen, and they will receive a copy of that site that's stored either in their browser, or on one of their service provider's computers. This will not be tracked on the server, because no request was made, but at the end of the day, we don't really care as much about repeat visitors as we do about unique visitors to the site.
Google analytics works in a very different way. When a user visits a web page that is using google analytics, they're also running a tiny piece of javascript hidden somewhere in the code. The user goes to the page, gets the script, the script runs, and sends information back to the google servers to be added into a database. There are a few transactions that are happening there, and as a result there are a few opportunities for things to get messed up. Getting messed up means that the visit is not tracked.
A few examples where a visit might not be recorded are:
The user has javascript disabled.
The user doesn't let the page fully load.
The user's internet connection fails or is blocked from communication with the google server when the analytics script runs.
Some mysterious error happens in the google system.
Some visits are dropped in those cases, but they're rare. Google Analytics also isn't picked up by search spiders or robots… because they're generally not running scripts in pages, just crawling through links.
So what's a responsible online team to do?
For the most part, we talk to our clients about site traffic in terms of Google Analytics data, but with the information above, I think it's important that we always attach those numbers with a disclaimer. I think the responsible course of action is to be selling Google Analytics as an INDICATION of the amount of traffic a site is getting, and not an actual representation of the amount of traffic a site is getting. There are just far too many things that can go wrong in the process of capturing the user's visit to ensure that this number is 100% accurate.
GA is a GREAT tool, and I don't think we should stop using it. It's easy to use, and gives a quick snapshot into the trends in traffic on the site in a very digestible format. However, when you need to know exactly how many people have visited, we need to go with the raw logs. There is no pretty website, and it takes a little time for me to generate a report, but the plus side is that you can expect a 30% higher (and more truthful) number of unique visitors to tell your client about. We should sell this as an added level of service to the client, perhaps as a line item in the scope, and say that we'll deliver a more detailed report of site activity at the end of the project's life or in meaningful intervals.
