Lies, Damn Lies, and Statistics

I read Tim O'Reilly's article How the Web was almost won this morning, and it got me thinking. One of the things Tim says is:

Browser-access statistics prove fairly conclusively that even in Netscape's heyday, the majority of clients were running on the Windows platform.

I take exception to that. While the fact may be strictly true (i.e. more than 50% of the browsers were running on Windows), there's a lot more to the story that's left unsaid.

The first thing is that where you compile the statistics has a lot to do with what statistics you're going to get. On my site, on an average day, I tend to see about a quarter of the users on Macs. On some days, I see 99% of my hits coming from browsers that say they're a Mac (that happens on days when I release new software, and it's covered on the online Mac software journals). Similarly, I've quit reading CNN from my mac, because they switched to using Windows Media for all of their movies. Fine. I'll get my news elsewhere. The important point is that the content of the site will influence what you see for browsers. If your content is Mac-hostile, there's not much chance of you seeing many Macintoshes.

Secondly, not all browsers tell the truth. For example, it's possible (or at least it was, I haven't looked to see if that feature is still there) to customize iCab to have it say that it's any browser and operating system you'd like. I had mine set to say it was Bob's Pretty Good Browser running on a Timex/Sinclair 9000 for a while. I also use the WebTV viewer to preview my own web-sites, and then will often surf other places to see how things look through that window. When I'm running that on my Macintosh, it doesn't say I'm a Mac, it says I'm a webTV. Do you think that might have screwed up some statistics? Similarly, any company running a web-proxy server can set the proxy server to say that all the browsers are whatever the person configuring the server would like. Often this is set to either Windows IE or Windows Netscape, since that causes the fewest compatibility problems (Macs can read most content targeted at Windows). If you set the proxy server to say something goofy (like I've done with iCab), some servers won't serve you any pages at all. Again, there's a bias towards saying you're running on Windows, just because it makes life easier. It's a white lie, and the only trouble comes when people actually believe it.

Lastly, no survey includes machines used solely on intranets. These aren't available to the public, and there's just no good way to take a survey of them. While browsers will often poke their heads outside the intranet, it will often be via a proxy server, and again, there's no way to tell what the browser really is.

As ever, the big problem with statistics is that you can gather statistics that say anything you want (by surveying the right population), and even if you try and get good data, there are a number of factors that can skew the data. Accepting statistics and making sweeping statements based upon them can be dangerous, especially if you're basing important decisions on bad data.

