How disclosive is traffic data?: A Wikipedia example

5 September 2012

The Home Office constantly insists that trafffic data is not about the content of the pages you look at, but about the sites you visit.

This would have made some sense in 1999 when RIPA was first being debated, but technology has moved on and new open data sources are now available. This allows for vastly more invasive tracking in 2012 than was envisaged in 2000. We’ve done a little bit of work on how…

The English Wikipedia contains 4 million articles, which contain 18 million links out to other websites.

We’ve run an analysis on those articles and links, and looked at how many of the outbound hostnames uniquely identify the page that you were looking at (a hostname is the part of a web address). Of the 4 million articles on Wikipedia, 1.3m of them - i.e. one in three - contain a link that is enough to identify the wikipedia page you were looking at, simply because only one page on wikipedia contains a link to that domain.

Some of those are obvious, for example, is linked only from the wikipedia page for Julian Huppert.

Less obviously, if you visited Wikipedia and then, that link only appears on the page for Cambridge and no other page. From traffic data, it is possible to know that it was the “Culture: Festivals and events” section of the page that was read.

Wikipedia posts go into multiple categories, and where one in three hostnames show the page, the number that show the category, and hence some aspect of information, will be higher. This information may be sufficiently indicative of intent that content can be inferred from context. The police already have adequate tools to investigate people suspected of criminal offences - making everyone a suspect and demanding yet more data will simply be counter-productive.