Open everything? Some background to current debates on open data
Let's be clear: the Open Data movement is not about the pursuit of complete and unconditional openness. We know that it would be unwise to publish details of police patrol patterns, or the combination to the safe containing the crown jewels. We believe that fundamental reference data like ordnance survey maps, transport timetables, and company information should be freely available to all - information about objects, rather than information about people. Internationally, slightly different standards apply in different countries, but in the UK open data can be defined as "non-personally identifiable data produced in the course of an organisation’s ordinary business". However, between 'open data' and 'personal data' there is a large grey area, and inevitably the boundaries are sometimes blurry. Serious privacy issues usually arise when an institution sees individuals as objects.
Open data is published on the internet, for free, for anyone, forever - open data licences are designed this way in order to prevent an institution arbitrarily removing access to data once it has been released and used. This is important, but has institutional downsides. Too often, an organization will release open data one week, only to discover the next week that some or all of the data should not in fact have been released, but is difficult to retract. Even when the pre-release due diligence is undertaken properly by experienced staff, mistakes can still happen.
As an example, Wikipedia recently made some data available from its search tool. They diligently ensured that only text box data and timestamps were included in an attempt to make records non-personally identifiable. However, as anyone who has ever accidentally typed their password into a search box or pasted unexpected contents into a chat window might have foreseen, some of the data was still personally identifiable. The threat model Wikipedia's team had been working to was inadequate, and since far fewer than 1% of the records gave rise to concern, inspection of a sample would likely not have detected it.
Organizationally, Wikipedia cares more deeply about the privacy of its users than most other organisations. If the open data process can go wrong for them, it can go wrong for anyone - the underlying processes of disclosure control are extremely hard to do remotely adequately, and even harder to do well. And unfortunately, nobody's perfect.