Tuesday, 2 December 2014

On the usefulness of log scales

In preparation for a piece I am writing I tripped over the following graphic in the OECD's recently published report on Measuring the Digital Economy:

Average data storage cost for consumer 1998-2012
Per Gbit

It just had to be replotted onto a log scale so I could actually see what was going on - yup darn close to a straight line (that'll be one of them thar Moore's Law things). §

However, a big open data hoorah! and 1000 Internets to the folks at OECD who have published all the data for the tables and graphs in their reports so that I was able to do this. Long may this trend continue.

From: OECD (2014), Measuring the Digital Economy: A New Perspective, OECD Publishing.
DOI: 10.1787/9789264221796-en

Friday, 26 September 2014

Sunday, 14 September 2014

How Broadband ADSL Works

Another in the series of videos about how stuff works, dating back to the time when I were an engineer...

Wednesday, 30 July 2014

Anonymous - we need some limits

I've been much annoyed recently by the wooliness of thinking around anonymity by many and thought it time to remind folks to look the word up in a dictionary before declaring that what they have done is "anonymized the data".

  1. not identified by name; of unknown name
  2. having no outstanding, individual, or unusual features
The recent example of the New York taxi data serves as a key example - the translation of the taxi hack license number (identifying the driver)  and taxi medallion (identifying the vehicle) through an MD5 hash function simply changes the names - well check defn. 1. The underpinning data, however, was highly individual - fail on defn. 2. More accurately this should have been described as de-identified data, but anyone who had recorded their own journey in a cab could use it to start to re-identify the cabby and vehicle by correleating their journey with the open data. Say someone set up a website to share that information - we would have re-identification by crowd sourcing.

(In fact the failure was worse than that - the de-identification algorithm only had 18.4m possibilities for drivers and 1.3m possibilities for vehicles. No human would want to take the time to sift through that lot, but to the computer it's done in the blink of an eye.)

Much fuss was kicked up about care.data recently; many factors contributed here, but one aspect that often receives attention is this aspect of re-identification. I'm sure it is annoying to the authors of the riveting page turner "Anonymisation Standard for Publishing Health and Social Care Data Specification" who have gone to considerable lengths to consider a vast array of possible re-identification attacks, but those still concerned point out that we can't conceive of all of the future data sources that could be correlated to enable re-identification and so the risk is too high to accept.

It does call out for a new angle of research on the problem - can we achieve some theoretical limit for a given anonymization technique about what granularity of re-identifcation would be possible.

We've been looking recently for inspiration from some of our related work within Horizon which has looked the predictability in human mobility from GPS traces (open access PDF). The work aims to provide a limit to what could possibly be achieved in terms of mobility prediction no matter how cunning we can be in future. And of course location data is in itself one of the most concerning privacy violating data sources that many people unknowingly continually stream to random third parties from their smart phones.

In any case we need to:
  1. remind ourselves about the definition of anonymous regularly
  2. somehow get beyond the unquantified risk argument...

Thursday, 24 July 2014

Tour de France in Artworks

ArtMaps continues apace, and most recently we mapped the Tour de France with Tate artworks.

So I wandered down to Pau to today's stage to see what was there. I was somewhat surprised noting the course goes through Lourdes to find a bicycle based place holder rather than a geolocated artwork - surely Tate has something for such a site of religious significance (that thought triggered not least by watching Monument Men last night!).

Off to the Tate website and search on Lourdes and of course there something - from the religious and colourful dichotomous Eric Gill "Our Lady of Lourdes".

But only one. And there's got to be so much more out there. Roll on the day that we can get a federated image search across all cultural archives and start to bring to life the richness embedded in these cultures.

Cultural API anyone?

Monday, 30 June 2014

Social media and real time analytics

Having responded to the UK parliamentary select committee on Science and Technology inquiry into social media data and real time analytics, I was pleasantly surprised to be invited to give oral evidence in a session a few days back.

The video on Parliament TV is two hours, but the transcript is perhaps more readily digestible!

  • Professor Derek McAuley, Horizon Digital Economy Research Institute, Professor David De Roure, Director, Economic and Social Research Council, and Sir Nigel Shadbolt, Web Science Trust
  • Professor Liesbet van Zoonen, Loughborough University, Professor David Robertson, UK Computing Research Committee, Dr Mathieu d'Aquin, The Open University, and Emma Carr, Big Brother Watch

Thursday, 5 June 2014

How Cell Phones Reveal Your Location

Many of us use Location Services & GPS on smartphones but mobile phone companies have been able to track us for a long time. More from Computerphile..

Saturday, 10 May 2014

Why sheep need open data

On Monday 19 May in collaboration with the ODI, Horizon is hosting a workshop, grandly entitled:
Show me the future of: the built environment and open data
So what's it all about? I've been on a journey...

About 10 years ago I bought the field next to my house; two border collies need space to stretch their legs....

Then we thought chickens. DEFRA regulations require that if you keep 50 or more birds then you need to register - we've not taken to supplying the village with eggs, so we were able to avoid this piece of agricultural bureaucracy.

Until we very recently decided sheep. One sheep and you need to register. You need a "County Parish Holding" number which you obtain by phoning the nice folks at Rural Payments Agency and providing "postcode or OS Grid Reference(s) - postcode not much use as it covers most of the village, but my O-level geography seemed to mostly consist of providing and looking up OS Grid References so I can do that.

Then I thought - when I bought this land it was lodged with the Land Registry; would it not be great if I could simply refer to the registry entry and then DEFRA would know not only some point in the middle of a field, but know the boundaries. That might be joined up.

However, I was saddened to remember that I had previously been looking for a digital representation of the boundaries for my home on a planning permission matter [1] and it was not readily available - seemed historically such dangerous digital information was reserved for "the trade".

This information is subject to Crown copyright and
is reproduced with the permission of Land Registry
I'm not a geospatial expert so was only aware in principle of the EU INSPIRE activity, but not the detail. So I was pleased to hear this week at a Big Data event at RAL from the Land Registry that, prompted by INSPIRE, and as part of their open data initiative, the "index polygon" data for land registered with them [2] is now available under OGL. It can be visualized on data.gov.uk directly, or downloaded in chunks from the Land Registry site (warning many files!). It is in GML format, so I had to dust off the Quantum GIS [3], but now have the open data happiness of being able to work with the authoritative data.

Still can't send it to DEFRA though; and likewise, now that I have my CPH number from DEFRA I'm not actually sure what they have on record as the land concerned, so still plenty of room for improvement and joined up thinking.

Which leads us to the topic of the workshop - how many other sources of information (public and private) out there could we be pulling together around the environment in which we live and what new services could we offer. My semi agricultural example is simple compared to the complexity of, say, providing a service that allows Joe Public to send an alert to the tenant in a multi occupancy building that water is gushing out of their flat even though Joe should never know who the tenant is (N.B. posts passim on privacy).

Join us - the wilder the speculation the better the result :-)

[1] You can deface your property and the neighborhood with solar panels on the roof but woe betide you if you want to install them at the bottom of your garden out of sight behind a 6ft fence - you'll need planning for that!

[2] Currently for freehold registries.

[3] Quantum GIS is a free and open source Geograpical Information System.

Wednesday, 7 May 2014

Wednesday, 16 April 2014

Is it the End for Moore's Law?

I think Computerphile asked me about Moore's Law because he saw I once worked for Intel....

Read the classic paper by Gordon Moore here.

There's also a great article on the issues ahead by Bob Colwell from August 2013 - he gives it ten years. As Intel’s former chief architect, and now as head DARPA’s Microsystems Technology Office, I'd say he's an authorative voice on this.

The picture from Moore's paper I was trying to draw from memory shows the "sweet spot" curves that actually indicate why building chips with too few transistors for a given generation is as bad an idea as trying to cram on too many!

Wednesday, 2 April 2014

Ethics and social media...

The UK parliamentary select committee on Science and Technology recently agreed to hold an inquiry into social media data and real time analytics. As we at Horizon been beavering away at privacy preserving means to handle personal data (Dataware), poked at legal aspects of digital assets at death (Death 2.0) and have had our fair share of ethics committee input (wearable camera use), it seemed useful to respond to their call for wirtten evidence to the inquiry.

So having spent the time, I thought I might post the Horizon response here. Many thanks to Gilad Rosner for background research and Lilian Edwards for editorial comments. It's not light reading.

The Horizon Digital Economy Research Institute centred at the University of Nottingham was created through grants totalling £30m from the RCUK Digital Economy programme and currently involves over 20 academics and 30 research fellows from many disciplines. Since opening in October 2009 it has engaged in research around the “Lifelong Contextual Footprint” investigating novel and creative uses of personal data, balanced with the need to remain human in the digital age, by addressing social and personal issues such as trust and privacy. Social media has been one of many footprints investigated.

Horizon response.

What are the barriers to implementing real time data analysis? Is the new Government data-capability strategy sufficient to overcome these barriers?

1. ”Social media data offers the possibility of studying social processes as they unfold at the level of populations as an alternative to traditional surveys or interviews.”
A true statement for a number of social media systems in existence at this time. However, already many users have migrated from initial services that were open by default and difficult to configure for privacy to new technologies that are declared and implemented to provide more limited and comprehensible sharing (e.g. SnapChat, WhatsApp). The window for this form of analysis may only ever be open for a short time and provide a very limited demographic view.

What are the ethical concerns of using personal data and how is this data anonymised for research?

2. While social media data holds a great deal of promise for insight into civil issues, its use is ethically complicated. Government analysis of social media data is not different from academic research, and as such must adhere the same ethical strictures for government action to be legitimate. While there is a temptation to see social media data as fully public and therefore not deserving of protections given to human subjects in other research, this perspective is wrong - the use of any data that is potentially linkable to human beings, thereby making people identifiable, is potentially privacy-invasive, and must be rigorously examined for its ability to harm them or violate their dignity.

3. Informed consent – Participation in research demands informed consent.  Modern privacy thought stresses that, with regard to the use of personal data, context matters (Nissenbaum, 2010; Solove, 2006). The users of social media participate on various websites, fora and applications in the context of those services – for the purposes of sharing, communicating, shopping, entertainment, and so on. Those services were not created for research, and so government use of that data, especially without informed consent, is potentially a violation of the context of the original data collection and of the intent of the human sources.

4.  Spectrum of private to public – Social media sites have a wide variety of privacy controls. As social media technology and business models advance, more nuanced and granular controls of data dissemination appear. For example, on Facebook, users’ posts can be set to Public, Friends, Friends Except Acquaintances, Custom, and other categories. On Twitter, accounts can be Public or Locked, requiring user consent before her or his tweets are visible to the applicant, and the way tweets are formed determines who sees or doesn’t see posts by default. These examples illustrate a spectrum of public to private, and that users’ intentions about the appropriateness and sensitivity of their data can be expressed in a variety of ways. The implication of this spectrum is that social media data must not be considered an undifferentiated mass of public data, ripe for use without the need to account for the privacy expectations of users. This issue is amplified when one considers that a) users can retract posts or make formerly public posts private, indicating a clear intent for the use of their data, and b) social media sites change their terms of service frequently, often with inadequate notice to users, muddying the already opaque view into users’ intentions.

5.  Anonymity – A key characteristic of social media is its personal nature. As such, it follows that social media data is highly identifiable. Given this, and the inevitable sensitivity involved in the collection, analysis, comparison, compilation and dissemination of personal data, anonymizing social media data is an essential step in government use of it.  The techniques used by the Census for many years would be a good place to start and adapt to this new environment.

6. Vulnerable populations – The inclusiveness of social media networks means that it is very easy to ingest the personal data of vulnerable populations in any research. These populations – including children, those fleeing domestic violence, marginalized social groups, and other at-risk individuals – must be treated with higher degrees of care and more stringent safety procedures. The danger of potentially sweeping up children’s data cannot be overstated. Given the above discussions of consent, anonymity and harm, government must understand that age verification on social media is flawed and ineffective. Existing methods for determining if a user is an adult or child are so weak as to cause any data set drawn from social media to be suspect of containing children’s data.

7. Social inequality – It is important to statistically consider the representativeness of social media populations. Government analysis of social media data must account for bias within such data so as to ensure that its research does not rely on and amplify social and economic disparities. The whole of the UK population does not use social media, and so social media analytics should not be used to exacerbate the socioeconomic issues of the less digital parts of the citizenry.

8. Academic research of this form is overseen by ethics review panels that operate under published guidelines, with reviews undertaken by independent experts. It is recommended that government engage privacy scholars, security experts and data scientists external to the stakeholder groups undertaking research, to review research methods and safeguards.  Sir David Omand (Omand) has identified principles that should apply even when the analysis is in pursuit of state security: 1) There must be sufficient sustainable cause; 2) There must be integrity of motive; 3) The methods used must be proportionate and necessary; 4) There must be right authority validated by external oversight; 5) Recourse to secret intelligence must be a last resort if more open sources can be found.

What impact is the upcoming EU Data Protection Legislation likely to have on access to social media data for research?

9. The forthcoming EU General Data Protection Regulation (GDPR) tightens the legal language mandating consent, requiring it to be explicit: “Consent should be given explicitly … enabling a freely given specific and informed indication of the data subject's wishes … ensuring that individuals are aware that they give their consent to the processing of personal data…. Silence or inactivity should … not constitute consent.” (Preamble (26));  also: “Where processing is based on the data subject's consent, the controller should have the burden of proving that the data subject has given the consent to the processing operation.” (Preamble (32)); and, the GDPR requires the ability for data subjects to meaningfully withdraw their consent at any time (Art. 7(3)).

10. Right to be forgotten – This is, in essence, a person’s right to have data about her or him deleted upon request.  “Any person should have the right to have personal data concerning them rectified and a 'right to be forgotten'…. In particular, data subjects should have the right that their personal data are erased and no longer processed, where the data are no longer necessary in relation to the purposes for which the data are collected or otherwise processed, [and] where data subjects have withdrawn their consent for processing…” (Preamble (53)); this right also includes “the obligation of the controller which has made the personal data public to inform third parties on the data subject's request to erase any links to, or copy or replication of that personal data.” (Preamble (54)); the right has important restrictions, including that the “retention of the data should be allowed where it is necessary for historical, statistical and scientific research purposes, for reasons of public interest in the area of public health, for exercising the right of freedom of expression, when required by law or where there is a reason to restrict the processing of the data instead of erasing them.” (Preamble (53)). It is clear that the term “necessary” in the above quote will require a great deal of elaboration and negotiation, and that this ambiguity is particularly salient to government social media research designs.

11. The GDPR mandates that data controllers shall provide easy access to held data in service of portability: “The data subject shall have the right … to obtain from the controller a copy of data undergoing processing in an electronic and structured format which is commonly used and allows for further use by the data subject.” (Art. 18(1)).  In the case that government social media research contains identifiable records this provision will apply.

12. The GDPR establishes a stronger bias against the potential harm of profiling than previous data protection policy.  “Every natural person should have the right not to be subject to a measure which is based on profiling by means of automated processing. However, such measure should be allowed when expressly authorised by law, carried out in the course of entering or performance of a contract, or when the data subject has given his consent. In any case, such processing should be subject to suitable safeguards, including specific information of the data subject and the right to obtain human intervention and that such measure should not concern a child.” (Preamble (58)). This bias has particular salience for government plans to perform real-time (and thereby automated) analysis of social media data, especially in light of the previous discussion of the challenge of screening out children’s data on social networks.

13. Privacy by design – The GDPR requires that personal data processing adhere to ‘privacy by design’ principles (Art. 23). At the moment, this means a bias for only collecting minimally necessary amounts of data for a given collection purpose, and storing that data for the minimum necessary time. Further, the data controller “shall ensure that by default personal data are not made accessible to an indefinite number of individuals” (Art. 23(2)). The European Commission has reserved the right to amplify this provision via its delegates and to adopt technical measures to give the provision effect.

Is UK legislation surrounding the collection and use of data fit for purpose?

Privacy advocates have long held that UK legislation regarding personal data has been weak in the following areas (although all of these will need to be reviewed given impending changes due to GDPR):

14. Data breach notification – Currently there is no legal requirement to notify regulators, the public or data subjects in the event of a breach. The Information Commissioner “believes serious breaches should be brought the attention of his Office,” and has published guidance on what “serious” means as well as the ICO’s possible reaction to a breach (ICO, 2012). Breach notification, explicit penalties and the potential ‘naming and shaming’ that might result are important features of a strong data protection regime.

15. Location privacy – Granular, large collections of location data are extremely revealing of one’s personal activities. A month of location data from someone’s mobile phone can answer the following questions: “Did you visit an STI clinic four times recently?”, “Did you visit a lawyer several times?”, “Were you at a political rally?”, “Did you leave the city at night for five days in a row?” (Blumberg and Eckersley, 2009). The privacy-invasive potential of location data is immense, yet there is no primacy given to it in the UK DPA. The DPA has a category for “sensitive personal data,” including racial, political, religious and sexual information, which triggers heightened protections for data collection and processing. Location data should be added to this list (Raper, 2010).

16. Consent revocation – The UK DPA contains no provisions for data subjects to revoke their consent to data processing. Consent is not meaningful without an ability to withdraw it.

17. Human Rights – Whatever UK DPA laws says, this still needs to be interpreted with the context of the European Court of Human Rights which may find existing and even future DPA law lacking, for example in policing, the need for processing to be according to clear rules and proportionate.


Article 29 Data Protection Working Party. (2013, May 13). Advice paper on essential elements of a definition and a provision on profiling within the EU General Data Protection Regulation. Retrieved from http://ec.europa.eu/justice/data-protection/article-29/documentation/other-document/files/2013/20130513_advice-paper-on-profiling_en.pdf

Barbaro, M. and Zeller, T. (2006, August 9). A Face Is Exposed for AOL Searcher No. 4417749. New York Times. Retrieved from http://www.nytimes.com/2006/08/09/technology/09aol.html

Blumberg, A. and Eckersley, P. (2009). On Locational Privacy, and How to Avoid Losing it forever. Electronic Frontier Foundation. Retrieved from https://www.eff.org/files/eff-locational-privacy.pdf

Buchanan, E. and Zimmer, M. (2012). Internet Research Ethics. Stanford Encyclopedia of Philopsophy. Retrieved from http://plato.stanford.edu/entries/ethics-internet-research/

Information Commissioner’s Office [ICO]. (2012). Notification of data security breaches to the Information Commissioner’s Office. Retrieved from http://ico.org.uk/for_organisations/guidance_index/~/media/documents/library/Data_Protection/Practical_application/breach_reporting.ashx

Kadushin, C. (2005). Who benefits from network analysis: ethics of social network research. Social Networks, 27(2), 139-153.

Mahon, P. (2013). Internet Research and Ethics: Transformative Issues in Nursing Education Research. Journal of Professional Nursing, Article in Press.

Nissenbaum, H. (2010). Privacy in Context. Stanford: Stanford University Press.

Ormand, Sir David (2010) Securing the State.

Raper, J. (2010). Data privacy in geographic information. Association for Geographic Information. Retrieved from http://www.agi.org.uk/storage/foresight/policy/Data%20privacy%20in%20geographic%20information.pdf

Solove, D. (2006). A Taxonomy of Privacy. University of Pennsylvania Law Review, 154(3), 477-560.

Sweeney, L. (2000) Simple Demographics Often Identify People Uniquely (Data Privacy Working Paper 3). Retrieved from http://dataprivacylab.org/projects/identifiability/paper1.pdf

Zimmer, M. (2010). ‘‘But the data is already public’’: on the ethics of research in Facebook. Ethics and Information Technology, 12(4), 313-325.

Monday, 10 February 2014

A tale of two blogs

Know your audience:
Wife's blog - children's author
My blog, ramblings of an academic

That is all.