Friday, 6 December 2013

Open data - libre not gratis

It started with a tweet - mischievous Maurizio sending the following:

  1. The question was *really* hoping to get at :-) - what's the cost of open data?

I don't think I even mentioned open data in the CloudCom 2013 keynote, but as I pointed out to the questioner - that wouldn't stop me having an opinion.

So easy one - "open data" should be "free as in speech".

This somewhat trite response simply conveys that the open data should be free for reuse without restriction, more accurately conveyed as the word libre loaned from the French.

Such freedom to reuse should not be confused with gratis, supplied at no cost or as the computing community have it, "free as in beer". Open data, like open source, may or may not be available without charge to the user, but it should always be available for reuse.

The Power of Information Task Force in the UK (and there have been several others worldwide with the same view) held that in releasing information, "data should be provided at marginal cost". According to Google:

marginal cost
  1. 1.
    the cost added by producing one extra item of a product.

In the digital economy, marginal costs are often surprisingly small, but they are (by definition!) non-zero; sometimes so few people want the data that the total cost of delivery, compared to the cost of collecting and curating the data, is minuscule and we decide not to charge. Also, since for much of the government data, there is the legal requirement to respond to FOI requests, you might be better spending your money on opening the data and delivering it than doing data archeology to respond to each FOI. So, when the driving force of open data was transparency, and since there are only so many Grauniad reporters and armchair accountants (nor do they repetitively access the information to write a story), not charging is probably an appropriate model.

8 armchair accountants of 1GB = $1 per month,
100000 real users of 1GB means $12,000 per month
More recently though, open data has been touted as a driver of economic growth - at this point it is worth thinking a bit about what marginal costs then mean. In the digital economy, you can find yourself with a lot of users very quickly - one app sold onto smart phones could land you with 10s of thousands of users within months, suddenly the delivery bill becomes a consideration.

Certainly at this point a government department might decide to not try and build monstrously scaleable computing systems to deliver all this data while trying to figure out where the money can come from (in a world of shrinking govt budgets...). One solution is to require reusers of the data to take one, constantly updated, copy of the data and run their own servers to scale up with the number of their customers; hence open data sources need only to scale to the number of commercial reusers of their data rather then total number of smart phone app users. 

And it would seem only sensible that commercial reusers could be asked to cough for the marginal cost, since it would be a small cost to them and a sustainability plan for open data.

Anyway, Lilian threatened me with a FOI; so here, I've come clean. Have a nice weekend.

Friday, 15 November 2013

Images, metadata, orphans and copyright

Originally posted as a CREATe blog. CREATe is the RCUK funded Centre for Copyright and New Business Models in the Creative Economy.

Coincidentally, five days after the publication of the Copyright Licensing Steering Group's report on the last 12 months of work on streamlining copyright, I was due to give a talk at a joint event of CREATe and the EPSRC funded Network of Excellence in Identity. The event "Identity Lost – electronic identity, digital orphan works and copyright law reform", the talk "Digital tool chains; get your act together" - what joy to find the CLSG report, which lays down 10 key principles, formed the perfect frame to what I had planned to talk about! What should we do to avoid the on going creation of digital works that are orphans at birth?

Herewith the blogged version of the talk...

Metadata matters: encourage its use and preservation

I. By default create and preserve metadata

What is metadata in a digital image? Simply data attached to the digital image written by the camera when the image is created or afterwards as the image is processed by various tools. Nearly all cameras (including smartphones) capture information about the "camera" - lens, aperture, timing, resolution etc. - and the when and (especially with smartphones) the where. (More here.)

(C) Derek McAuley; not really sufficient...

Professionals are well served when it comes to information concerning copyright and licensing; high end cameras from the likes of Nikon and Canon have for years had the facility to add this metadata.
However, the vast majority of images created by digital cameras today are created without any ownership information; so a first step would be to fix that in all consumer and pro-am cameras; it is especially galling that with all the personal information held about me in my smartphone the software doesn't get around to adding any useful identifying metadata to the images it creates.

Attach correct and meaningful metadata to your work so that others can find you or your agent

II. If you wish to claim credit you need to be contactable

Licensing metadata needs to include the means to make contact with the creator or their agent. The latter an important point for a creator seeking some degree of anonymity or separation of identities, yet maintaining control of their work - a recent high profile example being that of J. K. Rowling and her alter ego Robert Galbraith.

In this day and age - a unique URI per creator identity seems like a good idea - a raft of these already exist, for example OpenID. It's not clear to me that anything specific is required other than
exist the service to enable contact to be made to the owner of an OpenID, and with certain providers it's no more than a blogpost. In future the appropriate URI might be as link to content licensable on a digital copyright exchange (DCX) such as proposed by the Copyright Hub.

Actually, besides who, it might be really useful to include possible default licenses, especially if it something well understood like a Creative Commons license. It does raise the interesting question that if you assign CC0 (no rights reserved) license can you forgo the need for identification...

Support technology that makes it easy for you to include metadata

III.  Use tools that by default maintain metadata

Jolly good idea. Name and shame those that do not.

IV. Embed a unique identifier in the metadata

As with the copyright identifier, URIs could stand us in good stead, e.g. 

Check before use: always look for metadata

V. Take reasonable steps to find metadata

At one level if we have created metdata, used tools to maintain it, or at least have a unique identifier such as a URI embedded, finding the metadata in the digital age should be easy and operate instantaneously.

(C) Derek McAuley - really it is
However, what constitutes reasonable steps if the metadata is not attached to the image - a talk at the same  workshop by the Scottish National Library put a "due diligence" search at 4 hours per image and often came back with nothing. Some technology might help - PicScout provides a browser plug-in that identifies registered images in webpages, but is not fool proof. The image on the right was tagged by PicScout as "rights managed" because 1000s of tourists (myself included) and at least one photographer using Getty Images have photographed the same temple in Egypt from about the same spot, and it's usually sunny and a blue sky. It does mean it's probably not that valuable an image...

Do not ignore licensing metadata included with an image

VI.  If it is there use it!

Work within the law

VII. Obtain a license.

For some images, the metadata might include a quite permissive license (e.g. Creative Commons), for others you may need to contact the owner by following the URI trail, but the hope is that in future you find yourself at a DCX where a simple online transaction obtains you a license. One major challenge to automating this process is the degree to which many works are licensed for use in a specific context - for example an image for use in a company report would not cost the same as one used for the front cover of OK magazine...

If in doubt do not use

VIII. Absence of metadata is not a right to use.

There is a major implication in this for all those photo-sharing sites - if there is no licensing metadata should they permit images to be uploaded? At one level this seems extreme, but if in the future all cameras generated this metadata, most legitimate photo-sharing users would not even know the difference.

Do not break the chain: maintain the connection to the rights holder

IX. Don’t remove metadata

Indeed if you find damaged metadata, you might want to fix it...

If you must remove metadata from the original file, store it elsewhere

X. Maintain metadata associated with image somewhere

New windowing systems required...
The final point her feels to me like it should be transitionary guidance; I'm wracking my brains to figure out a technical reason this would be necessary and drawing a blank - yes many pieces of software that do this by default need fixing, but why not give the industry, say 3 years, to get their collective acts together and get back to the naming and shaming game.

There is one final evilness in the big scheme of creating new pictures - what I have done on the right here - the screen capture. This means of digital copying currently does not preserve metadata, as most windowing systems deal in pixels, not having a concept that a particular arrangement of pixels may have a license associated with them. Hmm, no reason it couldn't - there's a paper in that...

In conclusion, there do seem to be a set of quite simple technical means by which we could bring a lot more tidiness to the management of digital imagery; perhaps some device, software and service vendors should seek market advantage by being first movers in getting their act together. And there's certainly a fruity one that does all in a vertically integrated offering.

Friday, 1 November 2013

The data opportunity

The recently published government report entitled "Seizing the data opportunity" lays out a wide ranging programme of activities to take on the challenge of both turning UK companies into data driven organizations and leaders in new data driven services.

The Connected Digital Economy Catapult (CDEC) gets multiple billings, and as CIO there, I'm excited to be involved in pursuing various opportunities, for example: in e-Infrastructure developing novel tools and platforms to simplify access to what are often complex underpinning software architectures; and through our Trusted Data Accelerator, aiming to bring creative data processors together with rich datasets. My personal research for the last four years (drdrmc posts passim.) has been around personal data, so it is exciting to now be involved with CDEC in looking at what next for midata, and the midata innovation lab. More on these initiatives as they roll out...

The skills agenda is very high on the agenda in the report and I am reminded of two very interesting conversations with colleagues from across academia and industry: firstly at the Research Councils UK e-Infrastructure Users' Strategic Conference last week and earlier this week at the E-Infrastructure for SMEs Workshop. At the former, the great quote from a colleague in Arts and Humanities "...we need computational thinking skills in all our undergraduates..." and repeatedly from the HPC centres "...we have the equipment, we need the cross disciplinary skills now..." (*).

The BCS and Computing at Schools initiative are proactively working on driving "computational thinking" into the school curriculum - this is not solely about delivering more computer scientists, but delivering on the vision laid out by Jeanette Wing and others at the CMU Center for Computation thinking:
How many moves?

"Computational Thinking is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent."

For example everyone should be able to solve the Tower of Hanoi problem; and it is more important everyone understand the analysis of the problem and that a computational solution is possible (and how long it would take...) rather than necessarily how to programme a solution.

No doubt there will be masters and PhD programmes arising from this report, but my appeal is to move to a programme of undergraduate computational thinking courses accessible to all disciplines that skill up en masse for the workforce of the future.

(*) Someone did suggest this become the 4th R - hmm - no R in computational thinking unless you're a pirate.

Saturday, 5 October 2013

Virtual Machines Power the Cloud

Virtual Machines 101 for Computerphile.

Well maybe the first 5 mins of welcome to virtual machines 101. 

Thursday, 26 September 2013

Thoughts from a recent workshop

What do they know about you? MyDex CIC
Just posted an article over on The Conversation on the sentiments at our recent Horizon workshop around personal data.

 "Many researchers are concerned that inadequate checks and balances are in place to make sure the data gathered through midata is not used in ways that we might not like or that threatens our privacy."

Nice find by the editor of the image!

Tuesday, 17 September 2013

Data where?

@gikii and #gikii2013 on twitter
Attending GikII gave me a great opportunity to talk to folks at the junction of law and technology who concern themselves greatly with personal privacy and see a sea of tech washing over the population that causes great concern. I’d been laying out my strategy for dealing with the current tech and our research plans in this space, and was encouraged to get it written down in an easy to read version – i.e. not our research publications! So here goes...

The desire to access information anywhere has been leading to an increasing centralization of services into the cloud so that one can have access to email, files, contacts, etc. from anywhere - I'll refer to this as "data in the cloud". Following closely on this have been a series of applications either built-in (MacOS Mail, Android Mail), free (Dropbox, SkyDrive) or purchased (Outlook) that synchronize contents between the cloud servers and mobile devices and computers in the background - "data sync with cloud". This is done so that when the user interacts with the application, both the application and its data are local, which makes it more responsive and able to operate even when disconnected. One logical and privacy enhancing conclusion to this trend is to arrange the devices to synchronize information directly with each other and forget about maintaining a copy in the cloud - "data on my devices".

Crazy file sharing icons
These "data on my devices" services are already emerging for file sharing - I currently run seven file-sharing applications, which fall into two distinct categories.

Files in the cloud services include Dropbox, SkyDrive, GoogleDrive, Memopal and SpiderOak. The first three all maintain an unencrypted copy of my files in the cloud while the latter two assert they store encrypted copies of files - your level of PRISM related paranoia will dictate whether you trust the encryption of the latter, but for the big three you need to trust the provider to maintain confidentiality. Hence, I use these services for my research talks, publications and random other storage uses where the information is not private or personal; the consequences of a breach of confidentiality for this information are nothing more than a minor irritation – someone sees a work in progress paper or a half-baked presentation.

For private and personal information, including any data relating to other people, I use services that synchronize files across my devices without maintaining a copy in the cloud or ever seeing the contents; examples include BitTorrentSync and AeroFS. In these examples the cloud service merely provides the means for devices to find each other, and possibly provides an encrypted forwarding service if they cannot communicate directly (e.g. your iPad in the Internet cafe talking to your home computer behind your home router). 

The impending serious concern is the simultaneous arrival of the “Internet of Things” and “personal data stores” – the scope for dangerous privacy breaches if these services are all in the cloud is significant.  I have a simple take on this – don’t put the data in the cloud, synchronize it across your devices and run applications locally. I mean - it's not like we don't know how to do it and still make it easy for the user.

To this end we have been developing Nymote as a general solution for secure data synchronization across computing systems, one use of which is to securely store and share private information across my personal devices. Nymote is composed of three elements: Signpost, Irminsule and Mirage. Signpost provides the cloud service that allows your devices to find each other and establish secure communications paths. Irminsule provides the distributed data store, which moves beyond files to provide a robust database that allows simultaneous conflicting updates to data items on different devices with application hooks for their resolution - simply a more useful building block than files. Mirage is the underlying runtime environment that, in its most secure instantiation, runs within its own virtual machine.

That’s the tech side, aiming to build in “privacy by design”; it still needs the underpinning legislation for consumer protection in digital services rather than informed consent (blogs passim), especially if we are going to roll it out as a legal obligation to companies. So over to you JR...

Wednesday, 17 July 2013

Making Computers Count

Talking to Jonathan Mitchener of TSB (@jonathan_uk) today reminded me to post a link to the e-Infrastructure SIG site - mission:

The SIG promotes interaction and engagement from all points of the e-Infrastructure compass to facilitate the wider adoption of, and accessibility to, high performance and data-intensive computing within the UK. It will perform these duties by engaging relevant UK stakeholders both in the supply of, and demand for, e-Infrastructure, in order to facilitate economic growth, promote excellence in scientific and industrial research, and impact on societal challenges.

Wednesday, 26 June 2013

Cyber-security is a two edged sword.

One of the great (not eligible for REF 2014) impact stories I have is the small specialist company producing a VOIP service specialised for "blue light" services that provides interoperability between all those different commuications systems they use. This SME came to talk to us in Horizon and we rapidly got into how they deployed their service - the relevant "ah ha!" moment from the CEO was when we explained that Cloud is not about the technology, but what it enables - it is about translating previously what was capital expense into operational expense - Cloud needs to be understood by the CFO. In particular this company was concerned about bidding for large contracts as they did not know how to access they much capital even if they won the contract.

I explained the Cloud was simple to experiment with - here's the URL and get your credit card out - it'll take your IT man 15mins from a standing start to get a server up and running on which you can deploy. That was 1500 on day 1; day 2, 1100 I get the phone call "Mac we're up and running, thanks; this just changed our business". I love those days.

I think this is the story for many of us who have been working on Cloud technologies for a while, but recent work sponsored by Microsoft  indicates:
 'more than half (52%) of the companies that do not currently use the cloud said that data security concerns were "an inhibitor to adoption".'. 
Likewise, concerns about data visibility and compliance.

I leave to my colleagues who are much more knowledgable than me around privacy law and human rights to make that case, but recent events are not really going to cause SMEs to entrust their critical business functions to the Cloud. On the week that the Home Office launched a cyber-security awareness programme, I fear PRISM have had more impact on concerns around cyber-security than the £4m the Home Office have put aside for it, and not in a good way.

I was finally compelled to write something based on the tweet from @jaggeree pointing me at the article flagging how using encryption would cause your traffic to be especially suspicious:
When encryption is encountered, however, the gloves can come off, with analysts being allowed to retain "communications that are enciphered or reasonably believed to contain secret meaning" for any period of time. 
Yup - commercial confidentiality is about maintaining secrets. Having worked for Intel (Grove era moto "Only the paranoid survive"), I was drilled in the importance of commercial confidentiality.

I've usually said of TOR that using it from your home is akin to standing in the town square shouting "I want to be anonymous", but if agencies are systematically retaining all VPN traffic as possibly "containing secret meaning", I can't see any of this helping the cause of migration of SMEs to the cloud.

What to do? Well some smart service provider might try and address at least some concerns for UK SMEs by provide UK based services that at least mean you only need to be concerned about one agency looking at your data, and you might hope they have the economic wellbeing of the UK at heart. (Other countries are available - r/UK/yourcountry/.)

Cyber-security is a two edged sword; bad guys can use the technologies that are vital if we are expand the digital economy. We do need to work to achieve a sensible balance.

Thursday, 23 May 2013

CIO (In the house!)

Over the last year I've been involved in many of the events organised by the Technology Strategy Board to shape the creation of the Connect Digital Economy Catapult (CDEC) and have been keen to see how we can pass the results coming out of the Research Council's Digital Economy programme downstream into the hands of innovators. Hence I've been delighted to be offered the role of Chief Innovation Officer (in residence) at CDEC to help shape it during the start up phase. This entails a couple of day in London each week so calendar challenges, but already having fun. Details as it happens...

Thursday, 9 May 2013

iPhoto challenges my knowledge of history

So in preparing the previous post I am just vastly amused by iPhoto on my Mac challenging me on this face tagging.

I'd have to say, I would concur with this or there is some monstrous archeological conspiracy.

Mind you what does it mean for real name policies, for example if I tag this on FaceBook?

So you're informed?

Given the pick up on a very recent paper at CHI on the complexity of 'terms and conditions' and issues of informed consent I think Ewa can call impact for this research output since have picked it up and applied it to a bunch of betting websites - their results here.

The literatin plugin has featured here before simply because I'm using it everyday now to inform myself of how badly wrong the "informed consent" model of digital services is. (Psst - we are not informed so there is no consent.)

Due to circumstance beyond my control I found myself in Egypt recently and captured this image - I did think that given the general literacy of the day, it managed to both convey the relevant detailed message to those that could read the hieroglyphs and yet a  simple to understand message to those that couldn't.

Ts&Cs as graphics?

Monday, 4 February 2013

Artmaps technicals for a Tate audience

I had planned to put a fusion of a couple of the Artmaps blog posts from here onto the Tate Artmaps blog, but in the end wrote a much broader piece on "semi-structured blogging", so thought I would post a link to it...

Monday, 28 January 2013

midata - let's be clear, it needs to be my data

After the consultation last year on midata to which I think several member of Horizon responded (well done team!), we saw the government response in November, and now the move to legislation in the House of Lords (search for 58C on that page - and could someone teach them about the name tag!)

I am a strong proponent of making data available to customers. Today I have online access to a lot of my data - bank, credit card, energy, but the data is presented as webpages designed for human consumption accessed via some arbitrary means of authentication. If I can download the data it requires me to do it interactively and it arrives in some random (often proprietary) format and / or with a private schema.

We have been here before; the open data movement initially struggled with human readable webpages and proprietary file formats, and continues to campaign to get the datasets released in agreed and open standard forms that can be downloaded and processed by software on our behalf.

If midata legislation can be used to drive towards standardisation in the industries that already allow us access to our personal data to enable a new software market, for example in personal finance management and energy planning, then great. If the legislation can be applied more generally to all my personal transactional data, I can envisage applications offering dietary advice based on my shopping habits, and through mobile location aware apps, fusions of data offering personalised advice related to off the shelf medications, and household heating control using historical occupancy data and current household members whereabouts; etc.

The creativity potential here is vast and can be implemented using software running on my computer / tablet / mobile that can process my data in private; this approach has been at the heart of our research in the dataware project for the last three years.

However, I am deeply concerned by the idea that third parties are going to step in and start offering these applications as online services based on the assumption we will hand over this base raw personal data to them. If you thought the furore about smart meters was justified, this more general approach stands to be the privacy violation of the century.

There are times when society should protect us all and not resort to caveat emptor - consumer protection is enshrined in many country's laws, for example in the UK we have the Sale of Goods Act 1979 - at some point we must adopt the same approach to these online services.

A fool and their data are easily parted.

Literatin in use - 18.6 years of education
needed to understand this privacy policy
At the root of the issue is the legal basis on which many companies process this personal data - by clicking on "I agree to the terms and conditions" you are presumed to have given informed consent. Really? Are the vast majority of consumers actually engaging in informed consent at this point?

Using our Literatin [1] tool, I find a well known social media site where the privacy policy is "...suitable only for a graduate-level audience". I would suggest that for many in society they simply have not been informed, even if they did read it!

Furthermore, it is well documented [2] that many people think the very presence of an oft not very prominent privacy policy, means their data is kept private rather than it being a liability disclaimer by the company collecting, mining and sharing it. So a societal level misunderstanding; possibly (and the basis on some of our ongoing research!) based on a presumption that civilised societies have consumer protection legislation...

Babies and their bath water.

Does this line of thinking foreclose cloud services generally (the current darling of entrepreneurs and venture capitalists alike), and the comparison websites (e.g. uswitch, confused, moneysupermarket, those damn meerkats) which are much loved by those who believe in the informed consumer?

No - just requires a bit more thinking about the data transaction, than the "give em everything" philosophy. Let's just think through a couple of examples:
Energy data - personal yes, but if you 
don't know who I am, is it private? 
Energy switching - even in a complex market with price of day tariffs where a service provider might need my raw energy data at one minute intervals to compute the optimal tarriff [3], they only then need the first part of my (UK) postcode to understand which suppliers might be relevant - amongst the many things they do not need to do the job are my name, address, phone number, email address, age, or shoe size.
Car insurance - this actually came up during the consultation meeting on midata at BIS in London, and a chap from a comparison website agreed that actually they didn't need any identifiable information - yes postcode, age, driving experience, employment, car type, etc, but again no need for name or specific address. Indeed even when they deliver you as a customer to a specific insurer (who will require your name!), they can earn their commission without ever knowing who you are...
That some of these folks want much more data is not about the service they supply to you, but that they operate a multi-sided platform [4] and deliver you and your data to advertisers and service suppliers; oh and perhaps so they can send you a cuddly toy - but to tell you the truth I'll pass on possessing a clan of Suricata suricatta if I can have my privacy please.

[1] Luger, E., Moran, S., and Rodden, T. Consent for All: Revealing the Hidden Complexity of Terms and Conditions. to appear in SIGCHI Conference on Human Factors in Computing Systems, 2013.
[2] Joseph Turow, Chris Jay Hoofnagle, Deirdre K. Mulligan, Nathaniel Good, and Jens Grossklags, The Federal Trade Commission and Consumer Privacy in the Coming Decade, I/S: Journal of Law and Policy for the Information Societ, 3(3), 2007 (link).
[3] Given recent changes in UK energy billing the chances we pursue such complex tariffs seems remote.
[4] Ng, Irene CL Value & Worth: Creating New Markets in the Digital Economy, Cambridge: Innovorsa Press, 2013 (

Monday, 21 January 2013

Wardriving in Aspley

First plots of our wardriving exploits in Aspley, Nottingham, from the PAWS project.

This maps shows the density of WiFi access points in one of the urban areas in the UK with low penetration of fixed line broadband access. Our work within PAWS is looking at how all citizens can be enabled to access key public services via the internet using the unused capacity within these vast number of visible WiFi points.  It is worth flagging this in not the same as free internet service...


Many access points are already enabled with BT FON service, but only available to BT subscribers - our mission to bring this sort of widespread service to all to provide easy access to, amongst others, eGovernment services.

This is in part technological - how we can ensure those getting free access do not impact the subscribers and in reality how good is the coverage; however, the key challenges are in the economics and attitudes towards such a service. It is widely viewed that online access to public services can reduce the cost of delivery and increase economic activity, so from the sociological perspective will citizens view this as a public good and consider that enabling the community should be up there alongside public libraries.

For the technologically minded - the data was captured using Wigle and the plot includes 1064 separate access points, the data having been scrubbed of duplicate MAC addresses, only including those base-stations broadcasting their SSIDs and the judicious use of Occam's razor for some spurious readings. Some folks seem to enjoy providing details of their set up in their SSIDs, so we've truncated the names to four letters which at least indicates who the service provider is for home routers with default configurations.