Tuesday, 12 December 2017

The Anonymisation Problem

He came for one video and got two! Post filming Net Neutrality, Sean mentioned someone on @computer_phile had been raked over the coals for saying things could be done anonymously. So what was the anonymization problem?


Given the top YouTube comment and with apologies to @PrivateEyeNews, I think there is a suggestion:

At least one of them is pretending to be an academic. But who? We should be told.

Monday, 4 December 2017

A few thoughts on Net Neutrality

Sean came visiting again, so time for a @computerphile video; he wanted to know about Net Neutrality.

Not sure I gave him much other than "it's complicated"! The global philosophical debate about it is played out in each country differently depending on local legislation, regulation, the lobbying power of the various factions and the time varying whim of national governments...


Tuesday, 28 November 2017

Why Databox?

Databox results from many years of research into personal data and their ecosystems. This short note lays out the primary motivations and the thinking behind Databox without delving into the technical detail. As background, I recommend watching the “What is Databox” video on YouTube to obtain a high-level view of the Databox approach. Fundamentally, the forces that motivate Databox arise from the EU General Data Protection Regulation, the advent of the Internet of Things, and the need to balance consumer concerns such as privacy and accountability with commercial desire to exploit new opportunities provided by the widespread generation, collection and analysis of data.

The EU General Data Protection Regulation (see Wired GDPR article for a recent review) has been on the agenda since Horizon Digital Economy Research began over seven years ago. While we have seen amendments to details, the structure has remained robust for a great many years, which is unsurprising given the GDPR enacts elements of the European Convention on Human Rights with regards to personal information processing. Simultaneously we have seen the growth of movements seeking access to data in machine readable forms for reuse for new applications, where such data was previously either jealously guarded or simply viewed as not worth the effort. This equally applies to the open data movement seeking to open government data (whether for transparency or new business opportunities) and personal data (where access by the public is viewed as empowering them as consumers). In the UK, the latter has been the subject of the midata campaign, most recently delivering the Open Banking APIs. It is only natural that these converge in the “right to portability” enshrined in GDPR.

One simple purpose of online data access is to promote switching between providers. In the current cloud computing era, the prevalent technical implementation is a cloud-hosted web service through which a consumer shares their data obtained from one provider, either directly with a competitor, or more usually with a comparison site. However, computationally, as each consumer’s data for the purported purposes is essentially processed independently, each consumer could perform the calculation themselves if provided with easy to use software on commonly used platforms such as PCs and smartphones – and in the future, Databox

This example brings to the fore a core and increasing concern for consumers, one of privacy. In sharing our data, the rights obtained by the service provider, and data controller in Data Protection terms, are often very general and permit reuse (targeted advertising being merely the tip of the iceberg with which many consumers have come to accept) and resale. We must be cognisant that such reuse and resale may in fact be a business requirement as the actual web service itself may not be profitable in its own right – although given the cost of data protection compliance, a provider of software rather than a service might become profitable simply by removing those compliance costs. This also raises the interesting question with regards to the GDPR “Privacy by Design” (PbD) requirement - see Art. 25 GDPR Data protection by design and by default (*): if a function can be provided without sharing data (i.e. data is kept strictly private rather than confidentially shared), should sharing approaches simply be illegal? Databox addresses this by providing an environment in which consumers can execute software supplied by providers on their data, in private.

The issue of resale of data also highlights the need for accountability. Normal citizens should be able to be aware of how their data is used and what consequences arise. This right is embedded within the GDPR but, as with all matters of accountability, without the ability of third parties to audit the systems, the valid concerns raised by lobbying groups and a steady stream of breaches continue to undermine public confidence. Databox addresses this through three interlinked measures: (i) the requirement for a machine-readable manifest that stipulates the data to be accessed, the processing to be performed, and any data to be exported; (ii) conversion of the manifest to a service level agreement through direct interaction with the consumer, and which Databox subsequently enforces as the processing is carried out; and (iii) mandatory logging by the Databox platform of all application activities, including data access and export, for inspection by concerned consumers perhaps on receipt of notification.

So far, the discussion has centred on existing applications, but a key desire of Databox is to open new opportunities for applications that perform data fusion across multiple consumer data sources. Data protection is but one of a series of regulatory requirements that companies must adhere to. For example, in the UK financial processing must be performed in accordance with rules issued by the Financial Conduct Authority (FCA), while much medical data falls within the requirements of the NHS Information Governance framework. A service aiming to process “all” a consumer’s data by providing a service involving data sharing will find themselves in a complex, possibly even conflicting, compliance pickle. In Databox, a company with a great idea, e.g., to fuse consumer purchasing, medical and activity information, simply supplies the software to the consumer for them to execute, avoiding on-going compliance overheads.

The general class of solution for these issues are often referred to as Personal Information Management (PIM) platforms. Most have proposed cloud based solutions, including personal virtual machines to host PIMs. Even though the current Databox implementation is based on Docker containers and could run on any platform, Databox eschews the cloud approach. Under GDPR, the clear separation between “data controller” and “data processor” is no longer present and so cloud providers may well find themselves in difficult compliance situations where they may previously have viewed themselves merely as data processors. Even worse, in the UK they will likely be viewed as Communications Service Providers for the purposes of the IP and DE bills, leading to further significant compliance costs.

An emerging class of data is that of the Internet of Things (IoT). For Databox the primary context for considering IoT is the domestic environment, encompassing equipment installed in the home and the ubiquitous personal mobile devices such as smartphones and tablets that form the control and management interface to such infrastructures. In this domestic market, the ecosystem is still emergent – high profile security failures and data breaches have rightly raised concerns about building what will be national infrastructure, in a completely ad-hoc manner. The role of Databox in the IoT context includes all the foregoing benefits (privacy, accountability and new opportunities), but also adds resilience. Reliance on continuous Internet connectivity and service availability for basic in-home operation of heating, lighting and security is not tenable. Even more-so when considering that in home IoT is advocated to enable elderly to continue to live independent lives in their own homes. This requires the in-home infrastructure to be able to supply functionality for significant periods of time without reliance on cloud infrastructure – we propose that the integration point in the home is the Databox.

Finally, with regards to domestic IoT, Databox provides scalability. Many in-home sensors can produce substantial data rates that need local processing to ensure they do not consume the home owner’s upstream broadband service – consider as examples face recognition used as part of a home intrusion detection, or “condition monitoring” of home appliances using high frequency sampling of the electricity supply. In-home processing in a Databox avoids the need for high value servers in data centres, along with associated bandwidth and storage costs, by pushing the processing to the edge. This approach also removes the central services that hold significant personal data and are control points for possibly millions of devices – such services are honey pots for hackers, whether they are simply mischievous, engaged in direct action campaigning on some topic, or with criminal intent.

So, is it necessary the Databox be a separate computing appliance in the home? As noted, the current Databox implementation is based on Docker containers, so in the home the software could run on a PC, home router, set top box, fridge, etc. Indeed, running Databox as a peer-to-peer application on several different devices in the home network would improve resilience and enable defence against compromise of individual systems. However, given the envisaged usage of Databox, not only for IoT but as the location of secure private computation, we do envisage that at least one Databox instance will be available in the home at all times – and a specialist appliance seems likely to be the most secure, cost effective and energy efficient solution.

w: https://databoxproject.uk
e: info@databoxproject.uk
t: @databoxproject

(*) You happy @tnhh?

Wednesday, 2 August 2017

On things end-to-end

The latest from the UK Home Secretary on "end-to-end encryption" and the responses make me feel the need to explain some things cryptographic and historical.

Modern secure messaging apps and services provide several functions:

  1. I can find people - they provide a directory, where often the primary identity is simply the phone number;
  2. They provide a private and easy way to exchange encryption keys;
  3. They utilise state of the art, and publicly documented encryption protocols between the end points;
  4. And, since we lost end-to-end networking for most usersthey implement a network forwarding service.

My observation of the popular use of the phrase "ene-to-end encryption" is that often folks mean both functions 2 & 3 - that is both private key exchange as well as use of state of the art encryption between the end points. Hence as a technical pedant, I find myself peeved by much discussion on this topic which confounds and confuses these two functions, so I have felt compelled to write this post!

The directory service is technically the most boring but actually a very useful element of such a service; one route to finding friends is simply that the smart phone app reads your contacts and makes connections by looking for matching phone numbers already using the service. Compared to the suggestions we build a web of trust  it's a heck of a lot simpler to use!

Many of these apps (including WhtsApp) implement a very cunning state of the art key exchange algorithm as developed by Open Whispers Systems and widely available in their open source [1] Signal app. This allows two parties who wish to communicate to share an key with which to encrypt their message without the service provider knowing the key. As noted by in the excellent article by Kieren McCarthy:
...companies like Facebook, Google, Apple and so on could redesign their systems to make it possible to decrypt them. They could even avoid the problem of a simple backdoor by using constantly changing encryption keys – so long as they keep a copy of those keys.
The desired point of intervention is the key exchange protocol, it would be straightforward to arrange that only for those targeted for surveillance, the keys and messages are kept. Doing this on a per service and per target basis is not the end of the Internet, of secure banking, of eCommerce, etc. However, it is a backdoor - bad'uns will try and attack it of course and it is a risk. That said, all the public key and certificate infrastructure that underpins session key exchange for https, and hence the majority of the Internet services we use, rely on keeping the secret half of the public/private key pair secure - that's a even juicer target for bad'uns - grab that and they can subvert the service not just a single conversation...  So we do need some perspective on the hacking risk.

Importantly the point of intervention is not the encryption algorithm itself - here again I find myself peeved when I hear the statement "it is against the laws of mathematics". Not really - we could simply be relying on our mathematical ignorance. Currently we think the cryptanalysis of our modern crypto algorithms is too hard (that is computationally hard and hence expensive) - we haven't proved it is mathematically, maybe we just haven't figured it out yet [2] - indeed therein lies the history of cryptanalysis! We didn't know about differential cryptanalysis for years...

It is worthy of note that where is a need for encryption between two parties but with recoverable key exchange, protocols have been designed specifically to provide this - for example, MIKEY-SAKKE was designed to provide the ability for an organization to deploy end-to-end encryption but also allow for that organization to acquire the keys and decode the messages, specifically where there is a regulatory requirement to do so. Examples cited include emergency services or financial services - or maybe we should just trust the bankers not to collude on price fixing, that has always worked well.

However, the key point of Kieren's article that resonates is that governments have simply shown themselves to be untrustworthy, and on that, there is no going back. To be clear though, as a professional paranoid, I don't understand why we trust the companies either - WhatsApp could be saying they do "end-to-end encryption", but since their code is not open to independent review, how would we know? If you are serious about your privacy check out the EFF Secure Messaging Scorecard - me, I trust Signal.

So far, so current affairs; but I think we should analyse how we got here and where we might go. The root of this evil is in fact that we lost end-to-end networking.

1981 was a key year for the Internet; growing from the experience of years of research on Arpanet, it saw the release of RFC 791 which defined IPv4, the Internet Protocol we still live with today, and indeed for many is the only supported network level protocol [3]. The philosophy was encapsulated in "the End-to-End Argument" as extolled in the seminal paper by Saltzer, Reed and Clark [4]. Importantly underpinning the idea was that computers could all speak to each other if they were connected to the Internet, indeed in defining the IPv4 address:
"Internet addresses distinguish sources and destinations to the host level and provide a protocol field as well.  It is assumed that each protocol will provide for whatever multiplexing is necessary within a host." 
However, we lost the plot on this. In general my 4G/wifi roaming smart phone "on the Internet" cannot speak to my home computer "on the Internet" directly - we built a series of network level mechanisms in the home, in broadband infrastructure and in mobile networks that broke the end-to-end network connectivity. Today we require most interaction between two users to be via "over-to-top" service providers, so nearly everything is now mediated by the behemoths of our era - Google, Facebook, Twitter, ... Furthermore, it is in the commercial interests of these companies to continue be mediate communication and that applies to WhatsApp too. And it's getting worse - most of the IoT kit does the same.

This then leads to the current regulatory situation - if you provide a service, an app, and forward most of the messages, you are going to be seen as a plausible target for regulation, and in particular a single point at which someone might require key escrow on demand.

However, the day will come when someone tweaks and releases the open source Signal app to not use the Signal service, rather a combination of direct SMS messages and an end-to-end network layer like IPv6 [5], and all the regulating of these service providers will have been an exercise in futility.

Anyway, whether you believe that or not, can you at least be clear when talking about “end-to-end encryption” to separate the issues of key handling and actual encryption.

---- 8 ----

[1] As is repeatedly pointed out, the cat is out of the bag - the technology is freely available as open source for anyone to use...
[2] I promise not to create any more theorems, so let me put this stake in the ground - I have a proof concerning factoring large numbers but it is too large to fit in the margin.
[3] Recent reports indicate IPv6 is over 15% of all Internet traffic, but it has yet to achieve wide-scale deployment at the edge of the network, and hence be accessible to most end users.
[4] Saltzer, J. H., D. P. Reed, and D. D. Clark (1981) "End-to-End Arguments in System Design". In: Proceedings of the Second International Conference on Distributed Computing Systems. Paris, France. April 8–10, 1981. IEEE Computer Society, pp. 509-512.
[5] Possible 3rd year undergraduate project I think.

Thursday, 27 April 2017

Algorithms in decision-making

1.       UnBias[1] is a research project funded under the Digital Economy theme’s Trust, Identity, Privacy and Security programme (EPSRC grant EP/N02785X/1). The project brings together researchers from the universities of Nottingham, Oxford and Edinburgh to study the user experience of algorithm driven internet services and the process of algorithm design with special attention to the experience of young people (13 to 17 years old) and issues related to non-operationally justified bias. UnBias aims to provide policy recommendations, ethical guidelines and a ‘fairness toolkit’ co-produced with young people and other stakeholders. The toolkit will include educational materials and resources to support youth understanding about online environments as well as raise awareness among online providers about the concerns and rights of young internet users. The draft report[2] summarizing the outcomes of a set of case study discussions with stakeholders from academia, teachers, NGOs and SMEs has just been finalised.

2.       Professor Derek McAuley, Dr Ansgar Koene and Dr Elvira Perez Vallejos are part of Horizon Digital Economy Research[3] which is a Research Institute at The University of Nottingham and a Research Hub within the RCUK Digital Economy programme[4]. Horizon brings together researchers from a broad range of disciplines to investigate the opportunities and challenges arising from the increased use of digital technology in our everyday lives. Prof McAuley is Director of Horizon and principal investigator on the UnBias project. Dr Koene and Dr Perez Vallejos are Senior Research Fellows at Horizon and co-investigators on the UnBias[5] project. Dr Koene chairs the IEEE working group for the development of a Standards on Algorithm Bias Considerations[6].

3.       Professor Marina Jirotka, Dr Menisha Patel, and Dr Helena Webb are part of the Human Centred Computing (HCC) group[7] at the Department of Computer Science, University of Oxford. This is an interdisciplinary research group that seeks to increase understanding of how innovation impacts society and advance opportunities for new technologies to be developed in ways that are more responsive to societal acceptability and desirability. Prof Jirotka, and Dr Webb are co-investigators on the UnBias project.

1. The extent of current and future use of algorithms in decision-making in Government and public bodies, businesses and others, and the corresponding risks and opportunities.

4.       As part of the UnBias project we have been reviewing case studies of controversies over potential bias in algorithmic practice and scoping the informed opinion of stakeholders in this area (academics, educators, entrepreneurs, staff at platforms, NGOs, and staff at regulatory bodies etc.). It is apparent that the ever-increasing use of algorithms to support decision-making, whilst providing opportunities for efficiency in practice, carries a great deal of risk relating to unfair or discriminatory outcomes. When considering the role of algorithms in decision making we need to think not only of cases where an algorithm is the complete and final arbiter of a decision process, but also the many cases where algorithms play a key role in shaping a decision process, even when the final decision is made by humans; this may be illustrated by the now [in]famous example of the sentencing support algorithm used in some US courts which was shown to be biased[8]. Given the ubiquitous nature of computer based processing of data, almost all services, be they government, public, business or otherwise, are in some way affected by algorithmic decision-making. As the complexity of these algorithmic practices increases, so do the inherent risks of bias as there are a greater number of stages in the process where errors can occur and accumulate. These problems are in turn exacerbated by the absence of oversight and effective regulation.

5.       The recent research work that we have conducted with young people has highlighted important concerns around algorithm use and trust issues. Results from a series of 'Youth Juries'[9] show that many young people experience a lack of trust toward the digital world and are demanding a broader curriculum beyond the current provision of e-safety to help them understand algorithmic practices, and to increase their digital agency and confidence. Current use of algorithms in decision-making (e.g., job recruitment agencies) appears surprising to many young people, especially for those unaware of such practices. Algorithms are perceived for most young people as a necessary mechanism to filter, rank or select large amounts of data but its opacity and lack of accessibility or transparency is viewed with suspicion and undermines trust in the system. The Youth Juries also facilitated young people to deliberate together about what they require to regain this trust – the request is for a comprehensive digital education as well as for choices online to be meaningful and transparent.

2. Whether 'good practice' in algorithmic decision-making can be identified and spread, including in terms of:

2a. The scope for algorithmic decision-making to eliminate, introduce or amplify biases or discrimination, and how any such bias can be detected and overcome?

6.       When discussing bias in algorithmic decision-making it is important to start with a clear distinction between operationally-justified and non-operationally-justified bias. Justified bias prioritizes certain items/people as part of performing the desired task of the algorithm, e.g. identifying frail individuals when assigning medical prioritization. Non-operationally-justified bias by contrast is not integral to being able to do the task, and is often unintended and its presence is unknown unless explicitly looked for.

7.       In order to identify good practice related to biases or discrimination, some important processual issues must be taken into account, for example:

  1. In order to understand the scope for algorithmic decision-making in relation to bias adequately and appropriately, it is necessary to engage with, and integrate the views of, multiple stakeholders to understand how algorithms are designed, developed and appropriated into the social world, how they have been experienced, and what the concerns surrounding their use are;
  2. Importantly, this undertaking and exploration should be achieved through rigorous research rather than abstract orientations towards good practice in relation to algorithms: thus, considering examples of the consequences that people have experienced when algorithms have been implemented, particular scenarios surrounding their use, and as emphasised in the point above- talking to people about their experiences.
  3. Given the complexities of the landscape in which algorithms are developed and used- we need to recognise that it is difficult, in some cases impossible, to develop completely unbiased algorithms and that this would be an unrealistic ideal to aim towards. Instead, it is important to base good practice on a balanced understanding and considering of multi-stakeholder needs.

8.       The need for ‘good practice’ guidance regarding bias in algorithmic decision-making has also been recognized by professional associations such as the Institute of Electrical and Electronic Engineers (IEEE) which in April 2016 launched a Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous system[10]. As part of this initiative Dr Koene is chairing the working group for the development of a Standard on Algorithm Bias Considerations[11] which will provide certification oriented methodologies to identify and mitigate non-operationally-justified algorithm biases through:

I.            the use of benchmarking procedures

II.            criteria for selecting bias validation data sets

III.            guidelines for the communication of application domain limitations (using the algorithm for purposes beyond this scope invalidates the certification)

2b. Whether and how algorithmic decision-making can be conducted in a ‘transparent’ or ‘accountable’ way, and the scope for decisions made by an algorithm to be fully understood and challenged?

9.       What is essential here is to create a meaningful transparency: that is a transparency that all stakeholders can engage with, allowing the workings of, and practical implications of, algorithms to be accessible across the diverse stakeholder base that experience them.

10.   In order to create a meaningful transparency, we need to understand what stakeholders feel such a transparency would have to incorporate for them to be adequately informed, and enable them to engage with the positive and negative implications of algorithms. Though it is unlikely that there would be complete consensus, such stakeholder engagement can provide key insights for the nature and shape of solutions to be developed.

11.   Importantly, this meaningful transparency should also relate to a meaningful accountability. It is not enough for stakeholders just to understand how algorithms are developed and how they make decisions.  In making things meaningfully transparent, stakeholders should be given some agency to challenge algorithmic decision-making processes and outcomes.

12.   In principle, algorithmic decisions can be traced, step by step, to reconstruct how the outcome was arrived at. The problem with many of the more complex ‘big data’ type processes is the high dimensionality of the underlying data. This make it very difficult to comprehend which contributing factors are salient and which are effectively acting as noise (for any given specific decision). Analytic methods for dimension reduction can be used to make this more understandable in many situations, but may need to be applied on a case-by-case basis to appropriately evaluate the important outlying and challenging cases.

13.   Similarly, it is important to note that many ‘big data’ and ‘artificial intelligence’ algorithms learn from the data they are supplied with and modify their behaviour. We must look not only at the code that constitutes and algorithm, but the “training data” from which it learns. Practically this is becoming increasingly difficult as algorithms become embedded in off the shelf software packages and cloud services, where the algorithm itself is reused in various contexts and trained on different data – there is no one point at which the code and data are viewed together.

14.   The IEEE Global Initiative (see point 6) are also working to establish a Standard for Transparency of Autonomous Systems[12] which aims to set out measurable, testable levels of transparency. The working group for this standard is chaired by Prof. Alan Winfield[13].

2c. The implications of increased transparency in terms of copyright and commercial sensitivity, and protection of an individual’s data

15.   As mentioned in our responses to 2b, while there is a need for meaningful transparency, this does not require that copyrighted code (or data) is made public. Within the community currently researching this topic, a recurring suggestion is the use of a neutral (or government associated) auditing body that could be tasked with certifying algorithmic systems through a process of expert analysis. This algorithm auditing could be done under a non-disclosure-agreement, protecting the IP, and the individual data. A detailed discussion outlining arguments in favour of such an approach was developed in an open access published paper by Andrew Tutt with the title “An FDA for Algorithms”[14].

16.   Even if the copyrighted code is not made public, somehow making aspects of the design of algorithms more visible may still be useful. We see how the food industry make elements of their produce accessible for consumers to allow for consumers to make informed decisions about what they purchase.  At this point it is difficult to say what is better/worse without full and proper engagement with industry and other stakeholders, as we are currently engaged in through the UnBias project.

17.   It is necessary to have a dialogue with industry to understand their genuine concerns surrounding increased transparency, and how a way forward can be forged. There are elements of business procedures which have to be made transparent already (e.g. the requirements for audit, health and safety, etc…) so it is not that they are unaccustomed to such requirements. However, given that there is an element of commercial sensitivity in this context, then it is important to see what suggestions they would have to allow for increased transparency.

18.   We should be careful that we do not give the impression that commercial interests supersede the rights of people to obtain information about themselves. We should be cautious about assuming industry interests are more important than other ones, and move forward with a balanced approach.

19.   Finally, the traditional bargain between society and inventors has been the patent - disclosure to stimulate innovation in return for commercial protection – the question arises as to what role might patents play in transparency. However, the situation concerning software patents is globally complex, but then the issue of algorithmic transparency is rapidly becoming a global issue.

3. Methods for providing regulatory oversight of algorithmic decision-making, such as the rights described in the EU General Data Protection Regulation 2016

20.   The right to explanation in GDPR is still open to interpretation and the actual practice will become established as cases unfold when enforcement starts in 2018. For example, the right to recourse and to challenge algorithmic made decisions, is restricted to decisions that are made fully autonomously by algorithms and that have clearly significant impact on the person – it will be some time before we understand how these clauses will be implemented, and with impending Brexit, whether the UK will continue to align with the EU on these interpretations. The recent paper by Wachter et al.[15] puts forward the case that much more is needed to deliver a ‘right to explanation’.

21.   More broadly, it is our position as a project that open dialogue amongst key stakeholders is an important step towards advancing the responsible oversight of algorithmic decision-making. It is necessary to include the perspectives of those from a wide range of sectors, alongside government and industry, in order to scope concerns over the current and future use of algorithms, and to identify genuine opportunities for regulation that are both technically feasible and legally and societally valid. As noted above, the activities of the UnBias project include the scoping of opinion amongst a wide range of informed stakeholders. By promoting discussion between stakeholder groups we are working to identify potentially effective methods for oversight of algorithmic decision- making. From the work we have conducted in this area so far, it is clear (as described above) that transparency alone is not a meaningful solution to the potential problems caused by algorithmic practices. Regulatory oversight needs also to incorporate responsibility and accountability so that users affected by algorithmic-decision making have opportunities to 1) understand how decisions about them were reached and 2) challenge those decisions if they feel them to be unfair. As also noted above, suggestions emerging from our project stakeholder dialogue so far include the possibility of an expert auditing or ombudsman system that oversees practice and mediates disputes. Further suggestions, in line with developments by the IEEE and elsewhere, include the provision of industry standards and certificates.

22.   The Council of Europe’s Committee of Experts on Internet Intermediaries (MSI-NET)[16] is currently also exploring the human rights dimensions of automated data procession techniques (in particular algorithms) and possible regulatory implications. As part of this investigation a preliminary report[17] was published on February 20th which includes a number of relevant case studies and recommendations that are applicable to the topic of this inquiry.

April 2017

Tuesday, 21 March 2017

Search engines have no sense of humour

A colleague drew to my attention some recent plans to try and take on offensive and "fake" content... from the Search Engine Land article:

“We’re explicitly avoiding the term ‘fake news,’ because we think it is too vague,” said Paul Haahr, one of Google’s senior engineers who is involved with search quality. “Demonstrably inaccurate information, however, we want to target.”

Seems like a reasonable objective, although as the SEL article and another today in TechDirt notes the current drive is focussed on "upsetting-offensive" material. There are certainly categories of material, much already illegal, that would receive universal agreement on being offensive, but it is not long before we move into more difficult and subjective territory, for example is the Viz comic offensive? Well yes; and designed that way as a parody of British comics, politics, media, life, etc.

Trying to solve the "offensive" problem by making some universal binary categorization is not achievable. Too much of life is subjective, we need much richer categorizations of content, and if you wanted to start with one, try age appropriate search - start with readability using something simple like SMOG and implement the search tag "reader:reading_age" to allow the user (browser) to configure the search.

In general then indexing and rich tagging of content with the user able to configure their search tags seems like a means to achieve some actual match between what users want from search and the results they get. Of course here speaks someone who uses duckduckgo because I do not want what I searched for last time to effect what I search for next without my knowledge...

This brings me full circle to the SEL article's comment on "fake results" and parody.  Try a google search for "who invented stairs"... here's one I did earlier (in case they fix it!):

... a myth since 2007 propagated by a parody website, now archived here. How do I know it is parody, well not least I have used stairs older than 1948, but a tell tale sign is the statement "In case you hadn't guessed, this is a big, fat parody."

So the next search tag to add is "parody:on|off"; well until the search engines get a sense of humour.