Exploratory Data Analysis

It doesn’t take much to trigger me into a rant about the weaknesses of reports on data and “dashboards” purporting to be “analytics” or “business intelligence”. Lots of pie charts and line graphs with added bling are as the proverbial red rag to a bull.

Until recently my response was to demand more rigorous statistics: hypothesis testing, confidence limits, tests for reverse causality (but recognising that causality is a slippery concept in complex systems). Having recently spent some time thinking about using data analysis to gain actionable insights, particularly in the setting of an educational institution, it has become clear to me that this response is too shallow. It embeds an assumption of a linear process: ask a question, operationalise it in terms of data and statistics and crunch some numbers. As my previous post indicates, I don’t suppose all questions are approachable. Actually, thinking back to the ways I’ve  done a little text and data mining in the past, it wasn’t quite like this either.

The label “exploratory data analysis” captures the antithesis to the linear process. It was popularised in statistical circles by John W Tukey in the early 1960’s and he used it as a title for a highly influential book. Tukey was trying to challenge a statistical community that was very focused on hypothesis testing and other forms of “confirmatory data analysis”. He argued that statisticians should do both, approaching data with flexibility and an open frame of mind and he saw having a well-stocked toolkit of graphical methods as being essential for exploration (Tukey was responsible for inventing a number of plot types that are now widely used).

Tukey read a paper entitled “The Technical Tools of Statistics” at the 125th Anniversary Meeting of the American Statistical Association in 1964 which anticipated the development of computational tools (e.g. R and RapidMiner), is well worth a read and has timeless gems like:

“Some of my friends felt that I should be very explicit in warning you of how much time and money can be wasted on computing, how much clarity and insight can be lost in great stacks of computer output. In fact, I ask you to remember only two points:

  1. The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful.
  2. Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do.”

There is a correspondence between the open-minded and flexible approach to exploratory data analysis that Tukey advocated and the Grounded Theory (GT) Method of the social sciences. As a non-social scientist, GT seems to be a trying a bit too hard to be a Methodology (academic disputes and all) but the premise of using both inductive and deductive reasoning and going in to a research question free of the prejudice of a hypothesis that you intend to test (prove? how often is data analysed to find a justification for a prejudice?) is appealing.

Although GT is really focussed on qualitative research, some of the practical methods that the GT originators and practitioners have proposed might be applicable to data captured in IT systems and for practitioners of analytics. I quite like the dictum of “no talk” (see the wikipedia entry for an explanation).

My take home, then, is something like: if we are serious about analytics we need to be thinking about exploratory data analysis and confirmatory data analysis and the label “analytics” is certainly inappropriate if neither is occurring. For exploratory data analysis we need: visualisation tools, an open mind and an inquisitive nature.

A Poem for Analytics

There are many traps for the unwary in the practice of analytics, which I take to be the process of developing actionable insights through problem definition and the application of statistical models. The technical traps are most obvious but the epistemological traps are better disguised.

That these traps exist and are seemingly not recognised in the commercial and corporate rhetoric around analytics worries the more philosphically-minded; Virginia Tech’s Garner Campbell has shared some clear and well-received thoughts on the potential for damaging reductionism in Learning Analytics. I particularly like Anne Zelenka’s blogged reaction to Gardner’s LAK12 MOOC (I believe there is a recording but elluminate recordings don’t seem to play on linux) and my colleague Sheila has also blogged on the topic.

I don’t see reduction as being the issue per se but careless reductionism and failing to remember that our models are surrogates for what might be does worry me. Analytics does give us power for “myth busting” and a means to reduce the degree to which anecdote, prejudice and the opinion of the powerful determines action but let us be very wary indeed.

This all reminded me of the following poem by my favourite poet and mythographer, Robert Graves. Let us be slow.

In Broken Images

He is quick, thinking in clear images;
I am slow, thinking in broken images.

He becomes dull, trusting to his clear images;
I become sharp, mistrusting my broken images,

Trusting his images, he assumes their relevance;
Mistrusting my images, I question their relevance.

Assuming their relevance, he assumes the fact,
Questioning their relevance, I question the fact.

When the fact fails him, he questions his senses;
When the fact fails me, I approve my senses.

He continues quick and dull in his clear images;
I continue slow and sharp in my broken images.

He in a new confusion of his understanding;
I in a new understanding of my confusion.

Robert Graves

Making Sense of “Analytics”

There is currently a growing interest in increasing the degree to which data from various sources can be put to use by organisations to be more effective and a growing number of strategies for doing this. The term “analytics” is frequently being applied to descriptions of these situations but often without clarity as to what the word is intended to mean. This makes it difficult to make sense of what is happening, to decide what to appropriate from other sectors, and to make creative leaps forward in exploring how to adopt analytics.

I have just completed a public draft of a paper entitled “Making Sense of Analytics: a framework for thinking about analytics” [link removed – please visit our publications site to access the final versions] in an attempt to help anyone who is grappling with these questions in relation to post-compulsory education (as I am). It does so by:

  • considering the definition of “analytics”;
  • outlining analytics in relation to research management, teaching and learning or whole-institution strategy and operational concerns;
  • describing some of the key characteristics of analytics (the Framework).

The Framework is intended to support critical evaluation of examples of analytics, whether from commerce/industry or the research community, without resorting to definition of application or product categories. The intention behind this approach is to avoid discussion of “what it is” and to focus on “what it does” and “how it does it”.

This is a draft. Please feel free to comment via this blog or directly to me. A revised version will be published in June.

This paper is the first of a series that CETIS is producing and commissioning. These will be emerging during the coming months and collected together in a unified online resource in July/August. This is referred to briefly by Sheila MacNeill in her recent post “Learning Analytics, where do you stand?

UK Government Open Standards Consultation – CETIS Response

Earlier this year the UK Government Cabinet Office published what I thought was a rather good set of proposals for the role of open standards in government IT. They describe it as a “formal public consultation on the definition and mandation of open standards for software interoperability, data and document formats in government IT.” There are naturally points where we have critical comments but the direction of travel is broadly one that CETIS supports. The topic of mandation is, however, one to be approached with a great deal of caution in our view.

Our full response, which should be read alongside the consultation document (which includes the questions), is available for your information.

The consultation has now been extended to June 4th 2012 following the revelation of a conflict of interest; the chair of a public consultation meeting in April was found to be also working for Microsoft. This is the latest in a long series of concerns about Microsoft lobbying reported in Computer Weekly and elsewhere. I am actually encouraged by the Cabinet Office response both to FoI requests linked to meetings with Microsoft and to this recent revelation; they do seem to be trying to do the right thing.