Analytics and Big Data – Reflections from the Teradata Universe Conference 2012

As part of our current work on investigating trends in analytics and in contextualising it to post-compulsory education – which we are calling our Analytics Reconnoitre – I attended the Teradata Universe Conference recently. Teradata Universe is very much not an academic conference; this was a trip to the far side of the moon, to the land of corporate IT, grey-suits galore and a dress code…

Before giving some general impressions and then following with some more in-depth reflections and arising thoughts, I should be clear about the terms “analytics” and “big data”.

My working definition for Analytics, which I will explain in more detail in a forthcoming white paper and associated blog post is:
“Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data.”

I am interpreting Big Data as being data that is at such a scale that conventional databases (single server relational databases) can no longer be used.

Teradata has a 30 year history of selling and supporting Enterprise Data Warehouses so it should not have been a surprise that infrastructure figured in the conference. What was surprising was the degree to which infrastructure (and infrastructural projects) figured compared to applications and analytical techniques. There were some presentations in which brief case studies outlined applications but I did not hear any reference to algorithmic, methodological, etc development nor indeed any reference to any existing techniques from the data mining (a.k.a. “knowledge discovery in databases”) repertoire.

My overall impression is that the corporate world is generally grappling with pretty fundamental data management issues and generally focused on reporting and descriptive statistics rather than inferential and predictive methods. I don’t believe this is due to complacency but simply to the reality of where they are now. As the saying goes “if I was going there, I wouldn’t start here”.

The Case for “Data Driven Decisions”

Erik Brynjolfsson, Director of the MIT Center for Digital Business, gave an interesting talk entitled “Strength in Numbers: How do Data-Driven Decision-Making Practices affect Performance?”

The phrase “data driven decisions” raises my hackles since it implies automation and the elimination of the human component. This is not an approach to strive for. Stephen Brobst, Teradata CTO, touched on this issue in the last plenary of the conference when he asserted that “Sucess = Science + Art” and backed up the assertion with examples. Whereas my objections to data driven decisions revolve around the way I anticipate such an approach would lead to staff alientation and significant disruption to effective working of an organisational, Brobst was referring to the pitfall trap of incremental improvement leading to missed opportunities for breakthrough innovation.

As an example of a case where incremental improvement found a locally optimal solution but a globally sub-optimal one, Brobst cited actuarial practice in car insurance. Conventionally, risk estimation uses features of the car, the driver’s driving history and location and over time the fit between these parameters and statistical risk has been honed to a fine point. It turns out that credit risk data is actually a substantially better fit to car accident risk, a fact that was first exploited by Progressive Insurance back in 1996.

Rather than “data driven decisions”, I advocate human decisions supported by the use of good tools to provide us with data-derived insights. Paul Miller argues the same case against just letting the data speak for itself on his “cloud of data” blog.

This is, I should add, something Brynjolfsson and co-workers also advocate; they are only adopting terminology from the wider business world. See, for example an article in The Futurist (Brynjolfsson, Erik and McAfee, Andrew, “Thriving in the Automated Economy” The Futurist, March-April 2012.). In this article, Brynjolfsson and McAfee make the case for partnering humans and machines throughout the world of work and leisure. They cite an interesting example of the current best chess “player” in the world, which is 2 amateur American chess players using 3 computers. They go on to make some specific recommendations to try to make sure that we avoid some socio-economic pathologies that might arise from a humans vs technology race (as opposed to humans with machines), although not everyone will find all the recommendations ethically acceptable.

To return to the topic of Brynjolfsson’s talk, which is expanded in a paper of the same title (Brynjolfsson, Erik, Hitt, Lorin and Kim, Heekyung “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance”, April, 2011). The abstract:
“We examine whether performance is higher in firms that emphasize decisionmaking based on data and business analytics (which we term a data-driven decisionmaking approach or DDD). Using detailed survey data on the business practices and information technology investments of 179 large publicly traded firms, we find that firms that adopt DDD have output and productivity that is 5-6% higher what would be expected given their other investments and information technology usage. Using instrumental variables methods, we find evidence that these effects do not appear to be due to reverse causality. Furthermore, the relationship between DDD and performance also appears in other performance measures such as asset utilization, return on equity and market value. Our results provide some of the first large scale data on the direct connection between data-driven decisionmaking and firm performance.”

This is an important piece of research, adding to a relatively small existing body – which shows correlation between high levels of analytics use and factors such as growth (see the paper) – and one which I have no doubt will be followed up. They have taken a thorough approach to the statistics of correlation and tested for reverse causation. The limitations of the conclusion is clear from the abstract, however in “large publicly traded firms”. What of smaller firms? Futhermore, business sector (industry) is treated as a “control” but my hunch is that the 5-6% figure conceals some really interesting variation. The study also fails to establish mechanism, i.e. to demonstrate what it is about the context of firm A and the interventions undertaken that leads to enhanced productivity etc. These kinds of issues with evaluation in the social sciences are the subject of writings by Nick Tilley and Ray Pawson (see for example, “Realistic Evaluation: An Overview“) which I hold in high regard. My hope is that future research will attend to these issues. For now we must settle for less complete, but still useful, knowledge.

I expect that as our Analytics Reconnoitre proceeds we will return to this and related research to explore further whether any kind of business case for data-driven decisions can be robustly made for Higher or Further Education, or whether we need to gather more evidence by doing. I suspect the latter to be the case and that for now we will have to resort to arguments on the basis of analogy and plausibility of benefits.

Zeitgeist: Data Scientists

“Data Scientist” is a term which seems to be capturing the imagination in the corporate big data and analytics community but which has not been much used in our community.

A facetious definition of data scientist is “a business analyst who lives in California”. Stephen Brobst gave his distinctions between data scientist and business analyst in his talk. His characterisation of a business analyst is someone who: is interested in understanding the answers to a business question; uses BI tools with filters to generate reports. A data scientist, on the other hand, is someone who: wants to know what the question should be; embodies a combination of curiosity, data gathering skills, statistical and modelling expertise and strong communication skills. Brobst argues that the working environment for a data scientist should allow them to self-provision data, rather than having to rely on what is formally supported in the organisation, to enable them to be inquisitive and creative.

Michael Rappa from the Institute for Advanced Analytics doesn’t mention curiosity but offers a similar conception of the skill-set for a data scientist in an interview in Forbes magazine. The Guardian Data Blog has also reported on various views of what comprises a data scientist in March 2012, following the Strata Conference.

While it can be a sign of hype for new terminology to be spawned, the distinctions being drawn by Brobst and others are appealing to me because they are putting space between mainstream practice of business analysis and some arguably more effective practices. As universities and colleges move forward, we should be cautious of adopt the prevailing view from industry – the established business analyst role with a focus on reporting and descriptive statistics – and miss out on a set of more effective practices. Our lack of baked-in BI culture might actually be a benefit if it allows us to more quickly adopt the data scientist perspective alongside necessary management reporting. Furthermore, our IT environment is such that self-provisioning is more tractable.

Experimentation, Culture and HiPPOs

Like most stereotypes, the HiPPO is founded on reality; this is decision-making based on the Highest Paid Person’s Opinion. While it is likely that UK universities and colleges are some cultural distance from the world of corporate America that stimulated the coining of “HiPPO”, we are certainly not immune from decision-making on the basis of management intuition and anecdote suggests that many HEIs are falling into more autocratic and executive style management in response to a changing financial regime. As a matter of pride, though, academia really should try to be more evidence-based.

Avanish Kaushik (Digital Marketing Evangelist at Google) talked of HiPPOs and data driven decision making (sic) culture back in 2006 yet in 2012 these issues are still main stage items at Teradata 2012. Cultural inertia. In addition to proposing seven steps to becoming more data-driven, Kaushik’s posting draws the kind of distinctions between reporting and analysis that accords with the business analyst vs data scientist distinctions, above.

Stephen Brobst’s talk – “Experimentation is the Key to Business Success” – took a slightly different approach to challenging the HiPPO principle. Starting from an observation that business culture expects its leadership to have the answers to important and difficult questions, something even argumentative academics can still be found to do, Brobst argued for experimentation to acquire the data necessary for informed decision-making. He gained a further nod from me by asserting that the experiment should be designed on the basis of theorisation about mechanism (see earlier reference to the work of Tilley and Paulson).

Proctor and Gamble’s approach to pricing a new product by establishing price elasticity through a set of trial markets with different price points is one example. It is hard to see this being tractable for fee-setting in most normal courses in most universities but maybe not for all and it becomes a lot more realistic with large-scale distance education. Initiatives like coursera have the opportunity to build-out for-fee services with much better intelligence on pricing than mainstream HE can dream of.

Big Data and Nanodata Velocity

There is quite a lot of talk about Big Data – data that is at such a scale that conventional databases can no longer be used – but I am somewhat sceptical that the quantity of talk is merited. One presenter at Teradata Universe actually proclaimed that big data was largely an urban myth but this was not the predominant impression; others boasted about how many petabytes of data they had (1PB = 1,000TB = 1,000,000GB). There seems to be an unwarranted implication that big data is necessary for gaining insights. While it is clear that more data points improves the statistical significance and that if you have a high volume of transactions/interactions then even small % improvements can have significant bottom line value (e.g. a 0.1 increase in purchase completion at Amazon), there remains a great deal of opportunity to be more analytical in the way decisions are made using smaller scale data sources. The absence of big data in universities and colleges is an asset, not an impediment.

Erik Brynjolfsson chose the term “nanodata” to draw attention to the fine-grained components of most Big Data stores. Virtually all technology-mediated interactions are capable of capturing such “nanodata” and many do. The availability of nanodata is, of course, one of the key drivers of innovation in analytics. Brynjolfsson also pointed to data “velocity”, i.e. the near-real-time availability of nanodata.

The insights gained from using Google search terms to understand influenza is a fairly well-known example of using the “digital exhaust” of our collective activities to short-cut traditional epidemiological approaches (although I do not suggest it should replace them). Brynjolfsson cited a similar approach used in work with former co-worked Lynn Wu on house prices and sales (pdf), which anticipated official figures rather well. The US Federal Reserve Bank, we were told, was envious.

It has taken a long time to start to realise the vision of Cybersyn. Yet still our national and institutional decision-making relies on slow-moving and broadly obsolete data; low velocity information is tolerated when maybe it should not be. In some cases the opportunities from more near-real-time data may be neglected low-hanging fruit, and it doesn’t necessarily have to be Big Data. Maybe there should be talk of  Fast Data?

Data Visualisation

Stephen Few, author and educator on the topic of “visual business intelligence”, gave both a keynote and a workshop that could be very concisely summarised as a call to: 1) take more account of how human perception works when visualising data; 2) make more use of visualisation for sense-making. Stephen Brobst (Teradata CTO) made the latter point too: that data scientists use data visualisation tools for exploration, not just for communication.

Few gave an accessible account of visual perception as applied to data visualisation with some clear examples and reference to cognitive psychology. His “Perceptual Edge” website/blog covers a great deal of this – see for example “Tapping the Power of Visual Perception” (pdf) – as does his accessible book, “Now You See It“. I will not repeat that information here.

His argument that “visual reasoning” is powerful is easily demonstrated by comparing what can be immediately understood from the right kind of graphical presentation with tabulation of the data. The point that visual reasoning usually happens transparently (subconsciously) and hence that we need to guard against visualisation techniques that mislead, confuse of overwhelm.

I did feel that he advocated visual reasoning beyond the point at which it is reliable by itself. For example, I find parallel coordinates quite difficult. I would have liked to see more emphasis on visualising the results of statistical tests on the data (e.g. correlation, statistical significance) particularly as I am a firm believer that we should know of the strength of an inference before deciding on action. Is that correlation really significant? Are those events really independent in time?

Few’s second key point – about the use of data visualisation for sense-making – began with claims that the BI industry has largely failed to support it. He summarised the typical pathway for data as: collect > clean > transform > integrate > store > report. At this point there is a wall, Few claims, that blocks a productive sense-making pathway: explore > analyse > communicate > monitor > predict.

Visualisation tools tend to have been created with before-the-wall use cases, to be about the plot in the report. I rather agree with Few’s criticism that such tool vendors tend to err towards a “bling your graph” feature-set or flashy dashboards but there is hope in the form of tools such as Tibco Spotfire and Tableau, while Open Source afficionados or the budget-less can use ggobi for complex data visualisation, Octave or R (among others). The problem with all of these is complexity; the challenge to visualistion tool developers is to create more accessible tools for sense-making. Without this kind of advance it requires too much skill acquisition to move beyond reporting to real analytics and that limits the number of people doing analytics that an organisation can sustain.

It is worth noting that “Spotfire Personal” is free for one year and that “Tableau Public” is free and intended to let data-bloggers et al publish to their public web servers, although I have not yet tried them.

Analytics & Enterprise Architecture

The presentation by Adam Gade (CIO of Maersk, the shipping company) was ostensibly about their use of data but it could equally have been entitled “Maersk’s Experiences with Enterprise Architecture”. Although at no point did Gade utter the words “Enterprise Architecture” (EA), many of the issues he raised have appeared in talks at the JISC Enterprise Architecture Practice Group: governance, senior management buy-in, selection of high-value targets, tactical application, … etc. It is interesting to note that Adam Gade has a marketing and sales background – not the norm for a CIO – yet seems to have been rather successful; maybe he could sell the idea internally?

The link between EA and Analytics is not one which has been widely made (in my experience and on the basis of Google search results) but I think it is an important one which I will talk of a little more in a forthcoming blog post, along with an exploration of the Zachman Framework in the context of an analytics project. It is also worth noting that one of the enthusiastic adopters of our ArchiMate (TM) modelling tool, “Archi“, is Progressive Insurance which established a reputation as a leader in putting analytics to work in the US insurance industry (see, for example the book Analytics at Work, which I recommend, and the summary from Accenture, pdf).

Adam Gade also talked of the importance of “continuous delivery”, i.e. that analytics or any other IT-based projects start demonstrating benefits early rather than only after the “D-Day”. I’ve come across a similar idea – “time to value” – being argued for as being more tactically important than return on investment (RoI). RoI is, I think, a rather over-used concept and a rather poor choice if you do not have good baseline cost models, which seems to be the case in F/HEIs. Modest investments returning tangible benefits quickly seems like a more pragmatic approach than big ideas.

Conclusions – Thoughts on What this Means for Post-compulsory Education

For all the general perception is that universities and colleges are relatively undeveloped in terms of putting business intelligence and analytics to good use, I think there are some important “but …” points to make. The first “but” is that we shouldn’t measure ourselves against the most effective users from the commercial sector. The second is that the absence of entrenched practices means that there should be less inertia to adopting the most modern practices. Third, we don’t have data at the scale that forces us to acquire new infrastructure.

My overall impression is that there is opportunity if we make our own path, learning from (but not following) others. Here are my current thoughts on this path:

Learn from the Enterprise Architecture pioneers in F/HE

Analytics and EA are intrinsically related and the organisational soft issues in adopting EA in F/HE have many similarities to those for adopting analytics. One resonant message from the EA early adopters, which can be adapted for analytics, was “use just enough EA”.

Don’t get hung up on Big Data

While Big Data is a relevant technology trend, the possession of big data is not a pre-requisite for making effective use of analytics. The fact that we do not have Big Data is a freedom not a limitation.

Don’t focus on IT infrastructure (or tools)

Avoid the temptation (and sales pitches) to focus on IT infrastructure as a means to get going with analytics. While good tools are necessary, they are not the right place to start.

Develop a culture of being evidence-based

The success of analytics depends on people being prepared to critically engage with evidence based on data (including its potential weaknesses or bias and to avoid being over-trusting of numbers) and to take action on the analysis rather then being slaves to anecdote and the HiPPO. This should ideally start from the senior management. “In God we trust, all others bring data” (probably mis-attributed to W. Edwards Deming).

Experiment with being more analytical at craft-scale

Rather than thinking in terms of infrastructure or major initiatives, get some practical value with the infrastructure you have. Invest in someone with “data scientist” skills as master crafts-person and give them access to all data but don’t neglect the value of developing apprentices and of developing wider appreciation of the capabilities and limitations of analytics.

Avoid replicating the “analytics = reporting” pitfall

While the corporate sector fights its way out of that hole, let us avoid following them into it.

Ask questions that people can relate to and that have efficiency or effectiveness implications

Challenge custom and practice or anecdote on matters such as: “do we assess too much?”, “are our assessment instruments effective and efficient?”, “could we reduce heating costs with different use of estate?”, “could research groups like mine gain greater REF impact through publishing in OA journals?”, “how important is word of mouth or twitter reputation in recruiting hard-working students?”, “can we use analytics to better model our costs?”

Look for opportunities to exploit near-real-time data

Are decisions being made on old data, or no changes being made because the data is essentially obsolete? Can the “digital exhaust” of day-to-day activity be harnessed as a proxy for a measure of real interest in near-real-time?

Secure access to sector data

Sector organisations have a role to play in making sure that F/HEIs have access to the kind of external data needed to make the most of analytics. This might be open data or provisioned as a sector shared service. The data might be geospatial, socio-economic or sector-specific. JISC, HESA, TheIA, LSIS and others have roles to play.

Be open-minded about “analytics”

The emerging opportunities for analytics lie at the intersection of practices and technologies. Different communities are converging and we need to be thinking about creative borrowing and blurring of boundaries between web analytics, BI, learning analytics, bibliometrics, data mining, … etc. Take a wide view.

Collaborate with others to learn by doing

We don’t yet know the pathway for F/HE and there is much to be gained from sharing experiences in dealing with both the “soft” organisational issues and the challenge of selecting and using the right technical tools. While we may be competing for students or research funds, we will all fail to make the most from analytics and to effectively navigate the rapids of environmental factors if we fail to collaborate; competitive advantage is to be had from how analytics is applied  but that can only occur if capability exists.

New Draft British Standard – Exchanging Course Related Information

The two parts of a draft British Standard (BS), “BS 8581 – Exchanging course related information – Course advertising profile” have recently been released for public comment on the British Standard Institute “Draft Review” website.

This standard is heavily based on the XCRI-CAP 1.2 specification, which has been developed and piloted over the past few years with support from JISC and CETIS, and would create a British Standard that is consistent with the European Standard “Metadata for Learning Opportunities – Advertising” (EN 15982, also to be adopted as BS) but extends it and provides more detail suited to UK application.

The two parts for public review, which closes on April 30th 2012, are:

Registration is required to access the drafts and comment.

Background information and details of implementations of XCRI may be found on the XCRI-CAP website.

EdTech Blogs – a visualisation playground

During the CETIS Conference today (Feb 22nd), I showed a few graphs, plots and other visualisations that show the results of text mining around 7500 blog posts, mostly from 2011 and into early 2012. These were crawled by the RWTH Aachen University “Mediabase“.

There are far too many to show here and each of three analyses has its separate auto-generated output, which is linked to below. Each of these outlines key aspects of the method and headline statistics. I am quite aware that it is bad practice just to publish a load of visualisations without either an explicit or implicit story. If this bothers you, you might want to stop now, or visit my short piece “East and West: two worlds of technology enhanced learning“, which uses the first method outlined below but is not such a “bag of parts”. If you want to weave your own story… read on!

Stage 1: Dominant Themes

The starting point is simply to look at the dominant themes in blog posts from 2011 and early 2012 through the lens of frequent terms used. Common words with little significance (stop words) are removed and similar words are aggregated (e.g. learn, learner, learning). This set of blog posts is then split into two sets: those from CETIS and those from a broadly representative set of Ed Tech blogs. The frequent terms are then filtered into those that are statistically more significant in the CETIS set and those that are statistically more significant in the Ed Tech set.

The results of doing this are: “Comparison: CETIS Blogging vs EdTech Bloggers Generally (Jan 2011-Feb 2012)

Co-occurrence Pattern - Ed Tech Blogger Frequent Terms

Co-occurrence Pattern - Ed Tech Blogger Frequent Terms. (see the "results" link above for explanation and more...)

Stage 2: Emerging and Declining Themes

Stage 2a: Finding Rising and Falling Terms

In this case, I home in on CETIS blogs only, but go back further in time: to January 2009. The blog posts are split into two sets: one contains posts from the last 6 months and the other contains posts since the end of January 2009. The distribution of terms appearing in each set is compared to find those which are statistically significant in the change, taking into account the sample size. This process identifies four classes of term: terms that appear anew in recent months, terms that rose from very low frequencies, those that rose from moderate or higher frequencies and those that fell (or vanished).

The results of doing this are: “Rising and Falling Terms – CETIS Blogs Jan 31 2012“. This has a VERY LARGE number of plots, many of which can be skipped over but are of use when trying to dig deeper. This auto-generated report also contains links to the relevant blog posts and ratings for “novelty” and “subjectivity”.

Significant Falling Terms

Significant Falling Terms

Stage 2b: Visualising Changes Over Time

Various terms were chosen from Stage 2a and the changes in time rendered using the (in-) famous “bubble chart”. Although these should not be taken too seriously since the quantity of data per time step is rather small, these allow for quite a lot of experimentation with a range of related factors: term frequency, number of documents containing the term, positive/negative sentiment in posts containing the term. Four separate charts were created for CETIS blogs from 2009-2012: Rising, Established, Falling and Familiar (dominant terms from Stage 1). The dominant non-CETIS terms are also available, but only for 2011.

Final Words

Due to some problems with the blog crawler, a number of blogs could not be processed or had incompletely extracted postings so this is not truly representative. The results are not expected to change dramatically but there will be some terms appearing and some disappearing when these issues are fixed. This posting will be altered and the various auto-generated reports will be re-generated in due course.

The R code used, and results from using the same methods on conference abstracts/papers are available from my GitHub. This site also includes some notes on the technicalities of the methods used (i.e. separate from the way these were actually coded).

The Network of Society of Scholars (Fiction)

As preparation for the session “Emerging Reality: Making sense new models of learning organisations” at this week’s CETIS Conference (which is a session hosted by the TELMap Project), I have created the following scenario to try to make real some plausible drivers/issues/etc. The session will be debating the plausibility of these and other issues and hence their potential shapers of future learning organisations. I will emend this posting once the outcomes of the workshop are published.

The scenario is pure fiction, an informal speculation about something that might happen by around 2020-5.

The Scenario

What is a “Society of Scholars”?

A Society of Scholars is based in one or more large old houses with a combination of study-bedrooms, communal cooking and social spaces, a library and a central seminar. Students and some of the Fellows live there while many Fellows live with their families and study there.

There are no fixed courses but a framework within which depth and breadth of scholarship is guided and measured. This framework is validated by an established University, which awards the degree and provides QA (all for a fee). The Network of Societies of ScholarsTM has additional ethical codes and strict membership rules.
The physical co-location is a central part of the Society, combined with the wider (virtual) network of peers.

Genesis

Societies of Scholars sprang out of an initial “wild card” experiment where a small group of progressive academics with experience of inquiry-based learning pooled their redundancy payments from one of many rounds of staff-culling. A few sold their houses.

Their idea was to strip out the accumulation of both central services and formality of teaching and learning setting and to get back to basics while reducing cost and being able to do more of what they enjoy: thinking and talking. In doing this they hoped to attract students who were otherwise being asked to pay ever higher fees to endure ever more “commoditised” offerings and suffer poor employment prospects. The promise of high wages to pay off high debt is elusive for many who follow the conventional route. Graduate employment and student satisfaction are worse for those who opt for the newer “no frills degree course” offerings, which have cut costs without re-inventing the educational experience.

For several years they struggled to attract students but gradually a few gifted students managed to develop ultra-high web reputations started to attract more applications. The turning point was the winning of an international prize for work on “Smart Cities”, which led to a media frenzy in 2018. This triggered a spate of endowments of new Societies by successful entrepreneurs and the establishment of satellite societies to Cambridge and Oxford Universities in the UK and ETH Zurich in Switzerland with others quickly following (all recognising the threat but also the early-mover opportunity).

Character of a Society of Scholars

Societies are highly reputation conscious as are the individuals within them. They are highly effective at using the web, what we called “new media” in the naughties and in media management generally.

With the exception of assessment, Fellows and Students undertake essentially the same kind of activities; the Students strive to emulate the attitude and work of the fellows. Both divide their time between private study, informal and formal discussion. Collaboration works. There is no “Fellows teach Students”; all teach each other through the medium of the seminar. All consider “teaching the world” to be an important (but not dominating) part of what they do.

The selection process plays a key role in shaping the character of the Society. Students are admitted NOT primarily on the basis of examination grades but on evidence of self-discipline, self-awareness and especially self-directed intellectual activity.

Course and Assessment

There are no specified courses and all Students follow a unique pathway of their own. Fellows offer guidance and almost all Students piece together a collection of topics that are identifiable (e.g. similar to a conference theme, a textbook, etc). There is no fixed minimum or maximum period of study.

Societies typically focus on 3-4 disciplines but always adopt a multi-disciplinary perspective, for example computer science, electronic engineering, built environment and social theory was the combination that led to the “Smart Cities” prize.

Online resources are exploited to the fullest extent. Free or cheap MOOCs (massively open online courses, especially the form pioneered by Stanford University and udacity.com) are combined with the for-fee examinations offered alongside them.

Wikipedia is considered to be a “has been”; Society members (across the Network) and others collaborate on DIY textbooks using a system build on top of “git” (permitting multiple versions, derivatives, etc see GitHub for a “social coding” example) and a decentralised network of small servers. While being widely useful this activity is also a valued learning activity with the side effect of promoting coherence in the study pathway.

Assessment is complicated primarily by the idiosyncrasy of all pathways but also by the need to connect achievement to the breadth/depth framework. An award is typically evidenced by a mixture of: externally taught and examined modules; public examinations of the University of London; a patchwork of personal work (a “portfolio”); contributions to the DIY textbooks; seminar performance.

Demand and Expansion

Societies of Scholars are niche occupiers in a much wider higher education landscape. Demand is no more than 5% and supply  only about 3% in 2025. There is a feeling that graduates of the Societies are the “new elite”.

While some politicians call for the massification of the Society concept, society at large recognises that they need a special kind of student: more of an intellectual entrepreneur. The rise of the Society of Scholars has, however, started to change the way society understands (and answers) questions like: “what is the purpose of education?”; “how does learning happen?”… The long-term effect of this change on the face of education is not known yet (2025).

Employers in particular have understood what Societies offer and, while graduate unemployment for those following a conventional route to a degree remains close to 2012 levels, Society graduates are highly employable. Employers value: creativity, good communication skills, media-savvy people, multi-disciplinary thinking, self-motivation, intellectual flexibility, collaborative and community-oriented lifestyle.

The Drivers/Issues

This is a summary of some of the implicit or explicit assumed drivers/issues embedded in the scenario and which determine the plausibility of it (or alternatives). They are intentionally phrased as statements that could be disagreed with, argued for, …

  1. Physical co-location and (especially intimate) face-to-face interactions will continue to be seen to be an essential aspect of high quality education. Students who can afford (or otherwise access) this will generally do so. Employers will value awards arising from courses containing it more highly than those that do not. Telegraph newspaper article.
  2. Graduate unemployment will be an issue for years to come. Effective undergraduates will find ways to distinguish themselves. HESA Statistics
  3. Wikipedia (and similar centralised “web commons” services) are unsustainable in their current form. As the demand from users rises and the support from contributors and sponsors wanes (it becomes less cool to be a Wikipedian) a point of unsustainability is reached. One option is to monetise but another is to “go feral” and transition to peer-to-peer or decentralised approaches. Digital Trends article.
  4. Universities and colleges will increase the supply of course and educational components, disaggregated from “the course”, “the programme” and “the institutional offering”. Examinations, Award Granting and Quality Assurance are all potentially independent marketable offerings. David Willets article on the BBC (see “Flexible Learning”)
  5. Cheap large-scale online courses are capable of replacing a significant percentage of conventional teaching time. The “Introduction to AI” course demonstrated this: see http://is.gd/JOseb.
  6. Employers are conservative when it comes to education. While employers bemoan narrow knowledge of graduates, poor “soft skills”, etc, their shortlisting criteria continue to favour candidates with conventional degree titles and high grades from research-intensive universities. They will generally fail to take advantage of rich portfolio evidence.

The Stanford “Introduction to AI” Course – the sign of a disruptive innovation?

Over on the JISC Observatory website a recent interview with Seb Schmoller has just been published in which he talks about his experiences – from the perspective of an online distance educator – of the recent large scale open online course “Introduction to AI” run in association with Stanford University. As the interview unfolded it occurred to me that the aspects of the course that had struck Seb as being of potentially profound importance fitted the criteria for a “low end disruptive innovation” in the terminology of innovation theorist Clayton M Christensen. Low end disruption refers to the way apparently well-run businesses could be disrupted by newcomers with cheaper but good-enough offerings that focus on core customer needs and often make use generic off-the-shelf technologies.

Interesting stuff to ponder…

(interview on the JISC Observatory site)

Data Protection – Anticipating New Rules

On January 25th 2012, the European Commission released its proposals for significant reform of data protection rules in Europe (drafts had been leaked in late 2011). These proposals have been largely welcomed by the Information Commissioners Office , although it also recommends further thought over some of the proposals. The dramatic changes in the scale and scope of handling personal information in online retailing and social networking since the 1990’s, when current rules were implemented, is an obvious driver for change. The rise of “cloud computing” is a related factor.

What might this mean for the UK education system, especially for those concerned with educational technology?

On the whole, the answer is probably a fairly bland “not much” since we are, as a sector, pretty good at being responsible with personal data. Sector ethics, regardless of legislation, is to be institutionally concerned and careful and, providing enough time is available to adapt systems (of working and IT), this should be a relatively low impact change. There are, however, a few implications worthy of comment…

The Principle of Data Portability

Unless you know nothing about CETIS, it should come as no surprise that “data portability” caught my eye. EC Fact Sheet No. 2 says:

‘The Commission also wants to guarantee free and easy access to your personal data, making it easier for you to see what personal information is held about you by companies and public authorities, and make it easier for you to transfer your personal data between service providers – the so-called principle of “data portability”.’

Notice that this includes “public authorities”. Quite how this principle will affect practice remains to be seen but it does appear to have implications at the level of individual educational establishments and sector services such as the Learning Records Service (formerly MIAP). It is conceivable that this requirement will be satisfied by “download as HTML”, a rather lame interpretation of making it easier to transfer personal data, but I do hope not.

So: are there candidate interoperability standards? Yes, there are:

  • LEAP2A for e-portfolio portability and interoperability,
  • A European Standard, EN 15981, “European Learner Mobility Achievement Information” (an earlier open-access version is available as a CEN Workshop Agreement, CWA 16132)

These do not cover absolutely everything you might wish to “port” but widespread adoption as part of demonstrating compliance with a legislative “data portability” requirement is an option that is available to us.

It is also worth noting Principle 7 of “Information Principles for the UK Public Sector” (pdf) – see also my previous posting – which is entitled “Citizens and Businesses Can Access Information About Themselves” and recommends information strategies should go “… beyond the legal obligations” and  identify opportunities  “to proactively make information about citizens available to them by default”, noting that this would negate the cost of process and systems for responding to Subject Access Requests. I hope that this attitude is embraced and that the software is designed on a “give them everything” principle rather than “give them the minimum we think the law requires”. Software vendors should be thinking about this now.

There are some interesting possibilities for learner mobility if learners have a right to access and transfer fine grained achievement and progress information, especially where that is linked to well defined competence (etc) structures. Can we imagine more nomadic learners, especially those who may be early adopters of offerings from the kind of new providers that David Willetts and colleagues are angling for?

The Right to be Forgotten

This right is clearly aimed squarely at the social network hubs and online retailers (see the EC Fact Sheet No.3, pdf). It isn’t very  likely that anyone would want to have their educational experiences and achievements forgotten unless they plan to “vanish”. Indeed, it would be surprising if existing records retention requirements would be changed and the emerging trend of having secure document storage and retrieval services under user control – e.g. DARE – seems set to continue and be the way we manage this issue cost-effectively.

The right to be forgotten may be more of a threat to realising the “learning analytics” dream, even if only in adding to existing uncertainty, doubt and sometimes also fear. We need some robust and widely accepted protocols to define legally and ethically acceptable practice.

Uniformity of Legislation

The national laws that were enacted to meet the existing data protection requirements are all different and the new proposals are to have a single uniform set of rules. This makes sense from the point of view of a multi-nation business, although it will not be without critics. This is just one factor that could make a pan-European online Higher Education initiative easier to realise, whether a single provider or a collaboration. I perceive signs that people are moving closer to viable approaches to large scale online distance education using mature technologies, and possibly English as the language of instruction and assessment; looming “low-end disruptions” (see the Wikipedia article on “Disruptive Innovation“) for the academy as we know it. [Look out for an interview with Seb Schmoller which has influenced my views, due to be published soon on the JISC Observatory website.]

This is, of course, just some initial impressions on some proposals. I am sure there is a great deal that I have missed from a fairly quick scan of material from the commission and there is bound to be a lot of carping from those with businesses built around exploiting personal data so the final shape of things might be quite different.

Information Principles for the Public Sector – the Case of Principle 4

In December 2011, Version 1.0 of “Information Principles for the UK Public Sector” (pdf) was published by the Cabinet Office. The principles have been endorsed by both the CIO and CTO councils within government. What surprised me is how good this document is.

The approach taken recognises that the principles will be implemented in diverse ways according to the context. It is well written and full of material which strikes me as being widely applicable (not just to government bodies) in addition to containing a number of points that indicate a progressive attitude to information. In particular, “Principle 4 – Information is Standardised and Linkable”, gives me cause to nod with approval.

The standards message is not, of course, a new one for The Government; it is the inclusion of “linkable” in a principle that will be applied across government activities which is. This is not simply “linked data is cool” expressed in Civil Service Speak; the principle is deeper than that and speaks to me of a possible paradigm shift in the way [collected] data is understood.

Under Implications for Information Strategy, it recommends “a framework for linking information is established” and goes on to say:

“Aspects to consider include:

  • Unambiguous identification of items (eg using authoritative reference data, or URIs)
  • Classifying items and the relationships between them.
  • Linking of items (eg potentially using the open standard web mechanisms governed by the W3C)

Consideration should be given to both internal linkages to other information sources within the organisation, and also to external linkages to other information sources across government.”

In essence, I see this as being indicative of a shift away from conceiving of data as “stuff in databases” and towards “distributed data on the network”. I see this as being a Really Big Deal and significantly more sophisticated than the piecemeal publication of data seen so far (even on data.gov.uk, which remains an important innovation).

By itself, of course, Principle 4 achieves nothing but two recent events and the Principle add up to suggest that this is not just a pipe-dream.

The first event, which is a culmination of several priors, was the initiation of the in-elegantly, but accurately. named “Interim Regulatory Partnership Group: Project B, Redesigning the higher education data and information landscape”. While this project is only at present deliberating on a feasibility report, a bit of imagination of where this might go with some inspiration from the Reports and Documents (e.g. the “HE Information Landscape Study“, pdf) leads towards making more use of distributed collected data. Maybe I am making a leap too far but a combination of reducing data collection burden with principles of collect-once-use-many-times inevitably leads to linking data between (among others) UCAS, the Student Loans Company and HESA since it is inconceivable to me that we would not  have a multi-party landscape.

The second, more recent, event was the occasion of a discussion with a member of the Technical Support Service of the Information Standards Board for education skills and children’s services (ISB TSS) from which I understand the intention is to assign URIs to entities in the standards the ISB TSS creates. Only a small step but…

My take-home message on all of this: nothing will happen very quickly but the gradual permeation of an understanding of the implications of distributed data on the network will make possible conversations, decisions and interventions that are currently rather difficult and the drivers behind Project B are also drivers that will, I hope, accelerate this process.

Preparing for a Thaw – Seven Questions to Make Sense of the Future

“Preparing for a Thaw – Seven Questions to Make Sense of the Future” is the title of a workshop at ALT-C 2011. The idea of the workshop was to use a simple conversational technique to capture perspectives on where Learning Technology might be going, hopes and fears and views on where education and educational technology should be going. The abstract for the workshop (ALT-C CrowdVine site) gives a little more information and background and there is also a handout available that extends this. The workshop had two purposes: to introduce participants to the technique with a view to possible use in their organisations; to gather some interesting information on the issues and forces shaping the future of learning technology. All materials have Creative Commons licences.

The short version of the Seven Questions used in the session (it is usual to use variants of one form or another) are:

  • Questions for the Oracle about 2025 [what would you ask?]
  • What would be a  favourable outcome by 2025?
  • What would be an unfavourable outcome?
  • How will culture and institutions need to change?
  • What lessons can we learn from the past?
  • Which decisions need to be made and actions taken?
  • If you had a “Magic Wand”, what would you do?

Fourteen people, predominantly from Higher Education, attended the workshop; their responses to the “Seven Questions” are all online at http://is.gd/7QResponses and there is an online form to collect further responses at http://is.gd/7Questions that is now open to all. This blog post is a first reaction to the responses made during the workshop, where a peer-interviewing approach was used. A more considered analysis will be conducted on September 14th, taking account of any further responses gathered by then.

Wordle of all responses to the "Seven Questions" that were made during the ALT-C 2011 workshop. (NB: four words have been eliminated - "learning", "education", "technology", "technologies")
Wordle of all responses to the “Seven Questions” that were made during the ALT-C 2011 workshop. Click image to open full-size.
(NB: four words have been eliminated – “learning”, “education”, “technology”, “technologies”)

My quick take on the responses is that there are about half a dozen themes that recur and a few surprising ideas. These appear to be:

  • Universal and affordable access to education and the avoidance of a situation where access to technological advances favours one section of society (or region) was a concern. There was support for ensuring access to connectivity and hardware for everyone, that this should be an entitlement. This was a very strong theme in the “if you had a magic wand” responses.
  • The need for improved digital literacies amongst teaching staff but also across the institution as a whole came out several times. Similarly (but distinct from this) is a desire to be more effective with teacher education and staff development.
  • There were questions about whether there would be “learning technologies” in 2025 and whether there would still be “learning technologists”, at least as defined by their current role.
  • The transformation of assessment and accreditation was also drawn out as an uncertainty.
  • Several responses wondered about the dominant devices that would be used in 2025 and the kind of interface (mouse, gesture,… what next?).
  • An increasing potential role for using data was tempered by concerns about unethical or exploitative use of collected data.
  • There was interest in multi-direction, collaborative education and the role of technology in that.
  • The risk of educational institutions holding onto established (old) models came out several times.
  • I wasn’t the only person to mention “interoperability”.

The workshop participants seemed to enjoy the approach and I was pleased at how well the peer interviewing worked, although I can see that this might not be right for all groups. I’ll also be watching out for differences in response that might be present between peer interview and solo completion of the online form. 1 hour for introduction, reciprocal interviews and closing discussion was rather restrictive but it seems to have surfaced some good materials and it certainly gave a good indication of what could be achieved with a little more time.

A more considered and detailed write-up will be published soon; I’ll add a comment.

Educational Data Mining Conference – Impressions and Personal Favourites

The 2011 Educational Data Mining (EDM) Conference (in Eindhoven) is the fourth so far and provided an interesting, if sometimes quite esoteric, experience. A measure of comfort with statistical and data mining methods and vocabulary was necessary to really follow most of the papers, although it was by no means the case that they were of a narrow technical nature.

The Shock of Tutoring Systems

The majority of the papers were devoted to tutoring systems, particularly “intelligent tutoring systems” (ITS) and to a model of learning that is rather contrasting to the centre of gravity of the dominant models within the educational technology (e-Learning, technology enhanced learning, etc) domain. I will comment on this first, before mentioning some of my personal favourite posters/papers.

The dominance by papers on ITS and tutoring systems generally seems to be a consequence of the origins of the EDM Conference, arising particularly from a workshop of the Artificial Intelligence in Education (AIED) conference. The dominant model of learning in this work is based on “knowledge components” and the idea of “knowledge tracing“. There is a strong bond between this and AI concepts.

At times I got the feeling that there is quite a lot of work going on within this paradigm, almost blinkered to challenging its assumptions. It is certainly the case that there are very widely used ITS systems in the US (e.g. Cognitive Tutor) and a VERY large amount of log data available in the Pittsburg Science of Learning Data Shop. Given the practicability of working in this area (data, an established model and large-scale real use of some pieces of tutoring software) it is not surprising that career academics publish in it. As the conference wore on, it became clear that the EDM community is not blinkered; several conversations, moments of excitement in post-presentation questions and some clear signposting from a few thought leaders suggested to me that this is a community that is looking well beyond the point of current focus and is interested in reaching out to people whose pedagogic home is somewhat closer to social constructivism.

One thing remains absolutely clear: there remains a great deal of difference between North American and European educational practice and culture. I wonder, though, whether “we” should recover from our recoil against ITS, give a bit more thought to see whether there is something to make our own for some applications.

Some Personal Favourites

The full text of the papers and abstracts of posters is available online; the links are to these, which are PDF files.

The poster that won the “best poster” prize and a poster that was never shown both interested me for the same reason: they are essentially about mining and visualising data in the service of reflection. This seems to be a bit of a neglected area in general; Business Intelligence and the kind of tracking and performance features in VLEs is all very well but they seem rather dry and outcome rather than process oriented. My feeling is that this is an area where some good innovative ideas could go along way: in helping students to understand and manage their own learning and learning practices; and in helping teachers understand how students are really using the VLE, maybe challenging some assumptions, whether tacit or as held-forth over. While neither of these two pieces of software is finished-and-ready for use, both stimulate further ideas and are stepping stones across this particular stream:
* eLAT: An Exploratory Learning Analytics Tool for Reflection and Iterative Improvement of Technology Enhanced Learning (Anna Lea Dyckhoff, Dennis Zielke, Mohamed Amine Chatti and Ulrik Schroeder), “best poster” prize
* Brick: Mining Pedagogically Interesting Sequential Patterns (Anjo Anjewierden, Hannie Gijlers, Nadira Saab and Robert De Hoog)

My top three papers are (comments follow):

* What’s an Expert? Using learning analytics to identify emergent markers of expertise through automated speech, sentiment and sketch analysis (Marcelo Worsley and Paulo Blikstein)
* Student Translations of Natural Language into Logic: The Grade Grinder Translation Corpus Release 1.0 (Dave Barker-Plummer, Richard Cox and Robert Dale)
* A Dynamical System Model of Microgenetic Changes in Performance, Efficacy, Strategy Use and Value during Vocabulary Learning (Philip I. Pavlik Jr. and Sue-Mei Wu)

“What’s and Expert” reports on work following the constructionist tradition of Seymour Papert. Their primary research question – “How can we use informal student speech and drawings to decipher meaningful ‘markers of expertise’ in an automated and natural fashion?” – was the first of the papers presented grounded in such an educational theme. Whether for summative or formative purposes, this approach looks really interesting for anyone interested in making the connection between “authentic” behaviour and the more obscure growth of ability.

The “Grade Grinder” paper is ostensibly rather esoteric, being based on the release of data from an online self-test tool for the translation of logical statements from visual representations and natural language to formal logic. The interesting aspect is what they have been doing with this data, which is to try to uncover and better understand the causes of error and misconception. This is not a new kind of research question – science education research has considered misconception for years – but the methods employed to date have generally been interview rather than data-centric. My curiosity about such questions – not to mention its relevance to educators – that explore the characterisation of misconception and its consequences is why I chose this paper.

It isn’t so easy to justify choosing last of my top three; the exploratory nature of the work is clear in the paper and was noted by the presenter. What made it stand out is that the researchers have been trying to look at the question of motivation and I believe this is an important question in its own right and one where intelligent use of data mining could help to make more visible some of the invisible causes and consequences of changes in motivation. On the whole, however, the discussion and results tended to say more about the use of strategies. They had also investigated simulations, which I find an appealing, if problematical, means of investigating the dynamical effects of proposed models of affect.

Curiously, all but one of the things that I found most interesting were not full papers and it seems that at least two of these had been rejected as such by the reviewers.

The invited speakers were 2/3 from outside the EDM community. The conference opener was Barry Smyth from UC Dublin, a clearly entrepreneurial chap. His topic was social search and in particular his start-up, HeyStaks. Barry made much of  statistics on the amount of time people spend on web search and the percentage of time the search is for something previously found by the searcher or a friend/colleague; they are indeed most sobering. HeyStaks is intended to address this and falls under the category of “social search”. Social search has been lurking just off the hype radar for a while but has the feel of a term that we’ll be hearing a lot of pretty soon, not least due to the release of Google+. HeyStaks takes a significantly different approach to Google+, being based on a browser plugin that sits over Google, Bing or Yahoo use and communicates with the HeyStaks server to store the “staks”. HeyStaks also lets you re-find things from your own search-and-select history. A peer-to-peer version would be great but not so appealing to an entrepreneur. I can see HeyStaks being more attractive for teaching and learning or research group use and it would be interesting to see an analysis of the different perspectives on, and realisations of, “social search” more generally and the relative afforances for educational use.

The invited speaker for day 2 was Erik-Jan van der Linden, the CEO of Dutch visualisation software company MagnaView and a man with a history of research on data visualization at Amsterdam and Tilburg universities. MagnaView pitches variations of their software at the legal profession, education, etc, so far limited to secondary education in the Netherlands. This seems to have made a number of “what if” questions become more answerable but also stimulate more questions among the users. They are also experimenting with using user-intuitive actions to drive parameters for data mining, for example by allowing users to re-assign clustered items by drag/drop.

In conclusion: for all this conference has some history and to some extent some baggage in AI and ITS and in spite of the quite technically-challenging nature of a lot of the work, I see this as a community ready to embrace those from other backgrounds with an interest in “learning science” and one where some more people whose mindset is education-first will have a challenging but exciting time over the next few years.

Full proceedings are available for download from the EDM2011 site.

Weak Signals and Text Mining II – Text Mining Background and Application Ideas

Health warning: this is quite a long posting and describes ideas for work that has not yet been undertaken.

My previous post gave a broad introduction to what we are doing and why. This one explores the application of text mining after a brief introduction to text mining techniques in the context of a search for possible weak signals. The requirements, in outline, are that the technique(s) to be adopted should:

  • consume diverse sources to compensate for the surveillance filter;
  • strip out the established trends, memes etc;
  • not embed a-priori mental models;
  • discriminate relevance (reduce “noise”).

Furthermore, the best techniques will also satisfy aims such as:

  • replicability; they should be easily executed a number of times for different recorded source collections and longitudinally in time.
  • adoptability; it should be possible for digitally literate people to take up and potentially modify the techniques with only a modest investment in time to understand topic mining techniques.
  • intelligibility; what it is that the various computer algorithms do should be communicable in language to an educated person.
  • parsimony; the simplest approach that yields results (Occam’s Razor) is preferred.
  • flexibility; tuning, tweaking etc and an exploratory approach should be possible.
  • subjectible to automation; once a workable “machine” has been developed we would like to be able to treat it as a black box either with simple inputs and outputs or that can be executed at preset time intervals.

The approach taken will be less elaborate than the software built by Klerx (see “An Intelligent Screening Agent to Scan the Internet for Weak Signals of Emerging Policy Issues (ISA)“) but his paper has influenced my thinking.

Finally: this is not text mining research and the findings of importance are about technology enhanced learning.

Data Mining Strategies

Text mining is  not a new field and is one which has rather fuzzy boundaries and various definitions. As far back as 1999, Hearst considered various definitions (“Untangling Text Data Mining“)  in what was then a substantially less mature field. As the web exploded, more applications and more implied definitions have become apparent. I will not attempt to create a new definition nor to adopt someone else’s, although Hearst’s conception of “a process of exploratory data analysis that leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known” nicely captures the spirit.

The methods of data mining, of which text mining is a part, can be crudely split into two:

  • methods based on an initial model and the deduction of the rules or parameters that fit the data to the model. The same strategy is used to fit experimental data to a mathematical form in least squares fitting, for example to an assumed linear relationship.
  • methods that do not start with an explicit model but use a more inductive approach.

Bearing in mind our requirement to avoid a priori mental models, an inductive approach is clearly most appropriate. Furthermore, to adopt a model-based approach to textual analysis is a challenging task involving formal grammars, ontologies, etc and would fail the test of parsimony until other approaches have been exhausted. Finally: even given a model-based approach to text analysis it is not clear that models of the space in which innovation in technology enhanced learning occurs are tractable; education and social change are deeply ambiguous, fuzzy and relevant theories are many, unvalidated and often contested. The terms “deductive” and “inductive” should not be taken too rigidly, however, and I am aware that they may be applied to different parts of the methods described below.

Inductive approaches, sometimes termed “machine learning“, are extensively used in data mining and may again be subjected to a binary division into:

  • “supervised” methods, which make use of some a priori knowledge, and
  • “unsupervised methods which just look for patterns.

Supervision should be understood rather generally; the typical realisation of supervision is the use of a training set of data that has previously been classified, related to known outcomes or rated in some way. The machine learning then involves induction of the latent rules by which new data can be classified, outcomes predicted etc. This can be achieved using a range of techniques and algorithms, such as artificial neural networks. A supervised learning approach to detecting insurance fraud might start with data on previous detected cases to learn the common features. A text mining example is the classification of the subject of a document given a training set of documents with known subjects.

Unsupervised methods can take a range of forms and the creation of new ones is a continuing research topic. Common methods use one of several definitions of a measure of similarity to identify clusters. Whether or not the algorithm divides a set in successive binary splits, aggregates into overlapping or non-overlapping clusters. etc will tend to give slightly different results.

From this synopsis of inductive approaches it seems like we do not have an immediately useful strategy to hand. By definition, we do not have a training set for weak signals (although it could be argued that there are latent indicators of a weak signal and that we would gain some insight by looking at the history of changes that were once weak signals). The standard methods are oriented towards finding patterns, regularities, similarities, making predictions given previous patterns, which are not weak signals by definition.

For discovery of possible weak signals, it appears that we need to look from the opposite direction: to find the regularities so that they can be filtered out. Another way of expressing this is to say that it is outliers that will sometimes contain the information we most value, which is not usually the case in statistics. A concise description of outliers from Hawkins is sensible to our weak signals work as it is to general statistics: “an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism” (Identification of outliers / D.M. Hawkins ISBN:041221900X). Upon this, a case for exclusion of outliers in a statistical treatment may often be built whereas for us it is a pointer to a subject for further exploration, dissemination or hypothesis.

Actually, it is likely to be more subtle than simply filtering out regularities and to require some cunning in the design of our mining process. I will describe some ideas later. Some of these describe a process that will filter out the largest regularities while retaining just-detectable ones; maybe a larger dataset than a human can reasonably absorb will show these up. Others will look at differences in the regularities between different domains. Following from this, it is not the case that finding regularities is bad, rather that it may be necessary to stray a little from normal practice, although borrowing as much as possible. “Two Non-elementary Techniques”, below, briefly outlines two relevant approaches to finding structure.

Ensembles

It should be quite clear that any kind of search for potential weak signals is beset by indeterminacy inherent in the socio-technical world in which they arise. I contend that any approach to finding such signals is also easily challenged over reliability and validity. Referring back to “Weak Signals and Text Mining I”, the TELMap human sources (Delphi based) and recorded sources (text mining based) will do the best they can but neither will be able to mount a robust defence of any potential weak signal except in retrospect. This is why we say “potential” and emphasise that discourse over such signals is essential.

One way of mitigating this problem is to take inspiration from the use of “ensembles” in modeling and prediction in complex systems such as the weather. The idea is quite simple; use a range of different models or assumptions and either take a weighted average or look for commonality. The assumption is that wayward behaviour arising from approximations and assumptions, which are practically necessary, can be caught.

A slightly different perspective on dealing with indeterminacy is expressed by Petri Tapio in “Disaggregative policy Delphi Using cluster analysis as a tool for systematic scenario formation” (Technological Forecasting & Social Change 70 (2002) 83 – 101):

“Henrion et al. go as far as suggesting to run the analysis with all available clustering methods and pick up the one that makes sense. From the standard Popperian deductionist point of view adopted in statistical science, this would be circular reasoning. But from the more Newtonian inductionist point of view adopted in many technical and qualitatively oriented social sciences, experimenting [with] different methods would also seem as a relevant strategy, because of the Dewian ‘learning by doing’ background philosophy.”

The combination of these two related ideas will be adopted:

  • Bearing in mind the risk of re-introduction of the “mentality filter” (see part I), various methods and information sources will be experimented with to look for what “makes sense”. In an ideal scenario, several people with different backgrounds would address the same corpus to compensate for the mentality filter of each.
  • Cross-checking between the possible weak signals identified in the human and recorded sources approaches and between text mining results (even those that don’t “make sense” by themselves) will be undertaken to look for more defensible signals by analogy with ensemble methods.

Having a human in the process – seeing what makes sense – should help to spot juxtaposition of concepts, dissonance to context, … etc as well as seeing when the method just seems to be working. It will also help to eliminate false-positives, e.g. an apparently new topic might actually be a duplicated spelling mistake.

A Short Introduction to Elementary Text Mining Techniques

The starting point, whatever specific approach is adopted, will always be to process some text, which will be referred to as a “document” whether or not this term would be used colloquially, to generate some statistics upon which computation can occur. These statistics might be quite abstract measures used to assess similarity or they might be more intelligible. On the whole, I prefer the latter since the whole point of the work is to find meaning in dis-similarity. The mining process will consider a large number, where “large” may start in the hundreds, of documents in a collection from a particular source. The term “corpus” will be used for collections like this.

I will be drawing from standard toolkit of text processing to get from text to statistics, comprising the separate operations described below. These are “elementary” in the sense that they don’t immediately lead us to useful information. They are operations suited to a “bag of words” treatment, which seems quite crude but is common practice; it has been shown to be good enough for a many applications, it is computationally tractable with quite large corpora and it lends itself to relatively intelligible results. In “bag of words“, the word order is almost totally neglected and there is no concept of the meaning of a sentence. The meaning of a document becomes something that is approached statistically rather than through the analysis of the linguistic structure of sentences. Bag-of-words is just fine for our situation as we don’t actually want the computer to “understand” anything and we do want to apply statistical measures over moderate-sized corpora.

“Stop Word” Removal

Some of the words used in a document indicate little or nothing of what the document is about in a bag-of-words treatment, although they may be highly significant in a sentence. “No” is a good example of a word with different significance at sentence and bag-of-words levels. It is easy to call other examples to mind: or, and, them, of, in… In the interest of processing efficiency and the removal of meaningless and distracting indicators of similarity/difference, stop words should be removed at an early stage rather than trying to filter out what they cause later in the process. Differences in the occurrence of stop words can be considered to be 100% “noise” but they are easily-filtered out at an early stage. Standard stop-word lists exist for most languages and are often built into software for text mining, indexing, etc. It is possible that common noise-words will be discovered while looking for possible weak signals but these can be added to the stop-list.

Tokenisation

Tokenisation involves taking a stream of characters – letters, punctuation, numbers – and breaking it up into discrete items. These items are often what we would identify as words but they could be sentences, fixed length chunks or some other definable unit. Sometimes so-called “n-grams” are created in tokenisation. I will generally use single word tokens but some studies may include bi-grams or tri-grams. For example, all of the following might appear as items in a bag-of-words using bi-gram tokenisation: “learning”, “design”, “learning design”.

Part-of-Speech Tagging and Filtering

For the analysis of meaning in a sentence, the tagging of the part of speech (POS) of each word is clearly important. For the bag-of-words text mining it will not be so. I expect to use POS tagging in only a few applications (see below). When used, it will probably be accompanied by a filtering operation to limit the study to nouns or various forms of verb (VB* in the Penn Treebank POS tags scheme) in the interest of relevance discrimination.

Stemming

Many words change according the part of speech and have related forms but which effectively carry similar meaning in a bag of words. For example: learns learner, learning, learn. This will generally equate to “noise”, at best a minor distraction and at worst something that hides a potential weak signal by dissipating a detectable signal concept into lexically-distinct terms. In general it is statistically-desirable to reduce the dimensionality of variables, especially if they are not independent, and stemming does this since each word/term occurrence is a variable.

The standard approach is “stemming”, which reduces related words to what is generally the starting part of all of the related words (e.g. “learn” but it often leads to a stem that is not actually a word). There are a variety of ways this can be done, even within a single language. The Porter stemmer is widely used for English.

Document Term Statistics

A simple but practically-useful way to generate some statistics is to count the occurrence of terms (words or n-grams having been filtered for stop-words and stemmed). A better measure, which compensates for the differences in document length is to use the term frequency rather than the count; term frequency may be expressed as a percentage of the terms occurring in the document that are of a given type. Sometimes a simple yes/no indicator of occurrence may be useful.

A potentially more interesting statistic for a search for possible weak signals is the “term frequency inverse document frequency” (td-idf). This opaquely-named measure is obtained by dividing the term frequency by the logarithm of the fraction of documents that contain the term. This elevates the measure if the term is sparsely distributed among documents, which is exactly what is needed to emphasise outliers.

Given one of these kinds of document item statistic it is possible to hunt for some interesting information. This might involve sorting and visually-reviewing a table of data, resort to a graphical presentation, some ad-hoc recipe,  use of a structure-discovery algorithm (e.g, clustering)  that computes a “distance measure” between documents from the item statistics, … or a combination of these.

Synonyms, Hyponyms, …

For the reasons outlined in the section on stemming, it can be helpful to reduce the number of terms by mapping several synonyms onto one term or hyponyms onto their hypernym (e.g. scarlet, vermilion, carmine, and crimson are all hyponyms of red). The Wordnet lexical database contains an enormous number of word relationships, not just synonyms and hypo/hyper-nyms. I do not intend to adopt this approach, at least in the first round of work, as I fear that it will be hard to be a master of the method rather than the reverse. For example, wordnet makes imprinting be a hyponym of learning –  “imprinting — (a learning process in early life whereby species specific patterns of behavior are established)” – which I can see as a linguistically sensible relationship but one with the unintended consequence of suppressing potential weak signals.

An alternative use of Wordnet (or similar database) would be to expand one or more words into a larger set. This might be easier to control and would quickly generate what could be used as a set of query terms, for example to search for further evidence once a possible weak signal shortlist has been created. One of my “Application Scenarios”, below, proposes the use of expansion.

Two Non-elementary Techniques

The section “Data Mining Strategies” outlined some of the strengths and weaknesses of common approaches in relation to our quest for possible weak signals. It stressed the need to work with methods that focus on finding structure and regularities alongside a search for outliers. Two relevant techniques that go beyond the relative simplicity of the elementary text mining techniques outlined above are clustering and topic modeling. “Topic modeling” is a relatively specific term whereas “clustering” covers more diversity.

Clustering methods – many being well-established – may be split into three categories: partitioning methods, hierarchical methods and mapping methods. These usually work by computing a similarity (or distance between) the items being clustered. In our case it is the document term statistics that will give a location for each document.

The hierarchical approach is expected to make visible the most significant structure viewed from the top down (although the algorithms work from the bottom up in “aggregative” hierarchical clustering) and so not to lend itself to a search for possible weak signals although it is appropriate for scientific studies and for document subject classification where we naturally use hierarchical taxonomies.

Partitioning does not coerce the structure into a hierarchy and so may be expected to leave more of the detail in place. There are a number of different objectives that may be chosen for partitioning and potentially several algorithms for each. The “learning by doing” philosophy noted in the section “Ensembles” will be adopted. An old but useful description of the mathematics of some established clustering algorithms has been provided by Anja Struyf, Mia Hubert, Peter Rousseeuw in the Journal of Statistical Software Vol 1 Issue 4.

Hierarchical and partitioning approaches have been popular for many years whereas mapping approaches are a more recent innovation, probably due to greater computing power requirements. Self Organising Maps and Multi-dimensional scaling are two common “mapping approaches”. They are probably best understood by thinking of the problem of representing the document term statistics (a table with many columns, each representing the occurrence of terms, and many rows, one for each document) in two dimensions. This is clustering in the sense that aspects of sameness among the ‘n’ dimensions of the columns are aggregated. Although this process of dimension-reduction has the desirable property of making the data more easy to visualise, it is may be unsuitable for the discovery of possible weak signals.

Clustering is a stereotype unsupervised learning approache; there is no embedded model of the data. All that is needed is a means to compute similarity, hence the same methods can be applied to experimental data, text etc. Topic Modeling, however, introduces a theory of how words, topics and documents are related. This has two important consequences: the results of the algorithm(s) may be more intelligible; the model constrains the range of possible results. The latter may be either desirable or undesirable depending on the validity of the model. Intelligibility is improved because we are able to better relate to the concept of a topic than we are to some abstract statement of statistical similarity.

Probabalistic Topic Models (see Steyvers and Griffiths, pdf) assume a bag of words, each word having a frequency value, can be used to capture the essence of a topic. A given document may cover several topics to differing degrees leading to a document term frequency in the obvious way. The statistical inference required to work backwards from a set of documents to a plausible set of topics and the generation of the word frequency weightings for each topic and topic percentages in each document requires some mathematical cunning but has been programmed for R (see Grün and Hornik topicmodels package) as well as MATLAB (see Mark Steyvers’ toolbox).

Probablistic Topic Models could be useful to attenuate some of the noisiness expected with approaches working purely at the document-term level. This might make identification of possible weak signals easier; “discriminate relevance” was how it was phrased in the statement of requirements above. It is expected that some tuning of parameters, especially the number of topics, will be required. There is also a random element in the algorithms employed. This means that the results between different runs may be different. The margin of the stable part of the results may well contain pointers to possible weak signals. As for clustering, a “learning by doing” approach will be used, taking care not to introduce a mentality filter.

Sources of Information for Mining

One of the premises of text mining is access to relatively large amounts of information and the explosion of text on the web is clearly a factor in an increasing interest in text mining over the last decade and a half. There are both obvious and non-obvious reasons why an unqualified “look on the web” is not an answer to the question “where should I look for possible weak signals”. Firstly, the web is simply too mindbogglingly big. More subtly, it is to be expected that differences in style and purpose of different pieces of published text would hide possible weak signals; some profiling will be required to create corpora that contain comparable documents within each. Finally, crawling web pages and scraping out the kernel of information is a laborious and unreliable operation when you consider the range of menus, boilerplate, advertisements, etc that typically abound.

Three kinds of source get around some of these issues:

  • blogs occur with a reduced range of style and provide access to RSS feeds that serve-up the kernel of information as well as publication date;
  • journal abstracts generally have quite a constrained range of style, are keyword-rich and can be obtained using RSS or OAI-PMH to avoid the need for scraping;
  • email list archives are not so widely available as RSS (but this is sometimes available) and there is often stylistic consistency, although quoted sections, email “signatures” and anti-spam scan messages may present material in need of removal.

My focus will be on blogs and journal abstracts, which are expected to generally contain different information. RSS and OAI-PMH are important methods for getting the data with a minimum of fuss but are not the panacea for all text acquisition woes. RSS came out of news syndication and to this day RSS feeds serve up the most recent entries. Any attempt to undertake a study that looks at change over time using RSS to acquire the information will generally have to be conducted over a period of time. Atom, a later and similar feed grammar, is sometimes available but not the paging and archival features imagined in RFC5005. Even RSS feeds provided by journal publishers are limited to the latest issue and usually no obvious means to hack a URL to recover abstracts from older issues. The OAI-PMH provides a means to harvest abstract (etc) information over a date range and there is even an R package that implements OAI-PMH but many publishers do not offer OAI-PMH access.

A final problem which is specific to blogs is how to garner the URLs for the blogs to be analysed. It seems that all methods by which a list could be compiled are flawed; the surveillance and mentality filters seem unavoidable.

The bottom line is: there will be some work to do before text mining can begin.

On Tools, Process and Design…

The actual execution of the simple text mining approaches outlined in the “short introduction” is relatively straight-forward. There are several pieces of software, both code libraries/modules and desktop applications, that can be used after a little study of basic text mining techniques. I plan on using R and the “tm” package (Feinerer et al) for the “elementary” operations and RapidMiner. The former requires programming experience whereas the latter offers a graphical approach similar to Yahoo Pipes. In principle, the software used should be independent of the text mining process it implements, which can be thought of in flow-chart terms, so long as standard algorithms (e.g. for stemming) are used. In practice, of course, there will be some compromises. The essence of this is that the translation from a conceptual to an executable mining process is an engineering one.

The critical, less-well-defined and more challenging task is to decide how the toolkit of text-mining techniques will be applied. This starts with considering what text sources will be used, moves through how to “clean” text and then to how tokenisation, term frequency weights etc will be used and concludes with how clustering (etc) and visualisation should be deployed. In a sense, this is “designing the experiment” – but I use the term “application scenario” – and it will determine the meanignfulness of the results.

Some Application Scenarios

This section speculates on a number of tactics that might yield possible weak signals. The title of application scenario will be used as its name. Future postings will develop and apply one or more of these application scenarios.

Out-Boundary Crossing

Idea: Strong signals in other domains are spotted by a minority in the TEL domain who see some relevance of them. Discover these.

Notes: The signal should be strong in the other domain so that there is confidence that it is in some sense “real”.

Operationalisation: Extract low-occurrence nouns from a corpus of TEL domain blog posts and cross-match to search term occurrence in Google Trends.

Challenges: Google Trends does not provide an API for data access

In-Boundary Crossing

Idea: Some people in another domain (e.g. built environment, architectural design) appear to be talking about teaching and learning. Discover what aspect of their domain they are relating to ours.

Notes: This approach clearly cannot work with text from the education domain.

Operationalisation: Use Wordnet to generate one or more sets of keywords relevant to the practice of education and training. Use a corpus of journal abstracts (separate corpora for different domains) and identify documents with high-frequency values for the keyword set(s).

Challenges: It may be difficult to eliminate papers about education in the subject area (e.g. about teaching built environment) other than by manual filtering.

Novel Topics

Idea: Detect newly emerging topics against a base-line of past years.

Notes: This  is the naive application scenario, although there are many ways in which it can be operationalised.

Operationalisation: Against a corpus of several previous years, how do the topics identified by Probabilistic Topic Modeling correlate with those for the last 6 months or year. Both TEL blogs and TEL journal abstracts could be considered.

Challenges:It may be difficult to acquire historical text that is comparable – i.e. not imbalanced due to differences in source or due to quantity.

Parallel Worlds

Idea: There may be different perspectives between sub-communities within TEL: practitioners, researchers and industry. Identify areas of mismatch.

Notes: The mismatch is not so much a source of possible weak signals as an exploration of possible failure in what should be a synergy between sub-communities.

Operationalisation: Compare the topics displayed in email lists for institutional TEL champions, TEL journal abstracts and trade-show exhibitor profiles.

Challenges: Different communities tend to use communications in different ways: medium, style, etc, which is reflected in the different text sources. This may well over-power the capabilities of text mining. Web page scraping will be required for exhibitor profiles and maybe email list archives.

Rumble Strip

A “rumble strip” provides an alert to drivers when they deviate.

Idea: Discover differences between a document and a socially-normalised description of the same topic.

Notes: –

Operationalisation: Use an online classification services (e.g. OpenCalais) to obtain a set of subject “tags” for each document. Retrieve and merge the wikipedia entries relevant to each. Compare the document term frequencies for the original document and the merged wikipedia entries.

Challenges: Documents are rarely about a single topic; the practicability of this application scenario is slim.

Ripples on the Pond

Idea: A new idea or change in the environment may lead to a perturbation in activity in an established topic.

Notes: Being established is key to filtering out hype; these are not new topics.

Operationalisation: Identify some key-phrase indicators for established topics (e.g. “learning design”). Mine journal abstracts for the key phrase and analyse the time series of occurrence. Use OAI-PMH sources to provide temporal data.

Challenges: The results will be sensitive to the means by which the investigated topics are decided.

Shifting Sands

Idea:Over time the focus within a topic may change although writers would still say they are talking about the same subject. Discover how this focus changes as a source of possible weak signals.

Notes: Although the scenario considers an aggregate of voices in each time period, the voices of individuals may be influential on the results.

Operationalisation: Use key-phrases as for “Ripples on the Pond” but use Probabilistic Topic Modeling with a small number of topics. Analyse the drift in the word-frequencies determined for the most significant topics.

Challenges: –

Alchemical Synthesis

Idea:Words being newly-associated may be early signs of an emerging idea.

Notes: –

Operationalisation: Using single-word nouns in the corpus, compute an uplift for n-grams that quantifies the n-gram frequency compared to what it would be by chance. Sample a corpus of TEL domain blog posts and look for  bi-grams or tri-grams with un-expected uplift.

Challenges: –

Final Remarks

As was noted at the start, the implementation of these ideas is not yet undertaken. I may be rash in publishing such immature work but I do so in the hope that constructive criticism or offers of collaboration might arise.

There is much more that could be said about issues, challenges and what is kept out of scope but two warrant comment: I am only looking at text in English and recognise that this gives a biased set of possible weak signals; there are other analytical strategies such as social network analysis that provide interesting results both independently of and along side the kind of topic-oriented analysis I describe.

I hope to be able to report some possible weak signals in due course for comment and debate. These may appear on the TELMap site but will be signposted from here.