Adam Cooper » horizon scan http://blogs.cetis.org.uk/adam Cetis Blogs Sat, 18 Oct 2014 11:23:00 +0000 en-US hourly 1 http://wordpress.org/?v=4.1.22 Data Protection – Anticipating New Rules http://blogs.cetis.org.uk/adam/2012/02/02/data-protection-anticipating-new-rules/ http://blogs.cetis.org.uk/adam/2012/02/02/data-protection-anticipating-new-rules/#comments Thu, 02 Feb 2012 17:30:09 +0000 http://blogs.cetis.org.uk/adam/?p=440 On January 25th 2012, the European Commission released its proposals for significant reform of data protection rules in Europe (drafts had been leaked in late 2011). These proposals have been largely welcomed by the Information Commissioners Office , although it also recommends further thought over some of the proposals. The dramatic changes in the scale and scope of handling personal information in online retailing and social networking since the 1990’s, when current rules were implemented, is an obvious driver for change. The rise of “cloud computing” is a related factor.

What might this mean for the UK education system, especially for those concerned with educational technology?

On the whole, the answer is probably a fairly bland “not much” since we are, as a sector, pretty good at being responsible with personal data. Sector ethics, regardless of legislation, is to be institutionally concerned and careful and, providing enough time is available to adapt systems (of working and IT), this should be a relatively low impact change. There are, however, a few implications worthy of comment…

The Principle of Data Portability

Unless you know nothing about CETIS, it should come as no surprise that “data portability” caught my eye. EC Fact Sheet No. 2 says:

‘The Commission also wants to guarantee free and easy access to your personal data, making it easier for you to see what personal information is held about you by companies and public authorities, and make it easier for you to transfer your personal data between service providers – the so-called principle of “data portability”.’

Notice that this includes “public authorities”. Quite how this principle will affect practice remains to be seen but it does appear to have implications at the level of individual educational establishments and sector services such as the Learning Records Service (formerly MIAP). It is conceivable that this requirement will be satisfied by “download as HTML”, a rather lame interpretation of making it easier to transfer personal data, but I do hope not.

So: are there candidate interoperability standards? Yes, there are:

  • LEAP2A for e-portfolio portability and interoperability,
  • A European Standard, EN 15981, “European Learner Mobility Achievement Information” (an earlier open-access version is available as a CEN Workshop Agreement, CWA 16132)

These do not cover absolutely everything you might wish to “port” but widespread adoption as part of demonstrating compliance with a legislative “data portability” requirement is an option that is available to us.

It is also worth noting Principle 7 of “Information Principles for the UK Public Sector” (pdf) – see also my previous posting – which is entitled “Citizens and Businesses Can Access Information About Themselves” and recommends information strategies should go “… beyond the legal obligations” and  identify opportunities  “to proactively make information about citizens available to them by default”, noting that this would negate the cost of process and systems for responding to Subject Access Requests. I hope that this attitude is embraced and that the software is designed on a “give them everything” principle rather than “give them the minimum we think the law requires”. Software vendors should be thinking about this now.

There are some interesting possibilities for learner mobility if learners have a right to access and transfer fine grained achievement and progress information, especially where that is linked to well defined competence (etc) structures. Can we imagine more nomadic learners, especially those who may be early adopters of offerings from the kind of new providers that David Willetts and colleagues are angling for?

The Right to be Forgotten

This right is clearly aimed squarely at the social network hubs and online retailers (see the EC Fact Sheet No.3, pdf). It isn’t very  likely that anyone would want to have their educational experiences and achievements forgotten unless they plan to “vanish”. Indeed, it would be surprising if existing records retention requirements would be changed and the emerging trend of having secure document storage and retrieval services under user control – e.g. DARE – seems set to continue and be the way we manage this issue cost-effectively.

The right to be forgotten may be more of a threat to realising the “learning analytics” dream, even if only in adding to existing uncertainty, doubt and sometimes also fear. We need some robust and widely accepted protocols to define legally and ethically acceptable practice.

Uniformity of Legislation

The national laws that were enacted to meet the existing data protection requirements are all different and the new proposals are to have a single uniform set of rules. This makes sense from the point of view of a multi-nation business, although it will not be without critics. This is just one factor that could make a pan-European online Higher Education initiative easier to realise, whether a single provider or a collaboration. I perceive signs that people are moving closer to viable approaches to large scale online distance education using mature technologies, and possibly English as the language of instruction and assessment; looming “low-end disruptions” (see the Wikipedia article on “Disruptive Innovation“) for the academy as we know it. [Look out for an interview with Seb Schmoller which has influenced my views, due to be published soon on the JISC Observatory website.]

This is, of course, just some initial impressions on some proposals. I am sure there is a great deal that I have missed from a fairly quick scan of material from the commission and there is bound to be a lot of carping from those with businesses built around exploiting personal data so the final shape of things might be quite different.

]]>
http://blogs.cetis.org.uk/adam/2012/02/02/data-protection-anticipating-new-rules/feed/ 0
Preparing for a Thaw – Seven Questions to Make Sense of the Future http://blogs.cetis.org.uk/adam/2011/09/08/preparing-for-a-thaw-seven-questions-to-make-sense-of-the-future/ http://blogs.cetis.org.uk/adam/2011/09/08/preparing-for-a-thaw-seven-questions-to-make-sense-of-the-future/#comments Thu, 08 Sep 2011 14:20:42 +0000 http://blogs.cetis.org.uk/adam/?p=413 “Preparing for a Thaw – Seven Questions to Make Sense of the Future” is the title of a workshop at ALT-C 2011. The idea of the workshop was to use a simple conversational technique to capture perspectives on where Learning Technology might be going, hopes and fears and views on where education and educational technology should be going. The abstract for the workshop (ALT-C CrowdVine site) gives a little more information and background and there is also a handout available that extends this. The workshop had two purposes: to introduce participants to the technique with a view to possible use in their organisations; to gather some interesting information on the issues and forces shaping the future of learning technology. All materials have Creative Commons licences.

The short version of the Seven Questions used in the session (it is usual to use variants of one form or another) are:

  • Questions for the Oracle about 2025 [what would you ask?]
  • What would be a  favourable outcome by 2025?
  • What would be an unfavourable outcome?
  • How will culture and institutions need to change?
  • What lessons can we learn from the past?
  • Which decisions need to be made and actions taken?
  • If you had a “Magic Wand”, what would you do?

Fourteen people, predominantly from Higher Education, attended the workshop; their responses to the “Seven Questions” are all online at http://is.gd/7QResponses and there is an online form to collect further responses at http://is.gd/7Questions that is now open to all. This blog post is a first reaction to the responses made during the workshop, where a peer-interviewing approach was used. A more considered analysis will be conducted on September 14th, taking account of any further responses gathered by then.

Wordle of all responses to the "Seven Questions" that were made during the ALT-C 2011 workshop. (NB: four words have been eliminated - "learning", "education", "technology", "technologies")
Wordle of all responses to the “Seven Questions” that were made during the ALT-C 2011 workshop. Click image to open full-size.
(NB: four words have been eliminated – “learning”, “education”, “technology”, “technologies”)

My quick take on the responses is that there are about half a dozen themes that recur and a few surprising ideas. These appear to be:

  • Universal and affordable access to education and the avoidance of a situation where access to technological advances favours one section of society (or region) was a concern. There was support for ensuring access to connectivity and hardware for everyone, that this should be an entitlement. This was a very strong theme in the “if you had a magic wand” responses.
  • The need for improved digital literacies amongst teaching staff but also across the institution as a whole came out several times. Similarly (but distinct from this) is a desire to be more effective with teacher education and staff development.
  • There were questions about whether there would be “learning technologies” in 2025 and whether there would still be “learning technologists”, at least as defined by their current role.
  • The transformation of assessment and accreditation was also drawn out as an uncertainty.
  • Several responses wondered about the dominant devices that would be used in 2025 and the kind of interface (mouse, gesture,… what next?).
  • An increasing potential role for using data was tempered by concerns about unethical or exploitative use of collected data.
  • There was interest in multi-direction, collaborative education and the role of technology in that.
  • The risk of educational institutions holding onto established (old) models came out several times.
  • I wasn’t the only person to mention “interoperability”.

The workshop participants seemed to enjoy the approach and I was pleased at how well the peer interviewing worked, although I can see that this might not be right for all groups. I’ll also be watching out for differences in response that might be present between peer interview and solo completion of the online form. 1 hour for introduction, reciprocal interviews and closing discussion was rather restrictive but it seems to have surfaced some good materials and it certainly gave a good indication of what could be achieved with a little more time.

A more considered and detailed write-up will be published soon; I’ll add a comment.

]]>
http://blogs.cetis.org.uk/adam/2011/09/08/preparing-for-a-thaw-seven-questions-to-make-sense-of-the-future/feed/ 1
Weak Signals and Text Mining II – Text Mining Background and Application Ideas http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-ii-text-mining-background-and-application-ideas/ http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-ii-text-mining-background-and-application-ideas/#comments Thu, 12 May 2011 17:15:43 +0000 http://blogs.cetis.org.uk/adam/?p=364 Health warning: this is quite a long posting and describes ideas for work that has not yet been undertaken.

My previous post gave a broad introduction to what we are doing and why. This one explores the application of text mining after a brief introduction to text mining techniques in the context of a search for possible weak signals. The requirements, in outline, are that the technique(s) to be adopted should:

  • consume diverse sources to compensate for the surveillance filter;
  • strip out the established trends, memes etc;
  • not embed a-priori mental models;
  • discriminate relevance (reduce “noise”).

Furthermore, the best techniques will also satisfy aims such as:

  • replicability; they should be easily executed a number of times for different recorded source collections and longitudinally in time.
  • adoptability; it should be possible for digitally literate people to take up and potentially modify the techniques with only a modest investment in time to understand topic mining techniques.
  • intelligibility; what it is that the various computer algorithms do should be communicable in language to an educated person.
  • parsimony; the simplest approach that yields results (Occam’s Razor) is preferred.
  • flexibility; tuning, tweaking etc and an exploratory approach should be possible.
  • subjectible to automation; once a workable “machine” has been developed we would like to be able to treat it as a black box either with simple inputs and outputs or that can be executed at preset time intervals.

The approach taken will be less elaborate than the software built by Klerx (see “An Intelligent Screening Agent to Scan the Internet for Weak Signals of Emerging Policy Issues (ISA)“) but his paper has influenced my thinking.

Finally: this is not text mining research and the findings of importance are about technology enhanced learning.

Data Mining Strategies

Text mining is  not a new field and is one which has rather fuzzy boundaries and various definitions. As far back as 1999, Hearst considered various definitions (“Untangling Text Data Mining“)  in what was then a substantially less mature field. As the web exploded, more applications and more implied definitions have become apparent. I will not attempt to create a new definition nor to adopt someone else’s, although Hearst’s conception of “a process of exploratory data analysis that leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known” nicely captures the spirit.

The methods of data mining, of which text mining is a part, can be crudely split into two:

  • methods based on an initial model and the deduction of the rules or parameters that fit the data to the model. The same strategy is used to fit experimental data to a mathematical form in least squares fitting, for example to an assumed linear relationship.
  • methods that do not start with an explicit model but use a more inductive approach.

Bearing in mind our requirement to avoid a priori mental models, an inductive approach is clearly most appropriate. Furthermore, to adopt a model-based approach to textual analysis is a challenging task involving formal grammars, ontologies, etc and would fail the test of parsimony until other approaches have been exhausted. Finally: even given a model-based approach to text analysis it is not clear that models of the space in which innovation in technology enhanced learning occurs are tractable; education and social change are deeply ambiguous, fuzzy and relevant theories are many, unvalidated and often contested. The terms “deductive” and “inductive” should not be taken too rigidly, however, and I am aware that they may be applied to different parts of the methods described below.

Inductive approaches, sometimes termed “machine learning“, are extensively used in data mining and may again be subjected to a binary division into:

  • “supervised” methods, which make use of some a priori knowledge, and
  • “unsupervised methods which just look for patterns.

Supervision should be understood rather generally; the typical realisation of supervision is the use of a training set of data that has previously been classified, related to known outcomes or rated in some way. The machine learning then involves induction of the latent rules by which new data can be classified, outcomes predicted etc. This can be achieved using a range of techniques and algorithms, such as artificial neural networks. A supervised learning approach to detecting insurance fraud might start with data on previous detected cases to learn the common features. A text mining example is the classification of the subject of a document given a training set of documents with known subjects.

Unsupervised methods can take a range of forms and the creation of new ones is a continuing research topic. Common methods use one of several definitions of a measure of similarity to identify clusters. Whether or not the algorithm divides a set in successive binary splits, aggregates into overlapping or non-overlapping clusters. etc will tend to give slightly different results.

From this synopsis of inductive approaches it seems like we do not have an immediately useful strategy to hand. By definition, we do not have a training set for weak signals (although it could be argued that there are latent indicators of a weak signal and that we would gain some insight by looking at the history of changes that were once weak signals). The standard methods are oriented towards finding patterns, regularities, similarities, making predictions given previous patterns, which are not weak signals by definition.

For discovery of possible weak signals, it appears that we need to look from the opposite direction: to find the regularities so that they can be filtered out. Another way of expressing this is to say that it is outliers that will sometimes contain the information we most value, which is not usually the case in statistics. A concise description of outliers from Hawkins is sensible to our weak signals work as it is to general statistics: “an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism” (Identification of outliers / D.M. Hawkins ISBN:041221900X). Upon this, a case for exclusion of outliers in a statistical treatment may often be built whereas for us it is a pointer to a subject for further exploration, dissemination or hypothesis.

Actually, it is likely to be more subtle than simply filtering out regularities and to require some cunning in the design of our mining process. I will describe some ideas later. Some of these describe a process that will filter out the largest regularities while retaining just-detectable ones; maybe a larger dataset than a human can reasonably absorb will show these up. Others will look at differences in the regularities between different domains. Following from this, it is not the case that finding regularities is bad, rather that it may be necessary to stray a little from normal practice, although borrowing as much as possible. “Two Non-elementary Techniques”, below, briefly outlines two relevant approaches to finding structure.

Ensembles

It should be quite clear that any kind of search for potential weak signals is beset by indeterminacy inherent in the socio-technical world in which they arise. I contend that any approach to finding such signals is also easily challenged over reliability and validity. Referring back to “Weak Signals and Text Mining I”, the TELMap human sources (Delphi based) and recorded sources (text mining based) will do the best they can but neither will be able to mount a robust defence of any potential weak signal except in retrospect. This is why we say “potential” and emphasise that discourse over such signals is essential.

One way of mitigating this problem is to take inspiration from the use of “ensembles” in modeling and prediction in complex systems such as the weather. The idea is quite simple; use a range of different models or assumptions and either take a weighted average or look for commonality. The assumption is that wayward behaviour arising from approximations and assumptions, which are practically necessary, can be caught.

A slightly different perspective on dealing with indeterminacy is expressed by Petri Tapio in “Disaggregative policy Delphi Using cluster analysis as a tool for systematic scenario formation” (Technological Forecasting & Social Change 70 (2002) 83 – 101):

“Henrion et al. go as far as suggesting to run the analysis with all available clustering methods and pick up the one that makes sense. From the standard Popperian deductionist point of view adopted in statistical science, this would be circular reasoning. But from the more Newtonian inductionist point of view adopted in many technical and qualitatively oriented social sciences, experimenting [with] different methods would also seem as a relevant strategy, because of the Dewian ‘learning by doing’ background philosophy.”

The combination of these two related ideas will be adopted:

  • Bearing in mind the risk of re-introduction of the “mentality filter” (see part I), various methods and information sources will be experimented with to look for what “makes sense”. In an ideal scenario, several people with different backgrounds would address the same corpus to compensate for the mentality filter of each.
  • Cross-checking between the possible weak signals identified in the human and recorded sources approaches and between text mining results (even those that don’t “make sense” by themselves) will be undertaken to look for more defensible signals by analogy with ensemble methods.

Having a human in the process – seeing what makes sense – should help to spot juxtaposition of concepts, dissonance to context, … etc as well as seeing when the method just seems to be working. It will also help to eliminate false-positives, e.g. an apparently new topic might actually be a duplicated spelling mistake.

A Short Introduction to Elementary Text Mining Techniques

The starting point, whatever specific approach is adopted, will always be to process some text, which will be referred to as a “document” whether or not this term would be used colloquially, to generate some statistics upon which computation can occur. These statistics might be quite abstract measures used to assess similarity or they might be more intelligible. On the whole, I prefer the latter since the whole point of the work is to find meaning in dis-similarity. The mining process will consider a large number, where “large” may start in the hundreds, of documents in a collection from a particular source. The term “corpus” will be used for collections like this.

I will be drawing from standard toolkit of text processing to get from text to statistics, comprising the separate operations described below. These are “elementary” in the sense that they don’t immediately lead us to useful information. They are operations suited to a “bag of words” treatment, which seems quite crude but is common practice; it has been shown to be good enough for a many applications, it is computationally tractable with quite large corpora and it lends itself to relatively intelligible results. In “bag of words“, the word order is almost totally neglected and there is no concept of the meaning of a sentence. The meaning of a document becomes something that is approached statistically rather than through the analysis of the linguistic structure of sentences. Bag-of-words is just fine for our situation as we don’t actually want the computer to “understand” anything and we do want to apply statistical measures over moderate-sized corpora.

“Stop Word” Removal

Some of the words used in a document indicate little or nothing of what the document is about in a bag-of-words treatment, although they may be highly significant in a sentence. “No” is a good example of a word with different significance at sentence and bag-of-words levels. It is easy to call other examples to mind: or, and, them, of, in… In the interest of processing efficiency and the removal of meaningless and distracting indicators of similarity/difference, stop words should be removed at an early stage rather than trying to filter out what they cause later in the process. Differences in the occurrence of stop words can be considered to be 100% “noise” but they are easily-filtered out at an early stage. Standard stop-word lists exist for most languages and are often built into software for text mining, indexing, etc. It is possible that common noise-words will be discovered while looking for possible weak signals but these can be added to the stop-list.

Tokenisation

Tokenisation involves taking a stream of characters – letters, punctuation, numbers – and breaking it up into discrete items. These items are often what we would identify as words but they could be sentences, fixed length chunks or some other definable unit. Sometimes so-called “n-grams” are created in tokenisation. I will generally use single word tokens but some studies may include bi-grams or tri-grams. For example, all of the following might appear as items in a bag-of-words using bi-gram tokenisation: “learning”, “design”, “learning design”.

Part-of-Speech Tagging and Filtering

For the analysis of meaning in a sentence, the tagging of the part of speech (POS) of each word is clearly important. For the bag-of-words text mining it will not be so. I expect to use POS tagging in only a few applications (see below). When used, it will probably be accompanied by a filtering operation to limit the study to nouns or various forms of verb (VB* in the Penn Treebank POS tags scheme) in the interest of relevance discrimination.

Stemming

Many words change according the part of speech and have related forms but which effectively carry similar meaning in a bag of words. For example: learns learner, learning, learn. This will generally equate to “noise”, at best a minor distraction and at worst something that hides a potential weak signal by dissipating a detectable signal concept into lexically-distinct terms. In general it is statistically-desirable to reduce the dimensionality of variables, especially if they are not independent, and stemming does this since each word/term occurrence is a variable.

The standard approach is “stemming”, which reduces related words to what is generally the starting part of all of the related words (e.g. “learn” but it often leads to a stem that is not actually a word). There are a variety of ways this can be done, even within a single language. The Porter stemmer is widely used for English.

Document Term Statistics

A simple but practically-useful way to generate some statistics is to count the occurrence of terms (words or n-grams having been filtered for stop-words and stemmed). A better measure, which compensates for the differences in document length is to use the term frequency rather than the count; term frequency may be expressed as a percentage of the terms occurring in the document that are of a given type. Sometimes a simple yes/no indicator of occurrence may be useful.

A potentially more interesting statistic for a search for possible weak signals is the “term frequency inverse document frequency” (td-idf). This opaquely-named measure is obtained by dividing the term frequency by the logarithm of the fraction of documents that contain the term. This elevates the measure if the term is sparsely distributed among documents, which is exactly what is needed to emphasise outliers.

Given one of these kinds of document item statistic it is possible to hunt for some interesting information. This might involve sorting and visually-reviewing a table of data, resort to a graphical presentation, some ad-hoc recipe,  use of a structure-discovery algorithm (e.g, clustering)  that computes a “distance measure” between documents from the item statistics, … or a combination of these.

Synonyms, Hyponyms, …

For the reasons outlined in the section on stemming, it can be helpful to reduce the number of terms by mapping several synonyms onto one term or hyponyms onto their hypernym (e.g. scarlet, vermilion, carmine, and crimson are all hyponyms of red). The Wordnet lexical database contains an enormous number of word relationships, not just synonyms and hypo/hyper-nyms. I do not intend to adopt this approach, at least in the first round of work, as I fear that it will be hard to be a master of the method rather than the reverse. For example, wordnet makes imprinting be a hyponym of learning –  “imprinting — (a learning process in early life whereby species specific patterns of behavior are established)” – which I can see as a linguistically sensible relationship but one with the unintended consequence of suppressing potential weak signals.

An alternative use of Wordnet (or similar database) would be to expand one or more words into a larger set. This might be easier to control and would quickly generate what could be used as a set of query terms, for example to search for further evidence once a possible weak signal shortlist has been created. One of my “Application Scenarios”, below, proposes the use of expansion.

Two Non-elementary Techniques

The section “Data Mining Strategies” outlined some of the strengths and weaknesses of common approaches in relation to our quest for possible weak signals. It stressed the need to work with methods that focus on finding structure and regularities alongside a search for outliers. Two relevant techniques that go beyond the relative simplicity of the elementary text mining techniques outlined above are clustering and topic modeling. “Topic modeling” is a relatively specific term whereas “clustering” covers more diversity.

Clustering methods – many being well-established – may be split into three categories: partitioning methods, hierarchical methods and mapping methods. These usually work by computing a similarity (or distance between) the items being clustered. In our case it is the document term statistics that will give a location for each document.

The hierarchical approach is expected to make visible the most significant structure viewed from the top down (although the algorithms work from the bottom up in “aggregative” hierarchical clustering) and so not to lend itself to a search for possible weak signals although it is appropriate for scientific studies and for document subject classification where we naturally use hierarchical taxonomies.

Partitioning does not coerce the structure into a hierarchy and so may be expected to leave more of the detail in place. There are a number of different objectives that may be chosen for partitioning and potentially several algorithms for each. The “learning by doing” philosophy noted in the section “Ensembles” will be adopted. An old but useful description of the mathematics of some established clustering algorithms has been provided by Anja Struyf, Mia Hubert, Peter Rousseeuw in the Journal of Statistical Software Vol 1 Issue 4.

Hierarchical and partitioning approaches have been popular for many years whereas mapping approaches are a more recent innovation, probably due to greater computing power requirements. Self Organising Maps and Multi-dimensional scaling are two common “mapping approaches”. They are probably best understood by thinking of the problem of representing the document term statistics (a table with many columns, each representing the occurrence of terms, and many rows, one for each document) in two dimensions. This is clustering in the sense that aspects of sameness among the ‘n’ dimensions of the columns are aggregated. Although this process of dimension-reduction has the desirable property of making the data more easy to visualise, it is may be unsuitable for the discovery of possible weak signals.

Clustering is a stereotype unsupervised learning approache; there is no embedded model of the data. All that is needed is a means to compute similarity, hence the same methods can be applied to experimental data, text etc. Topic Modeling, however, introduces a theory of how words, topics and documents are related. This has two important consequences: the results of the algorithm(s) may be more intelligible; the model constrains the range of possible results. The latter may be either desirable or undesirable depending on the validity of the model. Intelligibility is improved because we are able to better relate to the concept of a topic than we are to some abstract statement of statistical similarity.

Probabalistic Topic Models (see Steyvers and Griffiths, pdf) assume a bag of words, each word having a frequency value, can be used to capture the essence of a topic. A given document may cover several topics to differing degrees leading to a document term frequency in the obvious way. The statistical inference required to work backwards from a set of documents to a plausible set of topics and the generation of the word frequency weightings for each topic and topic percentages in each document requires some mathematical cunning but has been programmed for R (see Grün and Hornik topicmodels package) as well as MATLAB (see Mark Steyvers’ toolbox).

Probablistic Topic Models could be useful to attenuate some of the noisiness expected with approaches working purely at the document-term level. This might make identification of possible weak signals easier; “discriminate relevance” was how it was phrased in the statement of requirements above. It is expected that some tuning of parameters, especially the number of topics, will be required. There is also a random element in the algorithms employed. This means that the results between different runs may be different. The margin of the stable part of the results may well contain pointers to possible weak signals. As for clustering, a “learning by doing” approach will be used, taking care not to introduce a mentality filter.

Sources of Information for Mining

One of the premises of text mining is access to relatively large amounts of information and the explosion of text on the web is clearly a factor in an increasing interest in text mining over the last decade and a half. There are both obvious and non-obvious reasons why an unqualified “look on the web” is not an answer to the question “where should I look for possible weak signals”. Firstly, the web is simply too mindbogglingly big. More subtly, it is to be expected that differences in style and purpose of different pieces of published text would hide possible weak signals; some profiling will be required to create corpora that contain comparable documents within each. Finally, crawling web pages and scraping out the kernel of information is a laborious and unreliable operation when you consider the range of menus, boilerplate, advertisements, etc that typically abound.

Three kinds of source get around some of these issues:

  • blogs occur with a reduced range of style and provide access to RSS feeds that serve-up the kernel of information as well as publication date;
  • journal abstracts generally have quite a constrained range of style, are keyword-rich and can be obtained using RSS or OAI-PMH to avoid the need for scraping;
  • email list archives are not so widely available as RSS (but this is sometimes available) and there is often stylistic consistency, although quoted sections, email “signatures” and anti-spam scan messages may present material in need of removal.

My focus will be on blogs and journal abstracts, which are expected to generally contain different information. RSS and OAI-PMH are important methods for getting the data with a minimum of fuss but are not the panacea for all text acquisition woes. RSS came out of news syndication and to this day RSS feeds serve up the most recent entries. Any attempt to undertake a study that looks at change over time using RSS to acquire the information will generally have to be conducted over a period of time. Atom, a later and similar feed grammar, is sometimes available but not the paging and archival features imagined in RFC5005. Even RSS feeds provided by journal publishers are limited to the latest issue and usually no obvious means to hack a URL to recover abstracts from older issues. The OAI-PMH provides a means to harvest abstract (etc) information over a date range and there is even an R package that implements OAI-PMH but many publishers do not offer OAI-PMH access.

A final problem which is specific to blogs is how to garner the URLs for the blogs to be analysed. It seems that all methods by which a list could be compiled are flawed; the surveillance and mentality filters seem unavoidable.

The bottom line is: there will be some work to do before text mining can begin.

On Tools, Process and Design…

The actual execution of the simple text mining approaches outlined in the “short introduction” is relatively straight-forward. There are several pieces of software, both code libraries/modules and desktop applications, that can be used after a little study of basic text mining techniques. I plan on using R and the “tm” package (Feinerer et al) for the “elementary” operations and RapidMiner. The former requires programming experience whereas the latter offers a graphical approach similar to Yahoo Pipes. In principle, the software used should be independent of the text mining process it implements, which can be thought of in flow-chart terms, so long as standard algorithms (e.g. for stemming) are used. In practice, of course, there will be some compromises. The essence of this is that the translation from a conceptual to an executable mining process is an engineering one.

The critical, less-well-defined and more challenging task is to decide how the toolkit of text-mining techniques will be applied. This starts with considering what text sources will be used, moves through how to “clean” text and then to how tokenisation, term frequency weights etc will be used and concludes with how clustering (etc) and visualisation should be deployed. In a sense, this is “designing the experiment” – but I use the term “application scenario” – and it will determine the meanignfulness of the results.

Some Application Scenarios

This section speculates on a number of tactics that might yield possible weak signals. The title of application scenario will be used as its name. Future postings will develop and apply one or more of these application scenarios.

Out-Boundary Crossing

Idea: Strong signals in other domains are spotted by a minority in the TEL domain who see some relevance of them. Discover these.

Notes: The signal should be strong in the other domain so that there is confidence that it is in some sense “real”.

Operationalisation: Extract low-occurrence nouns from a corpus of TEL domain blog posts and cross-match to search term occurrence in Google Trends.

Challenges: Google Trends does not provide an API for data access

In-Boundary Crossing

Idea: Some people in another domain (e.g. built environment, architectural design) appear to be talking about teaching and learning. Discover what aspect of their domain they are relating to ours.

Notes: This approach clearly cannot work with text from the education domain.

Operationalisation: Use Wordnet to generate one or more sets of keywords relevant to the practice of education and training. Use a corpus of journal abstracts (separate corpora for different domains) and identify documents with high-frequency values for the keyword set(s).

Challenges: It may be difficult to eliminate papers about education in the subject area (e.g. about teaching built environment) other than by manual filtering.

Novel Topics

Idea: Detect newly emerging topics against a base-line of past years.

Notes: This  is the naive application scenario, although there are many ways in which it can be operationalised.

Operationalisation: Against a corpus of several previous years, how do the topics identified by Probabilistic Topic Modeling correlate with those for the last 6 months or year. Both TEL blogs and TEL journal abstracts could be considered.

Challenges:It may be difficult to acquire historical text that is comparable – i.e. not imbalanced due to differences in source or due to quantity.

Parallel Worlds

Idea: There may be different perspectives between sub-communities within TEL: practitioners, researchers and industry. Identify areas of mismatch.

Notes: The mismatch is not so much a source of possible weak signals as an exploration of possible failure in what should be a synergy between sub-communities.

Operationalisation: Compare the topics displayed in email lists for institutional TEL champions, TEL journal abstracts and trade-show exhibitor profiles.

Challenges: Different communities tend to use communications in different ways: medium, style, etc, which is reflected in the different text sources. This may well over-power the capabilities of text mining. Web page scraping will be required for exhibitor profiles and maybe email list archives.

Rumble Strip

A “rumble strip” provides an alert to drivers when they deviate.

Idea: Discover differences between a document and a socially-normalised description of the same topic.

Notes: –

Operationalisation: Use an online classification services (e.g. OpenCalais) to obtain a set of subject “tags” for each document. Retrieve and merge the wikipedia entries relevant to each. Compare the document term frequencies for the original document and the merged wikipedia entries.

Challenges: Documents are rarely about a single topic; the practicability of this application scenario is slim.

Ripples on the Pond

Idea: A new idea or change in the environment may lead to a perturbation in activity in an established topic.

Notes: Being established is key to filtering out hype; these are not new topics.

Operationalisation: Identify some key-phrase indicators for established topics (e.g. “learning design”). Mine journal abstracts for the key phrase and analyse the time series of occurrence. Use OAI-PMH sources to provide temporal data.

Challenges: The results will be sensitive to the means by which the investigated topics are decided.

Shifting Sands

Idea:Over time the focus within a topic may change although writers would still say they are talking about the same subject. Discover how this focus changes as a source of possible weak signals.

Notes: Although the scenario considers an aggregate of voices in each time period, the voices of individuals may be influential on the results.

Operationalisation: Use key-phrases as for “Ripples on the Pond” but use Probabilistic Topic Modeling with a small number of topics. Analyse the drift in the word-frequencies determined for the most significant topics.

Challenges: –

Alchemical Synthesis

Idea:Words being newly-associated may be early signs of an emerging idea.

Notes: –

Operationalisation: Using single-word nouns in the corpus, compute an uplift for n-grams that quantifies the n-gram frequency compared to what it would be by chance. Sample a corpus of TEL domain blog posts and look for  bi-grams or tri-grams with un-expected uplift.

Challenges: –

Final Remarks

As was noted at the start, the implementation of these ideas is not yet undertaken. I may be rash in publishing such immature work but I do so in the hope that constructive criticism or offers of collaboration might arise.

There is much more that could be said about issues, challenges and what is kept out of scope but two warrant comment: I am only looking at text in English and recognise that this gives a biased set of possible weak signals; there are other analytical strategies such as social network analysis that provide interesting results both independently of and along side the kind of topic-oriented analysis I describe.

I hope to be able to report some possible weak signals in due course for comment and debate. These may appear on the TELMap site but will be signposted from here.

]]>
http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-ii-text-mining-background-and-application-ideas/feed/ 1
Weak Signals and Text Mining I – An Introduction to Weak Signals http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-i-an-introduction-to-weak-signals/ http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-i-an-introduction-to-weak-signals/#comments Thu, 12 May 2011 14:42:17 +0000 http://blogs.cetis.org.uk/adam/?p=356 “Weak Signals” is a rather fashionable term used in parts of the future-watching community, although it is an ill-defined term as evidenced by the lack of a specific entry in Wikipedia (there is only a reference under Futurology). There is an air of mystique and magic about Weak Signals Analysis that turns some people off, me included, but I have come to the conclusion that a sober interpretation of the idea can be provided. This is what we are trying to do in a work package led by the Zentrum für Soziale Innovation (strapline “all innovations are socially relevant”) in the TELMap project. This work combines two approaches: one with direct engagement with people, our “human sources” track, and one looking at “recorded sources”, i.e. existing written texts. My area of interest, and that of colleagues at RWTH Aachen University is in the recorded sources. This post provides an introduction to the work, I hope a sober interpretation of “weak signals”, and a following post will outline some initial ideas about how text mining might be used.

A Weak Signal is essentially a sign of change that is not generally appreciated. In some cases the majority of experts or people generally would dismiss it as being irrelevant or simply fail to notice it. In these days of social software and ubiquitous near-instantaneous global communication the focus is generally on trends, memes, etc. Thought leaders of various kinds – individuals and organisations – wield huge power over the focus of attention of a following majority. The act of anticipating what the next trend/meme/etc will be could be construed as looking for a weak signal. There are a number of problems with identifying these and a naive approach is bound to fail; for example, to ask people to “tell me some weak signals” is equivalent to asking them to tell of something they think is irrelevant which might be important. Neither can you ask the experts, by definition. The point here is that the person who spots a sign of change may well be an outsider, on the periphery or be in a despised sub-culture.

In spite of Weak Signals being a problem concept, the fact remains that to anticipate change would give an innovator an advantage and potentially help an agent in the mainstream to avoid being blind-sided. To make even a small contribution here is part of the mission of both TELMap and CETIS. Our intention is to divert some attention away from the hot topics of the day and to discover some neglected perceptions or ideas that are worthy of more attention, both social attention and analytical investigation. This intention, and an assertion that we only ever consider Possible Weak Signals, is my “sober interpretation”. There is no magic here, no shamanic trance leading to revelation.

There is ample literature around the topic of Weak Signals but I will only mention a couple. Elina Hiltunen is a well-known figure, see for example some slides and references (pdf), in which she gives an informal checklist for weak signals (quoted with minor changes to the English) that should be viewed as indicative of necessary rather than sufficient criteria:

  1. Makes your colleagues to laugh (ridicule)
  2. You colleagues are opposing it: no way, it will never happen
  3. Makes people wonder
  4. No one has heard about it before
  5. People would rather that no-one talks about it anymore (a tabu)

Two more Finns, Leena Ilmola and Anna Kotsalo-Mustonen discuss the importance of filters: “When monitoring their operating environments for weak signals and for other disruptive information companies face filters that hinder the entry of the information to the company”. Substitution of “technology enhanced learning community” for “company” gives us our initial problem statement.  Ilmola and Kotsalo-Mustonen describe thee kinds of filter following earlier work by Igor Ansoff, who is generally credited with introducing the concept of Weak Signals in the 1970’s:

  1. The surveillance filter. Colloquially, “just looking under the street-lamp”. The obvious compensator for the surveillance filter in our situation is diversity of recorded sources.
  2. The mentality filter. We tend to only notice things that are relevant to our immediate context and problems. Information overload and tendencies to conform to social norms and be influenced by fashion compound the effects of people working “in the trenches”. By using text mining approaches we hope to compensate for these problems by filtering information in the recorded sources in a mental-model-agnostic manner.
  3. The power filter. The signals of change that lead to change of strategy or action do so through an existing power structure and become filtered according to political considerations. Ideas that challenge the status quo are threatening. As for the previous filter, we hope to avoid some of the effect of the power filter, although not entirely. Most recorded sources have already been subjected to implicit (many bloggers self-censor to protect their job/career) or explicit (e.g. Journal or magazine articles) power filters.

The adoption of a text mining approach over a diverse range of recorded sources offers a promising means to draw out some Possible Weak Signals, although I am clear that text mining will be challenging to apply and that it will only be useful in tandem with human engagement. Given an initial list of possible signals, it will be necessary to apply some heuristics such as the Hiltunen checklist to try to reduce “noise”. These can then be used to facilitate discussion and disputation, cross-reference with other studies and with the conclusions of our “human sources” leading to ever shorter lists. If we find a few cases where people say “you know what, that isn’t so crazy after all”, or similar, I will consider the activity to have been a success. The next post summarises mainstream text mining approaches. describes how Weak Signals considerations affect the selection of text mining methods and outlines some ideas for application of text mining to look for possible weak signals.

]]>
http://blogs.cetis.org.uk/adam/2011/05/12/weak-signals-and-text-mining-i-an-introduction-to-weak-signals/feed/ 2
An Informal “Horizon Scan” from CETIS http://blogs.cetis.org.uk/adam/2011/04/08/an-informal-horizon-scan-from-cetis/ http://blogs.cetis.org.uk/adam/2011/04/08/an-informal-horizon-scan-from-cetis/#comments Fri, 08 Apr 2011 10:05:02 +0000 http://blogs.cetis.org.uk/adam/?p=352 For the past three years we have created a largely internal and informal “horizon scan” of technology trends and issues of interest and relevance to members of CETIS. The 2011 edition (.doc) was created in March and has just been uploaded to a public URL. All three are available in “.doc” format under the Resources section of our “Horizon Scan” topic page.

These should be seen as a set of un-processed perceptions rather than the product of a formal process; a great deal of ground is not scanned in this paper and it should be understood that no formal prioritisation process was undertaken. They are, therefore, not at all comparable with NMC Horizon Reports. The CETIS Horizon Scan should be seen as a set of potentially-ideosyncratic “takes”, material on which discourse and disputation may occur to make possible futures more clear.

]]>
http://blogs.cetis.org.uk/adam/2011/04/08/an-informal-horizon-scan-from-cetis/feed/ 1
ÜberStudent, Edubuntu – A sign of what is to come? http://blogs.cetis.org.uk/adam/2011/02/08/uberstudent-edubuntu-a-sign-of-what-is-to-come/ http://blogs.cetis.org.uk/adam/2011/02/08/uberstudent-edubuntu-a-sign-of-what-is-to-come/#comments Tue, 08 Feb 2011 15:23:48 +0000 http://blogs.cetis.org.uk/adam/?p=294 ÜberStudent is a new-comer (launched 2010) to the world of Linux distributions, being aimed at higher education and advanced secondary level students. Edubuntu has been around a few years longer. Both are based on the Ubuntu Linux distribution, which has a strong user-base among people who are not hard-core techies and who have migrated from Microsoft. A “distribution” is effectively a packaged-up combination of generic Linux code, drivers, applications etc.

As someone switched over to Ubuntu in October 2010, and with no regrets (I kept Windows 7 in reserve on the same machine but have never used it), I can imagine that ÜberStudent, or maybe a successor, maybe onto a winner. Consider some “what ifs”:

  • ubuntu makes headway with its offering for MIDs and netbooks and erodes the Android and iPhone territory
  • students look to save more money
  • students react against “the man” (after blithely paying for years) as part of a general reaction to the banking crisis and government policy
  • facebook, Apple or Google get too greedy or conceited.

I speculate that it is just a matter of time before “we” (staff in universities and colleges and their IT suppliers etc) need to grapple with a new wave of issues around user-owned technologies. How well would we cope if everyone accessed the VLE and portal (Sharepoint?) with Firefox and accessed powerpoint presentations or submitted documents/spreadsheets using LibreOffice? How much worse does this look when they are paying £6000-£9000 per annum?

Before ending, I should say that ubuntu is not all bliss – getting modern operating systems to work across diverse hardware and configurations is seriously difficult – but overall, it has been a good move for me. I just want a computer I can use for work, mostly email, calendar, documents, web…

So, what do I like about the change? NB these are a mix of features of the technical aspects of Linux/Ubuntu and consequences of the Open Source model. In brief:

  • Less friction – there are fewer times when I have to sit, waiting for something to happen.
  • Less conceitedness – I really dislike the way MS Windows tries to control the way you do things. This has got worse over the years. It seems you have to brainwash yourself to the Windows Way to avoid temptation to profanity.
  • More freedom – no-one is trying to “monetise” things. No I don’t want to change my search engine, use a particular cloud-based application, get my games from XXX. This is one of the things that stops me going Android.
  • Less memory use – less than 1Gb is more than adequate whereas my XP used to just swallow it up
  • More disciplined software management – if you stick to using the Package Manager

And what don’t I like? Only that the video driver often crashes if I change to using a monitor rather than the laptop screen. In general, hardware is a source of niggles and battery consumption is not so well optimised as for Windows or MacOS (Apple is especially good at this as they control the hardware and operating system).

In conclusion:

  • I recommend you try ubuntu; it can be tried by running off CD or a USB stick without “nuking” your current system or try installing on a PC that has just been retired (NB if you try a machine much older than 5 years the hardware driver support may cause problems)
  • Watch out for the future!
]]>
http://blogs.cetis.org.uk/adam/2011/02/08/uberstudent-edubuntu-a-sign-of-what-is-to-come/feed/ 2
CETIS Informal Horizon Scan – March 2009 http://blogs.cetis.org.uk/adam/2009/07/08/cetis-informal-horizon-scan-march-2009/ http://blogs.cetis.org.uk/adam/2009/07/08/cetis-informal-horizon-scan-march-2009/#comments Wed, 08 Jul 2009 16:41:37 +0000 http://blogs.cetis.org.uk/adam/?p=97 In March 2009 we conducted an informal Horizon Scan as an internal exercise in identifying areas of work to prioritise in the coming year. A public version is now available.

As the caveat at the start of the document notes, this was not intended to be a thorough or robust process but feel free to comment with any burning issues that you think we missed…

]]>
http://blogs.cetis.org.uk/adam/2009/07/08/cetis-informal-horizon-scan-march-2009/feed/ 7
Technology Forecast http://blogs.cetis.org.uk/adam/2009/07/08/technology-forecast/ http://blogs.cetis.org.uk/adam/2009/07/08/technology-forecast/#comments Wed, 08 Jul 2009 16:22:15 +0000 http://blogs.cetis.org.uk/adam/?p=94 PriceWaterhouseCoopers started publishing their annual Technology Forecast book as a quarterly journal during 2008 . Previous editions of the PwC Technology Forecast have focussed on the question of IT in support of increasing pressures on “business agility”, pressures also felt in post-compulsory education, whereas the latest edition pays particular attention to the increased significance of semantic web technology in practice. This latter actually seems to do a pretty good job of providing an introduction to some of the concepts and technology choices for the semantic web, at least for an IT professional.

]]>
http://blogs.cetis.org.uk/adam/2009/07/08/technology-forecast/feed/ 0