A lesson in tagging for UKOER

We’ve been encouraging projects in the HE Academy / JISC OER programmes to use platforms that help get the resources out onto the open web and into the places where people look, rather than expecting people to come to them. YouTube is one such place. However, we also wanted to be able to find all the outputs from the various projects wherever they had been put, without relying on a central registry, so one of the technical recommendations for the programme was that resources are tagged UKOER.

So, I went to YouTube and searched for UKOER, and this was the top hit. Well, it’s a lesson in tagging I suppose. I don’t think it invalidates the approach, we never expected 100% fidelity and this seems to be a one-off among the first 100 or so of the 500+ results. And it’s great to see results from Chemistry.FM and CoreMaterials topping 10,000 views.

Text and Data Mining workshop, London 21 Oct 2011

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Testing Caprét

I’ve been testing the alpha release of CaPRéT , a tool that aids attribution and tracking of openly licensed content from web sites. According to the Caprét website.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

I tested Caprét on a single page, my institutional home page and on this blog. To enable Caprét for material on a website you need to include links to four javascript files in your webpages. I went with the files hosted on the Caprét site so all I had to do was put this into my homepage’s <head> (The testing on my home page is easier to describe, since the options for WordPress will depend on the theme you have installed.)


<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/jquery.plugin.clipboard.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/oer_license_parser.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/capret.js" type="text/javascript"></script>

Then you need to put the relevant information, properly marked up into the webpage. Currently caprét cites the Title, source URL, Author, and Licence URI of the page from which the text was copied. The easiest way to get this information into your page is to use a platform which generates it automatically, e.g. WordPress or Drupal with the OpenAttribute plug-in installed. The next easiest way is to fill out the form at the Creative Commons License generator. Be sure to supply the additional information if you use that form.

If you’re into manual, this is what does the work:

Title, is picked up from any text marked as
<span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type"><span> or, if that’s not found, the page <title> in the <head>

Source URL comes from page url

Author name, is picked up from contents of <a xmlns:cc="http://creativecommons.org/ns#" href="http://jisc.cetis.org.uk/contact/philb" property="cc:attributionName" rel="cc:attributionURL"></a> (actually, the author attribution URL in the href attribute isn’t currently used, so this could just as well be a span)

Licence URI, is picked up from the href attribute of <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">

You might want to suggest other things that could be in the attribution/citation.

Reflections
As far as attribution goes it seems to work. Copy something from my home page or this blog and paste it elsewhere and the attribution information should magically appear. What’s also there is an embedded tracking gif, but I haven’t tested whether that is working.

What I like about this approach is that it converts self-description into embedded metadata. Self description is practice of including within a resource that information which is important for describing it: the title, author, date etc. Putting this information into the resource isn’t rocket science, it’s just good practice. To convert this information into metadata it needs to be encoded in such a way that a machine can read it. That’s where the RDFa comes in. What I like about RDFa (and microformats and microdata) as a way of publishing metadata is that it builds the actual descriptions are the very same ones that it’s just good practice to include in the resource. Having them on view in the resource is likely to help with quality assurance, and, while the markup is fiddly (and best dealt with by the content management system in use, not created by hand) creating the metadata should be no extra effort over what you should do anyway.

Caprét is being developed by MIT OEIT and Tatemae (OERGlue) as part of the JISC CETIS mini projects initiative; it builds on the browser plug-in developed independently by the OpenAttribute team.

The hunting of the OER

“As internet resources are being moved, they can no longer be traced.” I read in a press release from Knowledge Exchange. This struck me as important for OERs since part of their “openness” is the licence to copy them, and I have recently been on something of an OER hunt, which highlights the importance of using identifiers correctly and of “curatorial responsibility”.

The OER I was hunting was an “Interactive timeline on Anglo-Dutch relations (50 BC to 1830)” from the UKOER Open Dutch project. It was recommended at a year or so ago as great output which pretty much anyone could see the utility of that used the MIT SIMILE timeline software to create a really engaging interface. I liked it, but more importantly for what I’m considering now I used it as an example when investigating whether putting resources into a repository enhanced their visibility on Google (in this case it did).

Well, that was a year+ ago. The other week I wanted to find it again. So I went to Google and searched for “anglo dutch timeline” (without the quotes). Sure enough, I got three results for the one I am looking for on the first page (of course, your results my vary; Google’s like that now-a-days). These were, from the bottom up:

  1. A link to a record in the NDLR (the Irish National Digital Learning Resources Repository) which gave the link URL as http://open.jorum.ac.uk:80/xmlui/handle/123456789/517 (see below)
  2. A link to a resource page in HumBox, which turned out to be a manifest-only content package (i.e. metadata in a zip file). Looking into it, there’s no resource location given in the metadata, and the pointer to the content (which should be the resource being described) actually points to the Open Dutch home page.
  3. Finally, a link to a resource page in JORUM. This also describes the resource I was looking for but actually points to Open Dutch project page. The URL for Jorum page describing the resource is given as the persistent link–I believe that the NDLR harvests metadata from Jorum, so my guess is that that is why NDLR list this as the location of the resource.

Finding descriptions of a resource isn’t really helpful to many people. OK, I now know the full name and the author of the resource, which might help me track down the resource, but at this point I couldn’t. Furthermore, nobody wants to find a description of a resource that links to a description of the resource. I think one lesson concerns the importance of identifiers: “describe the thing you identify; identify the thing you describe.”

This story (and I very much suspect it is not an isolated case) has significance for debates about whether repositories should accept metadata-only “representations” of resources. Whether or not it is a good idea to deposit resources you are releasing as OERs in a third-party repository will depend on what you want to achieve by releasing them; whether or not it is a good idea for a repository to take and store resources from third parties will depend on what the repository’s sponsors wish to facilitate. Either way, someone needs to take some curatorial responsibility for the resource and for the metadata about it. That means on the one hand making sure that the resource stays on the web and on the other hand making sure that the metadata record continues to point to the right resource (automatic link checking for HTTP 404 responses etc. helps but, as this post on link rot notes, it’s not always that simple).

By the way, thanks to the incomparable David Kernohan, I now know that the timeline is currently at http://www.ucl.ac.uk/alternative-languages/OER/timeline/.

What I didn’t tweet from #OpenN11

For various reasons I didn’t get around to tweeting from the Open Nottingham 2011 seminar last Thursday, but that just gives me the excuse to record my impressions of it here, perhaps not in 140 chars per thought but certainly without much by way of discursive narrative.

Lack of travel options meant that I arrived late and missed the first couple of presentations.

Prof. Wyn Morgan (Director of Learning and Teaching, University of Nottingham): Nottingham started its OER initiative in 2006-7, it launched U-Now about the same time as the OU’s OpenLearn, and well before HEFCE funding. Before that they were “hiding behind passwords and VLEs”. Motivations included: corporate social respinsibility, widening participation, marketing/promotion, sharing materials with over-seas campuses, and cost savings.

Wayne Mackintosh (WikiEducator, OERU). Like many Wayne went into education in order to share knowledge: OER aligns with core values of those in HE. He compared the model of OERU to the University of London external examinations ca. 1850: decoupling learning from accreditation. While the open content is there, open curriculum development is something that needs working on.

Steve Stapleton, (Open Learning Support Officer The University of Nottingham): case studies on re-use. One case study showed that reuse is at micro-level, routine, not distinguished from other stuff on web, hence difficult to see what is being used. The other involved students remixing OERs for the next year of students to use: an example of open licensing enabling pedagogy.

Greg DeKoenigsberg (founder and first chairman of the Fedora Project Board at RedHat, now CTO of ISKME): What I learnt from open source. “People are the way we filter information” but we are interested in “tiny niche domains, the micro communities” which leads to the question “How do I find people like me?” Argued that the driver for open content may be the same as the driver for open source software: it allows you to stop competing on “non-differentiated value” and focus on what you do that is different.

Andy Lane (Director of OpenLearn, Open University): SCORE. Mentioned an interesting idea in passing, ‘born open': after 5 years open content is becoming mainstream at the OU; they no longer think of releasing existing content as open, rather they are developing open content.

Rob Pearce (HE Academy Engineering Subject Centre) spoke about simple tags (date-of-birth codes) that can be used to track resources by providing text within the resource that can be searched-for on Google.

Nathan Yergler gave his last presentation as CTO of Creative Commons. Key points: discovery on the web works best when it aligns with the the structure of the web, and the structure of the web is the links. Nathan suggested that the most important link for online learning is Attribution. The next step for discovery (he says) is the use of structured data to support search, e.g. RDFa in creative commons licences, and we need to develop practice of linking, attribution and annotation to support this.

Finally, in response to a question from Amber Thomas “what could we do to mess this up?” Nathan answered “check-box openness” that is stuff that was open just because it is a grant requirement, but with no real commitment. Which aligns nicely with an observation from Wyn’s presentation, that although there is support from the top for Open Nottingham, there is no mandate. Individuals get involved if they think it is worthwhile, which many of them do.

Many thanks to those who presented and organised this seminar.

Posted in oer

Self description and licences

One of the things that I noticed when I was looking for sources of UKOERs was that when I got to a resource there was often no indication on it that it was open: no UKOER tag, no CC-license information or logo. There may have been some indication of this somewhere on the way, e.g. on a repository’s information page about that resource, but that’s no good if someone arrives from a Google search, a direct link to the resource, or once someone has downloaded the file and put it on their own VLE.

Naomi Korn, has written a very useful briefing paper on embedding metadata about creative commons licences into digital resources as part of the OER IPR Support project starter pack. All the advice in that is worth following, but please, also make sure that licence and attribution information is visible on the resource as well. John has written about this in general terms in his excellent post on OERs, metadata, and self-description where he points out that this type of self description “is just good practice” which is complemented not supplanted by technical metadata.

So, OER resources, when viewed on their own, as if someone had found them through Google or a direct link, should display enough information about authorship, provenance, etc. for the viewer to know that they are open without needing an application to extract the metadata. The cut and paste legal text and technical code generated by the licence selection form on the Creative Commons website is good for this. (Incidentally, for HTML resources this code also includes technical markup so that the displayed text works as encoded metadata, which has been exploited recently by the OpenAttribute browser addon. I know the OpenAttribute team are working on embedding tools for licence selection and code generation into web content management systems and blogs).

Images, videos and sounds present their own specific problem for including human-readable licence text. Following practice from the publishing industry would suggest that small amounts of text discreetly tucked away on the bottom or side of an image can be enough to help. That example was generate by the Xpert attribution tool from an image of a bridge found on flickr. The Xpert tool will also does useful work for sounds and videos; but for sounds it is also possible to follow the example of the BBC podcasts and provide spoken information at the beginning or end of the audio, and for videos of course one can have scrolling credits at the end.

UKOER Sources

I have been compiling a directory of how people can get at the resources released by the UKOER pilot phase projects: that is the websites for human users and the “interoperability end points” for machines–ie the RSS and ATOM feed URLs, SRU targets, OAI-PMH base URLs and API documentation. This wasn’t nearly as easy as it should have been: I would have hoped that just listing the main URL for each project would have been enough for anyone to get at the resources they wanted or the interoperability end point in a click or two, but that often wasn’t the case.

So here are some questions I would like OER providers to answer by way of self assessment, which will hopefully simplify this in the future.

Does your project website have a very prominent link to where the OERs you have released may be found?

The technical requirements for phase 1 for delivery platforms said:

Projects are free to use any system or application as long as it is capable of delivering content freely on the open web. … In addition projects should use platforms that are capable of generating RSS/Atom feeds, particularly for collections of resources

So: what RSS feeds do you provide for collections of resources and where do you describe these? Have you thought about how many items you have in each feed and how well described they are?

Are your RSS feed URLs and other interoperability endpoints easy to find?

Do your interoperability end points work? I mean, have you tested them? Have you spoken to people who might use them?

While you’re thinking about interoperability end points: have you ever thought of your URI scheme as one? If for example you have a coherent scheme that puts all your OERs under a base URI, and better, provides URIs with some easily identifiable pattern for those OERs that form some coherent collection, then building simple applications such as Google Custom Search Engines becomes a whole lot easier. A good example is how MIT OCW is arranged: all most of the URIs have a pattern http://ocw.mit.edu/courses/[department]/[courseName]/[resourceType]/[filename].[ext] (the exceptions are things like video recordings where the actual media file is held elsewhere).

JISC CETIS OER Technical Mini Projects Call

JISC has provided CETIS with funding to commission a series of OER Technical Mini Projects to explore specific technical issues that have been identified by the community during CETIS events such as #cetisrow and #cetiswmd and which have arisen from the JISC / HEA OER Programmes.

Mini project grants will be awarded as a fixed fee of £10,000 payable on receipt of agreed deliverables. Funding is not restricted to UK Higher and Further Education Institutions. This call is open to all OER Technical Interest Group members, including those outwith the UK. Membership of the OER TIG is defined as those members of oer-discuss@jiscmail.ac.uk who engage with the JISC CETIS technical discussions.

The CETIS OER Mini Projects are building on rapid innovation funding models already employed by the JISC. In addition to exploring specific technical issues these Mini Projects will aim to make effective use of technical expertise, build capacity, create focussed pre-defined outputs, and accelerate sharing of knowledge and practice. Open innovation is encouraged: projects are expected to build on existing knowledge and share their work openly.

It is expected that three projects will be funded in the first instance. If this model proves successful, additional funding may be made available for further projects.

Technical Mini Project Topics
Project 1: Analysis of Learning Resource Metadata Records

The aim of this mini project is to identify those descriptive characteristics of learning resources that are frequently recorded / associated with learning resources and that collection managers deem to be important.

The project will undertake a semantic analysis of a large corpus of educational metadata records to identify what properties and characteristics of the resources are being described. Analysis of textual descriptions within these records will be of particular interest e.g. free text used to describe licence conditions, educational levels and approaches.

The data set selected for analysis must include multiple metadata formats (e.g. LOM and DC) and be drawn from at least ten collections. The data set should include metadata from a number of open educational resource collections but it is not necessary for all records to be from oer collections.

For further background information on this topic and for a list of potential metadata sources please see Lorna’s blog post on #cetiswmd activities

Funding: £10,000 payable on receipt of agreed deliverables.

Project 2: Search Log Analysis

Many sites hosting collections of educational materials keep logs of the search terms used by visitors to the site when searching for resources. The aim for this mini project is to develop a simple tool that facilitates the analysis of these logs to classify the search terms used with reference to the characteristics of a resource that may be described in the metadata. Such information should assist a collection manager in building their collection (e.g. by showing what resources were in demand) and in describing their resources in such a way that helps users find them.

The analysis tool should be shown to work with search logs from a number of sites (we have identified some who are willing to share their data) and should produce reports in a format that are readily understood, for example a breakdown of how many searches were for “subjects” and which were the most popular subjects searched for. It is expected that a degree of manual classification will be required, but we would expect that the system is capable of learning how to handle certain terms and that this learning would be shared between users: a user should not have to tell the system that “Biology” is a subject once they or any other user has done so. The analysis tool should be free to use or install without restriction and should be developed as Open Source Software.

Further information on the sort of data that is available and what it might mean is outlined in my blog post Metadata Requirements from the Analysis of Search Logs

Funding: £10,000 payable on receipt of agreed deliverables.

Project 3: Open Call

Proposals are invited for one short technical project or demonstrator in any area relevant to the management, distribution, discovery, use, reuse and tracking of open educational resources. Topics that applicants may wish to explore include, but are not restricted to: resource aggregations, presentation / visualisation of aggregations, embedded licences, “activity data”, sustainable approaches to RSS endpoint registries, common formats for sharing search logs, analysis of use of advanced search facilities, use of OAI ORE.

Funding: £10,000 payable on receipt of agreed deliverables.

Guidelines

Proposals must be no more than 1500 words long and must include the following information:

  1. The name of the mini project.
  2. The name and affiliation and full contact details of the person or team undertaking the work plus a statement of their experience in the relevant area.
  3. A brief analysis of the issues the project will be addressing.
  4. The aims and objectives of the project.
  5. An outline of the project methodology and the technical approaches the project will explore.
  6. Identification of proposed outputs and deliverables.

Proposals are not required to include a budget breakdown, as projects will be awarded a fixed fee on completion.

All projects must be completed within six months of date of approval.

Submission Dates

In order to encourage open working practices project proposals must be submitted to the oer-discuss mailing list at oer-discuss@jicmail.ac.uk by 17.00 on Friday 8th April. List members will then have until the 17th of April to discuss the proposals and to provide constructive comments. Proposals will be selected by a panel of JISC and CETIS representatives who will take into consideration comments put forward by OER TIG members. Successful bidders will be notified by the 21st of April and projects are expected to start in May and end by 31st October 2011.

Successful bidders will be required to disseminate all project outputs under a relevant open licence, such as CC-BY. Projects must post regular short progress updates and all deliverables including a final report to the oer-discuss list and to JISC CETIS.

We encourage all list members to engage with the Mini Projects and to input comments suggestions and feedback through the list.

If you have any queries about this call please contact Phil Barker at phil.barker@hw.ac.uk

Metadata requirements from analysis of search logs

Many sites hosting collections of educational materials keep logs of the search terms used by visitors to the site who search for resources. Since it came up during the CETIS What Metadata (CETISWMD) event I have been think about what we could learn about metadata requirements from the analysis of these search logs. I’ve been helped by having some real search logs from Xpert to poke at with some Perl scripts (thanks Pat).

Essentially the idea is to classify the search terms used with reference to the characteristics of a resource that may be described in metadata. For example terms such as “biology” “English civil war” and “quantum mechanics” can readily be identified as relating to the subject of a resource; “beginners”, “101” and “college-level” relate to educational level; “power point”, “online tutorial” and “lecture” relate in some way to the type of the resource. We believe that knowing such information would assist a collection manager in building their collection (by showing what resources were in demand) and in describing their resources in such a way that helps users find them. It would also be useful to those who build standards for the description of learning resources to know which characteristics of a resource are worth describing in order to facilitate resource discovery. (I had an early run at doing this when OCWSearch published a list of top searches.)

Looking at the Xpert data has helped me identify some complications that will need to be dealt with. Some of the examples above show how a search phrase with more than one word can relate to a single concept, but in other cases, e.g. “biology 101″ and “quantum mechanics for beginners” the search term relates to more than one characteristic of the resource. Some search terms may be ambiguous: “French” may relate to the subject of the resource or the language (or both); “Charles Darwin” may relate to the subject or the author of a resource. Some terms are initially opaque but on investigation turn out to be quite rich, for example 15.822 is the course code for an MIT OCW course, and so implies a publisher/source, a subject and an educational level. Also, in real data I see the same search term being used repeatedly in a short period of time: I guess an artifact of how someone paging through results is logged as a series of searches: should these be counted as a single search or multiple searches?

I think these are all tractable problems, though different people may want to deal with them in different ways. So I can imagine an application that would help someone do this analysis. In my mind it would import a search log and allow the user to go through search by search classifying the results with respect to the characteristic of the resource to which the search term relates. Tedious work, perhaps, but it wouldn’t take too long to classify enough search terms to get an adequate statistical snap-shot (you might want to randomise the order in which the terms are classified in order to help ensure the snapshot isn’t looking at a particularly unrepresentative period of the logs). The interface should help speed things up by allowing the user to classify by pressing a single key for most searches. There could be some computational support: the system would learn how to handle certain terms and that this learning would be shared between users. A user should not have to tell the system that “Biology” is a subject once they or any other user has done so. It may also be useful to distinguish between broad top-level subjects (like biology) and more specific terms like “mitosis”, or alternatively to know that specific terms like “mitosis” relate to the broader term “biology”: in other words the option to link to a thesaurus might be useful.

This still seems achievable and useful to me.