The modern art of metadata

At a meeting relating to the Resource Discovery Task Force the other day, Caroline Williams compared the variety of resource description metadata standards and profiles in Libraries, Archives, Museums and beyond as being like a Jackson Pollock picture.
jacksonpollock
She wouldn’t call it a mess (nor would I), but I think we could agree that it’s a bit chaotic.

That got me wondering what the alternatives might be…

Yves Klein?
Monochrome bleu sans titre [Untitled blue monochrome], 1960
Everything the same shade of blue. No, not achievable even if that is what you want.

How about Piet Mondrian?
piet_mondrian1
Nice and neat, but compartmentalized, and quite few empty compartments.

Or Picasso,
picasso
trying to fit several perspectives into one picture with the result that… well, two noses.

Actually Caroline had the answer when I asked her what she would like to see,

Bridget Riley
June, 1992-2002
Diversity, but working together.

Descriptions and metadata; documents and RDF

I keep coming back to thinking about embedding metadata into human-oriented resource descriptions web pages.

Last week I was discussing RDFa vs triple stores with Wilbert. Wilbert was making the point that publishing RDF is easier to manage, less error prone and easier on the consumer if you deal with it on its own rather than trying to deal with encoding triples and producing a human readable web page with valid XHTML all at the the same time. A valid point, though Wilbert’s starting point was “if you’re wanting to publish RDF” and that left me still with the question of when do we want metadata, i.e. encoded machine readable resource descriptions and when do we want resource descriptions that people can read, and do we really have to separate the two?

Then yesterday, following a recommendation by Dan Rehak, I read this excellent comparison of three approaches that could be used to manage resource descriptions or metadata, relational databases, document stores/noSQL, an triple stores/RDF. Which really helps in that it explains how storing information about “atomic” resources is a strength of document stores (with features like versioning and flexible schema) and storing relationships is a strength of triple stores (with, you know, features like links between concepts). So you might store information about a resource as an XML document structured by some schema so that you could extract the title, author name etc., but sometimes you want to give more detail, e.g. you might want to show how the subject related to other subjects, in which case you’re into the world where RDF has strengths. And then again, while author name is enough for many uses, an unambiguous identifier for the author encoded so that a machine will understand it as a link to more information about the author is also useful.

Also relevant:

CETIS “What metadata…?” meeting summary

Yesterday we had a meeting in London with about 25 people thinking about the question “What metadata is really useful?

My thinking behind having a meeting on this subject was that resource description can be a lot of effort; so we need to be careful that the decisions we make about how it is done are evidence-based. Given the right data we should be able to get evidence about what metadata is really used for, as opposed to what we might speculate that it is useful for (with the caveat that we need to allow for innovation, which sometimes involves supporting speculative usage scenarios). So, what data do we have and what evidence could we get that would help us decide such things as whether providing a description of a characteristic such as the “typical learning time for using a resource” either is or isn’t helpful enough to justify the effort? Pierre Far went to an even more basic level and asked in his presentation, why do we use XML for sharing metadata?–is it the result of a reasoned appraisal of the alternatives, such as JSON, or did just seem the right thing to do at some point?

Dan Rehak made the very useful point to me that we need a reason for wanting to answer such questions, i.e. what is it we want to do? what is the challenge? Most of the people in the room were interested in disseminating educational resources (often OERs): some have an interest in disseminating resources that had been provided by their own project or organization, others have an interest in services that help users find resources from a wide range of providers. So I had “help users find resources they needed” as the sort of reason for asking these questions; but I think Dan was after something new, less generic, and (though he would never say this) less vague and unimaginative. What he suggested as a challenge was something like “how do you build a recommender system for learning materials?” Which is a good angle, and I know it’s one that Dan is interested in at the moment; I hope that others can either buy into that challenge or have something equally interesting that they want to do.

I have suggested that user surveys, existing metadata and search logs are potential sources of data reflecting real use and real user behaviour, and no one has disagreed so I structured much of the meeting around discussion of those. We had short overviews of examples previous work on each each of these, and some discussion about that, followed by group discussions in more depth for each. I didn’t want this to be an academic exercise, I wanted the group discussions to turn up ideas that could be taken forward and acted on, and I was happy at the end of the day. Here’s a sampler of the ideas turned up during the day:
* continue to build the resources with background information that I gathered for the meeting.
* promote the use common survey tools, for example the online tool used by David Davies for the MeDeV subject centre (results here).
* textual analysis of metadata records to show what is being described in what terms.
* sharing search log in a common format so that they can be analysed by others (echoes here of Dave Pattern’s sharing of library usage data and subsequent work on business intelligence that can be extracted from it).
* analysis of search logs to show which queries yield zero hits which would identify topics on which there was unmet demand.

In the coming weeks we shall be working through the ideas generated at the meeting in more depth with the intention of seeing which can actually be brought to fruition. In the meantime keep an eye on the wikipage for the meeting which I shall be turning into a more detailed record of the event.

Analysing OCWSearch logs

We have a meeting coming up on the topic of investigating what data we have (or could acquire) to answer the question of what metadata is really required to support the discovery, selection, use and management of educational resources. At the same time as I was writing a blog post about that, over at OCWSearch they were publishing the list of top searches for their collection (I think Pierre Far is the person to thank for that). So, what does this tell us about metadata requirements?

I’ve been through the terms at the top half of the list (it says that the list is roughly in descending order of popularity, however it would be really good to know more about how popular each search term was) and tried to judge what characteristic or property of the resource the searcher was searching on.

There were just under 170 search terms in total. It doesn’t surprise me that the vast majority (over 95%) of them are subject searches. Both higher-level, broad subject terms (disciplines, e.g. “Mathematics”) and lower-level, finer-grained subject terms (topics, e.g. “Applied Geometric Algebra”) crop up in abundance. I’m not sure you can say much about their relative importance.

What’s left is (to me) more interesting. We have:

  • resource types, specifically: “online text book”, “audio”, “online classes”.
  • People, who seem to be staff at MIT, so while it’s possible someone is searching for material about them or about their theories, I think it is likely that people are searching for them as resource creators
  • level, specifically: 101, Advanced (x2), college-level. These are often used in conjunction with subject terms.
  • Course codes e.g. HSM 260, 15.822, Psy 315. (These also imply a level and a subject.)

I think with more data and more time spent on the analysis we could get some interesting results from this sort of approach.

Jorum and Google ranking

Les Carr has posted an interesting analysis of Visibility of OER Material: the Jorum Learning and Teaching Competition. He searches for six resources on Google and compares the ranking in the results page of the resource on Google with the resource elsewhere. The results are mixed: sometimes Jorum has the top place sometimes some other site (institutional or author’s site) is top, though it should be said that with one exception we’re talking about which is first and which is second. In other words both would be found quite easily.

Les concludes:

Can we draw any general patterns from this small sample? To be honest, I don’t think so! The range of institutions is too diverse. Some of the alternative locations are highly visible, so it is not surprising that Jorum is eclipsed by their ranking (e.g. Cambridge, very newsworthy Gurkhas international organisation). Some 49% of Open Jorum’s records provide links to external sources rather than holding bitstream contents directly. It would be very interesting to see the bigger picture of OER visibility by undertaking a more comprehensive survey.

Yes it would be very interesting to see the bigger picture, and also it would be interesting to see a more thorough investigation of just the Jorum’s role (I don’t think Les will mind the implication that he has no more than scraped the surface).

Some random thoughts that this raises in my mind:

  • Title searches are too easy, the quality of resource description will only be tested by searching for the keywords that are really used by people looking for these resources. Some will know the title of the resource, but not many. Just have a play with using the most important one or two words from the title rather than the whole title and see how the results change.
  • To say that Jorum enhances/doesn’t enhance visibility depending on whether it comes above or below the alternative sites is too simplistic. If it links to the other site Jorum will enhance the visibility of that site even if it ranks below it; having the same resource represented twice in the search engine results page enhances its visibility no matter what the ordering; on the other hand, having links from elsewhere pointing to two separate sites probably reduces the visibility of both.
  • Sometimes Jorum hosts a copy of the resource, sometimes it just points to a copy elsewhere; that’s got to have an effect (hasn’t it?).
  • What is the cause of the difference? When I’ve tried similar (very superficial) comparisons, I’ve noticed that Jorum gets some of the basics of SEO right (e.g. using the resource’s title in the HTML Title element; curiously it doesn’t seem to use the HTML Description element). How does this compare to other hosts? I’ve noticed some other OER sites that don’t get this right, so we could see Jorum as guaranteeing a certain basic quality of resource discovery rather than as necessarily enhancing visibility. (Question: is this really necessary?)
  • What happens over time? Do people link to the copy in the Jorum or elsewhere. This will vary a lot, but there may be a trend. I’ll note in passing that choosing six resources that had been promoted by Jorum’s learning and teaching competition may have skewed the results.
  • Which should be highest ranked anyway? Do we want Jorum to be highly ranked to reflect its role as part of the national infrastructure, a place to showcase what you’ve produced; or do institutions see releasing OERs as part of a marketing strategy, and the best Jorum can do is quietly improve the ranking of the OERs on the institution’s site by linking to them? This surely relates to the choice between having Jorum host the resource or just having it link to the resource on the institutions site (doesn’t it?).

Sorry, all questions and no answers!

What do we know about educational metadata requirements

We at CETIS are in the early stages of planning a meeting (pencilled in for October, date and venue tbc) to collect and compare evidence on what we know about user requirements for metadata to support the discovery, retrieval, use and management of educational resources. We would like to know who has what to contribute: so if you’re in the business of creating metadata for educational resources, please would you come and tell us what it is useful for.

One approach taken to developing metadata standards and application profiles is to start with use cases and derive requirements from them; the problem is that when standardizing a new domain these use cases are often aspirational. In other words, someone argues a case for describing some characteristic of a resource (shall we use “semantic density” as an example?) because they would like to those descriptions for some future application that they think would be valuable. Whether or not that application materialises, the metadata to describe the characteristic remains in the standard. Once the domain matures we can look back at what is actually useful practice. Educational metadata is now a mature domain, and some of this reviewing of what has been found to be useful is happening, it is this that we want to extend. We hope that in doing so we will help those involved in disseminating and managing educational resources make evidence-based decisions on what metadata they should provide.

I can think of three approaches for carrying out a review of what metadata really is useful. The first is to look at what metadata has been created, that is what fields have been used. This has been happening for some time now, for example back in 2004 Norm Friesen looked at LOM instances to see which elements were used, and Carol Jean Godby looked at application profiles to see which elements were recommended for use. More recent work associated with the IEEE LOM working group seems to confirm the findings of these early studies. The second approach is to survey users of educational resources to find out how they search for them. David Davies presented the results of a survey asking “what do people look for when they search online for learning resources?” at a recent CETIS meeting. Finally, we can look directly at the logs kept by repositories and catalogues of educational materials to ascertain the real search habits of users, e.g. what terms do they search for, what characteristics do they look for, what browse links do they click. I’m not sure that this final type of information is shared much, if at all, at present (though there have been some interesting ideas floated recently about sharing various types of analytic information for OERs, and there is the wider Collective Intelligence work of OLNet). If you have information from any of these approaches (or one I haven’t thought of) that you would be willing to share at the meeting I would like to hear from you. Leave a comment below or email phil.barker@hw.ac.uk .

CETIS Gathering

At the end of June we ran an event about technical approaches to gathering open educational resources. Our intent was that we would provide space and facilities for people to some and talk about these issues, but we would not prescribe anything like a schedule of presentations or discussion topics. So, people came but what did they talk about?

In the morning we had a large group discussing approaches to aggregating resources and information about them through feeds such RSS or ATOM, and another smaller group discussing tracking what happens to OER resources once they are released.

I wasn’t part of the larger discussion, but I gather than they were interested in the limits of what can be brought in by RSS and difficulties due to the (shall we say) flexible semantics of the elements typically used in RSS even when extended in the typical way with Dublin Core. They would like to bring in information which was more tightly defined and also information from a broader range of sources relating to the actual use of the resource. They would also like to identify the contents of resources at a finer granularity (e.g. an image or movie rather than a lesson) while retaining the context of the larger resource. These are perennial issues, and bring to my mind technologies such as OAI-PMH with metadata richer than the default Dublin Core, Dublin Core Terms (in contrast to Dublin Core Element Set), OAI-ORE, and projects such as PerX and TicToCs (see JournalToCs) (just to mention two which happened to be based in the same office as me). At CETIS we will continue to explore these issues, but I think it is recognised that the solution is not as simple as using a new metadata standard that is in some way better than what we have now.

The discussion on tracking resources (summarized here by Michelle Bachler) was prompted by some work from the Open University’s OLNet on Collective Intelligence, and also some CETIS work on tracking OERs. For me the big “take-home” idea was that many individual OER providers and services must have information about the use of their resources which, while interesting in themselves, would become really useful if made available more widely. So how about, for example, open usage information about open resources? That could really give us some data to analyse.

There were some interesting overlaps between the two discussions: for example how to make sure that a resource is identified in such a way that you can track it and gather information about it from many sources, and what role can usage information play in the full description of a resources.

After lunch we had a demo of a search service built by cross-searching web 2 resource hosts via their APIs, which has been used by the Engineering Subject Centre’s OER pilot project. This lead on to a discussion of the strengths and limitations of this approach: essentially it is relatively simple to implement and can be used to provide a tailored search for an specialised OER collection so long as the number of targets being searched is reasonably low and their APIs stable reliable. The general approach of pulling in information via APIs could be useful in pulling in some of the richer information discussed in the morning. The diversity of APIs lead on to another well-rehearsed discussion mentioning SRU and OpenSearch as standard alternatives.

We also had a demonstration of the iCOPER search / metadata enrichment tool which uses REST, Atom and SPI to allow annotation of metadata records–very interesting as a follow-on from the discussions above which were beginning to see metadata not as a static record but as an evolving body of information associated with a resource.

Throughout the day, but especially after these demos, people were talking in twos and threes, finding out about QTI, Xerte, cohere, and anything else that one person knew about and others wanted to. I hope people who came found it useful, but it’s very difficult as an organiser of such and event to provide a single definitive summary!

Additional technical work for UKOER

CETIS has been funded by JISC to do some additional technical work relevant to the the UKOER programme. The work will cover three topics: deposit via RSS feeds, aggregation of OERs, and tracking & analysis of OER use.

Feed deposit
There is a need for services hosting OERs to provide a mechanism for depositors to upload multiple resources with minimal human intervention per resource. One possible way to meet this requirement that has already identified by some projects is “feed deposit”. This approach is inspired by the way in which metadata and content is loaded onto user devices and applications in podcasting. in short, RSS and ATOM feeds are capable, in principle, of delivering the metadata required for deposit into a repository and in addition can provide either a pointer to the content or that content itself may be embedded into the feed. There are a number of issues with this approach that would need to be overcome.

In this work we will: (1) Identify projects, initiatives, services, etc. that are engaged in relevant work [–if that’s you, please get in touch]. (2) Identify and validate the issues that would arise with respect to feed deposit, starting with those outlined in the Jorum paper linked to above. (3) Identify current approaches used to address these issues, and identify where consensus may be readily achieved.

Aggregation of OERs
There is interest in facilitating a range of options for the provision of aggregations of resources representing the whole or a subset of the UKOER programme output (possibly along with resources from other sources). There have been some developments that implement solutions based on RSS aggregation, e.g. Ensemble and Xpert; and the UKOLN tagometer measures the number of resources on various sites that are tagged as relevant to the UKOER programme.

In this work we will illustrate and report on other approaches, namely (a) Google custom search, (b) query and result aggregation through Yahoo pipes and (c) querying through the host service APIs. We will document the benefits and affordances as well as drawbacks and limitations of each of these approaches. These include the ease with which they may be adopted, and the technical expertise necessary for their development, their dependency on external services (which may still be in beta), their scalability, etc.

Tracking and analysis of OER use
Monitoring the release of resources through various channels, how those resources are used and reused and the comments and ratings associated with them, through technical means is highly relevant to evaluating the uptake of OERs. CETIS have already described some of the options for resource tracking that are relevant to the UKOER programme.

In this work we will write and commission case studies to illustrate the use of these methods, and synthesise the results learnt from this use.

Who’s involved in this work
The work will be managed by me, Phil Barker, and Lorna M Campbell.

Lisa J Rogers will be doing most of the work related to feed deposit and aggregation of OERs

R John Robertson will be doing most of the work relating to Tracking and analysis of OER use.

Please do contact us if you’re interested in this work.

Resource profiles for learning materials

Stephen Downes has written a position paper which builds on his idea of Resource Profiles from 2003. The abstract runs:

Existing learning object metadata describing learning resources postulates descriptions contained in a single document. This document, typically authored in IEEE-LOM, is intended to be descriptively complete, that is, it is intended to contain all relevant metadata related to the resource. Based on my 2003 paper, Resource Profiles, an alternative approach is proposed. Any given resource may be described in any number of documents, each conforming to a specification relevant to the resource. A workflow is suggested whereby a resource profiles engine manages and combines this data, producing various views of the resource, or in other words, a set of resource profiles.

I’ve been interested in the idea of resource profiles since I first read about them, but more recently had them in mind while doing the Learning Materials Application Profile Scoping Study. Throughout that work we found heterogeneity to be a big theme: different metadata requirements and standards for different resource types, different requirements for supporting different activities, and information (potentially) available from a diverse range of systems. These all align well with what Downes’ says about resource profiles (and I wish I had said more along those lines in the report).

One thing I’ld like to see demonstrated is how you link all the information together. The same resource is going to be identified differently in different systems, and sometimes not at all. So if you have a video of a lecture you might want to pull in technical metadata from one system (and remember the same video may be available in different technical formats), licence metadata from another system which uses a different identifier, and link it to information about the course for which the lecture was delivered held in a system that doesn’t know about the video at all. How do you make and maintain these links? Some of the semantic web ideas will help here (e.g. providing ways of saying that the resource identified here and the resource identified there are the same as each other; or providing typed relations, “this resource was used in that course”). One of the positive things I’ve seen in the DC-Ed Application Profile and ISO-MLR Education work is that they are both building domain models that make these relationships explicit (see the DC-Ed model and ISO MLR model).

This work also reminds us that much of the educational information that we would like to gather relates not so much to the resource per se as to the use (intended or actual) or the resource in an educational setting. Maybe some of the work on tracking OER use could be helpful here: one of the challenges with tracking OERS is to discover when an OER has been by someone and what they used it for. If (and it is a very big if, it won’t happen accidentally) that leads you on to metadata about their use, then perhaps you could start to gather meaningful information about what education level the resource relates to, etc.