Descriptions and metadata; documents and RDF

I keep coming back to thinking about embedding metadata into human-oriented resource descriptions web pages.

Last week I was discussing RDFa vs triple stores with Wilbert. Wilbert was making the point that publishing RDF is easier to manage, less error prone and easier on the consumer if you deal with it on its own rather than trying to deal with encoding triples and producing a human readable web page with valid XHTML all at the the same time. A valid point, though Wilbert’s starting point was “if you’re wanting to publish RDF” and that left me still with the question of when do we want metadata, i.e. encoded machine readable resource descriptions and when do we want resource descriptions that people can read, and do we really have to separate the two?

Then yesterday, following a recommendation by Dan Rehak, I read this excellent comparison of three approaches that could be used to manage resource descriptions or metadata, relational databases, document stores/noSQL, an triple stores/RDF. Which really helps in that it explains how storing information about “atomic” resources is a strength of document stores (with features like versioning and flexible schema) and storing relationships is a strength of triple stores (with, you know, features like links between concepts). So you might store information about a resource as an XML document structured by some schema so that you could extract the title, author name etc., but sometimes you want to give more detail, e.g. you might want to show how the subject related to other subjects, in which case you’re into the world where RDF has strengths. And then again, while author name is enough for many uses, an unambiguous identifier for the author encoded so that a machine will understand it as a link to more information about the author is also useful.

Also relevant:

CETIS “What metadata…?” meeting summary

Yesterday we had a meeting in London with about 25 people thinking about the question “What metadata is really useful?

My thinking behind having a meeting on this subject was that resource description can be a lot of effort; so we need to be careful that the decisions we make about how it is done are evidence-based. Given the right data we should be able to get evidence about what metadata is really used for, as opposed to what we might speculate that it is useful for (with the caveat that we need to allow for innovation, which sometimes involves supporting speculative usage scenarios). So, what data do we have and what evidence could we get that would help us decide such things as whether providing a description of a characteristic such as the “typical learning time for using a resource” either is or isn’t helpful enough to justify the effort? Pierre Far went to an even more basic level and asked in his presentation, why do we use XML for sharing metadata?–is it the result of a reasoned appraisal of the alternatives, such as JSON, or did just seem the right thing to do at some point?

Dan Rehak made the very useful point to me that we need a reason for wanting to answer such questions, i.e. what is it we want to do? what is the challenge? Most of the people in the room were interested in disseminating educational resources (often OERs): some have an interest in disseminating resources that had been provided by their own project or organization, others have an interest in services that help users find resources from a wide range of providers. So I had “help users find resources they needed” as the sort of reason for asking these questions; but I think Dan was after something new, less generic, and (though he would never say this) less vague and unimaginative. What he suggested as a challenge was something like “how do you build a recommender system for learning materials?” Which is a good angle, and I know it’s one that Dan is interested in at the moment; I hope that others can either buy into that challenge or have something equally interesting that they want to do.

I have suggested that user surveys, existing metadata and search logs are potential sources of data reflecting real use and real user behaviour, and no one has disagreed so I structured much of the meeting around discussion of those. We had short overviews of examples previous work on each each of these, and some discussion about that, followed by group discussions in more depth for each. I didn’t want this to be an academic exercise, I wanted the group discussions to turn up ideas that could be taken forward and acted on, and I was happy at the end of the day. Here’s a sampler of the ideas turned up during the day:
* continue to build the resources with background information that I gathered for the meeting.
* promote the use common survey tools, for example the online tool used by David Davies for the MeDeV subject centre (results here).
* textual analysis of metadata records to show what is being described in what terms.
* sharing search log in a common format so that they can be analysed by others (echoes here of Dave Pattern’s sharing of library usage data and subsequent work on business intelligence that can be extracted from it).
* analysis of search logs to show which queries yield zero hits which would identify topics on which there was unmet demand.

In the coming weeks we shall be working through the ideas generated at the meeting in more depth with the intention of seeing which can actually be brought to fruition. In the meantime keep an eye on the wikipage for the meeting which I shall be turning into a more detailed record of the event.