Phil has already blogged a summary of last week’s memorably tagged What Metadata or cetiswmd meeting. During the latter part of the meeting we split up to discuss practical tasks and projects that the community could undertake with support from CETIS and JISC to explore the kind of issues that were raised at the meeting. We agreed to draft a rough outline of some of these potential activities and then feed them back to the community for comment and discussion. So if you have any thoughts or suggestions please let us know. CETIS are proposing to set up a task group or working group of some kind to develop this work and to provide a forum to explore technical issues relating to the resource description, management and discovery in the context of open educational resources.
I helped to facilitate the breakout group that focused on what we might be able to achieve by looking at existing metadata collections. Here’s an outline of the activity what we discussed.
Textual Analysis of Metadata Records
A large number of existing collections of metadata records were identified by participants including NDLR, JorumOpen, OU openlearn, US data.gov collections, all of which could be analysed to ascertain which fields are used most widely and how they are described. Clearly this metadata exists in a wide range of heterogeneous formats so the task is not as simple as comparing like with like. The “traditional” way to compare different metadata schema and records is through the use of cross-walks. However developing cross walks is a non-trivial task that in itself requires considerable time and resource.
An alternative approach was put forward by ADL’s Dan Rehak who suggested treating the metadata collections as text, stripping out fields and formatting and running the raw data through a semantic analysis tool such as Open Calais. Open Calais uses natural language processing, machine learning and other methods to analyse documents and find the entities within them. Calais claim to go “well beyond classic entity identification and return the facts and events hidden within your text as well.”
Applying data mining and semantic analysis techniques to a large corpus of educational metadata records would be an interesting exercise in itself but until we attempt such an analysis it’s hard to speculate what it might be possible to achieve with the output data. It would certainly be valuable to compare frequently occurring terms and relationships with an analysis of search web logs to see if the metadata records are actually describing the characteristics that users are searching for.
There was general agreement amongst participants that this would be an interesting and innovative project. Participants felt it would be advisable to start small with a comparison of two or three metadata collections, possibly those of JorumOpen, Xpert and the OU Openlearn before taking this forward further.
One thing I am slightly unsure about regarding this method is that Open Calais identifies the relationship between words but once we strip out the metadata encoding of our sample records this information will be lost. I don’t know enough about how these semantic analysis tools work to know whether this is a problem or if they are clever enough for this not to be an issue. I suppose the only way we’ll find out if the results are sensible or useful is to give it a try!
I’d also be very interested to hear how this approach compares with work being undertaken on a much larger scale by the Digging into Data Challenge projects and Mimas’ Bringing Meaning into Search initiative.
Other Activities
Phil has already summarised the other possible tasks and activities put forward by the other breakout groups which include:
- Establishing a common format for sharing search logs.
- Identify which fields are used on advanced forms and how many people use advanced search facilities.
- Analysis of the relative proportion of users who search and browse for resources and how many people click onwards from the initial resources.
- Further development of the search questionnaire used by David Davies. If sufficient responses could be gathered to the same questions this would facilitate meta analysis of the results.
- Work with communities around specific repositories and find out what works and doesn’t work across individual platforms and installations.
- Create a research question inventory on the CETIS wiki and invite people to put forward ideas.
If anyone has any comments or suggestions on any of the above ideas we’d love to hear from you!
On the suggestion of analysing stripped-out metadata and using OpenCalais (or similar offerings, for example I gather Zemanta is more likely to serve-up “public” URIs for entities than OpenCalais), I think you would have to compose proper sentences from the metadata rather than just stripping. e.g. “This presentation was created by Norbert Casteret”.
An ingenious approach to crosswalk!
Hi Adam, thanks for the ref to Zemanta. It may be that OpenCalais is not the most appropriate tool for this kind of semantic analysis, that would be one of the things we’d want to investigate.