Feeding a repository

There has been some discussion recently about mechanisms for remote or bulk deposit in repositories and similar services. David Flanders ran a very thought provoking and lively show and tell meeting a couple of weeks ago looking at deposit. In part this is familiar territory; looking at and tweaking the work that the creators of the SWORD profile have done based on APP; or looking again at webDav. But there is also a newly emerging approach of using RSS or Atom feeds to populate repositories, a sort of feed-deposit. Coincidentally we also received a query at CETIS from a repository which is looking to collect outputs of the UKOER programme asking for help in firming-up the requirements for bulk or remote deposit, and asking how RSS possibly fitted into this.

So what is this feed-deposit idea. The first thing to be aware of is that as far as I can make out a lot of the people who talk about this don’t necessarily have the same idea of “repository” and “deposit” as I do. For example the Nottingham Xpert rapid innovation project and the Ensemble feed aggregator are both populated by feeds (you can also disseminate material through iTunesU this way). But, (I think) these are all links-only collections, so I would call them a catalogues not repositories, and I would say that they work by metadata harvest(*) not deposit. However, they do show that you can do something with feeds which the people who think that RSS or Atom is about stuff like showing the last ten items published should take note of. The other thing to take note of is podcasting, by which I don’t mean sticking audio files on a web server and letting people find them, but I mean feeds that either carry or point to audio/video content so that applications and devices like phones and wireless-network enabled media players can automatically load that content. If you combine what Xpert and Ensemble are doing by way of getting information about entire collections with the way that podcasts let you automatically download content then you could populate a repository through feeds.

The trouble is, though, that once you get down to details there are several problems and several different ways of overcoming them.

For example, how do you go beyond having a feed for just the last 10 resources? Putting everything into one feed doesn’t scale. If your content is broken down into manageable sized collections (e.g. The OU’s OpenLearn courses and I guess many other OER projects) you could put everything from each collection into a feed and then have an OPML file to say where all the different feeds are (which works up to a point, especially if the feeds will be fairly static, until your OPML file gets too large). Or you could have an API that allowed the receiver of the feed to specify how they wanted to chunk up the data: OpenSearch should be useful here, it might be worth looking at YouTube as an example. Then there are similar choices to be made for how just about every piece of metadata and the content itself is expressed in the feed, starting with the choice of flavour(s) for RSS or ATOM feed.

But, feed-deposit is a potential solution, and it’s not good to try to start with a solution and then articulate the problem. The problem that needs addressing (by the repository that made the query I mentioned above) is how best to deposit 100s of items given (1) a local database which contains the necessary metadata (2) enough programming expertise to read that metadata from the database and republish or post to an API. The answer does not involve someone sat for a week copy-and-pasting into a web form that the repository provides as its only means of deposit.

There are several ways of dealling with that. So far a colleague who is in this position has had success depositing into Flickr, SlideShare and Scribd by repeated calls to their respective APIs for remote deposit—which you could call a depositer-push approach—but an alternative is that she put the resources somewhere, provides information to tell repositories where they are so any repository that listens can come and harvest them—which would be more like a repository-pull approach, and in which case Feed-deposit might be the solution.

[* Yes, I know about OAI-PMH, the comparison is interesting, but this is a long post already.]

Resource description requirements for a UKOER project

CETIS have provided information on what we think are the metadata requirements for the UK OER programme, but we have also said that individual projects should think about their own metadata requirements in addition to these. As an example of what I mean by this, here is what I produced for the Engineering Subject Centre’s OER project.

Like it says on the front page it’s an attempt to define what information about a resource should be provided, why, for whom, and in what format, where:

“Who” includes project funders (HEFCE + JISC and Academy as their agents), project partner contributing resource, project manager, end users (teachers and students), aggregators—that is people who wish to build services on top of the collection.

“Why” includes resource management, selection and use as well as discovery through Google or otherwise, etc. etc.

“Format” includes free text for humans to read (which is incidentally what Google works from) and encoded text for machine operations (e.g. XML, RSS, HTML metatags, microformats, metadata embedded in other formats or held in the database of whatever content management system lies behind the host we are using).

You can read it on Scribd: Resource description requirements for EngSC OER project

[I should note that I work for the Engineering Subject Centre as well as CETIS and this work was not part of my CETIS work.]

It would be useful to know if other projects have produced anything similar. . .

Distribution platforms for OERs

One of the workpackages for CETIS’s support of the UKOER programme is:

Technical Guidelines–Services and Applications Inventory and Guidance:
Checklist and notes to support projects in selecting appropriate publication/distribution applications and services with some worked examples (or recommendations).
Output: set of wiki pages based on content type and identifying relevant platforms, formats, standards, ipr issues, etc.

I’ve made a start on this here, in a way which I hope will combine the three elements mentioned in the workpackage:

  1. An inventory of host platforms by resource type. Which are platforms that are being used for which media or resource types?
  2. A checklist of technical factors that projects should consider in their choice of platform
  3. Further information and guidance for some of the host platforms. Essentially that’s the checklist filled in

In keeping with the nature of this phase of the UKOER programme as a pilot, we’re trying not to be prescriptive about the type of platform projects will use. Specifically, we’re not assuming that they will use standard repository software and are encouraging projects to explore and share any information about the suitability of web2.0 social sharing sites. At the moment the inventory is pretty biased to these web2.0 sites, but that’s just a reflection of where I think new information is required.

How you can help

Feedback
Any feedback on the direction of this work would be welcome. Are there any media types I’m not considering that I should? Are the factors being considered in the checklist the right ones? Is the level of detail sufficient? Where are the errors?

Information
I want to focus on the platforms that are actually being used, so it would be helpful to know which these are. Also, I know from talking to some of you that there is invaluable experience about using some of these services, for example some APIs are better documented than others, some offer better functionality than others, some have limitations that aren’t apparent until you try to use them seriously. It would be great to have this in-depth information, there is space in the entry for each platform for these “notes and comments”.

Contributions
The more entries are filled out the better, but there’s a limit on what I can do, so all contributions would be welcome. In particular, I know that iTunes/iTunesU is important for audio video / podcasting, but I don’t have access myself — it seems to require some sort of plug-in called “iTunes” ;-) — so if anyone can help with that I would be especially grateful.

Depending on how you feel, you help by emailing me (philb@icbl.hw.ac.uk), or by registering on the CETIS wiki and either using the article talk page (please sign your comments) or the article itself. Anything you write is likely to be distributed under a Creative Commons cc-by-nc licence.

About metadata & resource description (pt 2)

Trying to show how resource description on sites such as Flickr relates to metadata…

Some people have looked at the metadata requirements for the UK OER programme and taken them as a prescription for which LOM or Dublin Core elements they should use. After all that’s what metadata is, isn’t it? But UK OER projects are also encouraged to use Web2.0 or social sharing platforms (Flickr, YouTube, SlideShare etc.) to make their resources available, and these sites don’t know anything about metadata, do they?

Well, in my previous post I tried to distinguish between resource description and metadata, where resource description is pretty much any information about anything, and metadata is the structured information about a resource (acknowledging that the distinction is not always made by everyone). I think that some of the “metadata” requirements given for OER in various discussions are actually better seen at first as resource description requirements.

The second problem with seeing the UK OER metadata requirements as a prescriptions for which elements to use is that, to me at least, it misses the point of what metadata does best. I think that the best view of metadata is that it shows the relationship between resources. “Resources” here means anything — information resources like the OERs, people, places, things, organizations, abstract concepts — so long as the thing can be identified. What metadata does is express or assert a relationship such as “this OER was created by this person”.

So looking at an image’s “canonical” page on Flickr, we see a resource description which has a link to the photo stream of the person who uploaded it (me) and from there there is a link to my profile page on Flickr. That’s done with metadata, but how do we get at it?

Well, in the HTML for the image page the link is rendered as

<a href="/photos/philbarker/"
   title="Link to phil barker's photostream"
   rel="dc:creator cc:attributionURL">
       <b>phil barker</b>
</a>

the rel=”dc:creator cc:attributionURL” tell a computer what the relationship between this page and the URL is, i.e. that the URL identifies the creator of the page and should be used for attribution. That’s not great because I’m not my photostream, in fact my photostream doesn’t even describe me.

Things are better on the photostream page though, it has in its HTML

<link rel="alternate"
  type="application/rss+xml"
  title="Flickr: phil barker's Photostream RSS feed"
  href="http://api.flickr.com/services/feeds/photos_public.gne?id=56583935@N00&lang=en-us&format=rss_200">

which points any application that knows how to read HTML and RSS to the RSS feed for my photostream, where we see in the entry for that picture the following:

<author flickr:profile="http://www.flickr.com/people/philbarker/">nobody@flickr.com (phil barker)</author>

As well as the description of me (my name and not-my-email-address) there is the link to my profile page. Looking at the HTML for that profile page, not only does it generate a human readable rendering in a browser, but it includes the following


<div class="vcard">
    <span class="nickname">phil barker</span>
...
    <span class="RealName">/
        <span class="fn n">
           <span class="given-name">Phil</span>
           <span class="family-name">Barker</span>
        </span>
    </span>
...
</div>

That is a computer readable hCard microformat version of my contact information (coincidentally it’s the same underlying schema for person-data that is used in the LOM)

So there’s your Author metadata on Flickr. And I’ll note that all this happened without me ever thinking that I was “cataloguing”!

To generalise and idealise slightly, the resource pages (the canonical page for the image, the photostream page, my profile page) have embedded in them one or more of the following

  • links which describe the relationship of the resources described on those pages to each other in a computer-readable manner
  • links to alternative descriptions in machine readable metadata, e.g. an RSS or ATOM XML file for the resource described on the page
  • embedded computer readable metadata, e.g. vCard person-data embedded in the hCard microformat.

See also Adam’s post Objects in this Mirror are Closer than they Appear: Linked Data and the Web of Concepts.

About metadata & resource description (pt 1)

Trying to distinguish between metadata and resource description…

In our online support session for the UKOER programme, some of which John has summarized (1 2 3), instead of giving participants a definition of what metadata is we gave them a choice and asked them to vote on what they understood the word to mean.

The options were:
A: data about data
B: structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.
C: pretty much any information about anything.
D: any of the above.

You might recognise option A as the etymological definition, B as the NISO’s definition, found in Understanding Metadata [pdf]. I was interested in how many people included C in what they understood when they used/heard the term metadata. This was prompted by comment, I forget from whom and in what context, that the idea of metadata defined in option B was fine in a specialized academic sense, but the the word was used more widely and so loosely that you could no longer rely on that being what people meant. In other words you could not assume that someone who said they had metadata would be able provide you with nicely structured machine readable XML/RDF/HTML-Meta tagged information.

Our sample of participants in the online session wasn’t scientifically chosen. Everyone had some connexion with the UK OER programme either working for a project or helping to manage or provide advice to the programme; there were approximately equal representation of managers and technical people (with some overlap, I guess), and one person had a library/information science background (that was my co-presenter, John!). The vote came out as
5 for A: Data about data;
14 for B: Structured information…;
0 for C: any information about anything;
10 for D: any of the above.

In retrospect it’s not surprising that no one voted for C, since the people in our audience who recognise that as a meaning are likely to have come across A and B as well.

Like someone said during the vote, you can tell B is the “right” answer because it is the longest and most formal looking option :-). For me, data about data is too restrictive in range and I think it would be helpful not to call option C/D metadata. I would rather use the term resource description to cover all options and reserve metadata for the structured information about a resource (which includes but is broader than data about data). So metadata tells a computer that 2009-09-11 is to be interpreted as a date in ISO8601 format and is the sort of structured information found in LOM and Dublin Core. Resource description may be metadata or may be free text for people to read. Computers such as those run by Google can do a pretty good job of processing information aimed at people; people (on the whole) aren’t very good at information aimed at computers.

I think that the best view of metadata is that it shows the relationship between resources. “Resources” here means anything that can be identified (if you cannot identify it you cannot show how things are related to it), including: information resources like the OERs, people, places, things, organizations, abstract concepts. What metadata does is express the assertion that this OER (for example) was created by this person. I’ll try to show how this allows the mixing up of metadata and resource description (in a good way) in my next post.

Web2 vs iTunesU

There was an interesting discussion last week on the JISC-Repositories email list that kicked off after Les Carr asked

Does anyone have any experience with iTunes U? Our University is thinking of starting a presence on Apple’s iTunes U (the section of the iTunes store that distributes podcasts and video podcasts from higher education institutions). It looks very professional (see for example the OU’s presence at http://projects.kmi.open.ac.uk/itunesu/ ) and there are over 300 institutions who are represented there.

HOWEVER, I can’t shake the feeling that this is a very bad idea, even for lovers of Apple products. My main misgiving is that the content isn’t accessible apart from through the iTunes browser, and hence it is not Googleable and hence it is pretty-much invisible. Why would anyone want to do that? Isn’t it a much better idea to put material on YouTube and use the whole web/web2 infrastructure?

I’ld like to summarize the discussion here so that the important points raised get a wider airing; however it is a feature of these high quality discussions like this one that people learn and change their mind as a result, so I please don’t assume that people quoted below still hold the opinions attributed to them. (Fro example, invisibility on Google turned out to be far from the case for some resources.) If You would like to see the whole discussion look in the JISCMAIL archive

The first answers from a few posters was that it is not an either/or decision.

Patricia Killiard:

Cambridge has an iTunesU site. […] the material is normally deposited first with the university Streaming Media Service. It can then be made accessible through a variety of platforms, including YouTube, the university web pages and departmental/faculty sites, and the Streaming Media Service’s own site, as well as iTunesU.

Mike Fraser:

Oxford does both using the same datafeed: an iTunesU presence (which is very popular in terms of downloads and as a success story within the institution); and a local, openly available site serving up the same
content.

Jenny Delasalle and David Davis of Warwick and Brian Kelly of UKOLN also highlighted how iTunesU complemented rather than competed with other hosting options, and was discoverable on Google.

Andy Powell, however pointed out that it was so “Googleable” that a video from Warwick University on iTunesU video came higher in the search results for University of Warwick No Paradise without Banks than the same video on Warwick’s own site. (The first result I get is from Warwick, about the event, but doesn’t seem to give access to the video–at least not so easily that I can find it; the second result I get is the copy from iTunes U, on deimos.apple.com . Incidentally, I get nothing for the same search term on Google Videos.) He pointed out that this is “(implicitly) encouraging use of the iTunes U version (and therefore use of iTunes) rather than the lighter-weight ‘web’ version.” and he made the point that:

Andy also raised other “softer issues” about which ones will students be referred to that might reinforce one version rather than another as the copy of choice even if it wasn’t the best one for them.

Ideally it would be possible to refer people to a canonical version or a list of available version, (Graham Triggs mentioned Google’s canonical URLs, perhaps if if Google relax the rules on how they’re applied) but I’m not convinced that’s likely to happen. So there’s a compromise, variety of platforms for a variety of needs Vs possibly diluting the web presence for any give resource.

And a response from David Davies:

iTunesU is simply an RSS aggregator with a fancy presentation layer.
[…]
iTunesU content is discoverable by Google – should you want to, but as we’ve seen there are easier ways of discovering the same content, it doesn’t generate new URLs for the underlying content, is based upon a principle of reusable content, Apple doesn’t claim exclusivity for published content so is not being evil, and it fits within the accepted definition of web architecture. Perhaps we should simply accept that some people just don’t like it. Maybe because they don’t understand what it is or why an institution would want to use it, or they just have a gut feeling there’s something funny about it. And that’s just fine.

mmm, I don’t know about all these web architecture principles, I just know that I can’t access the only copy I find on Google. But then I admit I do have something of a gut feeling against iTunesU; maybe that’s fine, maybe it’s not; and maybe it’s just something about the example Andy chose: searching Google for University of Warwick slow poetry video gives access to copies at YouTube and Warwick, but no copy on iTunes.

I’m left with the feeling that I need to understand more about how using these services affects the discoverability of resources using Google–which is one of the things I would like to address during the session I’m organising for the CETIS conference in November.

Semantic Web in HE meeting in Nice

Just announced: “SemHE ’09: semantic technologies for teaching and learning support in higher education” a meeting co-located with the 4th European Conference on Technology Enhanced Learning in Nice 29 or 30 September (tbc). This isn’t a CETIS meeting, but it is part-organized by the JISC SemTech project at Southampton University, a project which is supported by a CETIS working group, and which had its origins in the Semantic Structures for Teaching and Learning session in the 2007 CETIS conference.

Full details an call for papers for the Nice meeting at http://www.semhe.org/.