ePub metadata what gets shown?

Posted on 18/06/2013 by Phil Barker

One of the issues around eTextBooks is how to describe them, specifically by way of educational metadata in ePub. That’s something that on the face of it shouldn’t be too difficult to address (at least to the extent that we know how to describe any educational resource). One thing that would be useful in demonstrating different choices for educational metadata is an app or tool that will display any metadata found in the ePub package in a sensible way. As a bit of long shot I tried four eBook readers to see whether they would; they don’t. The details follow, if you’re interested, but do let me know if you know of any tool that might be useful.

The package metadata of an ePub can include a selection of Dublin Core elements and terms. These can be refined, for example you may have two dc:title elements with refinements to specify that one is the main title and the other the subtitle. You can also extend with elements from other XML namespaces, or if you prefer you can just link to a metadata record of your favourite flavour which can be either inside the ePub package or elsewhere on the web. Any of this metadata can relate to the eBook as a whole or some part of it, e.g. a single chapter or image. Without going into details there seems to be enough scope there to experiment with how educational characteristics of the eBook might be described.

But how to see the results? I took an ePub (a copy of O’reily’s EPUB 3 Best Practices, since it seemed likely to provide as good a starting point as I was going to find in a real book), made a copy, unzipped it and changed the values of the meta elements so that I could easily identify what elements were being displayed. For example I changed
<dc:title id="pub-title">EPUB 3 Best Practices</dc:title> to
<dc:title id="pub-title">dc:title</dc:title> and so on.

Here’s a list of the metadata elements in that file:

<dc:title id="pub-title">
<dc:creator id="..." >
<dc:publisher>
<dc:date>
<meta property="dcterms:modified">
<dc:identifier id="pub-identifier">
<dc:language id="pub-language">
<dc:contributor> (repeated)
<dc:rights>
<dc:subject>
<dc:description>
<meta id="meta-identifier" property="dcterms:identifier">
<meta property="dcterms:title" id="meta-title">
<meta property="dcterms:language" id="meta-language">
<meta property="dcterms:rights">
<meta property="dcterms:rightsHolder">
<meta property="dcterms:publisher">
<meta property="dcterms:subject">
<meta property="dcterms:description>
<meta id="...." property="dcterms:creator"> (repeated, different ids)
<meta name="cover" content="cover-image"/>
<meta property="ibooks:specified-fonts">

I then looked at this with various eBook readers:

Readium

I had hopes for Readium since it is pretty much the reference implementation of EPUB3. It displayed

in Readium

dc:title
dc:creator
dc:publisher
dc:date
meta dcterms:modified
dc:identifier

Note that it doesn’t even check for a valid value for dates.

Calibre

Calibre, while it doesn’t claim to support ePub3 is targetted at managing personal book libraries. It displays:

in Calibre

dc:title
dc:creator
dc:subject (for tags)
dc:description
dc:publisher

It probably uses dc:language and dc:date (for published) as well but recognises that the values dc:language / dc:date aren’t valid.

Ideal Reader for Android

The Ideal Reader for Android is the other ePub3 reader I use. It displays

In Ideal Android Reader

dc:title
meta dcterms:creator (just one of them)
dc:date
dc:publisher
dc:description
dc:subject
dc:rights

iTunes

Finally I gave a chunk of diskspace to Apple

in iTunes desktop for Windows 7

dc:title
dc:creator
dc:title (again)
dc:subject (in the info tab, as Genre)

Yep, title is there twice: the info tab shows dc:title in the Name and Album fields, so you can gauge the amount of effort that Apple have put into adapting iTunes for books.

What did I learn?

I learnt that none of the ePub reading/management apps or tools that I have show more than the bare minimum of metadata, even if it is there. None of them will be much good for trying out ideas for how educationally characteristics can be described since I strongly suspect that none of it will be viewable. That’s not too surprising, especially when you consider that none of the tools I looked at are geared around resource discovery, but I can’t really go uploading dummy ePub files to book seller sites just to see what they look like. May be any meaningful exploration/demonstration of educational metadata in ePub is going to need a bespoke application, but if you know of a tool that might be helpful do drop me a line.

Three levels of design and innovation

Posted on 06/06/2013 by Phil Barker

An electronics company has just won a patent claim against another electronics company. It’s not relevant to this post which companies and what patent were involved, it just served to remind me once again of the different types of innovation that are subject to these patent claims–where there is a patent there is at least a claim of innovation, and that is what interests me. Specifically, I find it interesting that some of these patent claims are for antenna design, others certain user interactions, and that links to an idea I heard presented by Adam Procter a year or so ago which has stuck with me,–that there are three levels to design and innovation:

level one: the base technology. In phones this would be the physical design of antennae, the compression algorithms used for audio and video, the physics of the various sensors, and so on.

level two: the product. That is putting all the base technologies to create features of a working unit that (if it is to be successful) fulfills a need.

level three: user experience. Making the use of those features a pleasure.

I’ve found this useful in thinking about what it is that Apple gets right compared to, say, Nokia. It’s my impression (and I think the various patent claims bear this out) that Apple are very good at innovating for user experience whereas Nokia and others did a lot of the work somewhere around technologies and product.

I’ve also found it enlightening to reflect on just how hard it is to work out from the technology alone what would be a set of features that make up a successful product. I was using CCD cameras for science experiments in the early 90s, when the technology had been around twenty-odd years, and never once did it occur to me that it would be a really good idea to put one in a phone. Light sensors so that your curtains would open and close automatically, sure they were certain to come, but a camera in your phone!–why would anyone want that?

Put those together, and I think what you get is a picture of some people who are good at spotting (or just prepared to experiment with) how technologies can do something useful, and others who are good at spotting what is required in order to make those features pleasant enough to use. So Diamond and Creative and others showed that really small MP3 players were devices that people might find useful (others before them had put together advances in audio compression and storage technology to show such devices were possible), Apple made something that people wanted to use. What was it that made the difference? The integration with iTunes maybe?

Sometimes identifying the useful feature comes before the technology that makes it usable, at least to a certain extent: Palm showed how touch screen devices could be useful but Apple waited until the technology (capacitive rather than resistive sensors) was available to give the user experience they wanted. Of course that area of human endeavour which puts creation of innovative products completely ahead of technology developments is called science fiction–how’s that flying car coming along?

eTextBooks Europe

Posted on 21/01/2013 by Phil Barker

I went to a meeting for stakeholders interested in the eTernity (European textbook reusability networking and interoperability) initiative. The hope is that eTernity will be a project of the CEN Workshop on Learning Technologies with the objective of gathering requirements and proposing a framework to provide European input to ongoing work by ISO/IEC JTC 1/SC36, WG6 & WG4 on eTextBooks (which is currently based around Chinese and Korean specifications). Incidentally, as part of the ISO work there is a questionnaire asking for information that will be used to help decide what that standard should include. I would encourage anyone interested to fill it in.

The stakeholders present represented many perspectives from throughout Europe: publishers, publishing industry specification bodies (e.g. IPDF who own EPUB3, and DAISY), national bodies with some sort of remit for educational technology, and elearning specification and standardisation organisations. I gave a short presentation on the OER perspective.

Many issues were raised through the course of the day, including (in no particular order)

Interactive and multimedia content in eTextbooks
Accessibility of eTextbooks
eTextbooks shouldn’t be monolithic and immutable chunks of content, it should be possible to link directly to specific locations or to disaggregate the content
The lifecycle of an eTextbook. This goes beyond initial authoring and publishing
Quality assurance (of content and pedagogic approach)
Alignment with specific curricula
Personalization and adaptation to individual needs and requirements
The ability to describe the learning pathway embodied in an eTextbook, and vary either the content used on this pathway or to provide different pathways through the same content
The ability to describe a range IPR and licensing arrangements of the whole and of specific components of the eTextbook
The ability to interact with learning systems with data flowing in both directions

If you’re thinking that sounds like a list of the educational technology issues that we have been busy with for the last decade or two, then I would agree with you. Furthermore, there is a decade or two’s worth of educational technology specs and standards that address these issues. Of course not all of those specs and standards are necessarily the right ones for now, and there are others that have more traction within digital publishing. EPUB3 was well represented in the meeting (DITA is the other publishing standard mentioned in the eTernity documentation, but no one was at the meeting to talk about that) and it doesn’t seem impossible to meet the educational requirements outlined in the meeting within the general EPUB3 framework. The question is which issues should be prioritised and how should they be addressed.

Of course a technical standard is only an enabler: it doesn’t in itself make any change to teaching and learning; change will only happen if developers create tools and authors create resources that exploit the standard. For various reasons that hasn’t happened with some of the existing specs and standards. A technical standard can facilitate change but there needs to a will or a necessity to change in the first place. One thing that made me hopeful about this was a point made by Owen White of Pearson that he did not to think of the business he is in as being centred around content creation and publishing but around education and learning and that leads away from the view of eBooks as isolated static aggregations.

For more information keep an eye on the eTernity website

The Human Computer: a diversion from normal CETIS work

Posted on 26/06/2012 by Phil Barker

Alan Turing, 1951. Source: wikipedia

No, there’s no ‘Interaction’ missing in that title, this is about building a computer, or at least a small part of one, out of humans. The occasion was a birthday party that the department I work in, Computer Science at Heriot-Watt University, held to commemorate the centenary of Alan Turing’s birth. It was also the finale of a programming competition that the department set for sixth-formers, to create a simulation of a Turing Machine. So we had some of the most promising computer science pupils in the country attending.

As well as the balloons, cake and crisps, we had some party games, well, activities for our guests. They could have a go at the Turing test, at breaking enigma codes, and my contribution was for them to be a small part in a computer, a 2-bit adder. The aim was to show how the innards of a computer processor are little more than a whole load of switches and it doesn’t matter much (at least to a mathematician like Turing) what these switches are. I hoped this would help show that computers are more than black boxes, and help put add some context to what electronic computers were about when Turing was working. (And, yes, I do know that it was Shannon not Turing who developed the theory.)

So, it starts with a switch that can turn another switch on and off. Here’s a simulation of one which uses a transistor to do that. If you click on that link a java window should open that shows a simple circuit. The input on the left is at a Low voltage, the output is at a low voltage. Click on the input to set it to a High, and it will turn on the transistor, connecting the output to the high voltage source, so the output goes High. So by setting the input to high voltage (presumably by pressing a switch) you can set the output to high voltage. You’re allowed to be under-impressed at this stage. (Make sure you close any windows or browser tabs opened by that link, leaving them open might cause later examples not to work)

Turing didn’t have had access to transistors. At the time he worked these switches were electromechanical relays, a physical spring-loaded switch that was closed by the magnetic attraction between a coil and permanent magnet when a current ran through the coil. Later, vaccuum tube valves were available to replace these, but much to Tommy Flowers chagrin, Turing wasn’t at all interested in that. For mathematicians the details of the switching mechanism are a distraction. By not caring, maybe not even knowing, about the physics of the switch Turing was saved from worrying about a whole load of details that would have been out of date by the 1960s; as it is his work is still relevant today. This illustrates my favourite feature of Mathematics, which is that maths is the only subject where it is best not to know what you are talking about.

Back to this thing of turning a voltage signal high or low by turning a voltage high or low.

Two transistor AND gate

That may be underwhelming, but put two of these next to each other and something interesting happens: the output will only be High if both the inputs are. In other words the output is High if both input 1 AND input 2 are high. That’s mathematics: a simple logic calculation. You can try it out in the simulation. You can also try other arrangements that show an OR logic calculation and an XOR calculation, that is an exclusive OR, the output is high if on of input 1 or input 2 is high but not both. We call these circuits logic gates. Remember to close all windows and browser tabs when going from one simulation to another.

This is where we leave electronics and start using the audience. My colleague and I each had a flag and we gave everyone in the audience a flag. We were the inputs, they had to be logic gates; they had to raise their flag if she AND I both raised ours, or if she OR I had a flag up, or if she or I, but not both of us raised a flag (the XOR calculation).

The next trick was to show how these logic calculations relate to adding numbers together: A+B = S. First, of course, the numbers must be represented as binary with a low voltage/flag down equivalent to the digit 0 and high voltage/flag up equivalent to the digit 1. And we have to do the addition one digit at a time, starting from the units. Adding the first digit, the units, is easy enough. 0+0 = 0, 0+1=1, 1+0=1, 1+1=0 with 1 to carry. Think of that input 1 + input 2 = output, where the output can either be the digit for the sum or the digit to carry. For the sum, the output is 1 if either input 1 or input 2 is high, but not both, so S = input 1 XOR input 2; and we carry 1 if = input 1 AND input 2 are 1. The second and subsequent digits are harder since we need to add the digit from each number and the carry, but it’s not too difficult.

We can use logic gates to do the calculation for each bit of the addition. The circuit looks like this:
2bitadder
You can hopefully see how bit one of the sum if the XOR of the inputs for bits one of the numbers A and B, and the carry to the calculation of the second bit is the AND of these inputs. Again there is a simulation you can try, you might need to stretch the JAVA window to see all the circuit. Try 1 plus 1 (01+01 = 10 so set inputs A1 and B1 High, A2 and B2 Low to gives Output S1 Low and Output S2 High). And 2 + 2 (10 + 10).

We implemented this circuit using our audience of flag-wavers. We put pupils on the front row to be the inputs, pupils on the next row to be gates 1-4, and so on, making sure that each one knew at whom they should be looking and what condition should be met for them to raise their flag. We ran this three times, and each time it worked brilliantly. OK, so we could only add numbers less that 3, which isn’t much computing power, but given another 35 people we could have done eight-bit addition. And I’m pretty sure that we could have managed flip-flops and registers, but we would need something like 10,000 pupils to build a processor equivalent to an 8086, so the logistics might be difficult.

CETIS publications, now on WordPress

Posted on 13/03/2012 by Phil Barker

We have recently changed how we present our publications to the world. Where once we put a file on the web somewhere, anywhere, and entered the details into a home-spun publication database, now we use WordPress. We’re quite pleased with how that has worked out, so we’re sharing the information that might help others use WordPress as a means of presenting publications to the world (a repository, if you like).

Why WordPress?
First, what were we trying to achieve? The overall aims were to make sure that our publications had good exposure online, to have a more coherent approach to managing them (for example to collect all the files into one place in case we ever need to migrate them), and to move away from the bespoke system we were using to a system that someone else maintains. There were a few other requirements, we wanted something that was easy for us to adapt to fit the look and feel of the rest of our website, that was easy to maintain (familiarity is an important factor in how easy something is–it’s easy to use something if you know how to use it), and we wanted something that would present our publications in HTML and RSS sliced and diced by topic, author, and publication type: a URL for each publication and for each type of publication and feeds for everything. We’re not talking about a huge number of publications, maybe 100 or so, so we didn’t want a huge amount of up-front effort.

We thought about Open Journal Systems, but there seemed to be a whole load of workflow stuff that was relevant to Journals but not our publications. Likewise we thought about ePrints and Dspace, but they didn’t quite look like we wanted, and we are far more familiar with WordPress. As a wildly successful open source project, WordPress also fits the requirement of being maintained by other people, not just the core programme, but all those lovely plugins and themes. So the basic plan was to represent each publication in a WordPress post and to use a suitable theme and plugins to present them as we wanted.

The choice of theme
Having settled on WordPress the first decision was which theme to use. In order to get the look and feel to be similar to the rest of the CETIS website (and, to be honest, to make sure our publications pages didn’t look like a blog) we needed a very flexible theme. The most flexible theme I know of is Atahualpa, with over 200 options, including custom CSS snippets, parameters and HTML snippets it’s close to being a template for producing you own custom themes. So, for example, the theme options I have set include a byline of By %meta('By')%. %date('F Y')% which automatically inserts the additional metadata field ‘By’ and the date in the format of my choice, all of which can be styled any way I want. I’ll come back to the “byline” metadata later.

One observation here: there is clearly a trade-off between this level of customisation and ease of maintenance. On the one hand these are options set within the Atahualpa theme that can be saved between theme upgrades, which is better than would have been the case had we decided to fork the theme or add a few lines of custom code to the theme’s PHP files. On the other hand, it is not always immediately obvious which setting in the several pages of Atahualpa theme options has been used to change some aspect of the site’s appearance.

A post for each publication
As I mentioned above we can represent each publication by creating a WordPress post, but what information do we want to provide about each publication and how does it fit into a WordPress post? Starting with the simple stuff:

Title of the publication -> title of WordPress post.
Abstract / summary -> body of post.
Publication file -> uploaded as attached media.
Type of publication -> category.
Topic of publication -> tag.

Slightly less simple:

The date of the publication is represented as the date of the post. This is possible because WordPress lets you choose when to publish post. The default is for posts to be published immediately when you press the Publish button, however you can edit this to have them published in the past

WordPress publication date option
The author of the publication becomes the author of the post, but there are some complications. It’s simple enough when the publication has a single author who works for CETIS, I just added everyone as an “author” user of WordPress and a WordPress admin user can attribute any given post to the author of the publication it represents. Where there are two or more authors a nifty little plugin called Co-Authors Plus allows them all to be assigned to the post. But we have some publications that we have commissioned from external authors, so I created an user called “Other” for these “external to CETIS” authors. This saves having a great long list of authors to maintain and present, but creates a problem of how to attribute these external authors, a problem that was solved using WordPress’s “additional metadata” feature to enter a “by-line” for all posts. This also provides a nicely formatted by-line for multi-author papers with out worrying about how to add PHP to put in commas and “and”s.
The only other additional metadata added was an identifier for each publication, e.g. the latest QTI briefing paper is No. 2011:B02.

Presenting it all
As well as customisation for the look and feel, the Atahualpa theme allows for menus and widgets to added to the user interface. Atahualpa has an option to insert a menu into the page header which we used for the links to the other parts of the CETIS website. On the left hand side bar we’ve used the custom menu widget to list the tags and categories to provide access to the publications divided by topic and publication type as HTML and as a feed (just add /feed to the end of the URL). Also on the left, the List Authors plugin gives us links to publications by author.

In order to provide a preview of the publication in the post I used the TGN embed everything plugin. The only problem is that the “preview” is too good: it’s readable but not the highest quality, so it might lead some people to think that we’re disseminating low quality versions of the papers, whereas we do include links to high quality downloadable files.

The built-in WordPress search is rubbish. For example, it doesn’t include the author field in the search (not that the first thing we tested was vanity searching), and the results returned are sorted by date not relevance. Happily the relevanssi plugin provides all the search we need.

Finally a few tweaks. We chose URL patterns that avoid unnecessary cruft, and closed comments to avoid spam. We installed the Google analytics plugin, so we know what you’re doing on our site, and the login lock plugin for a bit of security. The only customisation that we want that couldn’t be done with a theme option or plugin was providing some context to the multi-post pages. These are pages like the list of all the publications, or all the briefing papers, and we wanted a heading and some text to explain what that particular cut of our collection was. Some themes do this by default, based on information entered about the tag/catergory/author on which the cut is made, but not Atahualpa. I put a few lines of PHP into the theme’s index.php template to deal with publication types, but we’ve yet to do it properly for all possible multipost pages.

And in the end…
As I said at the top, we’re happy with this approach; if you have any comment on it, do please leave them below.

One last thing. Using a popular platform like WordPress means that there is a lot of support, and I don’t just mean a well supported code base and directory of plugins and themes. One of the most useful sources of support has been the WordPress community, especially the local group of WPUK, at whose meet-ups I get burritos and advice on themes, plugins, security and all things wordpressy.

LRMI: after the meeting

Posted on 15/09/2011 by Phil Barker

Last week I was at the first face to face meeting of the Learning Resource Metadata Initiative technical working group, here are my reflections on it. In short, what I said in previous post was about right, and the discussion went the way I hoped. One addition, though, that I didn’t cover in that post, was some discussion of accessibility conditions. That was one of a number of issues that was set aside as being of more general importance than learning resources and best dealt with that wider scope in mind; the resources of the LRMI project being better spent on those issues that are specific to learning materials.

An interesting take on the scope of the project that someone (I forget who) raised during the meeting concerns working within the constraints of the search engine interface and results page. Yes, Google, Bing and Yahoo have advanced search interfaces that allow check-box selection of conditions such as licence requirements, they also provide and support for specialist search e.g. Google Scholar and custom search engines; however real success will come if the information that can be marked up as a result of LRMI is effective for people using the default search engine. What this means is that the actions that result from a use case or scenario should be condensed down to a few key words typed into a search box,

Bing search box

and the information displayed as an outcome should fit into an inch or two of screen space on a search engine results page.

Bing search result

That’s quite useful in terms of focussing on what is really important, but of course it won’t meet everyone’s ambitions for learning resource metadata. The question this raises is to what extent should the schema.org vocabulary attempt to meet these requirements? That, I think, is still an open question, but I am sure that embedded metadata markup such as schema.org has limitations and external metadata such as is provided by the IEEE LOM, Dublin Core and ISO MLR is a complemetary approach that may be necessary in meeting some of the more extensive use cases for learning resource metadata. Indeed, one requirement of LRMI which was raised during the meeting is to provide a means of linking to external metadata. One more observation on this line: at least from the basis of this meeting, it seems that the penetration of standards for educational metadata into the commercial educational publishing world (both online and more conventional) is not great.

A final issue concerning the scope of LRMI, and schema.org more generally, with respect to the use of other approaches to handling metadata is relevant to the idea of linking to external metadata, but is better illustrated by the issue of how to convey licence information. At the moment there is no schema.org term for indicating licence terms, however there is a perfectly good approach advocated by Creative Commons and recognised by Google and many other search and content providers (i.e. a link or anchor with attributes rel=”license” href=”licenceURL” optionally spanning a textual description of the licence–no prizes for guessing how I think this could be extended to links to external metadata). Is it helpful to reproduce this in schema.org? On the one hand, one of the aims of schema.org is to offer web masters a unified approach and a single coherent set of recommendations for embedding metadata; on the other hand this approach seems to be in accord with HTML in general and is aready widespread, so perhaps any clarification or coherence in terms of the schema.org offering would be at the expense of muddying and fragmentation of practice with respect to how to embed licence information in HTML more generally.

Google custom search for UKOER

Posted on 20/01/2011 by Phil Barker

It has become very clear to me over the last week or so that I haven’t done enough to publicise some work done over the summer by my colleague Lisa Scott (Lisa Rogers, as she then was) on showing how you can create a Google Custom Search Engine to search for OER materials. In summary, it’s very easy, very useful, but not quite the answer to all your UKOER discovery problems.

A Google Custom Search Engine (Google CSE) allows one to use Google to search only certain selected pages. The pages to be searched can be specified individually or as URL patterns identifying a part of a site or an entire site. Furthermore the search query can be modified by adding terms to those entered by the user.

The custom search engine can be accessed through a search box that can be hosted on Google or embedded in a web page, blog etc. Likewise, the search results page can be presented on Google or embedded in another site. Embedding of both search box and results page utilises javascript hosted on the Google site.

The pages to be searched can be specified either directly by entering the URL patterns via the Google CSE interface, listed in an XML or TSV (tab separated variable) file which is uploaded to the Google CSE site, or as a feed from any external site. This latter option offers powerful possibilities for dynamic or collective creation of Custom Search Engines, especially since Google provide a javascript snippet which will use the links on a page as the list of URLs to search. So, for example a group of people could have access to a wiki on which they list the sites they wish to search and thus build a CSE for their shared interest, whatever that may be.

A refinement that is sometimes useful is to label pages or sites that are searched. Labels might refer to sub-topics of the theme of the custom search engine or to some other facet such as resource type. So a custom search engine for engineering OERs might label pages as different branches of engineering {mechnical, electronic, civil, chemical, …} or type of resource to be found {presentation, image, movie, simulation, articles, …}. In practice, whatever the categorisation chosen for labels, there will often be pages or sites that mix resource from different categories, so use of this feature requires thought as to how to handle this.

A Google CSE for UKOER
Our example of a simple Google CSE can be found hosted on Google.

This works as a Google search limited to pages at the domains/directories listed below; where the URL pattern doesn’t lead only to content that is UKOER the term ‘+UKOER’ added to the search terms entered by the user. The ‘+’ in the added term means that only those pages which contain the term UKOER are returned. This is possible since the programme mandated that all resources should be associated with the tag UKOER. Each site was labelled so that after searching, the user could limit results to those found on any one site (e.g. just those on Jorum Open) or strand of UKOER. The domains/directories searched are:

* http://open.jorum.ac.uk/
* http://www.vimeo.com/
* http://www.youtube.com/
* http://www.slideshare.net/
* http://www.scribd.com/
* http://www.flickr.com/
* http://repository.leedsmet.ac.uk/main/
* http://openspires.oucs.ox.ac.uk/
* http://unow.nottingham.ac.uk/
* https://open.exeter.ac.uk/repository/
* http://web.anglia.ac.uk/numbers/
* http://www.multimediatrainingvideos.com/
* http://www.cs.york.ac.uk/jbb
* http://www.simshare.org.uk/
* http://fetlar.bham.ac.uk/repository/
* http://open.cumbria.ac.uk/moodle/
* http://skillsforscientists.pbworks.com/
* http://core.materials.ac.uk/search/
* http://www.humbox.ac.uk/

These were chosen as they were known to be used by a number of UKOER projects for disseminating resources. We must stress that these are meant to be illustrative of sites where UKOER resources may be found, they are definitely not intended to be a complete or even a sufficient set of sites.

This is the simplest option, the configuration files are hosted on Google and managed through forms on the Google website. Expanding it to cover other web sites requires being given permission to contribute by the original creator and then adding URLs as required.

Reflections
Setting up this search engine was almost trivially easy. Embedding it in a website is also straightforward (Google provides code snippets to cut and paste).

The approach will only be selective for OERs if those resources can be identified through a term or tag added to the user-entered search query or if it can be selected through a specific URL pattern (including the case where a site is wholly or predominantly OERs). This wasn’t always the case.

Importantly, not all expected results appear, this is possibly because the resources on these sites aren’t tagged as UKOER or may be due to the pages not being indexed by Google. However, sometimes the omission seems inexplicable. For example a search for “dental” limited to the Core materials website on Google yields the expected results the equivalent search on the CSE yields no results.

While hosting the configuration files off-google and editing them as XML files or modifying the programmatically allows some interesting refinement of the approach we found this to be less easy. One difficulty is that the documentation on Google is somewhat fragmented and frequently confusing. Different parts of it seem to have been added by different people and different times, and it was often the case that a “link to more information” about something we were trying to do failed to resolve the difficulty that had been encountered. This was compounded by some unpredictable behaviour which may have been caused by caching (maybe on serving the configuration files, or Google reading them, or Google serving the results), or by delays in updating the indexes for the search engine, which made testing changes to the configuration files difficult. These difficulties can be overcome, but we were unconvinced that there would be much benefit in this case and so concentrated our effort elsewhere.

Conclusions
If it works for the sites you’re interested in, we recommend the simple Google custom searches as very quick method for providing a search for a subset of resources from across a specified range of hosts. We reserve judgement on the facility for creating dynamic search engines by hosting the configurations files on ones own server.

Sharing service information?

Posted on 01/11/2010 by Phil Barker

Over the past few weeks the question of how to find service end-points keeps coming up in conversation (I know, says a lot about the sort of conversations I have), for example we have been asked whether we can provide information about where are the RSS feed locations for the services/collections created by the all the UKOER projects. I would generalise this to service end points, by which I mean the things like the base URL for OAI-PMH or RSS/ATOM feed locations or SRU target locations, more generally the location of the web API or protocol implementations that provide machine-to-machine interoperability. It seems that these are often harder to find than they should be, and I would like to recommend one and suggest another approach to helping make them easier to find.

The approach I would like to recommend to those who provide service end points, i.e. those of you who have a web-based service (e.g. a repository or OER collection) that supports machine-to-machine interoperability (e.g. for metadata dissemination, remote search, or remote upload) is that taken by web 2.0 hosts. Most of these have reasonably easy-to-find sections of their website devoted to documenting their API, and providing “how-to” information for what can be done with it, with examples you can follow, and the best of them with simple step-by-step instructions. Here’s a quick list by way of providing examples

I’ll mention Xpert Labs as well because, while the “labs” or “backstage” approach in general isn’t quite what I mean by simple “how-to” information, it looks like Xpert are heading that way and “labs” sums up the experimental nature of what they provide.

That helps people wanting to interoperate with those services and sites they know about, but it begs a more fundamental question, which is how to find those services in the first place; for example, how do you find all those collections of OERs. Well, some interested third-party could build a registry for you, but that’s an extra effort for someone who is neither providing or using the data/service/API. Furthermore, once the information is in the registry it’s dead, or at least at risk of death. What I mean is that there is little contact between the service provider and the service registry: the provider doesn’t really rely on the service registry for people to use their services and the service registry doesn’t actually use the information that it stores. Thus, it’s easy for the provider to forget to tell the service registry when the information changes, and if it does change there is little chance of the registry maintainer noticing. So my suggestion is that those who are building aggregation services based on interoperating with various other sites provide access to information about the endpoints they use. An example of this working is the JournalToCs service, which is an RSS aggregator for research journal tables of contents but which has an API that allows you to find information for the Journals that it knows about (JOPML showed the way here, taking information from a JISC project that spawned JournalToCs and passing on lists of RSS feeds as OPML). Hopefully this approach of endpoint users proving information about what they used would only provide information that actually worked and was useful (at least for them).

Descriptions and metadata; documents and RDF

Posted on 27/10/2010 by Phil Barker

I keep coming back to thinking about embedding metadata into human-oriented resource descriptions web pages.

Last week I was discussing RDFa vs triple stores with Wilbert. Wilbert was making the point that publishing RDF is easier to manage, less error prone and easier on the consumer if you deal with it on its own rather than trying to deal with encoding triples and producing a human readable web page with valid XHTML all at the the same time. A valid point, though Wilbert’s starting point was “if you’re wanting to publish RDF” and that left me still with the question of when do we want metadata, i.e. encoded machine readable resource descriptions and when do we want resource descriptions that people can read, and do we really have to separate the two?

Then yesterday, following a recommendation by Dan Rehak, I read this excellent comparison of three approaches that could be used to manage resource descriptions or metadata, relational databases, document stores/noSQL, an triple stores/RDF. Which really helps in that it explains how storing information about “atomic” resources is a strength of document stores (with features like versioning and flexible schema) and storing relationships is a strength of triple stores (with, you know, features like links between concepts). So you might store information about a resource as an XML document structured by some schema so that you could extract the title, author name etc., but sometimes you want to give more detail, e.g. you might want to show how the subject related to other subjects, in which case you’re into the world where RDF has strengths. And then again, while author name is enough for many uses, an unambiguous identifier for the author encoded so that a machine will understand it as a link to more information about the author is also useful.

Also relevant:

Building Linked Data For Both Humans and Machines [pdf]
Linked data for people first, machines second [ppt] (you had to be there, but slide 4 is worth ~~stealing~~ a look.
An infrastructure service anti-pattern

Event: what metadata is really useful?

Posted on 08/09/2010 by Phil Barker

CETIS are organising an event “What metadata is really useful” at Brettenham House in London on Mon. 18 October.

This meeting will focus on looking at what data we have (or could acquire) to answer the question of what metadata is really required to support the discovery, selection, use and management of educational resources. The emphasis is on identifying data that demonstrates a real requirement from some party; this is in contrast to other approaches such as hypothetical, future-looking usecases. Future looking use cases have their place–we would all like to see applications and services which allow us to do things that we cannot do now–but now seems to be suitable point to reflect on what needs to be prioritised because it meets the needs of users today. Of those four functions (discovery, selection, use and management), it is likely that the meeting will deal mainly with the first two or three; I think we will be able to find more data for these, but it is important to keep all four functions in mind before we say that there is no demonstrated need to describe some characteristic of a resource. The data in question may come from various sources, for example, user surveys of how people look for educational resources, current practice in metadata production, or analysis of user search behaviour.

Here’s an example of what we can get from user surveys from David Davies. I hope that at this meeting we will be able to build on this and any other existing work people care to bring along. we might, for example, want to consider whether we can increase the scope and reach of such questionnnaires in the future by suggesting some common questions they could include.

A second source of data can be found in the current cataloguing practice for existing repositories; this can be surfaced by examination of application profiles or cataloguing guidelines in use and examination of the records themselves. So we can find out whether people using the LOM do find it useful to have seperate description elements for general and educational properties (not to mention all those that come in the classification category). This is especially interesting since it is perhaps the only source of data I have thought of that reflects metadata required internally to the repository for managing resources.

Finally, and this is where I think we will have most to discuss, data can be obtained from logging access and queries. This is what I have in mind by way of questions that could be answered this way:

How do people find the site? Is it through search engines or direct referral? Do they land on a resource page (=> they were looking for a resource and found it directly with an external search) or on your home page (=> they were looking for a collection of resources)? Obviously the answers will depend on who your users are and why they are coming to your site, if you have an institional repository or other local collection of your own resources (e.g. an OER site) you might find that members of your own institution, staff and students, have a different behaviour to others from outwith your institution.
What search terms do people use to find resources? We can divide this into two: people who search elsewhere, e.g. Google, with query terms discovered through referrer logs or other web analytics tools; and people who search using a site’s own search functionality. A lot of the search terms will be subject keywords and they’ll be of interest to cataloguers or thesaurus developers for a specific site, but there will be other search terms (e.g. ‘powerpoint’, ‘ppt’, and ‘slides’ all featured in the one set of logs I looked at recently), which lead us to …
What do the search terms tell us about what characteristics of a resource people are searching for? And how do they conceptualize those characteristics. So a search for “powerpoint” suggests that they’re searching for a particular resource type, “introduction to…” would suggest a way of thinking about educational level. This would help us when making decisions about what metadata elements to use.

Interested?
If you are interested in this event, you can register online. More details about the programme etc. will become available on the event’s wiki page (at the time writing there is very little there that isn’t also on this post). Most importantly, if you think you have something to present or contribute, please get in touch with me, phil.barker /at/ hw.ac.uk.

Phil Barker

Cetis Blog

Category Archives: CETIS-Content