Where to put your EPUB metadata

Posted on 15/01/2014 by Phil Barker

Even in the knowledge that current mainstream EPUB readers and applications for managing eBooks will most likely ignore all but the most trivial metadata, we still have use cases that involve more sophisticate metadata. For example we would like to use the LRMI alignment object in schema.org to say that a particular subsection of a book can be useful in the context of a specific unit in a shared curriculum.

So, without evaluating pros and cons, starting from the most basic/most common, what are the options? This is a summary takes information from Garrish and Gulling, EPUB 3 Best Practices, OReilly 2013, (which I take to be authoritative and also as an example of best practice with regard to the metadata in the epub file) as well as the EPUB 3.0 Publications and Content Documents specifications. Any comments would be greatly appreciated.

1. Simple Dublin Core

Within the OEPBS directory of an unpacked EPUB3 is the content.opf file. It pretty much equates to the manifest of an IMS Content Package. The top-level element is <package> and <metadata> is a required first child of <package>.

The default metadata vocabulary is the Dublin Core Metadata Element Set (DCMES, simple DC), with prefix dc:. Three elements are mandatory–title, identifier and language–others are optional. For example, in /OEPBS/content.opf

<?xml version=’1.0’ encoding=’UTF-8’?>
<package xmlns:dc="http://purl.org/dc/elements/1.1/ [...]">
    <metadata>
        <dc:identifier>urn:isbn:9781449325299</dc:identifier>
        <dc:title>EPUB 3 Best Practices</dc:title>
        <dc:language>en</dc:language>
        <dc:rights>Copyright © 2013 Matt Garrish and Markus Gylling</dc:rights>
[...]

2 Other metadata schemas

The package element has a prefix attribute that may be used to declare prefixes for metadata schemas other than DCMES. Four vocabularies are reserved, i.e. the prefix may be used without a declaration: dcterms, marc, onix and media (the vocabulary used for EPUB3 media overlays). Example

<dcterms:title>EPUB 3 Best Practices</dcterms:title>

Other vocabularies may be used providing a prefix and a URL in a way so similar to xmlns that is makes you wonder why they didn’t just use xmlns.

<package prefix="prism: http://prismstandard.org/namespaces/basic/3.0/" [...]>

3 the meta element

If used without the refines attribute (see below) the meta element can provide information about the package as a whole, e.g.

<meta property="dcterms:title">EPUB 3 Best Practices</meta>

I have no idea what would be the benefit of this over <dcterms:title>.

4 Refining metadata elements: id attribute and the meta element

The id attribute can be used to provide an identifier any element in the metadata that it may be refined. One example of this is mandatory, i.e. that one occurrence of the dc:identifier element must be the publication identifier:

<dc:identifier id="pub-identifier">urn:isbn:9781449325299</dc:identifier>

In general the refinements are described using the meta element with the refines attribute and a property attribute that specifies the nature of the refinement. It’s kind of like RDF reification. The default vocabulary for the property attribute includes “file-as” – an alternative string for a name to be used when filing, “identifier-type” – a way to distinguish between different identifiers, “meta-auth” – the authority for a given instance of metadata, “title-type” – which of the six forms of title being provided.

<dc:creator id="1234">Matt Garrish</dc:creator>
<meta refines="#1234" property="file-as" id="5678">Garrish, Matt</meta>
<meta refines="#1234" property="role">Author</meta>

Terms from other vocabularies may be used for “property” so long as a prefix is declared.

Refinements may have ids and so may be refined.

<meta refines="#5678" property="meta-auth">Phil Barker</meta>

So and so you can make statements about your metadata statements to you heart’s content (though including the whole of the linked data graph in each epub would be silly).

The scheme attribute may be used to identify the controlled vocabulary from which the meta element’s value is drawn. For example, if the identifier is a DOI (which in onix is apparently entry 06 of codelist 5) you can have

<dc:identifier id="pub-id">urn:doi:10.1016/j.iheduc.2008.03.001 </dc:identifier>
<meta refines="#pub-id"
      property="identifier-type"
      scheme="onix:codelist5">06</meta>

Or, using the marc relator value Aut to specify author

<meta refines="#1234" property="role" scheme="marc:relators">Aut</meta>

5 Sub-package level metadata

The id attribute may be used to provide an identifier of an subelement of <package> or any element in the XHTML content documents, down to a span element around a phrase, word or character. So a chapter may have id=”chap1″ then we can use meta elements in the metadata to describe it seperately from the rest of the epub

<meta refines="#chap1" property="prism:contentType">bookChapter<meta>

6 Links to metadata records

The link element is an optional, repeatable subelement of <metadata>, “used to associate resources with a publication, such as metadata records” The metadata may be within package or anywhere on the www.
Example

<link rel="marc21xml-record" href="pub/meta/nor-wood-marc21.xml" />
<link refines="#chap1" rel="ex:schema_org-record"
      media-type="application/ld+json"
      href="http://example.org/nor-wood-lrmi.json" />

Metadata embedded in the XHTML5 content

As far as I can see the EPUB3 specs are mute on metadata in HTML of the content documents, e.g. as html:meta elements or as microdata or RDFa, there doesn’t seem to be any reason why one should not put metadata here. I wouldn’t expect any EPUB system to look that deeply into the package but it would be a good approach to helping the metadata travel with the resource if the EPUB is disaggregated and passed into a non-EPUB3 CMS.

Heads up for HEDIIP

Posted on 24/07/2013 by Phil Barker

A while back I summarised the input about semantics and academic coding that Lorna and I had made on behalf of Cetis for a study on possible reforms to JACS, the Joint Academic Coding System. That study has now been published.

JACS is mainatained by HESA (the Higher Education Statistics Agency) and UCAS (Universities and Colleges Admissions Service) as a means of classifying UK University courses by subject; it is also used by a number of other organisations for classification of other resources, for example teaching and learning resources. The report (with appendices) considers the varying requirements and uses of subject coding in HE and sets out options for the development of a replacement for JACS.

Of course, this is all only of glancing interest, until you realise that stuff like Unistats and the Key Information Set (KIS) are powered by JACS.
- See more at Followers of the apocalypse

If you’re not sure why this should interest you (and yet for some reason have read this far) David Kernohan has written what I can only describe as an appreciation of the report, Hit the road JACS, from which the quote above is taken.

To move forward from this and the other reports commissioned from the Redesigning the HE data landscape study, the Higher Education Data and Information Improvement Programme (HEDIIP) is being established to enhance the arrangements for the collection, sharing and dissemination of data and information about the HE system. Follow them on twitter.

Learning Resource Metadata is Go for Schema

Posted on 24/04/2013 by Phil Barker

The Learning Resource Metadata Initiative aimed to help people discover useful learning resources by adding to the schema.org ontology properties to describe educational characteristics of creative works. Well, as of the release of schema draft version 1.0a a couple of weeks ago, the LRMI properties are in the official schema.org ontology.

Schema.org represents two things: 1, an ontology for describing resources on the web, with a hierarchical set of resource types each with defined properties that relate to their characteristics and relationships with other things in the schema hierarchy; and 2, a syntax for embedding these into HTML pages–well, two syntaxes, microdata and RDFa lite. The important factor in schema.org is that it is backed by Google, Yahoo, Bing and Yandex, which should be useful for resource discovery. The inclusion of the LRMI properties means that you can now use schema.org to mark up your descriptions of the following characteristics of a creative work:

audience the educational audience for whom the resource was created, who might have educational roles such as teacher, learner, parent.

educational alignment an alignment to an established educational framework, for example a curriculum or frameworks of educational levels or competencies. Expressed through an abstract thing called an Alignment Object which allows a link to and description of the node in the framework to which the resource aligns, and specifies the nature of the alignment, which might be that the resource ‘assesses’, ‘teaches’ or ‘requires’ the knowledge/skills/competency to which the resource aligns or that it has the ‘textComplexity’, ‘readingLevel’, ‘educationalSubject’ or ‘educationLevel’ expressed by that node in the educational framework.

educational use a text description of purpose of the resource in education, for example assignment, group work.

interactivity type The predominant mode of learning supported by the learning resource. Acceptable values are ‘active’, ‘expositive’, or ‘mixed’.

is based on url A resource that was used in the creation of this resource. Useful for when a learning resource is a derivative of some other resource.

learning resource type The predominant type or kind characterizing the learning resource. For example, ‘presentation’, ‘handout’.

time required Approximate or typical time it takes to work with or through this learning resource for the typical intended target audience

typical age range The typical range of ages the content’s intended end user.

Of course, much of the other information one would want to provide about a learning resource (what it is about, who wrote it, who published it, when it was written/published, where it is available, what it costs) was already in schema.org.

Unfortunately one really important property suggested by LRMI hasn’t yet made the cut, that is useRightsURL, a link to the licence under which the resource may be used, for example the creative common licence under which is has been released. This was held back because of obvious overlaps with non-educational resources. The managers of schema.org want to make sure that there is a single solution that works across all domains.

Guides and tools

To promote the uptake of these properties, the Association of Educational Publishers has released two new user guides.

The Smart Publisher’s Guide to LRMI Tagging (pdf)

The Content Developer’s Guide to the LRMI and Learning Registry (pdf)

There is also the InBloom Tagger described and demonstrated in this video.

LRMI in the Learning Registry

As the last two resources show, LRMI metadata is used by the Learning Registry and services built on it. For what it is worth, I am not sure that is a great example of its potential. For me the strong point of LRMI/schema.org is that it allows resource descriptions in human readable web pages to be interpreted as machine readable metadata, helping create services to find those pages; crucially the metadata is embedded in the web page in way that Google trusts because the values of the metadata are displayed to users. Take away the embedding in human readable pages, which is what seems to happen when used with the learning registry, and I am not sure there is much of an advantage for LRMI compared to other metadata schema,–though to be fair I’m not sure that there is any comparative disadvantage either, and the effect on uptake will be positive for both sides. Of course the Learning Registry is metadata agnostic, so having LRMI/schema.org metadata in there won’t get in the way of using other metadata schema.

Disclosure (or bragging)

I was lucky enough to be on the LRMI technical working group that helped make this happen. It makes me vary happy to see this progress.

On Semantics and the Joint Academic Coding System

Posted on 17/04/2013 by Phil Barker

Lorna and I recently contributed a study on possible reforms to JACS, a study which is part of a larger piece of work on Redesigning the HE data landscape. JACS, the Joint Academic Coding System, is mainatained by HESA (the Higher Education Statistics Agency) and UCAS (Universities and Colleges Admissions Service) as a means of classifying UK University courses by subject; it is also used by a number of other organisations for classification of other resources, for example teaching and learning resources. The study to which we were contributing our thoughts had already identified a problem with different people using JACS in different ways, which prompted the first part of this post. We were keen to promote technical changes to the way that JACS is managed that would make it easier for other people to use (and incidentally might help solve some of the problems in further developing JACS for use by HESA and UCAS), which are outline in the second part.

There’s nothing new here, I’m posting these thoughts here just so that they don’t get completely lost.

Subjects and disciplines in JACS

One of the issue identified with the use of JACS is that “although ostensibly believing themselves to be using a single system of classification, stakeholders are actually applying JACS for a variety of different purposes” including Universities who “often try to align JACS codes to their cost centres rather than adopting a strictly subject-based approach”. The cost centres in question are academic schools or departments, which are discipline based. This is problematic to the use of JACS to monitor which subjects are being learnt since the same subject may be taught in several departments. A good example of this is statistics, which is taught across many fields from Mathematics through to social sciences, but there are many other examples: languages taught in mediaeval studies and business translation courses, elements of computing taught in electronic engineering and computer science and so on. One approach would be to ignore the discipline dimension, to say the subject is the same regardless of the different disciplinary slants taken, that is to say statistics taught to mathematicians is the same as statistics taught to physicists is the same as statistics taught to social sciences. This may be true at a very superficial level, but obviously the relevance of theoretical versus practical elements will vary between those disciplines, as will the nature of the data to be analysed (typically a physicist will design an experiment to control each variable independently so as not to deal with multivariate data, this is not often possible in social sciences and so multivariate analysis is far more important). When it comes to teaching and learning resources something aimed at one discipline is likely to contain examples or use an approach not suited to others.

Perhaps more important is that academics identify with a discipline as being more than a collection of subjects being taught. It encapsulates a way of thinking, a framework for deciding on which problems are worth studying and a way of approaching these problems. A discipline is a community, and an academic who has grown up in a community will likely have acquired that community’s view of the subjects important to it. This should be taken into account when designing a coding scheme that is to be used by academics since any perception that the topic they teach is being placed under someone else’s discipline will be resisted as misrepresenting what is actually being taught, indeed as a threat to the integrity of the discipline.

More objectively, the case for different disciplinary slants on a problem space being important is demonstrated by the importance of multidisciplinary approaches to solving many problems. Both the reductionist approach of physics and the holistic approach of humanities and social sciences have their strengths, and it would be a shame if the distinction were lost.

The ideal coding scheme would be able to represent both the subject learnt and the discipline context in which it was learnt.

JACS and 5* data

Tim Berners-Lee suggested a 5 star deployment scheme for open data on the world wide web:
* make your stuff available on the Web (whatever format) under an open licence
** make it available as structured data (e.g., Excel instead of image scan of a table)
*** use non-proprietary formats (e.g., CSV instead of Excel)
**** use URIs to denote things, so that people can point at your stuff
**** link your data to other data to provide context

Currently JACS fails to meet the open licence requirement for 1-star data explicitly, but that seems to be a simple omission of a licensing statement that shows the intention that JACS should be freely available for others to use. It is important that this is fixed, but aside from this, JACS operates at about 3-star level. Assigning URIs to JACS subjects and providing useful information when someone accesses these URIs will allow JACS to be part of the web of linked open data. The benefits of linking data over the web include:

The identifiers are globally unique and unambiguous, they can be used in any system without fear of conflicting with other identifiers.
The subjects can be referenced globally by humans by from websites, emails, and by computer systems in/from data feeds and web applications.
The subjects can be used for semantic web approaches to representing ontologies, such as RDF.
These allow relationships such as subject hierarchies and relationships with other concepts (e.g. academic discipline) to be represented independently of the coding scheme used. An example of this is SKOS, see below.

In practical terms, implementing this would mean:

Devising a URI scheme. This could be as simple as adding the JACS codes to a suitable base URI. For example H713 could become http://id.jacs.ac.uk/H713
Setting up a web service to provide suitable information. Anyone connecting to that URL would be redirected to information that matched parameters in their request. A simple web browser would request an HTML page and so be redirected to http://id.jacs.ac.uk/H713.html; web applications would request data in a machine readable form such as xml, rdf or json.

The main overhead is in setting up, maintaining and managing the data provided by the web service, but Southampton University have already set one up for their own use. (The only problem with the Southampton service–and I believe Oxford have done something similar–is a lack of authority, i.e. it isn’t clear to other users whether the data from this service is authoritative, up to date, used under a correct license, sustainable.)

JACS and SKOS

SKOS (Simple Knowledge Organization System) is a semantic web application of RDF which provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. It allows for the description of a concept and the expression of the relationship betweens pairs of concepts. But first the concept must be identified as such, with a URI. For example:
jacs:H713 rdf:type skos:concept
In this example jacs: is shorthand for the JACS base URI, http://id.jacs.ac.uk/ as suggested above; rdf: and skos: are shorthand for the base URIs for RDF and SKOS. This triple says “The thing identified by http://id.jacs.ac.uk/H713 is a resource of type (as defined by RDF) concept (as defined by SKOS)”.

Other assertions can be made about the resource, e.g. the preferred label to be used for it and a scope note for it.
jacs:H713 skos:prefLabel “Production Processes”
jacs:H713 skos:scopeNote “The study of the principles of engineering as they apply to efficient application of production-line technology.”

Assuming the other JACS codes have been suitably identified, relationships between them can be described:
jacs:H713 skos:broader jacs:H710
jacs:H713 skos:related jacs:H830

Once JACS is on the semantic web relationships between the JACS subjects and things in other schemas can also be represented
http://example.org/123 dct:subject jacs:H713
(The resource identified by the URI http://example.org/123 is about the subject identified by jacs:H713).

Where does schema.org fit in the (semantic) web?

Posted on 16/08/2012 by Phil Barker

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also

Webinar: Learning resource metadata for schema.org

Posted on 13/07/2012 by Phil Barker

As you may know, I have been involved in the development of the Learning Resource Metadata Initiative‘s extension of schema.org since about this time last year. Things are shaping up well for the inclusion of the LRMI properties in the main schema.org vocabulary, so this seems like a good time(*) to start explaining and promoting them. To that end, we will be running webinar, hosted on JISC’s BlackBoard Collaborate service on Fri 27 July starting at 15:00 UK time, it will run for up to 2 hours.

Update: the webinar happened, you can get the slides that were used from slideshare and you can view a full recording of the webinar (that’s a BlackBoard Collaborate recording, you need Java for it to play).

In this webinar we will explore the background, intent and output of the Learning Resource Metadata Initiative (LRMI). The LRMI has proposed extensions to the schema.org microdata vocabulary with the aim of facilitating the discovery of learning resources through major search engines and other discovery services. We will provide an introduction to schema.org and describe the specific approach taken by LRMI.

My first take at an outline programme is along the lines of:

Outline of schema.org as semantic tagging of HTML content (this isn’t intended to be a tutorial on how to add schema to a web page, but I think it will be useful to make sure everyone starts from the same understanding of schema’s place in the web)
Who is behind schema.org
Their motivation: “improve search services”–what that means
What schema.org (initial release) offers for Learning Resources and what it doesn’t.
Who is behind LRMI
How LRMI worked
Most importantly, what LRMI produced

I am delighted that helping me with this webinar will be two key players in LRMI and schema.org. Dan Brickley, who many of you will know from his years of activity on RDF and the semantic web and who is heavily involved in the outreach, standards and community work around schema.org, and Greg Grossmeier of Creative Commons who is Co-chair of the LRMI technical working group and so has steered us from the collection of user requirement through to the development of new schema.org properties.

The target audience is staff from UK Further and Higher Education with an interest in the dissemination of learning resources (for example Open Educational Resources, OERs) and building services for their discovery, especially those people involved in JISC projects and services. If demand is high priority will be given to this audience.

(* yeah, OK, Friday afternoon at the end of July isn’t really a good time for this, but it ended up as the best time for the people involved given their other constraints….)

Text and Data Mining workshop, London 21 Oct 2011

Posted on 21/10/2011 by Phil Barker

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Testing Caprét

Posted on 17/08/2011 by Phil Barker

I’ve been testing the alpha release of CaPRéT , a tool that aids attribution and tracking of openly licensed content from web sites. According to the Caprét website.

When a user cuts and pastes text from a CaPRéT-enabled site:

The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.

The OER site can also track what text was cut, allowing them to better understand how users are using their site.

I tested Caprét on a single page, my institutional home page and on this blog. To enable Caprét for material on a website you need to include links to four javascript files in your webpages. I went with the files hosted on the Caprét site so all I had to do was put this into my homepage’s <head> (The testing on my home page is easier to describe, since the options for WordPress will depend on the theme you have installed.)

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js" type="text/javascript"></script> <script src="http://capret.mitoeit.org/js/jquery.plugin.clipboard.js" type="text/javascript"></script> <script src="http://capret.mitoeit.org/js/oer_license_parser.js" type="text/javascript"></script> <script src="http://capret.mitoeit.org/js/capret.js" type="text/javascript"></script>

Then you need to put the relevant information, properly marked up into the webpage. Currently caprét cites the Title, source URL, Author, and Licence URI of the page from which the text was copied. The easiest way to get this information into your page is to use a platform which generates it automatically, e.g. WordPress or Drupal with the OpenAttribute plug-in installed. The next easiest way is to fill out the form at the Creative Commons License generator. Be sure to supply the additional information if you use that form.

If you’re into manual, this is what does the work:

Title, is picked up from any text marked as
<span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type"><span> or, if that’s not found, the page <title> in the <head>

Source URL comes from page url

Author name, is picked up from contents of <a xmlns:cc="http://creativecommons.org/ns#" href="http://jisc.cetis.org.uk/contact/philb" property="cc:attributionName" rel="cc:attributionURL"></a> (actually, the author attribution URL in the href attribute isn’t currently used, so this could just as well be a span)

Licence URI, is picked up from the href attribute of <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">

You might want to suggest other things that could be in the attribution/citation.

Reflections
As far as attribution goes it seems to work. Copy something from my home page or this blog and paste it elsewhere and the attribution information should magically appear. What’s also there is an embedded tracking gif, but I haven’t tested whether that is working.

What I like about this approach is that it converts self-description into embedded metadata. Self description is practice of including within a resource that information which is important for describing it: the title, author, date etc. Putting this information into the resource isn’t rocket science, it’s just good practice. To convert this information into metadata it needs to be encoded in such a way that a machine can read it. That’s where the RDFa comes in. What I like about RDFa (and microformats and microdata) as a way of publishing metadata is that it builds the actual descriptions are the very same ones that it’s just good practice to include in the resource. Having them on view in the resource is likely to help with quality assurance, and, while the markup is fiddly (and best dealt with by the content management system in use, not created by hand) creating the metadata should be no extra effort over what you should do anyway.

Caprét is being developed by MIT OEIT and Tatemae (OERGlue) as part of the JISC CETIS mini projects initiative; it builds on the browser plug-in developed independently by the OpenAttribute team.

Call for Papers: Semantic Technologies for Learning and Teaching Support in Higher Education

Posted on 03/03/2011 by Phil Barker

Our friends at the University of Southampton, Hugh Davis, David Millard and Thanassis Tiropanis (with whom we worked on the SemTech project and who organised a subsequent workshop) are guest editing a Special Section of IEEE Transactions on Learning Technology on Semantic Technologies for Learning and Teaching Support in Higher Education.

Call for papers (pdf) from the IEEE Computer Society. Deadline for submission 1 April 2011.

Semantic web applications in higher education

Posted on 12/11/2010 by Phil Barker

Last week I was in Southampton for the second workshop on Semantic web applications in higher education (SemHE), organised by Thanasis Tiropanis and friends from Learning Societies Lab at Southampton University. These same people had worked with CETIS on the Semantic Technologies in Learning and Teaching (SemTech) project. The themes from the meeting seemed to be an emphasis on using this technology to solve real problems, i.e. the applications in the workshop title, and, to quote Thanasis in his introduction a consequent “move away from complex idiosyncratic ontologies not much used outside of the original developers” and towards a simpler “linked data field”.

First: the scope of the workshop, at least in terms of what was meant by “higher education” in the title. The interests of those who attended came under at least two (not mutually exclusive) headings. One was HE as an enterprise, and the application of semantic web applications to the running of the University, i.e. e-administration, resource and facility management, and the like. The other was the role semantic technologies in teaching and learning, one aspect of which was summed up nicely by Su White as identifying the native semantic technologies that would give the student an authentic learning experience to prepare them for a world where massive amounts data are openly available, e.g. preparing geography students to work with real data sets.

The emphasis on solving a real problem was nicely encapsulated by a presentation from Farhana Sarker where she identified ~20 broad challenges facing UK HE such as management, funding, widening participation, retention, contribution to economy, assessment, plagiarism, group formation in learning & teaching, construction of personal and group knowledge…. She then presented what you might call a factorisation of the data that could help address these challenges into about 9 thematic repositories (using that word in a broad sense) containing: course information, teaching material, student records, research output, research activities, staff expertise, infrastructure data, accreditation records (course/institnl accred), staff development programme details (I may have missed a couple). Of course each repository addresses more than one of the challenges, and to do so much of the data held in them needs to be shared outside of the institution.

A nice, concrete, example of using shared data to address a problem in resource management and discovery was provided by Dave Lambert showing how external linked data sources such as Dewey.info, Library of Congress, GeoNames, sindice.com zemanta.com and a vocabulary drawn from from the FOAF, DC in RDFS, SKOS, WGS84 and Timeline ontologies, have been used by the OU to catalogue videos in the annomation tool and provide a discovery service through the SugarTube project.

One comment that Dave made was that many relevant ontologies were too heavyweight for the purpose he had, and this focus on what is needed to solve a problem linked with another theme that ran through the meeting, that of pragmatism and as much simplicity as possible. Chris Gutteridge made a very interesting observation, that the uptake of semantic technologies, like the uptake of the web in the late 1990s, would involve a change in the people working on it from those who were doing so because they were interested in the semantic web to those who were doing so because their boss had told them they had to. This has some interesting consequences, for example: there are clear gains to be made (says Chris) from the application of semantic technologies to e-admin, however the IT support for admin is not often well versed in semantic ideas. Therefore, to realise these gains those pioneering the use of the semantic web and linked data should supply patterns that are easy to follow; consuming data from a million different ontologies won’t scale.

Towards the end of the day the discussion on pragmatism rather than idealism settled on the proposal, I forget who made it, that ontologies were a barrier to mass adoption of the semantic web, and that what would be better would be to create a “big bag of predicates” with domain thing, range thing. The suggestion being that more specific domains or ranges tended to be ignored anyway. (Aside, I don’t know whether the domain & range would be owl:Thing s, or whether it would matter if rdfs:Resource were used instead. If you can explain how a distinction between those two helps interoperability then I would be interested; throw skos:Concept into the mix and I’ll buy you a pint.)

Returning to the SemTech project, the course of the meeting did a lot to reiterate the final report of that project, and in particular the roadmap it produced, which was a sequence of 1) release data openly as linked data with an emphasis on lightweight knowledge models; 2) creation and deployment of applications build on this data; 3) emergence of ontologies and pedagogy-aware semantic applications. While the linked data cloud shows the progress of step 1, I would suggest that it is worth keeping an eye on whether step 2 is happening (the SemTech project provided a baseline survey for comparison, so what I am suggesting is a follow up of that at some point).

Finally: thanks to Thanasis for organising the workshop, I know he had a difficult time of it, and I hope that doesn’t put him off organising a third (once you call something the second… you’ve created a series!)

Phil Barker

Cetis Blog

Category Archives: semantic technologies