Heads up for HEDIIP

A while back I summarised the input about semantics and academic coding that Lorna and I had made on behalf of Cetis for a study on possible reforms to JACS, the Joint Academic Coding System. That study has now been published.

JACS is mainatained by HESA (the Higher Education Statistics Agency) and UCAS (Universities and Colleges Admissions Service) as a means of classifying UK University courses by subject; it is also used by a number of other organisations for classification of other resources, for example teaching and learning resources. The report (with appendices) considers the varying requirements and uses of subject coding in HE and sets out options for the development of a replacement for JACS.

Of course, this is all only of glancing interest, until you realise that stuff like Unistats and the Key Information Set (KIS) are powered by JACS.
- See more at Followers of the apocalypse

If you’re not sure why this should interest you (and yet for some reason have read this far) David Kernohan has written what I can only describe as an appreciation of the report, Hit the road JACS, from which the quote above is taken.

hediip_logoTo move forward from this and the other reports commissioned from the Redesigning the HE data landscape study, the Higher Education Data and Information Improvement Programme (HEDIIP) is being established to enhance the arrangements for the collection, sharing and dissemination of data and information about the HE system. Follow them on twitter.

Learning Resource Metadata is Go for Schema

The Learning Resource Metadata Initiative aimed to help people discover useful learning resources by adding to the schema.org ontology properties to describe educational characteristics of creative works. Well, as of the release of schema draft version 1.0a a couple of weeks ago, the LRMI properties are in the official schema.org ontology.

Schema.org represents two things: 1, an ontology for describing resources on the web, with a hierarchical set of resource types each with defined properties that relate to their characteristics and relationships with other things in the schema hierarchy; and 2, a syntax for embedding these into HTML pages–well, two syntaxes, microdata and RDFa lite. The important factor in schema.org is that it is backed by Google, Yahoo, Bing and Yandex, which should be useful for resource discovery. The inclusion of the LRMI properties means that you can now use schema.org to mark up your descriptions of the following characteristics of a creative work:

audience the educational audience for whom the resource was created, who might have educational roles such as teacher, learner, parent.

educational alignment an alignment to an established educational framework, for example a curriculum or frameworks of educational levels or competencies. Expressed through an abstract thing called an Alignment Object which allows a link to and description of the node in the framework to which the resource aligns, and specifies the nature of the alignment, which might be that the resource ‘assesses’, ‘teaches’ or ‘requires’ the knowledge/skills/competency to which the resource aligns or that it has the ‘textComplexity’, ‘readingLevel’, ‘educationalSubject’ or ‘educationLevel’ expressed by that node in the educational framework.

educational use a text description of purpose of the resource in education, for example assignment, group work.

interactivity type The predominant mode of learning supported by the learning resource. Acceptable values are ‘active’, ‘expositive’, or ‘mixed’.

is based on url A resource that was used in the creation of this resource. Useful for when a learning resource is a derivative of some other resource.

learning resource type The predominant type or kind characterizing the learning resource. For example, ‘presentation’, ‘handout’.

time required Approximate or typical time it takes to work with or through this learning resource for the typical intended target audience

typical age range The typical range of ages the content’s intended end user.

Of course, much of the other information one would want to provide about a learning resource (what it is about, who wrote it, who published it, when it was written/published, where it is available, what it costs) was already in schema.org.

Unfortunately one really important property suggested by LRMI hasn’t yet made the cut, that is useRightsURL, a link to the licence under which the resource may be used, for example the creative common licence under which is has been released. This was held back because of obvious overlaps with non-educational resources. The managers of schema.org want to make sure that there is a single solution that works across all domains.

Guides and tools

To promote the uptake of these properties, the Association of Educational Publishers has released two new user guides.

The Smart Publisher’s Guide to LRMI Tagging (pdf)

The Content Developer’s Guide to the LRMI and Learning Registry (pdf)

There is also the InBloom Tagger described and demonstrated in this video.

LRMI in the Learning Registry

As the last two resources show, LRMI metadata is used by the Learning Registry and services built on it. For what it is worth, I am not sure that is a great example of its potential. For me the strong point of LRMI/schema.org is that it allows resource descriptions in human readable web pages to be interpreted as machine readable metadata, helping create services to find those pages; crucially the metadata is embedded in the web page in way that Google trusts because the values of the metadata are displayed to users. Take away the embedding in human readable pages, which is what seems to happen when used with the learning registry, and I am not sure there is much of an advantage for LRMI compared to other metadata schema,–though to be fair I’m not sure that there is any comparative disadvantage either, and the effect on uptake will be positive for both sides. Of course the Learning Registry is metadata agnostic, so having LRMI/schema.org metadata in there won’t get in the way of using other metadata schema.

Disclosure (or bragging)

I was lucky enough to be on the LRMI technical working group that helped make this happen. It makes me vary happy to see this progress.

Where does schema.org fit in the (semantic) web?

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also

Text and Data Mining workshop, London 21 Oct 2011

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Testing Caprét

I’ve been testing the alpha release of CaPRéT , a tool that aids attribution and tracking of openly licensed content from web sites. According to the Caprét website.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

I tested Caprét on a single page, my institutional home page and on this blog. To enable Caprét for material on a website you need to include links to four javascript files in your webpages. I went with the files hosted on the Caprét site so all I had to do was put this into my homepage’s <head> (The testing on my home page is easier to describe, since the options for WordPress will depend on the theme you have installed.)

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/jquery.plugin.clipboard.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/oer_license_parser.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/capret.js" type="text/javascript"></script>

Then you need to put the relevant information, properly marked up into the webpage. Currently caprét cites the Title, source URL, Author, and Licence URI of the page from which the text was copied. The easiest way to get this information into your page is to use a platform which generates it automatically, e.g. WordPress or Drupal with the OpenAttribute plug-in installed. The next easiest way is to fill out the form at the Creative Commons License generator. Be sure to supply the additional information if you use that form.

If you’re into manual, this is what does the work:

Title, is picked up from any text marked as
<span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type"><span> or, if that’s not found, the page <title> in the <head>

Source URL comes from page url

Author name, is picked up from contents of <a xmlns:cc="http://creativecommons.org/ns#" href="http://jisc.cetis.org.uk/contact/philb" property="cc:attributionName" rel="cc:attributionURL"></a> (actually, the author attribution URL in the href attribute isn’t currently used, so this could just as well be a span)

Licence URI, is picked up from the href attribute of <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">

You might want to suggest other things that could be in the attribution/citation.

As far as attribution goes it seems to work. Copy something from my home page or this blog and paste it elsewhere and the attribution information should magically appear. What’s also there is an embedded tracking gif, but I haven’t tested whether that is working.

What I like about this approach is that it converts self-description into embedded metadata. Self description is practice of including within a resource that information which is important for describing it: the title, author, date etc. Putting this information into the resource isn’t rocket science, it’s just good practice. To convert this information into metadata it needs to be encoded in such a way that a machine can read it. That’s where the RDFa comes in. What I like about RDFa (and microformats and microdata) as a way of publishing metadata is that it builds the actual descriptions are the very same ones that it’s just good practice to include in the resource. Having them on view in the resource is likely to help with quality assurance, and, while the markup is fiddly (and best dealt with by the content management system in use, not created by hand) creating the metadata should be no extra effort over what you should do anyway.

Caprét is being developed by MIT OEIT and Tatemae (OERGlue) as part of the JISC CETIS mini projects initiative; it builds on the browser plug-in developed independently by the OpenAttribute team.

Call for Papers: Semantic Technologies for Learning and Teaching Support in Higher Education

Our friends at the University of Southampton, Hugh Davis, David Millard and Thanassis Tiropanis (with whom we worked on the SemTech project and who organised a subsequent workshop) are guest editing a Special Section of IEEE Transactions on Learning Technology on Semantic Technologies for Learning and Teaching Support in Higher Education.

Call for papers (pdf) from the IEEE Computer Society. Deadline for submission 1 April 2011.

Semantic web applications in higher education

Last week I was in Southampton for the second workshop on Semantic web applications in higher education (SemHE), organised by Thanasis Tiropanis and friends from Learning Societies Lab at Southampton University. These same people had worked with CETIS on the Semantic Technologies in Learning and Teaching (SemTech) project. The themes from the meeting seemed to be an emphasis on using this technology to solve real problems, i.e. the applications in the workshop title, and, to quote Thanasis in his introduction a consequent “move away from complex idiosyncratic ontologies not much used outside of the original developers” and towards a simpler “linked data field”.

First: the scope of the workshop, at least in terms of what was meant by “higher education” in the title. The interests of those who attended came under at least two (not mutually exclusive) headings. One was HE as an enterprise, and the application of semantic web applications to the running of the University, i.e. e-administration, resource and facility management, and the like. The other was the role semantic technologies in teaching and learning, one aspect of which was summed up nicely by Su White as identifying the native semantic technologies that would give the student an authentic learning experience to prepare them for a world where massive amounts data are openly available, e.g. preparing geography students to work with real data sets.

The emphasis on solving a real problem was nicely encapsulated by a presentation from Farhana Sarker where she identified ~20 broad challenges facing UK HE such as management, funding, widening participation, retention, contribution to economy, assessment, plagiarism, group formation in learning & teaching, construction of personal and group knowledge…. She then presented what you might call a factorisation of the data that could help address these challenges into about 9 thematic repositories (using that word in a broad sense) containing: course information, teaching material, student records, research output, research activities, staff expertise, infrastructure data, accreditation records (course/institnl accred), staff development programme details (I may have missed a couple). Of course each repository addresses more than one of the challenges, and to do so much of the data held in them needs to be shared outside of the institution.

A nice, concrete, example of using shared data to address a problem in resource management and discovery was provided by Dave Lambert showing how external linked data sources such as Dewey.info, Library of Congress, GeoNames, sindice.com zemanta.com and a vocabulary drawn from from the FOAF, DC in RDFS, SKOS, WGS84 and Timeline ontologies, have been used by the OU to catalogue videos in the annomation tool and provide a discovery service through the SugarTube project.

One comment that Dave made was that many relevant ontologies were too heavyweight for the purpose he had, and this focus on what is needed to solve a problem linked with another theme that ran through the meeting, that of pragmatism and as much simplicity as possible. Chris Gutteridge made a very interesting observation, that the uptake of semantic technologies, like the uptake of the web in the late 1990s, would involve a change in the people working on it from those who were doing so because they were interested in the semantic web to those who were doing so because their boss had told them they had to. This has some interesting consequences, for example: there are clear gains to be made (says Chris) from the application of semantic technologies to e-admin, however the IT support for admin is not often well versed in semantic ideas. Therefore, to realise these gains those pioneering the use of the semantic web and linked data should supply patterns that are easy to follow; consuming data from a million different ontologies won’t scale.

Towards the end of the day the discussion on pragmatism rather than idealism settled on the proposal, I forget who made it, that ontologies were a barrier to mass adoption of the semantic web, and that what would be better would be to create a “big bag of predicates” with domain thing, range thing. The suggestion being that more specific domains or ranges tended to be ignored anyway. (Aside, I don’t know whether the domain & range would be owl:Thing s, or whether it would matter if rdfs:Resource were used instead. If you can explain how a distinction between those two helps interoperability then I would be interested; throw skos:Concept into the mix and I’ll buy you a pint.)

Returning to the SemTech project, the course of the meeting did a lot to reiterate the final report of that project, and in particular the roadmap it produced, which was a sequence of 1) release data openly as linked data with an emphasis on lightweight knowledge models; 2) creation and deployment of applications build on this data; 3) emergence of ontologies and pedagogy-aware semantic applications. While the linked data cloud shows the progress of step 1, I would suggest that it is worth keeping an eye on whether step 2 is happening (the SemTech project provided a baseline survey for comparison, so what I am suggesting is a follow up of that at some point).

Finally: thanks to Thanasis for organising the workshop, I know he had a difficult time of it, and I hope that doesn’t put him off organising a third (once you call something the second… you’ve created a series!)