Learning Resource Metadata is Go for Schema

The Learning Resource Metadata Initiative aimed to help people discover useful learning resources by adding to the schema.org ontology properties to describe educational characteristics of creative works. Well, as of the release of schema draft version 1.0a a couple of weeks ago, the LRMI properties are in the official schema.org ontology.

Schema.org represents two things: 1, an ontology for describing resources on the web, with a hierarchical set of resource types each with defined properties that relate to their characteristics and relationships with other things in the schema hierarchy; and 2, a syntax for embedding these into HTML pages–well, two syntaxes, microdata and RDFa lite. The important factor in schema.org is that it is backed by Google, Yahoo, Bing and Yandex, which should be useful for resource discovery. The inclusion of the LRMI properties means that you can now use schema.org to mark up your descriptions of the following characteristics of a creative work:

audience the educational audience for whom the resource was created, who might have educational roles such as teacher, learner, parent.

educational alignment an alignment to an established educational framework, for example a curriculum or frameworks of educational levels or competencies. Expressed through an abstract thing called an Alignment Object which allows a link to and description of the node in the framework to which the resource aligns, and specifies the nature of the alignment, which might be that the resource ‘assesses’, ‘teaches’ or ‘requires’ the knowledge/skills/competency to which the resource aligns or that it has the ‘textComplexity’, ‘readingLevel’, ‘educationalSubject’ or ‘educationLevel’ expressed by that node in the educational framework.

educational use a text description of purpose of the resource in education, for example assignment, group work.

interactivity type The predominant mode of learning supported by the learning resource. Acceptable values are ‘active’, ‘expositive’, or ‘mixed’.

is based on url A resource that was used in the creation of this resource. Useful for when a learning resource is a derivative of some other resource.

learning resource type The predominant type or kind characterizing the learning resource. For example, ‘presentation’, ‘handout’.

time required Approximate or typical time it takes to work with or through this learning resource for the typical intended target audience

typical age range The typical range of ages the content’s intended end user.

Of course, much of the other information one would want to provide about a learning resource (what it is about, who wrote it, who published it, when it was written/published, where it is available, what it costs) was already in schema.org.

Unfortunately one really important property suggested by LRMI hasn’t yet made the cut, that is useRightsURL, a link to the licence under which the resource may be used, for example the creative common licence under which is has been released. This was held back because of obvious overlaps with non-educational resources. The managers of schema.org want to make sure that there is a single solution that works across all domains.

Guides and tools

To promote the uptake of these properties, the Association of Educational Publishers has released two new user guides.

The Smart Publisher’s Guide to LRMI Tagging (pdf)

The Content Developer’s Guide to the LRMI and Learning Registry (pdf)

There is also the InBloom Tagger described and demonstrated in this video.

LRMI in the Learning Registry

As the last two resources show, LRMI metadata is used by the Learning Registry and services built on it. For what it is worth, I am not sure that is a great example of its potential. For me the strong point of LRMI/schema.org is that it allows resource descriptions in human readable web pages to be interpreted as machine readable metadata, helping create services to find those pages; crucially the metadata is embedded in the web page in way that Google trusts because the values of the metadata are displayed to users. Take away the embedding in human readable pages, which is what seems to happen when used with the learning registry, and I am not sure there is much of an advantage for LRMI compared to other metadata schema,–though to be fair I’m not sure that there is any comparative disadvantage either, and the effect on uptake will be positive for both sides. Of course the Learning Registry is metadata agnostic, so having LRMI/schema.org metadata in there won’t get in the way of using other metadata schema.

Disclosure (or bragging)

I was lucky enough to be on the LRMI technical working group that helped make this happen. It makes me vary happy to see this progress.

Where does schema.org fit in the (semantic) web?

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also

Will using schema.org metadata improve my Google rank?

It’s a fair question to ask. Schema.org metadata is backed by Google, and has the aim of making it easier for people to find the right web pages, so does using it to describe the content of a page improve the ranking of that page in Google search results? The honest answer is “I don’t know”. The exact details of the algorithm used by Google for search result ranking are their secret; some people claim to have elucidated factors beyond the advice given by Google, but I’m not one of them. Besides, the algorithm appears to be ever changing, so what worked last week might not work next week. What I do know is that Google says:

Google doesn’t use markup for ranking purposes at this time—but rich snippets(*) can make your web pages appear more prominently in search results, so you may see an increase in traffic.

*Rich Snippets is Google’s name for the semantic mark up that it uses, be it microformats, microdata (schema.org) or RDFa.

I see no reason to disbelieve Google on this, so the answer to the question above would seem to be “no”. But how then does using schema.org make it easier for people to find the right web pages? (and let’s assume for now that yours are the right pages). Well, that’s what the second part of the what Google says is about: making pages appear more prominently in search result pages. As far as I can see this can happen in two ways. Try doing a search on Google for potato salad. Chances are you’ll see something a bit like this

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

You see how some of the results are embellished with things like star ratings, or information like cooking time and number of calories–that’s the use of rich snippets to make a page appear more prominent.

But there’s more: the check boxes on the side allow the search results to be refined by facets such as ingredients, cooking time and calorie content. If a searcher uses those check boxes to narrow down their search, then only pages which have the relevant information marked-up using schema.org microdata (or other rich snippet mark-up) will appear in the search results.

So, while it’s a fair question to ask, the question posed here is the wrong question. It would be better to ask “will schema.org metadata help people find my pages using Google”, to which the answer is yes if Google decides to use that mark up to enhance search result pages and/or provide additional search options.

Background
I have been involved in the LRMI (Learning Resource Metadata Initiative), which has proposed extensions to schema.org for describing the educational characteristics of resources–see this post I did for Creative Commons UK for further details. I have promised a more technical briefing of the hows and whys of LRMI/Schema.org to be developed here, but given my speed of writing I wouldn’t hold my breath waiting for it.–In the meantime this is one of several questions I thought might be worth answering. If you can think of any, let me know.

RDFa Rich snippets for educational content

Prompted by a comment from Andy Powell that

It would be interesting to think about how much of required resource description for UKOER can be carried in the RDFa vocabularies currently understood by Google. Probably quite a lot.

I had a look at Googles webmaster advice on Marking up products for rich snippets.

My straw man mapping from the UKOER description requirements to Rich Snippets was:

Mandated Metadata
Programme tag = Brand?
Project tag = Brand?
Title = name
Author / owner / contributor = seller?
Date =
URL = offerURL (but not on OER page itself)
Licence information [Use CC code] price=0

Suggested Metadata
Language =
Subject = category
Keywords = category?
Additonal tags = category?
Comments = a review
Description = description

I put this into a quick example, and you can see what Google makes of it using the rich snippet testing tool. [I’m not sure I’ve got the nesting of a Person as the seller right.]

So, interesting? I’m not sure that this example shows much that is interesting. Trying to shoe-horn information about an OER into a schema that was basically designed for adverts isn’t ideal, but they already done recipes as well, once they’ve got the important stuff like that done they might have a go at educational resources. But it is kind-of interesting that Google are using RDFa; there seems to be a slow increase in the number of tools/sites that are parsing and using RDFa.