Where does schema.org fit in the (semantic) web?

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also

Will using schema.org metadata improve my Google rank?

It’s a fair question to ask. Schema.org metadata is backed by Google, and has the aim of making it easier for people to find the right web pages, so does using it to describe the content of a page improve the ranking of that page in Google search results? The honest answer is “I don’t know”. The exact details of the algorithm used by Google for search result ranking are their secret; some people claim to have elucidated factors beyond the advice given by Google, but I’m not one of them. Besides, the algorithm appears to be ever changing, so what worked last week might not work next week. What I do know is that Google says:

Google doesn’t use markup for ranking purposes at this time—but rich snippets(*) can make your web pages appear more prominently in search results, so you may see an increase in traffic.

*Rich Snippets is Google’s name for the semantic mark up that it uses, be it microformats, microdata (schema.org) or RDFa.

I see no reason to disbelieve Google on this, so the answer to the question above would seem to be “no”. But how then does using schema.org make it easier for people to find the right web pages? (and let’s assume for now that yours are the right pages). Well, that’s what the second part of the what Google says is about: making pages appear more prominently in search result pages. As far as I can see this can happen in two ways. Try doing a search on Google for potato salad. Chances are you’ll see something a bit like this

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

You see how some of the results are embellished with things like star ratings, or information like cooking time and number of calories–that’s the use of rich snippets to make a page appear more prominent.

But there’s more: the check boxes on the side allow the search results to be refined by facets such as ingredients, cooking time and calorie content. If a searcher uses those check boxes to narrow down their search, then only pages which have the relevant information marked-up using schema.org microdata (or other rich snippet mark-up) will appear in the search results.

So, while it’s a fair question to ask, the question posed here is the wrong question. It would be better to ask “will schema.org metadata help people find my pages using Google”, to which the answer is yes if Google decides to use that mark up to enhance search result pages and/or provide additional search options.

Background
I have been involved in the LRMI (Learning Resource Metadata Initiative), which has proposed extensions to schema.org for describing the educational characteristics of resources–see this post I did for Creative Commons UK for further details. I have promised a more technical briefing of the hows and whys of LRMI/Schema.org to be developed here, but given my speed of writing I wouldn’t hold my breath waiting for it.–In the meantime this is one of several questions I thought might be worth answering. If you can think of any, let me know.

A lesson in tagging for UKOER

We’ve been encouraging projects in the HE Academy / JISC OER programmes to use platforms that help get the resources out onto the open web and into the places where people look, rather than expecting people to come to them. YouTube is one such place. However, we also wanted to be able to find all the outputs from the various projects wherever they had been put, without relying on a central registry, so one of the technical recommendations for the programme was that resources are tagged UKOER.

So, I went to YouTube and searched for UKOER, and this was the top hit. Well, it’s a lesson in tagging I suppose. I don’t think it invalidates the approach, we never expected 100% fidelity and this seems to be a one-off among the first 100 or so of the 500+ results. And it’s great to see results from Chemistry.FM and CoreMaterials topping 10,000 views.

Text and Data Mining workshop, London 21 Oct 2011

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Testing Caprét

I’ve been testing the alpha release of CaPRéT , a tool that aids attribution and tracking of openly licensed content from web sites. According to the Caprét website.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

I tested Caprét on a single page, my institutional home page and on this blog. To enable Caprét for material on a website you need to include links to four javascript files in your webpages. I went with the files hosted on the Caprét site so all I had to do was put this into my homepage’s <head> (The testing on my home page is easier to describe, since the options for WordPress will depend on the theme you have installed.)


<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/jquery.plugin.clipboard.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/oer_license_parser.js" type="text/javascript"></script>
<script src="http://capret.mitoeit.org/js/capret.js" type="text/javascript"></script>

Then you need to put the relevant information, properly marked up into the webpage. Currently caprét cites the Title, source URL, Author, and Licence URI of the page from which the text was copied. The easiest way to get this information into your page is to use a platform which generates it automatically, e.g. WordPress or Drupal with the OpenAttribute plug-in installed. The next easiest way is to fill out the form at the Creative Commons License generator. Be sure to supply the additional information if you use that form.

If you’re into manual, this is what does the work:

Title, is picked up from any text marked as
<span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type"><span> or, if that’s not found, the page <title> in the <head>

Source URL comes from page url

Author name, is picked up from contents of <a xmlns:cc="http://creativecommons.org/ns#" href="http://jisc.cetis.org.uk/contact/philb" property="cc:attributionName" rel="cc:attributionURL"></a> (actually, the author attribution URL in the href attribute isn’t currently used, so this could just as well be a span)

Licence URI, is picked up from the href attribute of <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">

You might want to suggest other things that could be in the attribution/citation.

Reflections
As far as attribution goes it seems to work. Copy something from my home page or this blog and paste it elsewhere and the attribution information should magically appear. What’s also there is an embedded tracking gif, but I haven’t tested whether that is working.

What I like about this approach is that it converts self-description into embedded metadata. Self description is practice of including within a resource that information which is important for describing it: the title, author, date etc. Putting this information into the resource isn’t rocket science, it’s just good practice. To convert this information into metadata it needs to be encoded in such a way that a machine can read it. That’s where the RDFa comes in. What I like about RDFa (and microformats and microdata) as a way of publishing metadata is that it builds the actual descriptions are the very same ones that it’s just good practice to include in the resource. Having them on view in the resource is likely to help with quality assurance, and, while the markup is fiddly (and best dealt with by the content management system in use, not created by hand) creating the metadata should be no extra effort over what you should do anyway.

Caprét is being developed by MIT OEIT and Tatemae (OERGlue) as part of the JISC CETIS mini projects initiative; it builds on the browser plug-in developed independently by the OpenAttribute team.

RDFa Rich snippets for educational content

Prompted by a comment from Andy Powell that

It would be interesting to think about how much of required resource description for UKOER can be carried in the RDFa vocabularies currently understood by Google. Probably quite a lot.

I had a look at Googles webmaster advice on Marking up products for rich snippets.

My straw man mapping from the UKOER description requirements to Rich Snippets was:

Mandated Metadata
Programme tag = Brand?
Project tag = Brand?
Title = name
Author / owner / contributor = seller?
Date =
URL = offerURL (but not on OER page itself)
Licence information [Use CC code] price=0

Suggested Metadata
Language =
Subject = category
Keywords = category?
Additonal tags = category?
Comments = a review
Description = description

I put this into a quick example, and you can see what Google makes of it using the rich snippet testing tool. [I’m not sure I’ve got the nesting of a Person as the seller right.]

So, interesting? I’m not sure that this example shows much that is interesting. Trying to shoe-horn information about an OER into a schema that was basically designed for adverts isn’t ideal, but they already done recipes as well, once they’ve got the important stuff like that done they might have a go at educational resources. But it is kind-of interesting that Google are using RDFa; there seems to be a slow increase in the number of tools/sites that are parsing and using RDFa.

Self description and licences

One of the things that I noticed when I was looking for sources of UKOERs was that when I got to a resource there was often no indication on it that it was open: no UKOER tag, no CC-license information or logo. There may have been some indication of this somewhere on the way, e.g. on a repository’s information page about that resource, but that’s no good if someone arrives from a Google search, a direct link to the resource, or once someone has downloaded the file and put it on their own VLE.

Naomi Korn, has written a very useful briefing paper on embedding metadata about creative commons licences into digital resources as part of the OER IPR Support project starter pack. All the advice in that is worth following, but please, also make sure that licence and attribution information is visible on the resource as well. John has written about this in general terms in his excellent post on OERs, metadata, and self-description where he points out that this type of self description “is just good practice” which is complemented not supplanted by technical metadata.

So, OER resources, when viewed on their own, as if someone had found them through Google or a direct link, should display enough information about authorship, provenance, etc. for the viewer to know that they are open without needing an application to extract the metadata. The cut and paste legal text and technical code generated by the licence selection form on the Creative Commons website is good for this. (Incidentally, for HTML resources this code also includes technical markup so that the displayed text works as encoded metadata, which has been exploited recently by the OpenAttribute browser addon. I know the OpenAttribute team are working on embedding tools for licence selection and code generation into web content management systems and blogs).

Images, videos and sounds present their own specific problem for including human-readable licence text. Following practice from the publishing industry would suggest that small amounts of text discreetly tucked away on the bottom or side of an image can be enough to help. That example was generate by the Xpert attribution tool from an image of a bridge found on flickr. The Xpert tool will also does useful work for sounds and videos; but for sounds it is also possible to follow the example of the BBC podcasts and provide spoken information at the beginning or end of the audio, and for videos of course one can have scrolling credits at the end.

Metadata requirements from analysis of search logs

Many sites hosting collections of educational materials keep logs of the search terms used by visitors to the site who search for resources. Since it came up during the CETIS What Metadata (CETISWMD) event I have been think about what we could learn about metadata requirements from the analysis of these search logs. I’ve been helped by having some real search logs from Xpert to poke at with some Perl scripts (thanks Pat).

Essentially the idea is to classify the search terms used with reference to the characteristics of a resource that may be described in metadata. For example terms such as “biology” “English civil war” and “quantum mechanics” can readily be identified as relating to the subject of a resource; “beginners”, “101” and “college-level” relate to educational level; “power point”, “online tutorial” and “lecture” relate in some way to the type of the resource. We believe that knowing such information would assist a collection manager in building their collection (by showing what resources were in demand) and in describing their resources in such a way that helps users find them. It would also be useful to those who build standards for the description of learning resources to know which characteristics of a resource are worth describing in order to facilitate resource discovery. (I had an early run at doing this when OCWSearch published a list of top searches.)

Looking at the Xpert data has helped me identify some complications that will need to be dealt with. Some of the examples above show how a search phrase with more than one word can relate to a single concept, but in other cases, e.g. “biology 101″ and “quantum mechanics for beginners” the search term relates to more than one characteristic of the resource. Some search terms may be ambiguous: “French” may relate to the subject of the resource or the language (or both); “Charles Darwin” may relate to the subject or the author of a resource. Some terms are initially opaque but on investigation turn out to be quite rich, for example 15.822 is the course code for an MIT OCW course, and so implies a publisher/source, a subject and an educational level. Also, in real data I see the same search term being used repeatedly in a short period of time: I guess an artifact of how someone paging through results is logged as a series of searches: should these be counted as a single search or multiple searches?

I think these are all tractable problems, though different people may want to deal with them in different ways. So I can imagine an application that would help someone do this analysis. In my mind it would import a search log and allow the user to go through search by search classifying the results with respect to the characteristic of the resource to which the search term relates. Tedious work, perhaps, but it wouldn’t take too long to classify enough search terms to get an adequate statistical snap-shot (you might want to randomise the order in which the terms are classified in order to help ensure the snapshot isn’t looking at a particularly unrepresentative period of the logs). The interface should help speed things up by allowing the user to classify by pressing a single key for most searches. There could be some computational support: the system would learn how to handle certain terms and that this learning would be shared between users. A user should not have to tell the system that “Biology” is a subject once they or any other user has done so. It may also be useful to distinguish between broad top-level subjects (like biology) and more specific terms like “mitosis”, or alternatively to know that specific terms like “mitosis” relate to the broader term “biology”: in other words the option to link to a thesaurus might be useful.

This still seems achievable and useful to me.