Examples of good licence embedding

I was asked last week to provide some good examples of embedded licences in OERs. I’m pleased to do that (with the proviso that this is just my personal opinion of “good”) since it makes a change from carping about how some of the outputs of the UKOER programme demonstrate a neglect of seemingly obvious points about self-description. For example anyone who gets hold of a copy of the resource would want see that it is an OER, so it seems obvious that the Creative Commons licence should be clearly displayed on the resource; they would also want to see something about who created, owned or published the resource, partly to comply with the attribution condition of Creative Commons licences but also to conform with good academic and information literacy practice around provenance and citation. With few exceptions, the machine readable metadata hidden in the OERs’ files (such as MS Office file properties, id3 tags, EXIF etc.) are an irremediable mess, especially for licence and attribution information which cannot on the whole created automatically, and so are generally ignored. Also, the metadata stored in a content management system such as a repository and displayed on the landing page for the resource are not relevant when the resource is copied and used in some other system. So what I’m looking at here is human readable information about licence and attribution that travels with the resource when it is copied. Different approaches are required for different resource types, so I’ll take them in turn.

Text, e.g. office documents, MS Word, Powerpoint, PDF
Pretty simple really, you can have a title section with the name of resource creator and a footer with the copyright and licensing information. You can also have a more extensive “credits” page at the end of the document. Running page headers and footers work well if you think that people might take just a few pages rather than the whole document.
Example text OER with attribution and licence information. Note that the licence statement and logo link to the legal deed on the Creative Commons website.
Example OER powerpoint with licence and attribution information. Note how the final slide gives licence and attribution information of third party resources used.

Web pages
Basically a special case of a text document, the attribution and licence information can be included in a title or footer section, scroll down to the bottom of this page to see an example. For HTML there is a good case for making this information machine readable by wrapping the information in microdata or RDFa tags. Plugins exist for many web content management systems to do this, and the Creative Commons licensing generator will produce an HTML snippet that includes such tags.

ImagesExample of photo with attribution and licence information
Really the only option for putting the essentially textual information about licence and attribution into an image is to add it as a bar to the image. The Attribute Images and related projects at Nottingham have been doing good work on automating this.

Audio
A spoken introduction can provide the information required. BBC podcasts give good examples, though they are not OERs; also the introduction to the video below works as audio.

Video
An introductory screen or credits at the end (with optional voice over) can provide the required information. See for example this video from MIT OCW (be sure to skip to the end to see credits to third party resources used).

Podcasts (and other RSS feeds)

As well as having <copyright> and <creativeCommons:license> tags in the RSS feed at channel and item level, Oxford Universities OER podcasts use an image for the channel that includes the creative commons logo. This is useful because the image is displayed by many feed readers and podcast applications. Of course the recordings should have licence information in them just as any other audio or video OER.

The Challenge of ebooks

Yesterday I was in London, along with a group of people with a wide range of experience in digital resource management, OERs, and publishing for a workshop which was part of the Challenge of eBooks project. Here’s a quick summary and some reflections.

To kick off, Ken Chad defined eBooks for the purpose of the workshop, and I guess the report to be delivered by the project, as anything delivered digitally that was longer than a journal article. I’ll come back to what I think are the problems with that later, but we didn’t waste time discussing it. It did mean that we included in the discussion such things as scanned copies of texts such as those that can be made under the CLA licence, and the difficulties around managing and distributing those.

For the earliest printed books, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way publishers now seek to replicate printed books as ebooks.

With the earliest printed book, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way as publishers now seek to replicate printed books as ebooks.

The main part of the workshop was organised around a “jobs to be done” framework. The idea of this is to focus on what people are trying to do “people don’t want a 5mm drill bit, they want a 5mm hole”. I found that useful in distinguishing ebooks in the domain of HE from the vast majority of those sold. In the latter case the job to be done is simply reading the book: the customer wants a copy of a book simply because they want to read that book, or a book by that author, or a book of that genre, but there isn’t necessarily any further motive beyond wanting the experience of reading the book. In HE the job to be done (ultimately) is for the student or researcher to learn something, though other players may have a job to do that leads to this, for example providing a student with resources that will help them learn something. I have views on how the computing power in the delivery platform can be used for more than just making the delivery of text more convenient: how it can be used to make the content interactive, or to deliver multimedia content, or to aid discussion or just connect different readers of the same text (I was pleased that someone mentioned the way a kindle will show which passages have been bookmarked/commented on by other readers).

The issues raised in discussion included rights clearance, the (to some extent technical, but mostly legal) difficulties of creating course packs containing excerpts of selected texts, the diversity of platforms and formats, disability access, and relationships with publishers.

It was really interesting that accessibility featured so strongly. Someone suggested that this was because the mismatch between an ebook and the device on which it is displayed creates an impairment so frequently that accessibility issues are plain for all to see.

A lot of the issues seem to go back publishers struggling with a new challenge, not knowing how they can meet it and keep their business model intact. It was great to have Suzanne Hardy of the PublishOER project there with her experience of how publishers will respond to an opportunity (such as getting more information about their users through tracking) but need help in knowing what the opportunities are when all they can see is the threat of losing control of their content. Whether publishers can make the necessary changes in currently print-oriented business processes to realise these benefits was questioned. Also there are challenges to libraries in HE, who are used to being able to buy one copy of a book for an institution whereas publishers now want to be able to sell access to individuals–partly, I guess, so that they can make that link between a user and the content they provide, but also because one digital copy can go a lot further than a single physical copy.

Interestingly, the innovation in ebooks is coming not from conventional publishers but from players such as Amazon, Apple and from publishers such as O’Reilly and Pearson. (Note that Pearson have a stake in education that includes an assessment business, online courses and colleges and so go beyond being a conventional publisher.) Also, the drive behind these innovations comes from new technology making new business models possible, not from evolution of current business, nor, arguably, from user demand.

So, anyway, what is an ebook? I am not happy with a definition that includes web sites of additional content created to accompany a book, or pages of a physical book that have been scanned. That doesn’t represent the sort of technical innovation that is creating new and interesting opportunities and the challenges come with them. Yes there are important (long-standing) issues around digital content in general, some of which will overlap with ebooks, but I will be disappointed if the report from this project is full of issues that could have been written about 10yrs ago. That’s not because I think those issues are dead but because I think ebooks are something different that deserves attention. I’ll suggest two approaches defining to what that something is:

1. an ebook is what the ebook reading devices and apps read well. By-and-large that means content in mobi or ePub format. Ebook readers don’t handle scanned page images well. They don’t read most pdf well (though depends on the tool and nature of pdf used, but aim of pdf was to maintain page layout which is exactly what you don’t want on an ebook reader). Word processed files are borderline but mostly word processed documents are page-oriented which raises the same issue as with pdfs. In short WYSIWYG and ebooks don’t match.

2. an ebook is aggregated content, packaged so that it can be moved from server to device, with more-or-less linear navigation. In the aggregation (which is often a zip file under another extension name) are assets (the text, images and other content that are viewed) plus metadata that describes the book as a whole (and maybe the assets individually) and information about how the assets should be navigated (structural metadata describing the organisation of the book). That’s essentially what mobi and ePub are. It’s also what IMS Content Packaging and offspring like SCORM and Common Cartridge are; and for that matter it’s what the MS Office and Open Office formats are.

I had a short discussion with Zak Mensah of JISC Digital Media about whether the content should be mostly text based. I would like to see as much non-text material as is useful, but clearly there is a limit. It would be perverse to take a set of videos, sequence them one after another with screen of text between each one like a caption frame in a silent movie, and then call it a book. However, there is something more than text that would make sense as a book: imagine replacing all the illustrations in a well-illustrated text book with models, animations, videos … for example, a chemistry book with interactive models of chemical structures, graphs that change when you alter the parameters; or a Shakespeare text with videos of performance in parallel with text…that still makes sense as a book.

[image of page from Gutenberg Bible taken from wikipedia]

A short update on resource tracking

In our reflections on technical aspects of phase 2 of the UKOER programme, we said that we didn’t understand why projects aren’t worrying more about tracking the use and reuse of the OERs they released. The reason for this was that if you don’t know how much your resources are used you will not be in a good position sustain your project after JISC have stopped funding it. For example, how can you justify the effort and cost of clearing resources for release under a Creative Commons licence unless you can show that people want their own copies of the resources you release rather than just view the the copy you have on your own server? Here is a quick update projects related to resource tracking.

TrackOER
Under the OER Rapid Innovation programme JISC have funded the TracOER project. It was known from the outset that the project would start slowly but in the last couple of weeks it has got some momentum going. The nub of the problem they are looking at is that

when an OER is taken from its host or origin server, in order to be used and reused the origin institution and the community generally lose track of it.

Building on work by Scott Leslie, their prospective solution is the use of a web bug/beacon: an image, normally invisible but TrackOER may use the creative commons licence badge, embedded in the resource but hosted by whomever is collecting the stats (let’s say the OER publisher). So long as the image is not removed, whenever the resource is loaded there will be a request sent to the publisher’s server for that image and that request can be logged. Additional information can be acquired by appending ?key1=value1&key2=value2… onto the src url of the img element in the resource; anything after the ? is logged in the server logs but does not affect the image that is served. For example, you could encode an identifier for the OER like this

<img src="http://example.com/tracker.png?oerID=1234">

TrackOER are investigating the use of Google analytics and the open source alternative piwiki (both with and without JavaScript, maybe) for the actual tracking. One of their challenges is that both normally assume that the person doing the tracking knows where the resource is, i.e. it will be where they put it, whereas with OERs one of the things that would be most worth knowing about is whether anyone has made a copy your resource somewhere else. However if you use JavaScript you have access to this information and can write it to the tracking image URL. Another challenge comes with using Creative Commons licence images instead of an invisible tracking bug is that you use several images for tracking not just the one. TrackOER have modified piwiki to allow for the use of multiple alternative images.

As an aside, TrackOER have also found a service called Stipple, they say:

using Stipple to track OER across the web in the same way as the TrackOER script is perfectly feasible. It might even be easy. You could get richer analytics as well as access to promotional tools.

OER tracking at Creative Commons
Creative Commons have posted three ideas for tracking OERs, two which use a mechanism they call refback and one which provides an API to data they acquire as a result of people linking to their licences and using images of licence badges served from their hosts. In all cases it is a priority to avoid anything that smacks of DRM or excessive and covert surveillance, understandable given that Creative Commons as an organistion is a third party between resource user and owner and cannot do anything that would risk losing the trust of either.

Refback tracking involves putting a link in the resource being tracked to the site doing the tracking (the two variants are that this may be either the publisher or Creative Commons, i.e. independent and distributed or hosted and centralised). If a curious user follows that link (and the assumption is that occasionally someone will) the tracking site will log request for the page to which the link goes, included in the log information is the “referrer” i.e. the URL of the page on which the user clicked the link. An application on the tracking site will work through this referrer log and fetch the pages for any URL it does not recognise to ascertain (e.g. from the attribution metadata) whether they are copies of a resource that it is tracking.

The third approach involves Creative Commons logging the referrers for requests to get a copy of one of their licence badges, and then looking at the attribution metadata on the web page in which the badge was embedded to build up a graph of pages that represent re-use of others. This information would be hosted on Creative Commons servers and be available to others via an API.

Where does schema.org fit in the (semantic) web?

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also

The Human Computer: a diversion from normal CETIS work

Alan Turing, 1951. Source: wikipedia

Alan Turing, 1951. Source: wikipedia

No, there’s no ‘Interaction’ missing in that title, this is about building a computer, or at least a small part of one, out of humans. The occasion was a birthday party that the department I work in, Computer Science at Heriot-Watt University, held to commemorate the centenary of Alan Turing’s birth. It was also the finale of a programming competition that the department set for sixth-formers, to create a simulation of a Turing Machine. So we had some of the most promising computer science pupils in the country attending.

As well as the balloons, cake and crisps, we had some party games, well, activities for our guests. They could have a go at the Turing test, at breaking enigma codes, and my contribution was for them to be a small part in a computer, a 2-bit adder. The aim was to show how the innards of a computer processor are little more than a whole load of switches and it doesn’t matter much (at least to a mathematician like Turing) what these switches are. I hoped this would help show that computers are more than black boxes, and help put add some context to what electronic computers were about when Turing was working. (And, yes, I do know that it was Shannon not Turing who developed the theory.)

So, it starts with a switch that can turn another switch on and off. Here’s a simulation of one which uses a transistor to do that. If you click on that link a java window should open that shows a simple circuit. The input on the left is at a Low voltage, the output is at a low voltage. Click on the input to set it to a High, and it will turn on the transistor, connecting the output to the high voltage source, so the output goes High. So by setting the input to high voltage (presumably by pressing a switch) you can set the output to high voltage. You’re allowed to be under-impressed at this stage. (Make sure you close any windows or browser tabs opened by that link, leaving them open might cause later examples not to work)

Turing didn’t have had access to transistors. At the time he worked these switches were electromechanical relays, a physical spring-loaded switch that was closed by the magnetic attraction between a coil and permanent magnet when a current ran through the coil. Later, vaccuum tube valves were available to replace these, but much to Tommy Flowers chagrin, Turing wasn’t at all interested in that. For mathematicians the details of the switching mechanism are a distraction. By not caring, maybe not even knowing, about the physics of the switch Turing was saved from worrying about a whole load of details that would have been out of date by the 1960s; as it is his work is still relevant today. This illustrates my favourite feature of Mathematics, which is that maths is the only subject where it is best not to know what you are talking about.

Back to this thing of turning a voltage signal high or low by turning a voltage high or low.

Two transistor AND gate

Two transistor AND gate

That may be underwhelming, but put two of these next to each other and something interesting happens: the output will only be High if both the inputs are. In other words the output is High if both input 1 AND input 2 are high. That’s mathematics: a simple logic calculation. You can try it out in the simulation. You can also try other arrangements that show an OR logic calculation and an XOR calculation, that is an exclusive OR, the output is high if on of input 1 or input 2 is high but not both. We call these circuits logic gates. Remember to close all windows and browser tabs when going from one simulation to another.

This is where we leave electronics and start using the audience. My colleague and I each had a flag and we gave everyone in the audience a flag. We were the inputs, they had to be logic gates; they had to raise their flag if she AND I both raised ours, or if she OR I had a flag up, or if she or I, but not both of us raised a flag (the XOR calculation).

The next trick was to show how these logic calculations relate to adding numbers together: A+B = S. First, of course, the numbers must be represented as binary with a low voltage/flag down equivalent to the digit 0 and high voltage/flag up equivalent to the digit 1. And we have to do the addition one digit at a time, starting from the units. Adding the first digit, the units, is easy enough. 0+0 = 0, 0+1=1, 1+0=1, 1+1=0 with 1 to carry. Think of that input 1 + input 2 = output, where the output can either be the digit for the sum or the digit to carry. For the sum, the output is 1 if either input 1 or input 2 is high, but not both, so S = input 1 XOR input 2; and we carry 1 if = input 1 AND input 2 are 1. The second and subsequent digits are harder since we need to add the digit from each number and the carry, but it’s not too difficult.

We can use logic gates to do the calculation for each bit of the addition. The circuit looks like this:
2bitadder
You can hopefully see how bit one of the sum if the XOR of the inputs for bits one of the numbers A and B, and the carry to the calculation of the second bit is the AND of these inputs. Again there is a simulation you can try, you might need to stretch the JAVA window to see all the circuit. Try 1 plus 1 (01+01 = 10 so set inputs A1 and B1 High, A2 and B2 Low to gives Output S1 Low and Output S2 High). And 2 + 2 (10 + 10).

We implemented this circuit using our audience of flag-wavers. We put pupils on the front row to be the inputs, pupils on the next row to be gates 1-4, and so on, making sure that each one knew at whom they should be looking and what condition should be met for them to raise their flag. We ran this three times, and each time it worked brilliantly. OK, so we could only add numbers less that 3, which isn’t much computing power, but given another 35 people we could have done eight-bit addition. And I’m pretty sure that we could have managed flip-flops and registers, but we would need something like 10,000 pupils to build a processor equivalent to an 8086, so the logistics might be difficult.

Using Turn-it-in to track re-use of OERs…

…isn’t really worth the bother–a simple web search seems to work better.

I’ve wondered, somewhat idly, whether Turn-it-in (t-i-n) may be a useful way to track whether and OER has re-used on the more-or-less open web. T-i-n is plagiarism detection software, it is designed to detect plagiarism in student work by looking for resources with the same content. Simple idea: just put your original into t-i-n and see whether any resources out there have been created using it. So I used a briefing I wrote on the LOM in 2005, which we later submitted as a wikipedia article on Learning Object Metadata. Selecting a chunk of text from the wikipedia article, putting it in quotes and searching for it finds quite a few verbatim copies. But when I asked a colleague who has access to t-i-n to look for a copies of the briefing, t-i-n found four with high match values: which to be fair is enough to meet the t-i-n use case of showing that a significant amount of it had been copied (in this case to the web), but not much use for tracking the re-use of an OER.

Anyone tried anything similar?

Will using schema.org metadata improve my Google rank?

It’s a fair question to ask. Schema.org metadata is backed by Google, and has the aim of making it easier for people to find the right web pages, so does using it to describe the content of a page improve the ranking of that page in Google search results? The honest answer is “I don’t know”. The exact details of the algorithm used by Google for search result ranking are their secret; some people claim to have elucidated factors beyond the advice given by Google, but I’m not one of them. Besides, the algorithm appears to be ever changing, so what worked last week might not work next week. What I do know is that Google says:

Google doesn’t use markup for ranking purposes at this time—but rich snippets(*) can make your web pages appear more prominently in search results, so you may see an increase in traffic.

*Rich Snippets is Google’s name for the semantic mark up that it uses, be it microformats, microdata (schema.org) or RDFa.

I see no reason to disbelieve Google on this, so the answer to the question above would seem to be “no”. But how then does using schema.org make it easier for people to find the right web pages? (and let’s assume for now that yours are the right pages). Well, that’s what the second part of the what Google says is about: making pages appear more prominently in search result pages. As far as I can see this can happen in two ways. Try doing a search on Google for potato salad. Chances are you’ll see something a bit like this

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

Selection from the results page for a Google search for potato salad showing enhanced search options (check boxes for specific ingredients, cooking times, calorific value) and highlighting these values in some of the result snippets.

You see how some of the results are embellished with things like star ratings, or information like cooking time and number of calories–that’s the use of rich snippets to make a page appear more prominent.

But there’s more: the check boxes on the side allow the search results to be refined by facets such as ingredients, cooking time and calorie content. If a searcher uses those check boxes to narrow down their search, then only pages which have the relevant information marked-up using schema.org microdata (or other rich snippet mark-up) will appear in the search results.

So, while it’s a fair question to ask, the question posed here is the wrong question. It would be better to ask “will schema.org metadata help people find my pages using Google”, to which the answer is yes if Google decides to use that mark up to enhance search result pages and/or provide additional search options.

Background
I have been involved in the LRMI (Learning Resource Metadata Initiative), which has proposed extensions to schema.org for describing the educational characteristics of resources–see this post I did for Creative Commons UK for further details. I have promised a more technical briefing of the hows and whys of LRMI/Schema.org to be developed here, but given my speed of writing I wouldn’t hold my breath waiting for it.–In the meantime this is one of several questions I thought might be worth answering. If you can think of any, let me know.

CETIS publications, now on WordPress

We have recently changed how we present our publications to the world. Where once we put a file on the web somewhere, anywhere, and entered the details into a home-spun publication database, now we use WordPress. We’re quite pleased with how that has worked out, so we’re sharing the information that might help others use WordPress as a means of presenting publications to the world (a repository, if you like).

Why WordPress?
First, what were we trying to achieve? The overall aims were to make sure that our publications had good exposure online, to have a more coherent approach to managing them (for example to collect all the files into one place in case we ever need to migrate them), and to move away from the bespoke system we were using to a system that someone else maintains. There were a few other requirements, we wanted something that was easy for us to adapt to fit the look and feel of the rest of our website, that was easy to maintain (familiarity is an important factor in how easy something is–it’s easy to use something if you know how to use it), and we wanted something that would present our publications in HTML and RSS sliced and diced by topic, author, and publication type: a URL for each publication and for each type of publication and feeds for everything. We’re not talking about a huge number of publications, maybe 100 or so, so we didn’t want a huge amount of up-front effort.

We thought about Open Journal Systems, but there seemed to be a whole load of workflow stuff that was relevant to Journals but not our publications. Likewise we thought about ePrints and Dspace, but they didn’t quite look like we wanted, and we are far more familiar with WordPress. As a wildly successful open source project, WordPress also fits the requirement of being maintained by other people, not just the core programme, but all those lovely plugins and themes. So the basic plan was to represent each publication in a WordPress post and to use a suitable theme and plugins to present them as we wanted.

The choice of theme
Having settled on WordPress the first decision was which theme to use. In order to get the look and feel to be similar to the rest of the CETIS website (and, to be honest, to make sure our publications pages didn’t look like a blog) we needed a very flexible theme. The most flexible theme I know of is Atahualpa, with over 200 options, including custom CSS snippets, parameters and HTML snippets it’s close to being a template for producing you own custom themes. So, for example, the theme options I have set include a byline of By %meta('By')%. %date('F Y')% which automatically inserts the additional metadata field ‘By’ and the date in the format of my choice, all of which can be styled any way I want. I’ll come back to the “byline” metadata later.

One observation here: there is clearly a trade-off between this level of customisation and ease of maintenance. On the one hand these are options set within the Atahualpa theme that can be saved between theme upgrades, which is better than would have been the case had we decided to fork the theme or add a few lines of custom code to the theme’s PHP files. On the other hand, it is not always immediately obvious which setting in the several pages of Atahualpa theme options has been used to change some aspect of the site’s appearance.

A post for each publication
As I mentioned above we can represent each publication by creating a WordPress post, but what information do we want to provide about each publication and how does it fit into a WordPress post? Starting with the simple stuff:

  • Title of the publication -> title of WordPress post.
  • Abstract / summary -> body of post.
  • Publication file -> uploaded as attached media.
  • Type of publication -> category.
  • Topic of publication -> tag.

Slightly less simple:

  • The date of the publication is represented as the date of the post. This is possible because WordPress lets you choose when to publish post. The default is for posts to be published immediately when you press the Publish button, however you can edit this to have them published in the past :)
    WordPress publication date option

    WordPress publication date option

  • The author of the publication becomes the author of the post, but there are some complications. It’s simple enough when the publication has a single author who works for CETIS, I just added everyone as an “author” user of WordPress and a WordPress admin user can attribute any given post to the author of the publication it represents. Where there are two or more authors a nifty little plugin called Co-Authors Plus allows them all to be assigned to the post. But we have some publications that we have commissioned from external authors, so I created an user called “Other” for these “external to CETIS” authors. This saves having a great long list of authors to maintain and present, but creates a problem of how to attribute these external authors, a problem that was solved using WordPress’s “additional metadata” feature to enter a “by-line” for all posts. This also provides a nicely formatted by-line for multi-author papers with out worrying about how to add PHP to put in commas and “and”s.
  • The only other additional metadata added was an identifier for each publication, e.g. the latest QTI briefing paper is No. 2011:B02.

Presenting it all
As well as customisation for the look and feel, the Atahualpa theme allows for menus and widgets to added to the user interface. Atahualpa has an option to insert a menu into the page header which we used for the links to the other parts of the CETIS website. On the left hand side bar we’ve used the custom menu widget to list the tags and categories to provide access to the publications divided by topic and publication type as HTML and as a feed (just add /feed to the end of the URL). Also on the left, the List Authors plugin gives us links to publications by author.

In order to provide a preview of the publication in the post I used the TGN embed everything plugin. The only problem is that the “preview” is too good: it’s readable but not the highest quality, so it might lead some people to think that we’re disseminating low quality versions of the papers, whereas we do include links to high quality downloadable files.

The built-in WordPress search is rubbish. For example, it doesn’t include the author field in the search (not that the first thing we tested was vanity searching), and the results returned are sorted by date not relevance. Happily the relevanssi plugin provides all the search we need.

Finally a few tweaks. We chose URL patterns that avoid unnecessary cruft, and closed comments to avoid spam. We installed the Google analytics plugin, so we know what you’re doing on our site, and the login lock plugin for a bit of security. The only customisation that we want that couldn’t be done with a theme option or plugin was providing some context to the multi-post pages. These are pages like the list of all the publications, or all the briefing papers, and we wanted a heading and some text to explain what that particular cut of our collection was. Some themes do this by default, based on information entered about the tag/catergory/author on which the cut is made, but not Atahualpa. I put a few lines of PHP into the theme’s index.php template to deal with publication types, but we’ve yet to do it properly for all possible multipost pages.

And in the end…
As I said at the top, we’re happy with this approach; if you have any comment on it, do please leave them below.

One last thing. Using a popular platform like WordPress means that there is a lot of support, and I don’t just mean a well supported code base and directory of plugins and themes. One of the most useful sources of support has been the WordPress community, especially the local group of WPUK, at whose meet-ups I get burritos and advice on themes, plugins, security and all things wordpressy.

A reflection for open education week

It’s open education week, lots of interesting events are happening and lots of reflections being made on what open education means. One set of reflections that caught my eye was a trio of posts from Jisc programme managers David, Amber and Lawrie: three personal attempts to draw a picture of the open education space to answer the question “what is open education and how does it fit in with everything else?”. These sprung from an attempt “to describe the way JISC-funded work is contributing to developing this space”. They are great. But I think they miss one thing: the time dimension. By a stroke of good luck, Lou Macgill has recently produced an OER Timeline which I think represents this very nicely. (Yes, I know that there is much more to education than resources, and much more open education than OER, but it’s resource management and dissemination that I mostly work on.)

Maybe it’s a sign of age, but the changes in approaches to supporting the sharing of content is something that has been interesting me more and more of late. Nearly two years ago Lorna, John and I produced a paper for the ADL Repositories and Registries Summit called Then and Now which highlighted changes in technical approaches to JISC programmes that CETIS had helped support between 2002 and 2010. The desire to share resources had always been there, the change was from a focus on tight technical specifications to one which put openness at the centre. This wasn’t done for any ideological reason, but because we had an aim, “share stuff”, and the open approach seemed the one that presents fewest obstacles. I tried to describe the advantages of the open approach in An open and closed case for educational resources.

The timeline helps me understand why we are doing OER rather than some other means of solving the problem of how to share content, but that is just one aspect. What I really like about the open approach is that it creates new possibilities as well as solving old problems. So as well as a timeline of solutions what we should have is a timeline that shows what we are trying to do, one which shows the changing aims as well as the changing solutions, and that I think would show a trend to Open Education.