Where to put your EPUB metadata

Posted on 15/01/2014 by Phil Barker

Even in the knowledge that current mainstream EPUB readers and applications for managing eBooks will most likely ignore all but the most trivial metadata, we still have use cases that involve more sophisticate metadata. For example we would like to use the LRMI alignment object in schema.org to say that a particular subsection of a book can be useful in the context of a specific unit in a shared curriculum.

So, without evaluating pros and cons, starting from the most basic/most common, what are the options? This is a summary takes information from Garrish and Gulling, EPUB 3 Best Practices, OReilly 2013, (which I take to be authoritative and also as an example of best practice with regard to the metadata in the epub file) as well as the EPUB 3.0 Publications and Content Documents specifications. Any comments would be greatly appreciated.

1. Simple Dublin Core

Within the OEPBS directory of an unpacked EPUB3 is the content.opf file. It pretty much equates to the manifest of an IMS Content Package. The top-level element is <package> and <metadata> is a required first child of <package>.

The default metadata vocabulary is the Dublin Core Metadata Element Set (DCMES, simple DC), with prefix dc:. Three elements are mandatory–title, identifier and language–others are optional. For example, in /OEPBS/content.opf

<?xml version=’1.0’ encoding=’UTF-8’?>
<package xmlns:dc="http://purl.org/dc/elements/1.1/ [...]">
    <metadata>
        <dc:identifier>urn:isbn:9781449325299</dc:identifier>
        <dc:title>EPUB 3 Best Practices</dc:title>
        <dc:language>en</dc:language>
        <dc:rights>Copyright © 2013 Matt Garrish and Markus Gylling</dc:rights>
[...]

2 Other metadata schemas

The package element has a prefix attribute that may be used to declare prefixes for metadata schemas other than DCMES. Four vocabularies are reserved, i.e. the prefix may be used without a declaration: dcterms, marc, onix and media (the vocabulary used for EPUB3 media overlays). Example

<dcterms:title>EPUB 3 Best Practices</dcterms:title>

Other vocabularies may be used providing a prefix and a URL in a way so similar to xmlns that is makes you wonder why they didn’t just use xmlns.

<package prefix="prism: http://prismstandard.org/namespaces/basic/3.0/" [...]>

3 the meta element

If used without the refines attribute (see below) the meta element can provide information about the package as a whole, e.g.

<meta property="dcterms:title">EPUB 3 Best Practices</meta>

I have no idea what would be the benefit of this over <dcterms:title>.

4 Refining metadata elements: id attribute and the meta element

The id attribute can be used to provide an identifier any element in the metadata that it may be refined. One example of this is mandatory, i.e. that one occurrence of the dc:identifier element must be the publication identifier:

<dc:identifier id="pub-identifier">urn:isbn:9781449325299</dc:identifier>

In general the refinements are described using the meta element with the refines attribute and a property attribute that specifies the nature of the refinement. It’s kind of like RDF reification. The default vocabulary for the property attribute includes “file-as” – an alternative string for a name to be used when filing, “identifier-type” – a way to distinguish between different identifiers, “meta-auth” – the authority for a given instance of metadata, “title-type” – which of the six forms of title being provided.

<dc:creator id="1234">Matt Garrish</dc:creator>
<meta refines="#1234" property="file-as" id="5678">Garrish, Matt</meta>
<meta refines="#1234" property="role">Author</meta>

Terms from other vocabularies may be used for “property” so long as a prefix is declared.

Refinements may have ids and so may be refined.

<meta refines="#5678" property="meta-auth">Phil Barker</meta>

So and so you can make statements about your metadata statements to you heart’s content (though including the whole of the linked data graph in each epub would be silly).

The scheme attribute may be used to identify the controlled vocabulary from which the meta element’s value is drawn. For example, if the identifier is a DOI (which in onix is apparently entry 06 of codelist 5) you can have

<dc:identifier id="pub-id">urn:doi:10.1016/j.iheduc.2008.03.001 </dc:identifier>
<meta refines="#pub-id"
      property="identifier-type"
      scheme="onix:codelist5">06</meta>

Or, using the marc relator value Aut to specify author

<meta refines="#1234" property="role" scheme="marc:relators">Aut</meta>

5 Sub-package level metadata

The id attribute may be used to provide an identifier of an subelement of <package> or any element in the XHTML content documents, down to a span element around a phrase, word or character. So a chapter may have id=”chap1″ then we can use meta elements in the metadata to describe it seperately from the rest of the epub

<meta refines="#chap1" property="prism:contentType">bookChapter<meta>

6 Links to metadata records

The link element is an optional, repeatable subelement of <metadata>, “used to associate resources with a publication, such as metadata records” The metadata may be within package or anywhere on the www.
Example

<link rel="marc21xml-record" href="pub/meta/nor-wood-marc21.xml" />
<link refines="#chap1" rel="ex:schema_org-record"
      media-type="application/ld+json"
      href="http://example.org/nor-wood-lrmi.json" />

Metadata embedded in the XHTML5 content

As far as I can see the EPUB3 specs are mute on metadata in HTML of the content documents, e.g. as html:meta elements or as microdata or RDFa, there doesn’t seem to be any reason why one should not put metadata here. I wouldn’t expect any EPUB system to look that deeply into the package but it would be a good approach to helping the metadata travel with the resource if the EPUB is disaggregated and passed into a non-EPUB3 CMS.

Embed innovation or implant potential?

Posted on 12/08/2013 by Phil Barker

This thought on etextbooks is an overflow from a conversation I was having on skype with Li and Tore about a workshop aimed at scoping what we would like the etextbooks of the future to look like. We were talking about how the idea of a textbook–its role in teaching and learning and hence (perhaps) its nature–was different in different cultures (Europe, US, Asia) and educational settings (school, higher education), when Tore said something along the lines of “why are are discussing this, shouldn’t we be talking about educational requirements”. Of course we should be talking about educational requirements and how they might be met by technologies such as ebooks, but I think there is more than that. My immediate reply was that by defining an area of interest as “etextbooks” we were implying a continuity with textbooks. I don’t think continuity implies a simple like-for-like replacement because I think the potential for etextbooks is far greater than that for paper textbooks, so moving to etextbooks should radically shift the trajectory of change. But the implication seems to be that etextbooks will pick up where paper text books leave off. That, I think is different from 20 or so years ago when we were talking about how computer based learning (or more recently online courses and technology enhanced learning) marked a step change in how education was delivered. In that case much of the talk was about how technology will radically change education. Even if my characterisation of the two cases as opposing is a bit crude (as it is), it’s worth comparing the two approaches. I’ll do that here, just briefly.

The technology-will-revolutionise-education approach runs the risk of alienating the people who you most need on your side if that revolutionary change is to be an improvement, that is the teachers and students. I remember we used to talk about technology as a Trojan Horse for introducing pedagogic improvement in HE, something that I stopped doing when I went to a presentation where the speaker pointed out that the Trojan Horse was an act of war in the context of a bloody siege, and perhaps that isn’t the way learning technologists should approach teachers. More importantly, introducing technology probably isn’t the best way to approach improving education. Introducing technology is not straightforward, it will take attention away from other matters: whatever the initial intent, it will distract from thinking about teaching and learning. If you want to improve education you should focus on that and probably not do something else that is really difficult in it’s own right at the same time.

So the start-with-something-familiar approach has an advantage here in that it simply focuses on planting a technology with higher potential into existing practice. The risk is that substiution is seen as as all that needs to be done, or that requirements that arise from this objective are over prioritised. For example, I have seen requirements for page-faithful display (i.e. the ability to reproduce on the ebook reader exactly what would be on paper) and page numbers as requirements for etextbooks. They may be desirable for marketing purposes, and there are real functional requirements relating to how content is presented and how it may be referenced, but building-in these restrictions as requirements would, in my view, be a mistake. Let’s have a strategy where we aim to embed but with a view to enhancing.

A triangle of objectives for etextbook technology; from the bottom: cost, availability, portability, functionality, innovation.

The path forward suggested for the US by the Educause/Internet 2 pilot etextbook pilot. Start with a basis aimed at increasing adoption and move forward to improvements in functionality and transformation.
Image from Grajek, Susan, Understanding what higher education needs from e-textbooks: an EDUCAUSE/Internet2 pilot (Research Report), EDUCAUSE July 2013.

I think this is the approach which is suggested by the recent report on the Educause/Internet2 pilots Understanding what higher education needs from e-textbooks, summarised in the image on the right. I must admit that I find this somewhat depressing, I am interested in getting to the peak of that pyramid as quickly as possible, but I would rather get there with teachers and learners than to be touting some theoretical improvement that is divorced from real teaching and learning. And of course, it’s important to be thinking from the outset what functionality and innovation should be built once the technology is in people’s hands.

I am presenting a session at Alt-C 2013 entitled Into the Mainstream? New developments in eTextBooks next month where I hope to discuss ideas like this.

ePub metadata what gets shown?

Posted on 18/06/2013 by Phil Barker

One of the issues around eTextBooks is how to describe them, specifically by way of educational metadata in ePub. That’s something that on the face of it shouldn’t be too difficult to address (at least to the extent that we know how to describe any educational resource). One thing that would be useful in demonstrating different choices for educational metadata is an app or tool that will display any metadata found in the ePub package in a sensible way. As a bit of long shot I tried four eBook readers to see whether they would; they don’t. The details follow, if you’re interested, but do let me know if you know of any tool that might be useful.

The package metadata of an ePub can include a selection of Dublin Core elements and terms. These can be refined, for example you may have two dc:title elements with refinements to specify that one is the main title and the other the subtitle. You can also extend with elements from other XML namespaces, or if you prefer you can just link to a metadata record of your favourite flavour which can be either inside the ePub package or elsewhere on the web. Any of this metadata can relate to the eBook as a whole or some part of it, e.g. a single chapter or image. Without going into details there seems to be enough scope there to experiment with how educational characteristics of the eBook might be described.

But how to see the results? I took an ePub (a copy of O’reily’s EPUB 3 Best Practices, since it seemed likely to provide as good a starting point as I was going to find in a real book), made a copy, unzipped it and changed the values of the meta elements so that I could easily identify what elements were being displayed. For example I changed
<dc:title id="pub-title">EPUB 3 Best Practices</dc:title> to
<dc:title id="pub-title">dc:title</dc:title> and so on.

Here’s a list of the metadata elements in that file:

<dc:title id="pub-title">
<dc:creator id="..." >
<dc:publisher>
<dc:date>
<meta property="dcterms:modified">
<dc:identifier id="pub-identifier">
<dc:language id="pub-language">
<dc:contributor> (repeated)
<dc:rights>
<dc:subject>
<dc:description>
<meta id="meta-identifier" property="dcterms:identifier">
<meta property="dcterms:title" id="meta-title">
<meta property="dcterms:language" id="meta-language">
<meta property="dcterms:rights">
<meta property="dcterms:rightsHolder">
<meta property="dcterms:publisher">
<meta property="dcterms:subject">
<meta property="dcterms:description>
<meta id="...." property="dcterms:creator"> (repeated, different ids)
<meta name="cover" content="cover-image"/>
<meta property="ibooks:specified-fonts">

I then looked at this with various eBook readers:

Readium

I had hopes for Readium since it is pretty much the reference implementation of EPUB3. It displayed

in Readium

dc:title
dc:creator
dc:publisher
dc:date
meta dcterms:modified
dc:identifier

Note that it doesn’t even check for a valid value for dates.

Calibre

Calibre, while it doesn’t claim to support ePub3 is targetted at managing personal book libraries. It displays:

in Calibre

dc:title
dc:creator
dc:subject (for tags)
dc:description
dc:publisher

It probably uses dc:language and dc:date (for published) as well but recognises that the values dc:language / dc:date aren’t valid.

Ideal Reader for Android

The Ideal Reader for Android is the other ePub3 reader I use. It displays

In Ideal Android Reader

dc:title
meta dcterms:creator (just one of them)
dc:date
dc:publisher
dc:description
dc:subject
dc:rights

iTunes

Finally I gave a chunk of diskspace to Apple

in iTunes desktop for Windows 7

dc:title
dc:creator
dc:title (again)
dc:subject (in the info tab, as Genre)

Yep, title is there twice: the info tab shows dc:title in the Name and Album fields, so you can gauge the amount of effort that Apple have put into adapting iTunes for books.

What did I learn?

I learnt that none of the ePub reading/management apps or tools that I have show more than the bare minimum of metadata, even if it is there. None of them will be much good for trying out ideas for how educationally characteristics can be described since I strongly suspect that none of it will be viewable. That’s not too surprising, especially when you consider that none of the tools I looked at are geared around resource discovery, but I can’t really go uploading dummy ePub files to book seller sites just to see what they look like. May be any meaningful exploration/demonstration of educational metadata in ePub is going to need a bespoke application, but if you know of a tool that might be helpful do drop me a line.

ebooks 2013

Posted on 13/05/2013 by Phil Barker

Every year for the past dozen or so years the Department of Information Sciences at UCL have organised a meeting on ebooks. I’ve only been to one of them before, two or three years ago, when the big issues were around what publishers’ DRM requirements for ebooks meant for libraries. I came away from that musing on what the web would look like if it had been designed by publishers and librarians (imagine questions like: “when you lend out our web page, how will you know that the person looking at the screen is a member of your library?”…). So I wasn’t sure what to expect when I decided to go to this year’s meeting. It turned out to be far more interesting than I had hoped, I latched on to three themes of particular interest to me: changing paradigms (what is an ebook?), eTextBooks and discovery.

Changing paradigms

With the earliest printed books, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way as publishers now seek to replicate printed books as ebooks.

In the first presentation of the day Lorraine Estelle, chief executive of Jisc Collections, focussed on access to electronic resources. Access not lending; resources not ebooks. She highlighted the problems of using yesterday’s language and thinking as being problematic in this context, like having a “horseless carriage” and buying it hay. [This is my chance to make the analogy between incunabula and ebooks again, see right.] The sort of discussions I recalled from the previous meeting I attended reflect this thinking, publishers wanting a digital copy of a book to be equivalent to the physical book, only lendable to one person at a time and to require replacing after a certain number of loans.

We need to treat digital content as offering new possibilities and requiring new ways of working. This might be uncomfortable for publishers (some more than others) and there was some discussion about how we cannot assume that all students will naturally see the advantages, especially if they have mostly encountered problematic content that presents little that could not be put on paper but is encumbered with DRM to the point that it is questionable as to whether they really own the book. But there is potential as well as resistance. Of course there can be more interesting, more interactive content–Will Russell of the Royal Society of Chemistry described how they have been publishing to mobile devices, with tools such as Chem Goggles that will recognise a chemical structure and display information about the chemical. More radically, there can also be new business models: Lorraine suggested Institutions could become publishers of their own teaching content, and later in the day Caren Milloy, also of Jisc Collections, and Brian Hole of Ubiquity Press pointed to the possibilities of open access scholarly publishing.

Caren’s work with the OAPEN Library is worth looking through for useful information relating to quality assurance in open monograms such as notifying readers of updates or errata. Caren also talked about the difficulties in advertising that a free online version of a resource is available when much of the dissemination and discovery ecosystem (you know, Amazon, Google…) is geared around selling stuff, difficulties that work with EDitEUR on the ONIX metadata scheme will hopefully address soon.

Brian described how Ubiquity Press can publish open access ebooks by driving down costs and being transparent about what they charge for. They work from XML source, created overseas, from which they can publish in various formats including print on demand, and explore economies of scale by working with university presses, resulting in a charge to the author (or their funders) of about £150 for a chapter assuming there is nothing to complex in that chapter.

eTextBooks

All through the day there were mentions of eTextBooks, starting again with Lorraine who highlighted the paperless medic and how his quest to work only with digital resources is complicated by the non-articulation of the numerous systems he has to use. When she said that what he wanted was all his content (ebooks, lecture handouts, his own notes etc.) on the same platform, integrated with knowledge about when and where he had to be for lectures and when he had exams, I really started to wonder how much functionality can you put into an eContent platform before it really becomes a single-person content-oriented VLE. And when you add in the ability to share notes with the social and communication capability of most mobile devices, what then do you have?

A couple of presentations addressed eTextBooks directly, from a commercial point of view. Jenni Evans spoke about Vital Source and Andrejs Alferovs about Kortext both of which are in the business of working with institutions distributing online textbooks to students. Both seem to have a good grasp of what students want, which I think should be useful requirements to feed into eTextBook standardization efforts such as eTernity, these include:

ability to print
offline access
availability across multiple devices
reliable access under load
integration with VLE
integration with syllabus/curriculum
epub3 interactive content
long term access
ability for student to highlight/annotate text and share this with chosen friends
ability to search text and annotations

Discovery

There was also a theme of resource discovery running through the day, and I have already mentioned in passing that this referenced Google and Amazon, but also social media. Nick Canty spoke about a survey of library use of social media, I thought it interesting that there seemed to be some sophisticated use of the immediacy of Twitter to direct people to more permanent content, e.g. to engagement on Facebook or the library website.

Both Richard Wallis of OCLC and Robert Faber of OUP emphasized that users tend to use Google to search and gave figures for how much of the access to library catalogue pages came direct from Google and other external systems, not from their own catalogue search interface. For example the Biblioteque Nationale de France found that 80% of access to their catalogue pages cam directly from web search engines not catalogue searches, and Robert gave similar figures for access to Oxford Journals. The immediate consequence of this is that if most people are trying to find content using external systems then you need to make sure that at least some (as much as possible, in fact) of your content is visible to them–this feeds in to arguments about how open access helps solve discoverability problems. But Richard went further, he spoke about how the metadata describing the resources needs to be in a language that Google/Bing/Yahoo understand, and that language is schema.org. He did a very good job distinguishing between the usefulness of specialist metadata schema for exchanging precise information between libraries or publishers, but when trying to pass general information to Google:

it’s no use using a language only you speak.

Richard went on to speak about the Google Knowledge graph and their “things not strings” approach facilitated by linked data. He urged libraries to stop copying text and to start linking, for example not to copy an author name from an authority file but to link to the entry in that file, in Eric Miller’s words to move from cataloguing to “catalinking”.

ebooks?

So was this really about ebooks? Probably not, and the point was made that over the years the name of the event has variously stressed ebooks and econtent and that over that time what is meant by “ebook” has changed. I must admit that for me there is something about the idea of a [e]book that I prefer over a “content aggregation” but if we use the term ebook, let’s use it acknowledging that the book of the future will be as different from what we have now as what we have now is from the medieval scroll.

Picture Credit
Scanned image of page of the Epistle of St Jerome in the Gutenberg bible taken from Wikipedia. No Copyright.

eTextBooks Europe

Posted on 21/01/2013 by Phil Barker

I went to a meeting for stakeholders interested in the eTernity (European textbook reusability networking and interoperability) initiative. The hope is that eTernity will be a project of the CEN Workshop on Learning Technologies with the objective of gathering requirements and proposing a framework to provide European input to ongoing work by ISO/IEC JTC 1/SC36, WG6 & WG4 on eTextBooks (which is currently based around Chinese and Korean specifications). Incidentally, as part of the ISO work there is a questionnaire asking for information that will be used to help decide what that standard should include. I would encourage anyone interested to fill it in.

The stakeholders present represented many perspectives from throughout Europe: publishers, publishing industry specification bodies (e.g. IPDF who own EPUB3, and DAISY), national bodies with some sort of remit for educational technology, and elearning specification and standardisation organisations. I gave a short presentation on the OER perspective.

Many issues were raised through the course of the day, including (in no particular order)

Interactive and multimedia content in eTextbooks
Accessibility of eTextbooks
eTextbooks shouldn’t be monolithic and immutable chunks of content, it should be possible to link directly to specific locations or to disaggregate the content
The lifecycle of an eTextbook. This goes beyond initial authoring and publishing
Quality assurance (of content and pedagogic approach)
Alignment with specific curricula
Personalization and adaptation to individual needs and requirements
The ability to describe the learning pathway embodied in an eTextbook, and vary either the content used on this pathway or to provide different pathways through the same content
The ability to describe a range IPR and licensing arrangements of the whole and of specific components of the eTextbook
The ability to interact with learning systems with data flowing in both directions

If you’re thinking that sounds like a list of the educational technology issues that we have been busy with for the last decade or two, then I would agree with you. Furthermore, there is a decade or two’s worth of educational technology specs and standards that address these issues. Of course not all of those specs and standards are necessarily the right ones for now, and there are others that have more traction within digital publishing. EPUB3 was well represented in the meeting (DITA is the other publishing standard mentioned in the eTernity documentation, but no one was at the meeting to talk about that) and it doesn’t seem impossible to meet the educational requirements outlined in the meeting within the general EPUB3 framework. The question is which issues should be prioritised and how should they be addressed.

Of course a technical standard is only an enabler: it doesn’t in itself make any change to teaching and learning; change will only happen if developers create tools and authors create resources that exploit the standard. For various reasons that hasn’t happened with some of the existing specs and standards. A technical standard can facilitate change but there needs to a will or a necessity to change in the first place. One thing that made me hopeful about this was a point made by Owen White of Pearson that he did not to think of the business he is in as being centred around content creation and publishing but around education and learning and that leads away from the view of eBooks as isolated static aggregations.

For more information keep an eye on the eTernity website

Jisc Observatory report on Ebooks in Education

Posted on 20/12/2012 by Phil Barker

The joint CETIS and UKOLN Observatory has just published a report “Preparing for Effective Adoption and Use of Ebooks in Education” written by James Clay. My CETIS colleague Li and I wrote the foreword for this report, which I’ve reproduced here but really you would be better going to the observatory and downloading the whole report.
Continue reading →

The Challenge of ebooks

Posted on 22/08/2012 by Phil Barker

Yesterday I was in London, along with a group of people with a wide range of experience in digital resource management, OERs, and publishing for a workshop which was part of the Challenge of eBooks project. Here’s a quick summary and some reflections.

To kick off, Ken Chad defined eBooks for the purpose of the workshop, and I guess the report to be delivered by the project, as anything delivered digitally that was longer than a journal article. I’ll come back to what I think are the problems with that later, but we didn’t waste time discussing it. It did mean that we included in the discussion such things as scanned copies of texts such as those that can be made under the CLA licence, and the difficulties around managing and distributing those.

For the earliest printed books, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way publishers now seek to replicate printed books as ebooks.

With the earliest printed book, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way as publishers now seek to replicate printed books as ebooks.

The main part of the workshop was organised around a “jobs to be done” framework. The idea of this is to focus on what people are trying to do “people don’t want a 5mm drill bit, they want a 5mm hole”. I found that useful in distinguishing ebooks in the domain of HE from the vast majority of those sold. In the latter case the job to be done is simply reading the book: the customer wants a copy of a book simply because they want to read that book, or a book by that author, or a book of that genre, but there isn’t necessarily any further motive beyond wanting the experience of reading the book. In HE the job to be done (ultimately) is for the student or researcher to learn something, though other players may have a job to do that leads to this, for example providing a student with resources that will help them learn something. I have views on how the computing power in the delivery platform can be used for more than just making the delivery of text more convenient: how it can be used to make the content interactive, or to deliver multimedia content, or to aid discussion or just connect different readers of the same text (I was pleased that someone mentioned the way a kindle will show which passages have been bookmarked/commented on by other readers).

The issues raised in discussion included rights clearance, the (to some extent technical, but mostly legal) difficulties of creating course packs containing excerpts of selected texts, the diversity of platforms and formats, disability access, and relationships with publishers.

It was really interesting that accessibility featured so strongly. Someone suggested that this was because the mismatch between an ebook and the device on which it is displayed creates an impairment so frequently that accessibility issues are plain for all to see.

A lot of the issues seem to go back publishers struggling with a new challenge, not knowing how they can meet it and keep their business model intact. It was great to have Suzanne Hardy of the PublishOER project there with her experience of how publishers will respond to an opportunity (such as getting more information about their users through tracking) but need help in knowing what the opportunities are when all they can see is the threat of losing control of their content. Whether publishers can make the necessary changes in currently print-oriented business processes to realise these benefits was questioned. Also there are challenges to libraries in HE, who are used to being able to buy one copy of a book for an institution whereas publishers now want to be able to sell access to individuals–partly, I guess, so that they can make that link between a user and the content they provide, but also because one digital copy can go a lot further than a single physical copy.

Interestingly, the innovation in ebooks is coming not from conventional publishers but from players such as Amazon, Apple and from publishers such as O’Reilly and Pearson. (Note that Pearson have a stake in education that includes an assessment business, online courses and colleges and so go beyond being a conventional publisher.) Also, the drive behind these innovations comes from new technology making new business models possible, not from evolution of current business, nor, arguably, from user demand.

So, anyway, what is an ebook? I am not happy with a definition that includes web sites of additional content created to accompany a book, or pages of a physical book that have been scanned. That doesn’t represent the sort of technical innovation that is creating new and interesting opportunities and the challenges come with them. Yes there are important (long-standing) issues around digital content in general, some of which will overlap with ebooks, but I will be disappointed if the report from this project is full of issues that could have been written about 10yrs ago. That’s not because I think those issues are dead but because I think ebooks are something different that deserves attention. I’ll suggest two approaches defining to what that something is:

1. an ebook is what the ebook reading devices and apps read well. By-and-large that means content in mobi or ePub format. Ebook readers don’t handle scanned page images well. They don’t read most pdf well (though depends on the tool and nature of pdf used, but aim of pdf was to maintain page layout which is exactly what you don’t want on an ebook reader). Word processed files are borderline but mostly word processed documents are page-oriented which raises the same issue as with pdfs. In short WYSIWYG and ebooks don’t match.

2. an ebook is aggregated content, packaged so that it can be moved from server to device, with more-or-less linear navigation. In the aggregation (which is often a zip file under another extension name) are assets (the text, images and other content that are viewed) plus metadata that describes the book as a whole (and maybe the assets individually) and information about how the assets should be navigated (structural metadata describing the organisation of the book). That’s essentially what mobi and ePub are. It’s also what IMS Content Packaging and offspring like SCORM and Common Cartridge are; and for that matter it’s what the MS Office and Open Office formats are.

I had a short discussion with Zak Mensah of JISC Digital Media about whether the content should be mostly text based. I would like to see as much non-text material as is useful, but clearly there is a limit. It would be perverse to take a set of videos, sequence them one after another with screen of text between each one like a caption frame in a silent movie, and then call it a book. However, there is something more than text that would make sense as a book: imagine replacing all the illustrations in a well-illustrated text book with models, animations, videos … for example, a chemistry book with interactive models of chemical structures, graphs that change when you alter the parameters; or a Shakespeare text with videos of performance in parallel with text…that still makes sense as a book.

[image of page from Gutenberg Bible taken from wikipedia]

Text and Data Mining workshop, London 21 Oct 2011

Posted on 21/10/2011 by Phil Barker

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Hopes and fears for eReaders and eTextBooks

Posted on 24/01/2011 by Phil Barker

About 15 years ago, when I was first starting to promote the use of resources for “computer aided learning” the message was fairly clear: reading text off a screen is problematic so don’t use computers for this, use them for what they are good at. For me, in physical sciences at that time, they were good at multimedia presentation and the calculations necessary for creating interactive models that allow active engagement with the physics being taught. More generally, computers were good at things that allowed more pedagogically appropriate approaches to teaching and learning.

I’ve been disappointed since then: the most widely adopted applications of technology in teaching and learning are to use them to project presentations instead of transparencies on an OHP, and as VLEs to distribute course info and handouts. In both example the net impact of the computer is to do the same thing in a slightly more convenient way. Now a platform has reached maturity that allows a slightly more convenient way to read books, reproducing the text-on-paper experience. It’s bound to be the next big thing.

So, what to do about this? Admit that in practice technology will enhance learning by making small incremental improvements to established practice? Press for enhanced capability where it will facilitate good pedagogy? Work in anticipation of some revolutionary change driven by factors outwith the HE system?

In the meantime, some relevant stuff elsewhere:

An open e-Textbook usecase, our contribution to the ISO/IEC JTC1 SC36 Study Period on e-Textbooks.

Digital Textbooks, a blog devoted to documenting significant initiatives that relate to any and all aspects of digital textbooks, most notably their use in higher education.

Wolfram assistants the sort of good stuff that could find its way into a digital text book.

Amazon Kindle customer review on the bad stuff: problems with footnotes in academic eTexts.

A short update on Ramlet

Posted on 03/12/2009 by Phil Barker

Ramlet, or Resource Aggregation Model for Learning, Education and Training (which is working group 13 of the IEEE Learning Technology Standards Subcommittee) is an ongoing piece of work which aims to define a conceptual model that includes an ontology and a nomenclature for enabling the interpretation of externalized representations of digital aggregates of resources for learning, education, and training applications. In other words, it will help show semantic relationships between content aggregation formats such as IMS CP, ATOM, MPEG 21 DID and OAI-ORE.

Like many standardization efforts, progress is slow and gradual so it’s difficult to know when it’s worth giving an update. But last week the RAMLET technical editor, Scott Lewis sent this message about the conceptual model:

This standard has taken a long time, but it is a complex standard that presents an ontology for resource aggregation and down-loadable files to help implement the ontology.

The good news is that virtually all of the technical work has been done for the standard and for a series of IEEE recommend practices that will be published after the standard is published. The working group expects to have a draft of the base standard for internal review by year’s end and a balloting draft submitted to IEEE in Q1 of 2010. The series of recommended practices that specify mappings for IMS CP, ATOM, METS, MPEG-21 DID, and OAI-PMH ORE will be published as soon as possible after the standard is published. Again, the technical work for these recommend practice has been done, and it is just a matter of converting that work to IEEE recommended practices after the base standard has been approved.

CETIS’s Wilbert Kraan is taking part in the RAMLET work, working on a proof of concept implementation using standard open source components.

Phil Barker

Cetis Blog

Category Archives: aggregated content