Wilbert Kraan » metadata

Poking data.gov.uk for a day

Wilbert Kraan — Sat, 23 Jan 2010 00:27:06 +0000

data.gov.uk – a portal for UK governmental open data – is clearly a fantastic idea, and has the makings of a real treasure trove. But estimating how far it is in practice, means picking up a stick and poking the beast.

Which data

What it’s all about. To my unscientific eyeballing, the existence of most every dataset of interest is flagged. You can find them too, with some patience and use of the tags. Clicking through to the source quickly shows that most aren’t available to us yet in any shape or form, though. That’s fine, I’m sure something like that just takes time.

What would be good, though, is some indication which ones are available in the navigation or searching. Where available means: queriable through the SPARQL interface, or downloadable as an RDF dump.

Getting at the data

The main means through which you get at the data is via the SPARQL query service provided. That works, but has some quirks:

From experience, the convention for such services would be that you’d provide one at http://mysite.org/sparql, and that you’d get a human SPARQL query form if you go there with a browser, and you get an answer if you fire a SPARQL protocol query at it from the comfort of your own SPARQL client.

There is a human interface at https://www.data.gov.uk/sparql, but machine queries need to go someplace else: http://services.data.gov.uk/{optional dataset}/sparql Confusingly, the latter has a basic but very nice human interface too…

Of more concern is that both spit out either JSON (in theory) or XML, but not RDF. Including XML and JSON is very sensible indeed, because the greatest number of people can suck those formats up and stick them in the mashups they know and love. But for the promises of linked data to work, there is an absolute need for some kind of RDF output.

Exploring the data

In order to formulate the whizzy linked data queries that this stuff is all about, you have to get a feel for what an open data set and its vocabulary looks like. That is a bit lacking as well at the moment: you can ask the SPARQL endpoint for types, and then keep poking, but making the whole thing browsable on something like the OpenLink RDF browser would be even better. I stumbled on some ways of exploring vocabularies, but that didn’t seem to allow navigation between concepts just yet.

There is a forum and wiki where people can cooperate on how to work on these issues, but I couldn’t see how to join them- hence the post here.

So?

If you’re getting rather disappointed by now: don’t. As far as I can see, the underlying platform is easily capable of addressing each of the points I stumbled across. More importantly, all the pieces are there for something truly compelling: freely mixable open data. It’s just that not all the pieces have been put together yet. The roof has been pitched, but the rest still needs doing.

Semantic tech finds its niches and gets productive

Wilbert Kraan — Tue, 03 Jun 2008 12:10:36 +0000

Rather than the computer science foundation, the annual semantic technology conference in San Jose focusses on the practical applications of the approach. Visitor numbers are growing at a ‘double digit’ clip, and vendors are starting to include big names such as Oracle. We take a look.

It seems that we’re through the trough of disillusionment about the fact that the semantic web as outlined by the Tim Berners Lee in 1999 has not exactly materialised (yet). It’s not news that we do not all have intelligent agents that can seek out all kinds of data on the ‘net, and integrate it to satisfy our specific queries and desires. What we do have is a couple of interesting and productive approaches to the mixing and matching of disparate information that hint at a slope of enlightenment, heading to a plateau of productivity.

Easily the clearest example from the conference is the case of Blue Shield of California, a sizeable health care insurer in the US. They faced the familiar issue of a pile of legacy applications with custom connections, that were required to do things they were never designed to do. As a matter of fact, customer and policy data (core data, clearly) were spread over two systems of different vintage, making a unified view very difficult.

In contrast to what you might expect, the solution they built leaves the data in the existing databases- nothing is replicated in a separate store. Instead, the integration takes place in a ‘semantic layer’. In essence, that means that users can ask pretty complex and detailed questions of any information that is held in any system, in terms of a set of models of the business. These queries end up at the same old systems, where they get mapped from semantic web query form into good old SQL database queries.

This kind of approach doesn’t look all that different from the Enterprise Service Bus (ESB) in principle, but takes takes the insulation of service consumers from the details of service providers rather further. Service consumers in a semantic layer have just one API for any kind of data (the W3C’s SPARQL query language) and one datamodel (RDF, though output in XML or JSON is common). Most importantly, the meaning of the data is modelled in a set of ontologies, not in the documentation of the service providers, or the heads of their developers.

While the Blue Shield of California case was done by BBN, other vendors that exhibited in San Jose have similar approaches, often built on open source components. The most eye catching of those components (and also used by BBN) is netkernel: the overachieving offspring of the web and unix. It’s not necessarily semantic tech per se, but more of a language agnostic application environment that competes with J2EE.

Away from the enterprise, and more towards the webby (and educational!) end of things, the state of semantic technology becomes less clear. There are big web apps such as the Twine social network where the triples are working very much in the background, or powerset, where it is much more in your face, but to little apparent effect.

Much less polished, but much, much more interesting is dbpedia.org- an attempt to put all public domain knowledge in a triple store. Yes, that includes wikipedia, but also the US census and much more. DBpedia is accessible via a number of interfaces, including SPARQL. The result is the closest thing yet to a live instance of the semantic web as originally conceived, where it really is possible to ask questions like “give me all football players with number 11 shirts from clubs with stadiums with more than 40000 seats, who were born in a country with more than 10M inhabitants“. Because of the inherent flexibility of a triple store and with the power of the SPARQL interface, dbpedia could end up powering all kinds of other web applications and user interfaces.

Nearer to a real semantic web, though, is Yahoo’s well publicised move to start supporting relevant standards. While the effect isn’t yet so obvious as semantic integration in the enterprise or dbpedia. it could end up being significant, for the simple reason that it focusses on the organisational model. It does that by processing data in various ‘semantic web light’ formats that are embedded in webpages in the structuring and presentation of search results. If you’d want to present a nice set of handles on your site’s content in a yahoo search results page -links to maps, contact info, schedules etc- it’s time to start using RDFa or microformats.

Beyond the specifics of semantic standards or technologies of this point in time, though, lies the increasing demands for such tools. The volume and heterogeneity of data is increasing rapidly, not least because means of fishing structured data out of unstructured data are improving. At the same time, the format of structured data (its syntax) is much less of an issue than it once was, as is the means of shifting that data around. What remains is making sense of that data, and that requires semantic technology.

Resources

The semantic conference site gives an overview, but not any presentations, alas.

The California Blue Shield architecture was built with BBN’s Asio tool suite

More about the netkernel resource oriented computing platform can be found on the 1060 website

Twine is still in private beta, but will open up in overhauled form in the autumn.

Powerset is wikipedia with added semantic sauce.

DBpedia is the community effort to gather all public domain knowledge in a triple store. There’s a page that outlines all ways of accessing it over the web.

Yahoo’s SearchMonkey is the project to utilise semweb light specs in search results.

The e-framework, social software and mashups

Wilbert Kraan — Wed, 10 Oct 2007 09:44:56 +0000

The e-framework has just opened a wiki, and will happily accommodate Web APIs and mashups. We show how the former works with the submission of an example of the latter.

The e-framework is all about sharing service oriented solutions in a way that can be easily replicated by other people in a similar situation. The situation of the Design for Learning programme is simple, and probably quite familiar: a group of very diverse JISC projects want to be able to share resources between each other, but also with a number of other communities. Some of these communities have pretty sophisticated sharing infrastructures built around institutional, national or even global repositories that do not interoperate with each other out of the box.

There are any number of sophisticated solutions to that issue, several of which are also being documented as an e-framework Service Usage Model (SUM). Sometimes, though, a simple, low-cost solution -preferably one that doesn’t require agreement between all the relevant infrastructures beforehand- is good enough.

Simplicity and low cost are the essence of the Design for Learning sharing SUM. It achieves that by concentrating on a collection of collaboratively authored bookmarks that point to the Design for Learning resources. That way, the collection can happily sit alongside any number of more sophisticated infrastructures that need to store the actual learning resource. It also make the solution flexible, future-proof and applicable to any resource that can be addressed with a URL.

There is, of course, slightly more to it than that, and that’s what the SUM is designed to draw out. The various headings that make up a SUM draw out all the information that’s needed to either find the SUM’s parts (i.e. services), or to replicate, adapt or incorporate the SUM itself.

For example, the Design for Learning SUM shows how to link a bookmark store such as del.icio.us to a community website that can render a newsfeed. That is: it calls for a Syndicate service. This SUM doesn’t say which kind of Syndicate service exactly, but it does show where you have the choice and roughly what the choices are.

By the same token, the SUM can be taken up and tweaked to meet different needs with a minimum of effort. Accommodating different bookmark stores at the same time, for example, is a matter of adding one of several mash-up builders between the Syndicate service providers and the community website. Or you could simply refer to the whole SUM as one part of a much wider and more ambitious resource sharing strategy.

Fortunately, completing a SUM is a bit easier now that there’s a specialised wiki. Bits and bobs can be added gradually and collaboratively until the point that it can be submitted as the official version. Once that’s done, feedback from the wider e-framework community will follow, and the SUM hosted in the e-framework registry. You should be able to see how that works by watching the Design for Learning SUM page in the coming weeks.

The Design for Learning sharing SUM wiki page.
The Design for Learning support pages, which will have an implementation of the sharing SUM soon.
How you can start your own e-framework page is outlined on the wiki itself.

SCORM getting ready to leave ADL home

Wilbert Kraan — Fri, 23 Mar 2007 12:11:32 +0000

Not a new idea, but one that took a few decisive steps this week in a series of meetings in London: the moving of the SCORM educational content format out of the US Department of Defense’s ADL initiative, and into something that reflects the current SCORM user community. The parents will continue to cough up some fees, and provide a home if necessary, but would rather SCORM found its own way now.

The ‘something that reflects the current SCORM user community’ is, of course, the crucial bit. A prospectus for such a something (LETSI- Learning, Education & Training Systems Interoperability) was introduced to coalesce thinking about the issue at the meetings. That prospectus tentatively assumed that LETSI would be a new, stand-alone organisation. Another acronym to put on the pile, if you like.

This is not universally popular, for the obvious reason that the education technology interoperability sphere is plenty fragmented as it is. Australia’s DEST (Department for Education, Science and Training) makes that point eloquently. In the London event, offers from existing organisations such as IMS to look after a LETSI like set-up where dismissed by the observation that not one of the existing organisations is truly representative of the community that now uses SCORM.

The other point of controversy is whether the organisation is to do more than just SCORM. As the DEST paper indicates, there is no concrete indication of what such work might look like in the prospectus, and the meetings didn’t clarify that further either.

After the meetings, both issues are still relatively open. What was decided was that a strategic and operational charter is to be drawn up by June, for ratification come September, in a meeting parallel to the ISO JTSC 1 SC36 (the educational technology arm of the most formal standards body) quarterly in Toronto. Contributions of time and/or money are solicited- there’ll be more on the adl community site about that.

It’ll be in that process that the heft of LETSI (or some other, better name) and its scope will be decided. It could be an open annual meeting and forum to tweak SCORM and nothing more, or it could be a full bells-and-whistles, big dollar membership organisation that looks at everything to do with education and interoperability, worldwide, etc, etc, etc.

It seems unlikely that we’ll fully escape another acronym, regrettably, but, on the bright side, you could argue that it is in effect a direct swap out of ADL for not-LETSI. That is, from the point of view of the education community, the total acronym population remains constant, but the ecosystem shifted slightly.

New content aggregation object work started

Wilbert Kraan — Thu, 19 Oct 2006 00:19:12 +0000

From the same people that brought you the Open Archive Initiative’s Protocol for Metadata Harvesting (OAI-PMH), now comes Object Reuse and Exchange (ORE). It’s a Mellon funded project around the exchange of digital objects between repositories, with a particular focus on digital objects and services that allow access and ingest of these things.

The project is intended to produce an “interoperability fabric” that will facilitate the interoperability between research oriented digital repositories. Since we’re at the start of the project, the scope is not entirely clear but would at least encompass a data model for the objects to be exchanged, a means of binding those (i.e. sticking them in an agreed file format) and some interfaces for moving the objects about.

The obvious question in a case like this is: why do another set of specs?

Why design another such fabric when there are already plenty of digital object aggregation formats such as the Metadata Encoding and Transmission Standard (METS), and no shortage of digital repository query protocols and other interfaces? Indeed, the OAI’s own PMH is a widely used means of providing basic interoperability between the catalogs of repositories, and the authors themselves cite the MPEG 21 Digital Item Declaration standard as a good example of a digital object data model (Interoperability-finalreport.pdf).

Answer in short: unmet use cases in the specific domain of inter-repository interoperability.

There are other between-repository operations one would want to do other than just harvesting of metadata. Obtaining and putting digital objects spring to mind, for example. To be sure, there are protocols for that as well, but possibly not any tailored specifically for inter-repository workflows, with their formal workflows and multi-layered understanding of sets of related artefacts. Still, it’d be nice to see those use cases spelled out, and an indication of the extent to which existing specs support them.

Much the same goes for the digital object data model, though existing specs in this field could well be more directly relevant to the domain. Documentation (Interoperability-finalreport.pdf, MellonProposalwithoutbudget.pdf) suggests that the ORE work is likely to build directly on a datamodel developed under the NSF Pathways project, which at least looks fairly simple and appears to have some implementation experience behind it.

More widely, the fact that people propose work on a new spec in a well known area could suggest that the area simply has not reached full maturity yet. Only when the last counter proposal to a dominant specification has failed to gain any traction can anyone be sure that the technology it codifies is truly well understood.