The cloud is for the boring

Posted on April 15, 2011 by Wilbert Kraan

Members of the Strategic Technologies Group of the JISC’s FSD programme met at King’s Anatomy Theatre to, ahem, dissect the options for shared services and the cloud in HE.

The STG’s programme included updates on projects of the members as well as previews of the synthesis of the Flexible Service Delivery programme of which the STG is a part, and a preview of the University Modernisation Fund programme that will start later in the year.

The main event, though, was a series of parallel discussions on business problems where shared services or cloud solutions could make a difference. The one I was at considered a case from the CUMULUS project; how to extend rather than replace a Student Record System in a modular way.

View from the King's anatomy theatre up to the clouds

In the event, a lot of the discussion revolved around what services could profitably be shared in some fashion. When the group looked at what is already being run on shared infrastructure and what has proven very difficult, the pattern is actually very simple: the more predictable, uniform, mature, well understood and inessential to the central business of research and education, the better. The more variable, historically grown, institution specific and bound up with the real or perceived mission of the institution or parts thereof, the worse.

Going round the table to sort the soporific cloudy sheep from the exciting, disputed, in-house goats, we came up with following lists:

Cloud:

email
Travel expenses
HR
Finance
Student network services
Telephone services
File storage
Infrastructure as a Service

In house:

Course and curriculum management (including modules etc)
Admissions process
Research processes

This ought not to be a surprise, of course: the point of shared services – whether in the cloud or anywhere else – is economies of scale. That means that the service needs to be the same everywhere, doesn’t change much or at all, doesn’t give the users a competitive advantage and has well understood and predictable interfaces.

Linked Data meshup on a string

Posted on February 25, 2010 by Wilbert Kraan

I wanted to demo my meshup of a triplised version of CETIS’ PROD database with the impressive Linked Data Research Funding Explorer on the Linked Data meetup yesterday. I couldn’t find a good slot, and make my train home as well, so here’s a broad outline:

The data

The Department for Business Innovation and Skills (BIS) asked Talis if they could use the Linked Data Principles and practice demonstrated in their work with data.gov.uk to produce an application that would visualise some grant data. What popped out was a nice app with visuals by Iconomical, based on a couple of newly available data sets that sit on Talis’ own store for now.

The data concerns research investment in three disciplines, which are illustrated per project, by grant level and number of patents, as they changed over time and plotted on a map.

CETIS have PROD; a database of JISC projects, with a varying amount of information about the technologies they use, the programmes they were part of, and any cross links between them.

The goal

Simple: it just ought to be possible to plot the JISC projects alongside the advanced tech of the Research Funding Explorer. If not, than at least the data in PROD should be augmentable with the data that drives the Research Funding Explorer.

Tools

Anything I could get my hands on, chiefly:

The D2R toolkit
OpenLink’s Virtuoso platform and associated kit like their RDF browser
Talis’ Platform

The recipe

For one, though PROD pushes out Description Of A Project (DOAP, an RDF vocabulary) files per project, it doesn’t quite make all of its contents available as linked data right now. The D2R toolkit was used to map (part of) the contents to known vocabs, and then make the contents of a copy of PROD available through a SPARQL interface. Bang, we’re on the linked data web. That was easy.

Since I don’t have access to the slick visualisation of the Research Funding Explorer, I’d have to settle for augmenting PROD’s data. This is useful for two reasons: 1) PROD has rather, erm, variable institutional names. Synching these with canonical names from a set that will go into data.gov.uk is very handy. 2) PROD doesn’t know much about geography, but Talis’ data set does.

To make this work, I made a SPARQL query that grabs basic project data from PROD, and institutional names and locations from the Talis data set, and visualises the results.

Results

A partial map of England, Wales and southern Scotland with markers indicating where projects took place
An excerpt of PROD project data, augmented with proper institutional names and geographic positions from Talis’ Research Grant Explorer, visualised in OpenLink RDF browser.

A star shaped overview of various attributes of a project, with the name property highlighted
Zooming in on a project, this time to show the attributes of a single project. Still in OpenLink RDF browser.

A two column list of one project's attributes and their values
A project in D2R’s web interface; not shiny, but very useful.

From blagging a copy of the SQL tables from the live PROD database to the screen shots above took about two days. Opening up the live server straight to the web would have cut that time by more than half. If I’d have waited for the Research Grant Explorer data to be published at data.gov.uk, it’d have been a matter of about 45 minutes.

Lessons learned

Opening up any old database as linked data is incredibly easy.

Cross-searching multiple independent linked data stores can be surprisingly difficult. This is why a single SPARQL endpoint across them all, such as the one presented by uberblic‘s Georgi Kobilarov yesterday, is interesting. There are many other good ways to tackle the problem too, but whichever approach you use, making your linked data available as simple big graphs per major class of thing (entity) in your dataset helps a lot. I was stymied somewhat by the fact that I wanted to make use of data that either wasn’t published properly yet (Talis’ research grant set), or wasn’t published at all (our own PROD triples).

A bit of judicious SPARQLing can alleviate a lot of inconsistent data problems. This is salient to a recent discussion on twitter around Brian Kelly’s Linked Data challenge. One conclusion was that it was difficult, because the data was ‘bad’. IMHO, this is the web, so data isn’t really bad, just permanently inconsistent and incomplete. If you’re willing to put in some effort when querying, a lot can be rectified. We, however, clearly need to clean up PROD’s data to make it easier on everyone.

SPARQL-panning for gold in multiple datastores (or even feeds or webpages) is way too much fun to seem like work. To me, anyway.

What’s next

What needs to happen is to make all the contents of PROD and related JISC project information available as proper linked data. I can see three stages for this:

We clean up the PROD data a little more at source, and load it into the Data Incubator to polish and debate the database to triple mapping. Other meshups would also be much easier at that point.
We properly publish PROD as linked data either on a cloud platform such as Talis’, or else directly from our own server via D2R or OpenLink Virtuoso. Simal would be another great possibility for an outright replacement of PROD, if it’s far enough along at that point.
JISC publishes the public part of its project information as Linked Data, and PROD just augments (rather than replicates) it.

Poking data.gov.uk for a day

Posted on January 23, 2010 by Wilbert Kraan

data.gov.uk – a portal for UK governmental open data – is clearly a fantastic idea, and has the makings of a real treasure trove. But estimating how far it is in practice, means picking up a stick and poking the beast.

Which data

What it’s all about. To my unscientific eyeballing, the existence of most every dataset of interest is flagged. You can find them too, with some patience and use of the tags. Clicking through to the source quickly shows that most aren’t available to us yet in any shape or form, though. That’s fine, I’m sure something like that just takes time.

What would be good, though, is some indication which ones are available in the navigation or searching. Where available means: queriable through the SPARQL interface, or downloadable as an RDF dump.

Getting at the data

The main means through which you get at the data is via the SPARQL query service provided. That works, but has some quirks:

From experience, the convention for such services would be that you’d provide one at http://mysite.org/sparql, and that you’d get a human SPARQL query form if you go there with a browser, and you get an answer if you fire a SPARQL protocol query at it from the comfort of your own SPARQL client.

There is a human interface at https://www.data.gov.uk/sparql, but machine queries need to go someplace else: http://services.data.gov.uk/{optional dataset}/sparql Confusingly, the latter has a basic but very nice human interface too…

Of more concern is that both spit out either JSON (in theory) or XML, but not RDF. Including XML and JSON is very sensible indeed, because the greatest number of people can suck those formats up and stick them in the mashups they know and love. But for the promises of linked data to work, there is an absolute need for some kind of RDF output.

Exploring the data

In order to formulate the whizzy linked data queries that this stuff is all about, you have to get a feel for what an open data set and its vocabulary looks like. That is a bit lacking as well at the moment: you can ask the SPARQL endpoint for types, and then keep poking, but making the whole thing browsable on something like the OpenLink RDF browser would be even better. I stumbled on some ways of exploring vocabularies, but that didn’t seem to allow navigation between concepts just yet.

There is a forum and wiki where people can cooperate on how to work on these issues, but I couldn’t see how to join them- hence the post here.

So?

If you’re getting rather disappointed by now: don’t. As far as I can see, the underlying platform is easily capable of addressing each of the points I stumbled across. More importantly, all the pieces are there for something truly compelling: freely mixable open data. It’s just that not all the pieces have been put together yet. The roof has been pitched, but the rest still needs doing.

Wilbert Kraan

Cetis blog

Category Archives: cloud computing

The cloud is for the boring

Linked Data meshup on a string

Poking data.gov.uk for a day