Inspired by the VirtualDutch timeline, I wondered how easy it would be to create something similar with all JISC e-learning projects that I could get linked data for. It worked, and I learned some home truths about scalability and the web architecture in the process.
As Lorna pointed out, UCL’s VirtualDutch timeline is a wonderful example of using time to explore a dataset. Since I’d already done ‘place’ in my previous meshup, I thought I’d see how close I could get with £0.- and a few nights tinkering.
The plan was to make a timeline of all JISC e-learning projects, and the developmental and affinity relations between them that are recorded in CETIS’ PROD database of JISC projects. More or less like Scott Wilson and I did by hand in a report about the toolkits and demonstrators programme. This didn’t quite work; SPARQLing up the data is trivial, but those kinds of relations don’t fit well into the widely used Simile timeline from both a technical and usability point of view. I’ll try again with some other diagram type.
What I do have is one very simple, and one not so simple timeline for e-learning projects that get their data from the intensely useful rkb explorer Linked Data knowledge base of JISC projects and my own private PROD stash:
Recipe for the simple one
- Go to the SPARQL proxy web service
- Tell it to ask this question
- Tell the proxy to ask that question of the RKB by pointing it at the RKB SPARQL endpoint (http://jisc.rkbexplorer.com/sparql/), and order the results as CSV
- Copy the URL of the CSV results page that the proxy gives you
- Stick that URL in a ‘=ImportData(“{yourURL}”)’ function inside a fresh Google Spreadsheet
- Insert timeline gadget into Google spreadsheet, and give it the right spreadsheet range
- Hey presto, one timeline coming up:
Recipe for the not so simple one
For this one, I wanted to stick a bit more in the project ‘bubbles’, and I also wanted the links in the bubbles to point to the PROD pages rather than the RKB resource pages. Trouble is, RKB doesn’t know about data in PROD, and for some very sound reasons I’ll come to in a minute, won’t allow the pulling in of external datasets via SPARQL’s ‘FROM’ operator either. All other SPARQL endpoints in the web that I know of that allow FROM couldn’t handle my query- they either hung or conked out. So I did this instead:
- Download and install your own SPARQL endpoint (I like Leigh Dodds’ simple but powerful Twinkle)
- Feed it this query
- Copy and paste the results into a spreadsheet and fiddle with concatenation
- Hoist spreadsheet into Google docs
- Insert timeline gadget into Google spreadsheet, and give it the right spreadsheet range
- Hey presto, a more complex timeline:
It’s Linked Data on the Web, stupid
When I first started meshing up with real data sets on the web (as opposed to poking my own triple store), I had this inarticulate idea that SPARQL endpoints were these magic oracles that we could ask anything about anything. And then you notice that there is no federated search facility built into the language. None. And that the most obvious way of querying across more than one dataset – pulling in datasets from outside via SPARQL’s FROM – is not allowed by many SPARQL endpoints. And that if they do allow FROM, they frequently cr*p out.
The simple reason behind all this is that federated search doesn’t scale. The web does. Linked Data is also known as the Web of Data for that reason- it has the same architecture. SPARQL queries are computationally expensive at the best of times, and federated SPARQL queries would be exponentially so. It’s easy to come up with a SPARQL query that either takes a long time, floors a massive server (or cloud) or simply fails.
That’s a problem if you think every server needs to handle lots of concurrent queries all the time, especially if it depends on other servers on a (creaky) network to satisfy those queries. By contrast, chomping on the occasional single query is trivial for a modern PC, just like parsing and rendering big and complex html pages is perfectly possible on a poky phone these days. By the same token, serving a few big gobs of (RDF XML) text that sits at the end of a sensible URL is an art that servers have perfected over the past 15 years.
The consequence is that exposing a data set as Linked Data is not so much a matter of installing a SPARQL endpoint, but of serving sensibly factored datasets in RDF with cool URLs, as outlined in Designing URI Sets for the UK Public Sector (pdf). That way, servers can easily satisfy all comers without breaking a sweat. That’s important if you want every data provider to play in Linked Data space. At the same time, consumers can ask what they want, without constraint. If they ask queries that are too complex or require too much data, then they can either beef up their own machines, or live with long delays and the occasional dead SPARQL engine.
Well done, another very useful demo! I can only claim to vaguely grasp the technicalities here however I wonder if you could say a little more about SPARQL functionality (or not as the case may be!) I’ve been reading a lot of posts recently about the necessity, or not, of SPARQL for data to become Linked Data and I wondered what your opinion was? In a recent blog post Tony Hirst has suggested that SPARQL doesn’t cut it. What do you think? Can you have Linked Data without a SPARQL endpoint?
SPARQL and RDF are a sine qua non of Linked Data, IMHO. You can keep the label, widen the definition out, and include other things, but then I’d have to find another label for what I’m interested in here.
What I’m arguing for here is that for data _providers_ should focus on nicely factored graphs and sensible URIs first, and worry about SPARQL endpoints later. From a _consumers_ point of view, though, SPARQL endpoints are the place where you start.
In that sense, there is also developer preference and experience. Tony Hirst is the grand wizard of Yahoo pipes. I don’t really get that stuff; it feels like trying to do calligraphy with boxing gloves on. And that’s before considering the fact that I’m reluctant to stick a lot of time in a proprietary platform.
On the other hand, I find it very easy and natural to think in rdf and sparql, but I guess not everyone’s brain is wired that way.
Not to fret, though: the simple meshup quickly goes from the results of a SPARQL query to formats more amenable to Tony’s mashups: CSV and Google spreadsheets. JSON and XML is just as easy.
Thanks Wilbert, that’s very helpful indeed. More food for thought! I’ll come back to you about this via other channels shortly.
Pingback: Lorna’s JISC CETIS blog » When is Linked Data not Linked Data? - A summary of the debate
Hey Wilbert,
A few pointers that might be interesting to you:
You may want to have a look at my slides about “Querying Linked Data with SPARQL”. These slides list different options for querying Linked Data, including the pros and cons. This slideset is from the Consuming Linked Data tutorial we gave at last year’s ISWC. At the WWW conference in April we will give a similar tutorial again.
I’m working on a novel query execution paradigm called link traversal based query execution that allows to execute SPARQL queries over the Web of Linked Data (at least over connected parts of it). The idea is to discover data relevant for answering a query by following specific RDF links during the query execution itself. All that is required from the publishers is adherence to the Linked Data principles; i.e. no requirement for SPARQL endpoints. You can read about link traversal based query execution in my ISWC’09 paper “Executing SPARQL Queries over the Web of Linked Data”. A query engine that implements the query approach is part of the Semantic Web Client Library. On top of this library I implemented SQUIN which provides the functionality of the query engine as a simple Web service that can be accessed like an ordinary SPARQL endpoint.
Greetings,
Olaf
Hi Olaf,
Many thanks for the links; there’s a lot of interesting and highly usable stuff there.
I like the SQUIN approach, since it solves the data source discovery and selection issue, and also follows the ‘smart, hard-working client & dumb, data shoveling server’ approach I’ve tried to outline here.
As you’ve indicated, though, the SQUIN approach seems best suited for discovery oriented queries. For queries where you know the datasets, and just need up-to-date results, a ‘named graphs as RDF dumps’ approach seems fine to me so far. I know that’s technically a local data replication option, but it seems a lot easier, quicker and more timely (depending on the frequency of the data-dumps client side) than other local data replication methods.
Cheers!
Wilbert
Pingback: Wilbert’s work blog» Blog Archive » How to meshup eportfolios, learning outcomes and learning resources using Linked Data, and why
Pingback: Sheila’s work blog » Talis platform day
Pingback: OER Visualisation Project: Timelines, timelines, timelines [day 30] #ukoer – MASHe