Rather than the computer science foundation, the annual semantic technology conference in San Jose focusses on the practical applications of the approach. Visitor numbers are growing at a ‘double digit’ clip, and vendors are starting to include big names such as Oracle. We take a look.
It seems that we’re through the trough of disillusionment about the fact that the semantic web as outlined by the Tim Berners Lee in 1999 has not exactly materialised (yet). It’s not news that we do not all have intelligent agents that can seek out all kinds of data on the ‘net, and integrate it to satisfy our specific queries and desires. What we do have is a couple of interesting and productive approaches to the mixing and matching of disparate information that hint at a slope of enlightenment, heading to a plateau of productivity.
Easily the clearest example from the conference is the case of Blue Shield of California, a sizeable health care insurer in the US. They faced the familiar issue of a pile of legacy applications with custom connections, that were required to do things they were never designed to do. As a matter of fact, customer and policy data (core data, clearly) were spread over two systems of different vintage, making a unified view very difficult.
In contrast to what you might expect, the solution they built leaves the data in the existing databases- nothing is replicated in a separate store. Instead, the integration takes place in a ‘semantic layer’. In essence, that means that users can ask pretty complex and detailed questions of any information that is held in any system, in terms of a set of models of the business. These queries end up at the same old systems, where they get mapped from semantic web query form into good old SQL database queries.
This kind of approach doesn’t look all that different from the Enterprise Service Bus (ESB) in principle, but takes takes the insulation of service consumers from the details of service providers rather further. Service consumers in a semantic layer have just one API for any kind of data (the W3C’s SPARQL query language) and one datamodel (RDF, though output in XML or JSON is common). Most importantly, the meaning of the data is modelled in a set of ontologies, not in the documentation of the service providers, or the heads of their developers.
While the Blue Shield of California case was done by BBN, other vendors that exhibited in San Jose have similar approaches, often built on open source components. The most eye catching of those components (and also used by BBN) is netkernel: the overachieving offspring of the web and unix. It’s not necessarily semantic tech per se, but more of a language agnostic application environment that competes with J2EE.
Away from the enterprise, and more towards the webby (and educational!) end of things, the state of semantic technology becomes less clear. There are big web apps such as the Twine social network where the triples are working very much in the background, or powerset, where it is much more in your face, but to little apparent effect.
Much less polished, but much, much more interesting is dbpedia.org- an attempt to put all public domain knowledge in a triple store. Yes, that includes wikipedia, but also the US census and much more. DBpedia is accessible via a number of interfaces, including SPARQL. The result is the closest thing yet to a live instance of the semantic web as originally conceived, where it really is possible to ask questions like “give me all football players with number 11 shirts from clubs with stadiums with more than 40000 seats, who were born in a country with more than 10M inhabitants“. Because of the inherent flexibility of a triple store and with the power of the SPARQL interface, dbpedia could end up powering all kinds of other web applications and user interfaces.
Nearer to a real semantic web, though, is Yahoo’s well publicised move to start supporting relevant standards. While the effect isn’t yet so obvious as semantic integration in the enterprise or dbpedia. it could end up being significant, for the simple reason that it focusses on the organisational model. It does that by processing data in various ‘semantic web light’ formats that are embedded in webpages in the structuring and presentation of search results. If you’d want to present a nice set of handles on your site’s content in a yahoo search results page -links to maps, contact info, schedules etc- it’s time to start using RDFa or microformats.
Beyond the specifics of semantic standards or technologies of this point in time, though, lies the increasing demands for such tools. The volume and heterogeneity of data is increasing rapidly, not least because means of fishing structured data out of unstructured data are improving. At the same time, the format of structured data (its syntax) is much less of an issue than it once was, as is the means of shifting that data around. What remains is making sense of that data, and that requires semantic technology.
Resources
The semantic conference site gives an overview, but not any presentations, alas.
The California Blue Shield architecture was built with BBN’s Asio tool suite
More about the netkernel resource oriented computing platform can be found on the 1060 website
Twine is still in private beta, but will open up in overhauled form in the autumn.
Powerset is wikipedia with added semantic sauce.
DBpedia is the community effort to gather all public domain knowledge in a triple store. There’s a page that outlines all ways of accessing it over the web.
Yahoo’s SearchMonkey is the project to utilise semweb light specs in search results.