
Ebook: Reasoning Techniques for the Web of Data

Linked Data publishing has brought about a novel “Web of Data”: a wealth of diverse, interlinked, structured data published on the Web. These Linked Datasets are described using the Semantic Web standards and are openly available to all, produced by governments, businesses, communities and academia alike. However, the heterogeneity of such data – in terms of how resources are described and identified – poses major challenges to potential consumers.
Herein, use cases for pragmatic, lightweight reasoning techniques that leverage Web vocabularies (described in RDFS and OWL) to better integrate large scale, diverse, Linked Data corpora are examined. A test corpus of 1.1 billion RDF statements collected from 4 million RDF Web documents is taken and the use of RDFS and OWL analysed. The next part of the book details and evaluates scalable and distributed techniques for applying rule-based materialisation to translate data between different vocabularies, and to resolve coreferent resources that talk about the same thing. It is shown how such techniques can be made robust in the face of noisy and often impudent Web data. Also examined is a use case for incorporating a PagerRank-style algorithm to rank the trustworthiness of facts produced by reasoning, subsequently using those ranks to fix formal contradictions in the data. All the methods are validated against our real world, large scale, open domain, Linked Data evaluation corpus.
This book is based on a thesis I submitted to the National University of Ireland, Galway in pursuance of a Doctorate of Philosophy in February 2011. (I passed!) The thesis was originally titled “Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora”. My supervisor was Axel Polleres, my internal examiner was Stefan Decker and my external examiner was Jim Hendler. In this version I've ironed out a few kinks and made some corrections. I have also abbreviated some parts to maintain a reasonable length.
Returning to work on a thesis again after three years can make one philosophical about the PhD process. I don't wish to get philosophical here except to say that even in the darkest days of my PhD, I was driven by the idea that the Web has not come close to realising its full potential yet, and that perhaps I could influence, in some small but meaningful way, what the Web will look like in ten, twenty or fifty years time. (What will it look like?)
For better or worse, I bet my early twenties on investigating methods by which deductive reasoning can be applied over lots of diverse Web data without making an enormous mess, and this thesis was the result. Automated deductive reasoning techniques can help machines better understand and process the content of the Web, leading to new applications. But needless to say, it was and still is an ambitious goal.
Three years the wiser, the work presented herein has yet to revolutionise the Web. But I learnt a lot from the PhD process. And I would like to think that, in its own unique way, this work demonstrates that the original vision of the Semantic Web, with “altruistic” machines automatically acting upon their “deep understanding” of Web content, is not science fiction. Machines can certainly do useful things by reasoning over Web data. The only question is to what extent such reasoning will be needed by the future Web of ten, twenty or fifty years time.