
Ebook: Federated Query Processing for the Semantic Web

During the last years, the amount of RDF data has increased exponentially over the Web, exposed via SPARQL endpoints. These SPARQL endpoints allow users to direct SPARQL queries to the RDF data. Federated SPARQL query processing allows to query several of these RDF databases as if they were a single one, integrating the results from all of them. This is a key concept in the Web of Data and it is also a hot topic in the community. Besides that, the W3C SPARQL-WG has standardized it in the new Recommendation SPARQL 1.1.
This book provides a formalization of the W3C proposed recommendation. This formalization allows to identify existing errors and allows to correct them before the implementation phase, or when the execution of these federated queries start. The book constitutes a valuable resource for any implementer since it also proposes solutions to the problems identified as well as proposing a set of SPARQL pattern reordering rules, which reduce the execution time of federated queries significantly.
Another strong point of this book is the research methodology followed. It states clearly the problems in the state of the art, next defines the research hypothesis for next providing a thoroughly analysis of the semantics of the SPARQL 1.1 specification. Once the theoretical part is concluded, the book steps into the implementation part, describing clearly the implementation decisions for finally evaluating the overall system.
Recent years have witnessed a constant growth in the amount of RDF data available on the Web. This growth is largely based on the increasing rate of data publication on the Web by different actors such as governments, life science researchers or geographical institutes. RDF data generation is mainly done by converting already existing legacy data resources into RDF (e.g. converting data stored in relational databases into RDF), but also by creating that RDF data directly (e.g. sensors). These RDF data are normally exposed by means of Linked Data-enabled URIs and SPARQL endpoints.
Given the sustained growth that we are experiencing in the number of SPARQL endpoints available, the need to be able to send federated SPARQL queries across them has also grown. Tools for accessing sets of RDF data repositories are starting to appear, differing between them on the way in which they allow users to access these data (allowing users to specify directly what RDF data set they want to query, or making this process transparent to them). To overcome this heterogeneity in federated query processing solutions, the W3C SPARQL working group is defining a federation extension for SPARQL 1.1, which allows combining, in a single query, graph patterns that can be evaluated in several endpoints.
In this PhD thesis, we describe the syntax of that SPARQL extension for providing access to distributed RDF data sets and formalise its semantics. We adapt existing techniques for distributed data access in relational databases in order to deal with SPARQL endpoints, which we have implemented in our federation query evaluation system (SPARQL-DQP). We describe the static optimisation techniques that we implemented in our system and we carry out a series of experiments that show that our optimisations significantly speed up the query evaluation process in presence of large query results and optional operators.