Ebook: Query Processing over Graph-structured Data on the Web
In the last years, Linked Data initiatives have encouraged the publication of large graph-structured datasets using the Resource Description Framework (RDF). Due to the constant growth of RDF data on the web, more flexible data management infrastructures must be able to efficiently and effectively exploit the vast amount of knowledge accessible on the web.
This book presents flexible query processing strategies over RDF graphs on the web using the SPARQL query language. In this work, we show how query engines can change plans on-the-fly with adaptive techniques to cope with unpredictable conditions and to reduce execution time. Furthermore, this work investigates the application of crowdsourcing in query processing, where engines are able to contact humans to enhance the quality of query answers.
The theoretical and empirical results presented in this book indicate that flexible techniques allow for querying RDF data sources efficiently and effectively.
Linked Data initiatives have encouraged the publication of large datasets on the Web. As a result, a huge dataspace known as the Linked Open Data (LOD) Cloud has emerged, where data is represented using the graph-based data model RDF and can be queried using the SPARQL language. In order to support querying capabilities over Linked Data sets, web access interfaces such as SPARQL endpoints or Triple Pattern Fragment (TPF) servers have been deployed. TPF servers support the evaluation of single triple patterns and have been recently proposed as a highly available mechanism to query RDF data online. Despite these developments, the web-like characteristics of RDF sources pose fundamental challenges to Linked Data management that impact on the efficiency and effectiveness of query processing engines over autonomous and remote RDF datasets.
Regarding efficient query processing, the lack of statistics about selectivities and data distributions, as well as unpredictable data transfer rates and server workload, can negatively impact the performance of query engines that consume Linked Data, even in presence of the innovative querying capabilities offered by TPF servers. This problem is mainly generated because existing SPARQL engines implement query execution strategies of fixed plans following the traditional optimize-then-execute paradigm, instead of following adaptive strategies that adjust query executions to unexpected runtime conditions. To tackle this problem, in this thesis we present an adaptive SPARQL query engine tailored to execute queries against TPFs. Our solution exploits the statistics provided by TPFs during query optimization to devise effective plans quickly. The plans are executed by our adaptive engine able to change query execution schedulers to reduce query runtime. The results of our empirical studies indicate that our solution outperforms static web query schedulers in scenarios with unpredictable transfer delays or data distributions and also provide novel insights about the tradeoffs of different adaptive strategies when evaluating selective and non-selective queries.
An orthogonal but equally important aspect of querying Linked Data is the quality of the retrieved data. Recent studies reveal that RDF datasets exhibit varying quality in different dimensions including completeness, semantic validity, and semantic accuracy. Moreover, the semi-structured nature of RDF data makes it very hard to assess the quality of datasets up front. Executing SPARQL queries against data with quality issues leads to low-quality and even incomplete results. To overcome similar challenges in structured databases, state-of-the-art solutions have investigated a hybrid paradigm in which the contribution of human crowds is integrated into query processing to enhance the quality of the query results. Based on these findings, we propose a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. Our solution implements a query engine that relies on the graph structure of RDF data to decide on-the-fly which parts of a SPARQL query should be executed against a dataset or via crowdsourcing. Our engine encodes the knowledge collected from the crowd as fuzzy RDF graphs which are exploited in subsequent query executions. We empirically evaluated the performance of our solution and the experimental results show that our engine is able to enhance the completeness of SPARQL queries while retrieving correct answers from the crowd. Furthermore, we conducted an extensive empirical analysis to study the applicability of crowdsourcing to detect different quality issues in Linked Data. We compare the performance of combining experts and lay users in two different workflows. Our results indicate that crowdsourcing is also a feasible solution to detect low-quality statements in Linked Data sets and that both types of crowds exhibit complementary skills when assessing different quality issues.
In summary, the main contribution of this thesis is the definition of flexible query processing strategies over RDF graphs on the web. In this thesis, we show how query engines can change plans on-the-fly with adaptive techniques to reduce execution time or even contact humans to enhance the quality of query answers. Due to the constant growth of graph-structured data on the web, more flexible data management infrastructures are required in order to be able to efficiently and effectively exploit the vast amount of knowledge accessible on the web.