Preface
During recent years a set of best practices for publishing and connecting structured data on the World Wide Web (WWW) has emerged. These best practices are referred to as the Linked Data principles and the resulting form of Web data is called Linked Data. The increasing adoption of these principles has lead to the creation of a globally distributed space of Linked Data that covers various domains such as government, libraries, life sciences, and media. Approaches that conceive this data space as a huge distributed database and enable an execution of declarative queries over this database hold an enormous potential; they allow users to benefit from a virtually unbounded set of up-to-date data. As a consequence, several research groups have started to study such approaches. However, the main focus of existing work is to address practical challenges that arise in this context. Research on the foundations of such approaches is largely missing. This dissertation closes this gap.
This dissertation first establishes a well-defined framework for defining and studying queries over Linked Data on the WWW. In particular, we introduce a data model that enables us to formally conceive Linked Data on the WWW as a (distributed) database and a computation model that captures the capabilities of a query execution system for this database. Based on these models, we adapt the declarative query language SPARQL to the given scenario. More precisely, we define a full-Web query semantics and a family of reachability-based query semantics such that each of these query semantics presents a well-defined basis for using SPARQL to query Linked Data on the WWW. Thereafter, we show theoretical properties of queries under these query semantics. Perhaps the most important result of this study formally verifies the common assumption that a computation of query results that are complete w.r.t. all Linked Data on the WWW, is not feasible. However, we also identify classes of queries for which the computational feasibility is less limited.
After analyzing queries over Linked Data independent of specific approaches for executing such queries, this dissertation focuses on a particular execution approach and studies fundamental aspects thereof. The studied approach presents a general strategy for executing queries by integrating traversal-based data retrieval into the result construction process. To analyze this notion of traversal-based query execution formally, we define it in the form of an abstract query execution model. In addition, we discuss a concrete implementation approach for our execution model; this approach is based on the well-known concept of iterators. Our analysis of both the execution model and the iterator-based implementation shows that (i) for one of our reachability-based query semantics, the given notion of traversal-based query execution, in general, is sound and complete, whereas (ii) for the same query semantics, the specific, iterator-based implementation approach cannot guarantee completeness of query results. Based on an experimental analysis we verify that the latter limitation has a significant impact in practice.