Ebook: Decentralized Query Processing over Heterogeneous Sources of Knowledge Graphs
Graph-based data formats are a flexible way of representing data – semantic data models in particular – where the schema is part of the data, and have become more popular and had some commercial success in recent years. Semantic data models are also the basis for the Semantic Web – a Web of data governed by open standards in which computer programs can freely access the data provided.
This book is about checking the correctness of programs that can access semantic data. Although the flexibility of semantic data models is one of their greatest strengths, it can lead programmers to accidentally fail to account for unintuitive edge cases, leading to run-time errors or unintended side-effects during program execution. A program may even run for a long time before such an error occurs and the program crashes. Providing a type system is an established methodology for proving the absence of run-time errors in programs without requiring execution. The book defines type systems that can detect and avoid such run-time errors based on schema languages available for the Semantic Web. Using the Web Ontology Language (OWL) and its theoretic underpinnings i.e. description logics, and the Shapes Constraint Language (SHACL) in particular, the book defines systems that can provide type-safe data access to semantic data graphs. The book is divided into 3 parts: Part I contains an introduction and preliminaries; Part II covers type systems for the Semantic Web; and Part III includes related work and conclusions.
The concept of “Knowledge Graphs” refers to a knowledge representation technique that has become increasingly prominent in scientific and industrial applications. Knowledge graphs describe a set of typed entities with their attributes and the relationships between those entities. Linked Data is a set of best practices for representing and publishing knowledge graphs on the Web. The foundations of Linked Data are the Resource Description Framework (RDF) as a graph-based data model and SPARQL as the query language for RDF data. The increasing number and size of knowledge graphs published as Linked Data in autonomous sources led to the development of various Web interfaces to query these knowledge graphs. The Linked Data Fragment framework provides a uniform way to describe these interfaces regarding their querying expressivity and the metadata they provide, ranging from Triple Pattern Fragment servers to SPARQL endpoints. As a result, heterogeneous interfaces and decentralized, federated publishing lead to new challenges for querying knowledge graphs on the Web.
First, many traditional query planning methods rely on fine-grained statistics on the interfaces’ querying performance and the data distributions of the knowledge graphs. Therefore, performance and data distribution profiling approaches need to be adapted to the capabilities and limitations of different Linked Data Fragment interfaces. The first part of this thesis addresses this challenge and investigates sample-based profiling approaches to support query processing over large, decentralized knowledge graphs. We propose a sample-based approach for generating fine-grained performance profiles and showcase how the information from those profiles can be leveraged in cost model-based query planning. Moreover, we propose a sample-based data distribution profiling approach that aims to estimate statistical profile features of large knowledge graphs. We also demonstrate the applicability of these estimations in federated querying processing.
Another challenge in such decentralized scenarios is devising efficient query processing approaches even if no fine-grained statistics are available and when heterogeneous interfaces need to be queried. The second part of this thesis focuses on query processing techniques to address this challenge. To this end, we investigate robust query planning and execution techniques to support efficient query processing in the absence of fine-grained statistics. Our experimental results demonstrate the benefits of our robust query processing techniques as they outperform state-of-the-art approaches. Finally, we propose a framework for federated query processing over heterogeneous federations of Linked Data Fragments to exploit the capabilities of different sources by defining interface-aware approaches. We showcase the effectiveness of our framework and show that substantial performance improvements are achieved by devising these interface-aware approaches exploiting the capabilities of heterogeneous interfaces in federations.
In summary, this thesis contributes to SPARQL query processing over knowledge graphs by investigating novel approaches to address the challenges that arise in the presence of decentralized, heterogeneous sources of knowledge graphs. Throughout this thesis, we empirically evaluate and demonstrate the effectiveness of these approaches using various real-world and synthetic large-scale knowledge graphs.