The concept of “Knowledge Graphs” refers to a knowledge representation technique that has become increasingly prominent in scientific and industrial applications. Knowledge graphs describe a set of typed entities with their attributes and the relationships between those entities. Linked Data is a set of best practices for representing and publishing knowledge graphs on the Web. The foundations of Linked Data are the Resource Description Framework (RDF) as a graph-based data model and SPARQL as the query language for RDF data. The increasing number and size of knowledge graphs published as Linked Data in autonomous sources led to the development of various Web interfaces to query these knowledge graphs. The Linked Data Fragment framework provides a uniform way to describe these interfaces regarding their querying expressivity and the metadata they provide, ranging from Triple Pattern Fragment servers to SPARQL endpoints. As a result, heterogeneous interfaces and decentralized, federated publishing lead to new challenges for querying knowledge graphs on the Web.
First, many traditional query planning methods rely on fine-grained statistics on the interfaces’ querying performance and the data distributions of the knowledge graphs. Therefore, performance and data distribution profiling approaches need to be adapted to the capabilities and limitations of different Linked Data Fragment interfaces. The first part of this thesis addresses this challenge and investigates sample-based profiling approaches to support query processing over large, decentralized knowledge graphs. We propose a sample-based approach for generating fine-grained performance profiles and showcase how the information from those profiles can be leveraged in cost model-based query planning. Moreover, we propose a sample-based data distribution profiling approach that aims to estimate statistical profile features of large knowledge graphs. We also demonstrate the applicability of these estimations in federated querying processing.
Another challenge in such decentralized scenarios is devising efficient query processing approaches even if no fine-grained statistics are available and when heterogeneous interfaces need to be queried. The second part of this thesis focuses on query processing techniques to address this challenge. To this end, we investigate robust query planning and execution techniques to support efficient query processing in the absence of fine-grained statistics. Our experimental results demonstrate the benefits of our robust query processing techniques as they outperform state-of-the-art approaches. Finally, we propose a framework for federated query processing over heterogeneous federations of Linked Data Fragments to exploit the capabilities of different sources by defining interface-aware approaches. We showcase the effectiveness of our framework and show that substantial performance improvements are achieved by devising these interface-aware approaches exploiting the capabilities of heterogeneous interfaces in federations.
In summary, this thesis contributes to SPARQL query processing over knowledge graphs by investigating novel approaches to address the challenges that arise in the presence of decentralized, heterogeneous sources of knowledge graphs. Throughout this thesis, we empirically evaluate and demonstrate the effectiveness of these approaches using various real-world and synthetic large-scale knowledge graphs.