
Ebook: Strategies and Techniques for Federated Semantic Knowledge Integration and Retrieval

The vast amount of data available on the web has led to the need for effective retrieval techniques to transform that data into usable machine knowledge. But the creation of integrated knowledge, especially knowledge about the same entity from different web data sources, is a challenging task requiring the solving of interoperability problems.
This book addresses the problem of knowledge retrieval and integration from heterogeneous web sources, and proposes a holistic semantic knowledge retrieval and integration approach to creating knowledge graphs on-demand from diverse web sources. Semantic Web Technologies have evolved as a novel approach to tackle the problem of knowledge integration from heterogeneous data, but because of the Extraction-Transformation-Load approach that dominates the process, knowledge retrieval and integration from web data sources is either expensive, or full physical integration of the data is impeded by restricted access. Focusing on the representation of data from web sources as pieces of knowledge belonging to the same entity which can then be synthesized as a knowledge graph helps to solve interoperability conflicts and allow for a more cost-effective integration approach, providing a method that enables the creation of valuable insights from heterogeneous web data.
Empirical evaluations to assess the effectiveness of this holistic approach provide evidence that the methodology and techniques proposed in this book help to effectively integrate the disparate knowledge spread over heterogeneous web data sources, and the book also demonstrates how three domain applications of law enforcement, job market analysis, and manufacturing, have been developed and managed using the approach.
The vast amount of data shared on the Web requires effective and efficient techniques to retrieve and create machine usable knowledge out of it. The creation of integrated knowledge from the Web, especially knowledge about the same entity spread over different web data sources, is a challenging task. Several data interoperability problems such as schema, structure, or domain conflicts need to be solved during the integration process. Semantic Web Technologies have evolved as a novel approach to tackle the problem of knowledge integration out of heterogeneous data. However, knowledge retrieval and integration from web data sources is an expensive process, mainly due to the Extraction-Transformation-Load approach that predominates the process. In addition, there are increasingly many scenarios, where a full physical integration of the data is either prohibitive (e.g. due to data being hidden behind APIs) or not allowed (e.g. for data privacy concerns). Thus, a more cost-effective and federated integration approach is needed, a method that supports organizations to create valuable insights out of the heterogeneous data spread on web sources. In this book, we tackle the problem of knowledge retrieval an integration from heterogeneous web sources and propose a holistic semantic knowledge retrieval and integration approach that creates knowledge graphs on-demand from a federation of web sources. We focus on the representation of web sources data, which belongs to the same entity, as pieces of knowledge to then synthesize them as knowledge graph solving interoperability conflicts at integration time. First, we propose MINTE, a novel semantic integration approach that solves interoperability conflicts present in heterogeneous web sources. MINTE defines the concept of RDF molecules to represent web sources data as pieces of knowledge. Then, MINTE relies on a semantic similarity function to determine RDF molecules belonging to the same entity. Finally, MINTE employs fusion policies for the synthesis of RDF molecules into a knowledge graph. Second, we define a similarity framework for RDF molecules to identify semantically equivalent entities. The framework includes state-of-the-art semantic similarity metrics, such as GADES, but also a semantic similarity metric based on embeddings named MateTee developed in the scope of this book. Ultimately, based on MINTE and our similarity framework, we design a federated semantic retrieval engine named FuhSen. Fuh-Sen is able to effectively integrate data from heterogeneous web data sources and create an integrated knowledge graphs on-demand. Fuh-Sen is equipped with a faceted browsing user interface oriented to facilitate the exploration of on-demand built knowledge graphs. We conducted several empirical evaluations to assess the effectiveness and efficiency of our holistic approach. More importantly, three domain applications, i.e., Law Enforcement, Job Market Analysis, and Manufacturing, have been developed and managed by our approach. Both the empirical evaluations and concrete applications provide evidence that the methodology and techniques proposed in this book help to effectively integrate the pieces of knowledge about entities that are spread over heterogeneous web data sources.