Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many in formation systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains.
In this thesis, we investigate the hypothesis if Semantic Web knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks.
Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining the Web of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a Semantic Web knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited.
In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks.
In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction.