Many web pages include structured data in the form of semantic markup which can be transferred to RDF or provide an interface to retrieve RDF data directly. This RDF data enables machines to automatically process and use the data. When applications need data from more than one source, typically since the data in one source is incomplete or the sources cover different aspects, the data has to be integrated. For describing the data in a concise way, vocabularies are used. But because of the decentralized nature of the web, multiple data sources can provide similar information with different vocabularies. The use of different vocabularies and modeling choices on the data provider side makes integration difficult. In this thesis’ approach, similar statements about entities are identified across sources, independent of the vocabulary, and data modeling choices.
Previous approaches rely on clean and extensively modeled ontologies for aligning statements. But in a web context, data is usually noisy and does not necessarily adhere to these prerequisites. To tackle this problem, the use of RDF label information of entities is proposed which allows a better integration of noisy data. The presented experiments in this thesis confirm that. Traditional alignment approaches rely on string similarity measures on a purely syntactic level. They can neither handle synonyms nor detect semantic relationships between words. For incorporating a measure of semantic similarity, the use of textual embeddings is investigated which shows superior results.
However textual embeddings are restricted to the information reported in text and thereby are neglecting for human self-evident facts that usually are not captured in text. To mitigate this reporting bias, we investigate the incorporation of information from other modalities: We explore the potential of complementing the textual knowledge via learning of a shared latent representation by integrating information across three modalities: images, text, and knowledge graphs. Thereby, we leverage the results from years of research in different domains: Computer Vision, Computational Linguistics, and Semantic Web. In Computer Vision, visual object features are learned from large image collections, in Computational Linguistics, word embeddings are extracted from huge text corpora capturing their distributional semantics, and in the Semantic Web, embeddings of knowledge graphs effectively capture explicit relational knowledge about entities. This thesis investigates if by fusing the single-modal representations into a multi-modal one, a more holistic representation can be attained. Therefore, the problem of aligning and combining modalities is investigated. The holistic representation is demonstrated to better identify similarities as it contains the different aspects of an entity covered by the different modalities, e.g. visual attributes of entities cover shape and color information that is not easily covered in other modalities.
While the beneïňĄts of multi-modal embeddings have become clear, they are limited to a small number of concepts: The fusion is restricted to concepts with cross-modal alignments which are only available for a few concepts. Since alignments over different modalities are rare and expensive to create, an extrapolation approach to translate entity representations outside of the training corpus to the shared representation space is developed as the ïňĄnal contribution of this thesis.