Ebook: Further with Knowledge Graphs
The field of semantic computing is highly diverse, linking areas such as artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management. As such it forms an essential part of the computing technology that underpins all our lives today.
This volume presents the proceedings of SEMANTiCS 2021, the 17th International Conference on Semantic Systems. As a result of the continuing Coronavirus restrictions, SEMANTiCS 2021 was held in a hybrid form in Amsterdam, the Netherlands, from 6 to 9 September 2021. The annual SEMANTiCS conference provides an important platform for semantic computing professionals and researchers, and attracts information managers, ITarchitects, software engineers, and researchers from a wide range of organizations, such as research facilities, NPOs, public administrations and the largest companies in the world. The subtitle of the 2021 conference’s was “In the Era of Knowledge Graphs”, and 66 submissions were received, from which the 19 papers included here were selected following a rigorous single-blind reviewing process; an acceptance rate of 29%. Topics covered include data science, machine learning, logic programming, content engineering, social computing, and the Semantic Web, as well as the additional sub-topics of digital humanities and cultural heritage, legal tech, and distributed and decentralized knowledge graphs.
Providing an overview of current research and development, the book will be of interest to all those working in the field of semantic systems.
This volume contains the proceedings of the 17th International Conference on Semantic Systems, SEMANTiCS 2021. SEMANTiCS is the annual meeting place for professionals and researchers who make semantic computing work, who understand its benefits and encounter its limitations. Every year, SEMANTiCS attracts information managers, IT-architects, software engineers, and researchers from organisations ranging from research facilities, NPOs, through public administrations to the largest companies in the world.
SEMANTiCS offers a forum for the exchange of the latest scientific results in semantic systems and complements these topics with new research challenges in areas like data science, machine learning, logic programming, content engineering, social computing, and the Semantic Web. The conference is in its 17th year and has developed into an internationally visible and professional event at the intersection of academia and industry.
Contributors to and participants of the conference learn from top researchers and industry experts about emerging trends and topics in the wide area of semantic computing. The SEMANTiCS community is highly diverse; attendees have responsibilities in interlinking areas such as artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.
The conference’s subtitle in 2021 was “In the Era of Knowledge Graphs”, and especially welcomed submissions on the following topics:
∙ Web Semantics & Linked (Open) Data
∙ Enterprise Knowledge Graphs, Graph Data Management, and Deep Semantics
∙ Machine Learning & Deep Learning Techniques
∙ Semantic Information Management & Knowledge Integration
∙ Terminology, Thesaurus & Ontology Management
∙ Data Mining and Knowledge Discovery
∙ Reasoning, Rules, and Policies
∙ Natural Language Processing
∙ Data Quality Management and Assurance
∙ Explainable Artificial Intelligence
∙ Semantics in Data Science
∙ Semantics in Blockchain environments
∙ Trust, Data Privacy, and Security with Semantic Technologies
∙ Economics of Data, Data Services, and Data Ecosystems
We additionally issued calls for three special sub-topics:
∙ Digital Humanities and Cultural Heritage
∙ LegalTech
∙ Distributed and Decentralized Knowledge Graphs
Due to the health crisis caused by the Corona-Virus pandemic, 2021’s SEMANTiCS took place in a hybrid form in Amsterdam. A call for papers was distributed publicly and we received 66 submissions to the Research and Innovation track.
In order to properly provide high-quality reviews, a program committee comprising 99 members supported us in selecting the papers with the highest impact and scientific merit. For each submission, at least 3 reviews were written independently from the assigned reviewers in a single–blind review process (author names are visible to reviewers, reviewers stay anonymous). After all reviews were submitted the PC chairs compared the reviews and discussed discrepancies and different opinions with the reviewers to facilitate a meta-review and suggest a recommendation to accept or reject the paper. Overall, we accepted 19 papers which resulted in an acceptance rate of 29%.
In addition to the peer-reviewed work, the conference had four renowned keynote speakers: Joe Pairman (Senior Product Manager, Tridion Docs; Prof. Enrico Motta (Professor of Knowledge Technologies at the Knowledge Media Institute – The Open University); Prof. Maria-Esther Vidal (Head of Scientific Data Management Research Group TIB – Leibniz Information Centre for Science and Technology and University Library); and Dr. Vanessa Lopez (Research Scientist and Manager – AI for Health and Social Care – IBM).
Additionally, the program had posters and demos, a comprehensive set of workshops, as well as talks from industry leaders.
We thank all authors who submitted papers. We particularly thank the program committee which provided careful reviews in a quick turnaround time. Their service is essential for the quality of the conference.
Sincerely yours,
The Editors
Amsterdam, September 2021
Nowadays, the demand in industry of dialogue systems to be able to naturally communicate with industrial systems is increasing, as they allow to enhance productivity and security in these scenarios. However, adapting these systems to different use cases is a costly process, due to the complexity of the scenarios and the lack of available data. This work presents the Task-Oriented Dialogue management Ontology (TODO), which aims to provide a core and complete base for semantic-based task-oriented dialogue systems in the context of industrial scenarios in terms of, on the one hand, domain and dialogue modelling and, on the other hand, dialogue management and tracing support. Furthermore, its modular structure, besides grouping specific knowledge in independent components, allows to easily extend each of the modules, attending the necessities of the different use cases. These characteristics allow an easy adaptation of the ontology to different use cases, with a considerable reduction of time and costs. So as to demonstrate the capabilities of the the ontology by integrating it in a task-oriented dialogue system, TODO has been validated in real-world use cases. Finally, an evaluation is also presented, covering different relevant aspects of the ontology.
This paper describes the steps taken to model a valency lexicon for Latin (Latin Vallex) according to the principles of the Linked Data paradigm, and to interlink its valency frames with the lexical senses recorded in a manually checked subset of the Latin WordNet. The valency lexicon and the WordNet share lexical entries and are part of the LiLa Knowledge Base, which interlinks multiple linguistic resources for Latin. After describing the overall architecture of LiLa, as well as the structure of the lexical entries of Latin Vallex and Latin WordNet, the paper focuses on how valency frames have been modeled in LiLa, in line with a submodule of the Predicate Model for Ontologies (PreMOn) specifically created for the representation of grammatical valency. A mapping of the valency frames and the WordNet synsets assigned to the lexical entries shared by the two resources is detailed, as well as a number of queries that can be run across the interoperable resources for Latin currently included in LiLa.
Matching tables against Knowledge Graphs is a crucial task in many applications. A widely adopted solution to improve the precision of matching algorithms is to refine the set of candidate entities by their type in the Knowledge Graph. However, it is not rare that a type is missing for a given entity. In this paper, we propose a methodology to improve the refinement phase of matching algorithms based on type prediction and soft constraints. We apply our methodology to state-of-the-art algorithms, showing a performance boost on different datasets.
Knowledge graphs facilitate systematic large-scale data analysis by providing both human and machine-readable structures, which can be shared across different domains and platforms. Nowadays, knowledge graphs can be used to standardise the collection and sharing of user information in many different sectors such as transport, insurance, smart cities and internet of things. Regulations such as the GDPR make sure that users are not taken advantage of when they share data. From a legal standpoint it is necessary to have the user’s consent to collect information. This consent is only valid if the user is aware about the information collected at all times. To increase this awareness, we present a knowledge graph visualisation approach, which informs users about the activities linked to their data sharing agreements, especially after they have already given their consent. To visualise the graph, we introduce a user-centred application which showcases sensor data collection and distribution to different data processors. Finally, we present the results of a user study conducted to find out whether this visualisation leads to more legal awareness and trust. We show that with our visualisation tool data sharing consent rates increase from 48% to 81.5%.
The Semantic Web research community understood since its beginning how crucial it is to equip practitioners with methods to transform non-RDF resources into RDF. Proposals focus on either engineering content transformations or accessing non-RDF resources with SPARQL. Existing solutions require users to learn specific mapping languages (e.g. RML), to know how to query and manipulate a variety of source formats (e.g. XPATH, JSON-Path), or to combine multiple languages (e.g. SPARQL Generate). In this paper, we explore an alternative solution and contribute a general-purpose meta-model for converting non-RDF resources into RDF: Facade-X. Our approach can be implemented by overriding the SERVICE operator and does not require to extend the SPARQL syntax. We compare our approach with the state of art methods RML and SPARQL Generate and show how our solution has lower learning demands and cognitive complexity, and it is cheaper to implement and maintain, while having comparable extensibility and efficiency.
The last decades have witnessed significant advancements in terms of data generation, management, and maintenance. This has resulted in vast amounts of data becoming available in a variety of forms and formats including RDF. As RDF data is represented as a graph structure, applying machine learning algorithms to extract valuable knowledge and insights from them is not straightforward, especially when the size of the data is enormous. Although Knowledge Graph Embedding models (KGEs) convert the RDF graphs to low-dimensional vector spaces, these vectors often lack the explainability. On the contrary, in this paper, we introduce a generic, distributed, and scalable software framework that is capable of transforming large RDF data into an explainable feature matrix. This matrix can be exploited in many standard machine learning algorithms. Our approach, by exploiting semantic web and big data technologies, is able to extract a variety of existing features by deep traversing a given large RDF graph. The proposed framework is open-source, well-documented, and fully integrated into the active community project Semantic Analytics Stack (SANSA). The experiments on real-world use cases disclose that the extracted features can be successfully used in machine learning tasks like classification and clustering.
We describe the use of Linguistic Linked Open Data (LLOD) to support a cross-lingual transfer framework for concept detection in online health communities. Our goal is to develop multilingual text analytics as an enabler for analyzing health-related quality of life (HRQoL) from self-reported patient narratives. The framework capitalizes on supervised cross-lingual projection methods, so that labeled training data for a source language are sufficient and are not needed for target languages. Cross-lingual supervision is provided by LLOD lexical resources to learn bilingual word embeddings that are simultaneously tuned to represent an inventory of HRQoL concepts based on the World Health Organization’s quality of life surveys (WHOQOL). We demonstrate that lexicon induction from LLOD resources is a powerful method that yields rich and informative lexical resources for the cross-lingual concept detection task which can outperform existing domain-specific lexica. Furthermore, in a comparative evaluation we find that our models based on bilingual word embeddings exhibit a high degree of complementarity with an approach that integrates machine translation and rule-based extraction algorithms. In a combined configuration, our models rival the performance of state-of-the-art cross-lingual transformers, despite being of considerably lower model complexity.
Knowledge Graph Question Answering (KGQA) systems are often based on machine learning algorithms, requiring thousands of question-answer pairs as training examples or natural language processing pipelines that need module fine-tuning. In this paper, we present a novel QA approach, dubbed TeBaQA. Our approach learns to answer questions based on graph isomorphisms from basic graph patterns of SPARQL queries. Learning basic graph patterns is efficient due to the small number of possible patterns. This novel paradigm reduces the amount of training data necessary to achieve state-of-the-art performance. TeBaQA also speeds up the domain adaption process by transforming the QA system development task into a much smaller and easier data compilation task. In our evaluation, TeBaQA achieves state-of-the-art performance on QALD-8 and delivers comparable results on QALD-9 and LC-QuAD v1. Additionally, we performed a fine-grained evaluation on complex queries that deal with aggregation and superlative questions as well as an ablation study, highlighting future research challenges.
RDF Stream Processing (RSP) has been proposed as a way of bridging the gap between the Complex Event Processing (CEP) paradigm and the Semantic Web standards. Uncertainty has been recognized as a critical aspect in CEP, but it has received little attention within the context of RSP. In this paper, we investigate the impact of different RSP optimization strategies for uncertainty management. The paper describes (1) an extension of the RSP-QL⋆ data model to capture bind expressions, filter expressions, and uncertainty functions; (2) optimization techniques related to lazy variables and caching of uncertainty functions, and a heuristic for reordering uncertainty filters in query plans; and (3) an evaluation of these strategies in a prototype implementation. The results show that using a lazy variable mechanism for uncertainty functions can improve query execution performance by orders of magnitude while introducing negligible overhead. The results also show that caching uncertainty function results can improve performance under most conditions, but that maintaining this cache can potentially add overhead to the overall query execution process. Finally, the effect of the proposed heuristic on query execution performance was shown to depend on multiple factors, including the selectivity of uncertainty filters, the size of intermediate results, and the cost associated with the evaluation of the uncertainty functions.
In Faceted Search Systems (FSS), users navigate the information space through facets, which are attributes or meta-data that describe the underlying content of the collection. Type-based facets (aka t-facets) help explore the categories associated with the searched objects in structured information space. This work investigates how personalizing t-facet ranking can minimize user effort to reach the intended search target. We propose a lightweight personalisation method based on Vector Space Model (VSM) for ranking the t-facet hierarchy in two steps. The first step scores each individual leaf-node t-facet by computing the similarity between the t-facet BERT embedding and the user profile vector. In this model, the user’s profile is expressed in a category space through vectors that capture the users’ past preferences. In the second step, this score is used to re-order and select the sub-tree to present to the user. The final ranked tree reflects the t-facet relevance both to the query and the user profile. Through the use of embeddings, the proposed method effectively handles unseen facets without adding extra processing to the FSS. The effectiveness of the proposed approach is measured by the user effort required to retrieve the sought item when using the ranked facets. The approach outperformed existing personalization baselines.
The usage of Named Entity Recognition tools on domain-specific corpora is often hampered by insufficient training data. We investigate an approach to produce fine-grained named entity annotations of a large corpus of Austrian court decisions from a small manually annotated training data set. We apply a general purpose Named Entity Recognition model to produce annotations of common coarse-grained types. Next, a small sample of these annotations are manually inspected by domain experts to produce an initial fine-grained training data set. To efficiently use the small manually annotated data set we formulate the task of named entity typing as a binary classification task – for each originally annotated occurrence of an entity, and for each fine-grained type we verify if the entity belongs to it. For this purpose we train a transformer-based classifier. We randomly sample 547 predictions and evaluate them manually. The incorrect predictions are used to improve the performance of the classifier – the corrected annotations are added to the training set. The experiments show that re-training with even a very small number (5 or 10) of originally incorrect predictions can significantly improve the classifier performance. We finally train the classifier on all available data and re-annotate the whole data set.
With significant growth in RDF datasets, application developers demand online availability of these datasets to meet the end users’ expectations. Various interfaces are available for querying RDF data using SPARQL query language. Studies show that SPARQL end-points may provide high query runtime performance at the cost of low availability. For example, it has been observed that only 32.2% of public endpoints have a monthly uptime of 99–100%. One possible reason for this low availability is the high workload experienced by these SPARQL endpoints. As complete query execution is performed at server side (i.e., SPARQL endpoint), this high query processing workload may result in performance degradation or even a service shutdown. We performed extensive experiments to show the query processing capabilities of well-known triple stores by using their SPARQL endpoints. In particular, we stressed these triple stores with multiple parallel requests from different querying agents. Our experiments revealed the maximum query processing capabilities of these triple stores after which point they lead to service shutdowns. We hope this analysis will help triple store developers to design workload-aware RDF engines to improve the availability of their public endpoints with high throughput.
This paper describes a new semantic metadata-based approach to describing and integrating diverse data processing activity descriptions gathered from heterogeneous organisational sources such as departments, divisions, and external processors. This information must be collated to assess and document GDPR legal compliance, such as creating a Register of Processing Activities (ROPA). Most GDPR knowledge graph research to date has focused on developing detailed compliance graphs. However, many organisations already have diverse data collection tools for documenting data processing activities, and this heterogeneity is likely to grow in the future. We provide a new approach extending the well-known DCAT-AP standard utilising the data privacy vocabulary (DPV) to express the concepts necessary to complete a ROPA. This approach enables data catalog implementations to merge and federate the metadata for a ROPA without requiring full alignment or merging all the underlying data sources. To show our approach’s feasibility, we demonstrate a deployment use case and develop a prototype system based on diverse data processing records and a standard set of SPARQL queries for a Data Protection Officer preparing a ROPA to monitor compliance. Our catalog’s key benefits are that it is a lightweight, metadata-level integration point with a low cost of compliance information integration, capable of representing processing activities from heterogeneous sources.
The growing web of data warrants better data management strategies. Data silos are single points of failure and they face availability problems which lead to broken links. Furthermore the dynamic nature of some datasets increases the need for a versioning scheme. In this work, we propose a novel architecture for a linked open data infrastructure, built on open decentralized technologies. IPFS is used for storage and retrieval of data, and the public Ethereum blockchain is used for naming, versioning and storing metadata of datasets. We furthermore exploit two mechanisms for maintaining a collection of relevant, high-quality datasets in a distributed manner in which participants are incentivized. The platform is shown to have a low barrier to entry and censorship-resistance. It benefits from the fault-tolerance of its underlying technologies. Furthermore, we validate the approach by implementing our solution.
Social media as infrastructure for public discourse provide valuable information that needs to be preserved. Several tools for social media harvesting exist, but still only fragmented workflows may be formed with different combinations of such tools. On top of that, social media data but also preservation-related metadata standards are heterogeneous, resulting in a costly manual process. In the framework of BESOCIAL at the Royal Library of Belgium (KBR), we develop a sustainable social media archiving workflow that integrates heterogeneous data sources in a Europeana and PREMIS-based data model to describe data preserved by open source tools. This allows data stewardship on a uniform representation and we generate metadata records automatically via queries. In this paper, we present a comparison of social media harvesting tools and our Knowledge Graph-based solution which reuses off-the-shelf open source tools to harvest social media and automatically generate preservation-related metadata records. We validate our solution by generating Encoded Archival Description (EAD) and bibliographic MARC records for preservation of harvested social media collections from Twitter collected at KBR. Other archiving institutions can build upon our solution and customize it to their own social media archiving policies.
Learned latent vector representations are key to the success of many recommender systems in recent years. However, traditional approaches like matrix factorization produce vector representations that capture global distributions of a static recommendation scenario only. Such latent user or item representations do not capture background knowledge and are not customized to a concrete situational context and the sequential history of events leading up to it.
This is a fundamentally limiting restriction for many tasks and applications, since the latent state can depend on a) abstract background information, b) the current situational context and c) the history of related observations. An illustrating example is a restaurant recommendation scenario, where a user’s assessment of the situation depends a) on taxonomical information regarding the type of cuisine, b) on situational factors like time of day, weather or location and c) on the subjective individual history and experience of this user in preceding situations. This situation-specific internal state of the user is not captured when using a traditional collaborative filtering approach, since background knowledge, the situational context and the sequential nature of an individual’s history cannot easily be represented in the matrix.
In this paper, we investigate how well state-of-the-art approaches do exploit those different dimensions relevant to POI recommendation tasks. Naturally, we represent such a scenario as a temporal knowledge graph and compare plain knowledge graph, a taxonomy and a hypergraph embedding approach, as well as a recurrent neural network architecture to exploit the different context-dimensions of such rich information. Our empirical evidence indicates that the situational context is most crucial to the prediction performance, while the taxonomical and sequential information are harder to exploit. However, they still have their specific merits depending on the situation.
The results of a SPARQL query are generally presented as a table with one row per result, and one column per projected variable. This is an immediate consequence of the formal definition of SPARQL results as a sequence of mappings from variables to RDF terms. However, because of the flat structure of tables, some of the RDF graph structure is lost. This often leads to duplicates in the contents of the table, and difficulties to read and interpret results. We propose to use nested tables to improve the presentation of SPARQL results. A nested table is a table where cells may contain embedded tables instead of RDF terms, and so recursively. We introduce an automated procedure that lifts flat tables into nested tables, based on an analysis of the query. We have implemented the procedure on top of Sparklis, a guided query builder in natural language, in order to further improve the readability of its UI. It can as well be implemented on any SPARQL querying interface as it only depends on the query and its flat results. We illustrate our proposal in the domain of pharmacovigilance, and evaluate it on complex queries over Wikidata.
Infusing autonomous artificial systems with knowledge about the physical world they inhabit is of utmost importance and a long-lasting goal in Artificial Intelligence (AI) research. Training systems with relevant data is a common approach; yet, it is not always feasible to find the data needed, especially since a big portion of this knowledge is commonsense. In this paper, we propose a novel method for extracting and evaluating relations between objects and actions from knowledge graphs, such as ConceptNet and WordNet. We present a complete methodology of locating, enriching, evaluating, cleaning and exposing knowledge from such resources, taking into consideration semantic similarity methods. One important aspect of our method is the flexibility in deciding how to deal with the noise that exists in the data. We compare our method with typical approaches found in the relevant literature, such as methods that exploit the topology or the semantic information in a knowledge graph, and embeddings. We test the performance of these methods on the Something-Something Dataset.
This paper presents a prosopographical knowledge graph describing the Members of Parliament in Finland and related actors in politics, extracted from the databases and textual descriptions of the Parliament of Finland. The data has been interlinked internally and enriched with data linking to external data sources according to the 5-star Linked Data model. The data has been published together with its schema for better re-usability and is validated using ShEx. The knowledge graph presented is integrated with another knowledge graph about over 900 000 parliamentary plenary speeches in Finland (1907–) to form a larger parliamentary LOD publication FinnParla of the Parliament of Finland. The data is being used for Digital Humanities research on parliamentary networks, culture, and language.