Ebook: Towards a Knowledge-Aware AI
Semantic systems lie at the heart of modern computing, interlinking with areas as diverse as AI, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, enterprise vocabulary management, machine learning, logic programming, content engineering, social computing, and the Semantic Web.
This book presents the proceedings of SEMANTiCS 2022, the 18th International Conference on Semantic Systems, held as a hybrid event – live in Vienna, Austria and online – from 12 to 15 September 2022. The SEMANTiCS conference is an annual meeting place for the professionals and researchers who make semantic computing work, who understand its benefits and encounter its limitations, and is attended by information managers, IT architects, software engineers, and researchers from organizations ranging from research facilities and NPOs, through public administrations to the largest companies in the world. The theme and subtitle of the 2022 conference was Towards A Knowledge-Aware AI, and the book contains 15 papers, selected on the basis of quality, impact and scientific merit following a rigorous review process which resulted in an acceptance rate of 29%. The book is divided into four chapters: semantics in data quality, standards and protection; representation learning and reasoning for downstream AI tasks; ontology development; and learning over complementary knowledge.
Providing an overview of emerging trends and topics in the wide area of semantic computing, the book will be of interest to anyone involved in the development and deployment of computer technology and AI systems.
Abstract. This volume contains the proceedings of the 18th International Conference on Semantic Systems, SEMANTiCS 2022 under the title “Towards A Knowledge-Aware AI”. SEMANTiCS is the annual meeting place for professionals and researchers who make semantic computing work, who understand its benefits and encounter its limitations. Every year, SEMANTiCS attracts information managers, IT- architects, software engineers, and researchers from organizations ranging from research facilities, NPOs, through public administrations to the largest companies in the world.
Keywords. Semantic Systems, Knowledge Graphs, Artificial Intelligence, Semantic Web, Linked Data, Machine Learning, Knowledge Discovery
SEMANTiCS offers a forum for the exchange of latest scientific results in semantic systems and complements these topics with new research challenges in areas like data science, machine learning, logic programming, content engineering, social computing, and the Semantic Web. The conference is in its 18th year and has developed into an internationally visible and professional event at the intersection of academia and industry.
Contributors to and participants of the conference learn from top researchers and industry experts about emerging trends and topics in the wide area of semantic computing. The SEMANTiCS community is highly diverse attendees have responsibilities in interlinking areas such as artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.
The conference’s subtitle in 2022 was “Towards A Knowledge-Aware AI”, and especially welcomed submissions to the following topics:
∙ Enterprise Linked Data & Data Integration
∙ Data Fabric: access and sharing of data in a distributed dataenvironment
∙ Knowledge Graphs
∙ Business models and Success Factors in applying Semantic Technologies
∙ Digital Twins and Industry 4.0
∙ VR augmented by Semantic Technologies
∙ Semantics on the Web & schema.org
∙ Terminology, Thesaurus & Ontology Management
∙ Knowledge Discovery, Semantic Search and Recommender Systems
∙ Smart Connectivity & Interlinking
∙ Blockchain and Distributed Ledger Applications
∙ Machine Learning, NLP & Semantic Computing as a combined approach
∙ Big Data & Advanced Analytics
∙ Explainable AI
∙ Semantically Enriched Digital Experience Platforms
∙ Data Governance and Data Quality Management
∙ Data Portals & Knowledge Visualization
∙ Economics of data, data services, data ecosystems and data strategies
∙ Community, Social & Societal Aspects
∙ Apps, services and industry demos that make use of any of the above mentioned topics
In order to properly provide high-quality reviews, a program committee comprising 114 members supported us in selecting the papers with the highest impact and scientific merit. For each submission, at least 3 reviews were written independently from the assigned reviewers in a single-blind review process (author names are visible to reviewers, reviewers stay anonymous). After all reviews were submitted the PC chairs compared the reviews and discussed discrepancies and different opinions with the reviewers to facilitate a meta-review and suggest a recommendation to accept or reject the paper. Overall, we accepted 15 papers which resulted in an acceptance rate of 29%. In addition to the peer-reviewed work, the conference had four renowned keynote Prof. Olaf Hartig (Associate Professor at Department of Computer and Information Science, Linköping University), Steve Moyle (Founder Amplify Intelligence, Fractional CIO at Freeman Clarke), Adam Keresztes (Business owner IKEA Knowledge Graph), and Jans Aasman (CEO of Franz Inc). Additionally, the program had posters and demos, a comprehensive set of workshops, as well as talks from industry leaders. For the first time in 2022 SEMANTiCS also hosted the co-located conference LT Innovate, an established community event for language and translational technologies. We thank all authors who submitted papers. We particularly thank the program committee which provided careful reviews in a quick turnaround time. Their service is essential for the quality of the conference.
Sincerely yours,
The Editors
Vienna, September 2022
The credibility and trustworthiness of online content has become a major societal issue as human communication and information exchange continues to evolve digitally. The prevalence of misinformation, circulated by fraudsters, trolls, political activists and state-sponsored actors, has motivated a heightened interest in automated content evaluation and curation tools. We present an automated credibility evaluation system to aid users in credibility assessments of web pages, focusing on the automated analysis of 23 mostly language- and content-related credibility signals of web content. We find that emotional characteristics, various morphological and syntactical properties of the language, and exclamation mark and all caps usage are particularly indicative of credibility. Less credible web pages have more emotional, shorter and less complex texts, and put a greater emphasis on the headline, which is longer, contains more all caps and is frequently clickbait. Our system achieves a 63% accuracy in fake news classification, and a 28% accuracy in predicting the credibility rating of web pages on a five-point Likert scale.
Linked Data datasets when they are published typically have varying levels of quality. These datasets are created using mapping artefacts, which define the transformation rules from non-graph based data into graph based RDF data. Currently, quality issues are detected after the mapping artefact has been executed and the Linked Data has already been published. It is argued in this paper that addressing quality issues within the mapping artefacts will positively improve the quality of the resulting dataset that is generated. Furthermore, we suggest that an explicit quality process for mappings will improve quality, maintenance, and reuse. This paper describes the evaluation of the Mapping Quality Vocabulary (MQV) Framework, which aims to guide linked data producers in producing high quality datasets, by enabling the quality assessment and subsequent improvement of the mapping artefacts. The evaluation of the MQV framework consisted of 58 participants with varying level of background knowledge.
The GDPR requires assessing and conducting a Data Protection Impact Assessment (DPIA) for processing of personal data that may result in high risk and impact to the data subjects. Documenting this process requires information about processing activities, entities and their roles, risks, mitigations and resulting impacts, and consultations. Impact assessments are complex activities where stakeholders face difficulties to identify relevant risks and mitigations, especially for emerging technologies and specific considerations in their use-cases, and to document outcomes in a consistent and reusable manner. We address this challenge by utilising linked-data to represent DPIA related information so that it can be better managed and shared in an interoperable manner. For this, we consulted the guidance documents produced by EU Data Protection Authorities (DPA) regarding DPIA and by ENISA regarding risk management. The outcome of our efforts is an extension to the Data Privacy Vocabulary (DPV) for documenting DPIAs and an ontology for risk management based on ISO 31000 family of standards. Our contributions fill an important gap within the state of the art, and paves the way for shared impact assessments with future regulations such as for AI and Cybersecurity.
The growing number of incidents caused by (mis)using Artificial Intelligence (AI) is a matter of concern for governments, organisations, and the public. To control the harmful impacts of AI, multiple efforts are being taken all around the world from guidelines promoting trustworthy development and use, to standards for managing risks and regulatory frameworks. Amongst these efforts, the first-ever AI regulation proposed by the European Commission, known as the AI Act, is prominent as it takes a risk-oriented approach towards regulating development and use of AI within systems. In this paper, we present the AI Risk Ontology (AIRO) for expressing information associated with high-risk AI systems based on the requirements of the proposed AI Act and ISO 31000 series of standards. AIRO assists stakeholders in determining ‘high-risk’ AI systems, maintaining and documenting risk information, performing impact assessments, and achieving conformity with AI regulations. To show its usefulness, we model existing real-world use-cases from the AIAAIC repository of AI-related risks, determine whether they are high-risk, and produce documentation for the EU’s proposed AI Act.
Knowledge graphs have emerged as an effective tool for managing and standardizing semistructured domain knowledge in a human- and machine-interpretable way. In terms of graph-based domain applications, such as embeddings and graph neural networks, current research is increasingly taking into account the time-related evolution of the information encoded within a graph. Algorithms and models for stationary and static knowledge graphs are extended to make them accessible for time-aware domains, where time-awareness can be interpreted in different ways. In particular, a distinction needs to be made between the validity period and the traceability of facts as objectives of time-related knowledge graph extensions. In this context, terms and definitions such as dynamic and temporal are often used inconsistently or interchangeably in the literature. Therefore, with this paper we aim to provide a short but well-defined overview of time-aware knowledge graph extensions and thus faciliate future research in this field as well.
In the automotive industry, testing for reliability and safety is very important but costly. Due to the deployment of an increasing number of features within these systems, mapping them to compatible test environments becomes more and more complex. In this paper, we present a use case for applying ontological reasoning in the automotive industry for supporting testers while making the selection of test environments. The given task has been to map the software under test together with test cases to test environments through ontological reasoning. To this end, we defined an ontology of test environments. It can be used for ontological reasoning, both by applying instance classification and subsumption reasoning, to assign test environments. This approach is prototypically implemented in Stardog, in combination with OWL2 and SPARQL. It is deployed alongside existing software at our industry partner’s premises and provides a user interface, which supports testers while selecting test environments and executing tests.
The problem of entity resolution is central in the field of Digital Humanities. It is also one of the major issues in the Golden Agents project, which aims at creating an infrastructure that enables researchers to search for patterns that span across decentralised knowledge graphs from cultural heritage institutes. To this end, we created a method to perform entity resolution on complex historical knowledge graphs. In previous work, we encoded and embedded the relevant (duplicate) entities in a vector space to derive similarities between them based on sharing a similar context in RDF graphs. In some cases, however, available domain knowledge or rational axioms can be applied to improve entity resolution performance. We show how domain knowledge and rational axioms relevant to the task at hand can be expressed as (probabilistic) rules, and how the information derived from rule application can be combined with quantitative information from the embedding. In this work, we perform our entity resolution method on two data sets. First, we apply it to a data set for which we have a detailed ground truth for validation. This experiment shows that the combination of embedding and the application of domain knowledge and rational axioms leads to improved resolution performance. Second, we perform a case study by applying our method to a larger data set for which there is no ground truth and where the outcome is subsequently validated by a domain expert. Results of this demonstrate that our method achieves a very high precision.
We present the Frame-based ontology Design Outlet (FrODO), a novel method and tool for drafting ontologies from competency questions automatically. Competency questions are expressed as natural language and are a common solution for representing requirements in a number of agile ontology engineering methodologies, such as the eXtreme Design (XD) or SAMOD. FrODO builds on top of FRED. In fact, it leverages the frame semantics for drawing domain-relevant boundaries around the RDF produced by FRED from a competency question, thus drafting domain ontologies. We carried out a user-based study for assessing FrODO in supporting engineers for ontology design tasks. The study shows that FrODO is effective in this and the resulting ontology drafts are qualitative.
Ontology development offers many challenges, with some of the most prominent being modularization and evolution of ontologies over time. Based on lessons learned from popular programming language package managers, we present a novel approach to package management of OWL ontologies. Most prominently we integrate a dependency resolution algorithm based on the popular SemVer versioning scheme with tooling support for dependency locking, which allows for decoupling publication and consumption of ontologies, reducing the need for coordination in ontology evolution. To complete our unified approach, we additionally provide an integrated registry, which serves as a domain-agnostic repository for ontologies (https://registry.field33.com).
The blockchain technology provides integrity and reliability of the information, thus offering a suitable solution to guarantee trustability in a multi-stakeholder scenario that involves actors defining business agreements. The Ride2Rail project investigated the use of the blockchain to record as smart contracts the agreements between different stakeholders defined in a multimodal transportation domain. Modelling an ontology to represent the smart contracts enables the possibility of having a machine-readable and interoperable representation of the agreements. On one hand, the underlying blockchain ensures trust in the execution of the contracts, on the other hand, their ontological representation facilitates the retrieval of information within the ecosystem. The paper describes the development of the Ride2Rail Ontology for Agreements to showcase how the concept of an ontological smart contract, defined in the OASIS ontology, can be applied to a specific domain. The usage of the designed ontology is discussed by describing the modelling as ontological smart contracts of business agreements defined in a ride-sharing scenario.
Many tools for knowledge management and the Semantic Web presuppose the existence of an arrangement of instances into classes, i. e. an ontology. Creating such an ontology, however, is a labor-intensive task. We present an unsupervised method to learn an ontology from text. We rely on pre-trained language models to generate lexical substitutes of given entities and then use matrix factorization to induce new classes and their entities. Our method differs from previous approaches in that (1) it captures the polysemy of entities; (2) it produces interpretable labels of the induced classes; (3) it does not require any particular structure of the text; (4) no re-training is required. We evaluate our method on German and English WikiNER corpora and demonstrate the improvements over state of the art approaches.
In recent years, the task of sequence to sequence based neural abstractive summarization has gained a lot of attention. Many novel strategies have been used to improve the saliency, human readability, and consistency of these models, resulting in high-quality summaries. However, because the majority of these pretrained models were trained on news datasets, they contain an inherent bias. One such bias is that most of these generated summaries originate from the start or end of the text, much like a news story might be summarised. Another issue we encountered while using these summarizers in our Technical discussion forums usecase was token recurrence, which resulted in lower ROUGE-precision scores. To overcome these issues, we present a unique approach that includes: a) An additional parameter to the loss function based on ROUGE-precision score that is optimised alongside categorical cross entropy loss. b) An adaptive loss function based on token repetition rate which is optimized along with the final loss so that the model may provide contextual summaries with less token repetition and successfully learn with the least training samples. c) To effectively contextualize this summarizer for technical forum discussion platforms, we added extra metadata indicator tokens to aid the model in learning latent features and dependencies in text segments with relevant metadata information. To avoid overfitting due to data scarcity, we test and verify all models on a hold-out dataset that was not part of the training or validation dataset. This paper discusses the various strategies we used and compares the performance of fine tuned models against baseline summarizers n the test dataset. By end-to-end training our models with these losses, we acquire substantially better ROUGE scores while being the most legible and relevant summary on the Technical forum dataset.
Deep Learning models based on the Transformer architecture have revolutionized the state of the art of NLP tasks. As English is the language in which most significant advances are made, languages like Spanish require specific training, but this training has a computational cost so high that only big corporations with servers and GPUs are capable of generating them. This work has explored how to create a model for the Spanish language from a big multilingual model. Specifically, a model aimed at creating text summarization, a very common task in NLP. The results, concerning the quality of the summarization (ROUGE score), point out that these small models, for a specific language, achieve similar results than much bigger models, with a reasonable training in terms of time required and computational power, and are significantly faster at inference.
Text documents are rich repositories of causal knowledge. While journal publications typically contain analytical explanations of observations on the basis of scientific experiments conducted by researchers, analyst reports, News articles or even consumer generated text contain not only viewpoints of authors, but often contain causal explanations for those viewpoints. As interest in data science shifts towards understanding causality rather than mere correlations, there is also a surging interest in extracting causal constructs from text to provide augmented information for better decision making. Causality extraction from text is viewed as a relation extraction problem which requires identification of causal sentences as well as detection of cause and effect clauses separately. In this paper, we present a joint model for causal sentence classification and extraction of cause and effect clauses, using a sequence-labeling architecture cascaded with fine-tuned Bidirectional Encoder Representations from Transformers (BERT) language model. The cause and effect clauses are further processed to identify named entities and build a causal graph using domain constraints. We have done multiple experiments to assess the generalizability of the model. It is observed that when fine-tuned with sentences from a mixed corpus, and further trained to solve both the tasks correctly, the model learns the nuances of expressing causality independent of the domain. The proposed model has been evaluated against multiple state-of-the-art models proposed in literature and found to outperform them all.
Sentences where two verbs share a single argument represent a complex and highly ambiguous syntactic phenomenon. The argument sharing relations must be considered during the detection process from both a syntactic and semantic perspective. Such expressions can represent ungrammatical constructions, denoted as zeugma, or idiomatic elliptical phrase combinations. Rule-based classification methods prove ineffective because of the necessity to reflect meaning relations of the analyzed sentence constituents.
This paper presents the development and evaluation of ZeugBERT, a language model tuned for the sentence classification task using a pre-trained Czech transformer model for language representation. The model was trained with a newly prepared dataset, which is also published with this paper, of 7,849 Czech sentences to classify Czech syntactic structures containing coordinated verbs that share a valency argument (or an optional adjunct) in the context of coordination. ZeugBERT here reaches 88% of test set accuracy. The text describes the process of the new dataset creation and annotation, and it offers a detailed error analysis of the developed classification model.