Ebook: Knowledge Graphs in the Age of Language Models and Neuro-Symbolic AI
Semantic systems encompass a variety of technologies that play a fundamental role in our daily lives, such as artificial intelligence, machine learning, knowledge graphs and ontology engineering, and enterprise vocabulary management.
This book presents the proceedings of SEMANTiCS 2024, the 20th International Conference on Semantic Systems, held from 17 to 19 September 2024 in Amsterdam, the Netherlands. The conference has become recognized as important international event providing a regular opportunity for professionals and researchers actively engaged in harnessing the power of semantic computing and artificial intelligence to gather and discuss the possibilities and practical limitations of various transformative technologies. These include semantic web and artificial intelligence, as well as areas such as data science, machine learning, logic programming, content engineering, social computing, natural language processing, digital humanities, and many more. A total of 95 submissions were received for the conference. These were subjected to a double-anonymous peer review process, with a minimum of 3 independent reviews for each submission, after which 26 papers were accepted for presentation and publication, representing an acceptance rate of 27%. The papers are divided into 6 sections: knowledge engineering with large language models; embeddings and machine learning on knowledge graphs; ontologies and knowledge graphs; linked data management; question answering and querying systems; and digital humanities and cultural heritage.
Providing an overview of emerging trends and themes within the vast field of semantic computing and artificial intelligence, the book will be of interest to all those working in the field.
Abstract. This volume encompasses the proceedings of SEMANTiCS 2024, the 20th International Conference on Semantic Systems, a pivotal event for professionals and researchers actively engaged in harnessing the power of semantic computing and Artificial Intelligence. At SEMANTiCS, attendees gain a profound understanding of its transformative potential, while also confronting the practical limitations it presents. Each year, the conference magnetises researchers, IT architects, information managers, software and data engineers from a broad spectrum of organisations, spanning research facilities, non-profit entities, public administrations, and the world’s largest corporations.
Keywords. Semantic Systems, Knowledge Graphs, Artificial Intelligence, Semantic Web, Linked Data, Machine Learning, Knowledge Discovery, Neuro-Symbolic AI, Large Language Models, Decentralized Web
SEMANTiCS serves as a vibrant platform facilitating the exchange of cutting-edge scientific findings in the realm of semantic systems and Artificial Intelligence. Furthermore, it extends its scope to encompass research challenges in areas such as data science, machine learning, logic programming, content engineering, social computing, natural language processing, digital humanities. and the Semantic Web. Having reached its 20th anniversary, the conference has evolved into a distinguished international event that seamlessly bridges the gap between academia and industry. Participants and contributors of SEMANTiCS gain invaluable insights from esteemed researchers and industry experts, enabling them to stay abreast of emerging trends and themes within the vast field of semantic computing and artificial intelligence. The SEMANTiCS community thrives on its diverse composition, attracting professionals with multifaceted roles encompassing artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.
In 2024, the conference embraced the subtitle “Knowledge Graphs in the Time of Language Models and Neuro-Symbolic AI” and particularly welcomed submissions pertaining to the following topics:
∙ Web Semantics & Linked (Open) Data
∙ Enterprise Knowledge Graphs, Graph Data Management
∙ Machine Learning Techniques for/using Knowledge Graphs (e.g. reinforcement learning, deep learning, data mining and knowledge discovery)
∙ Interplay between generative AI and Knowledge Graphs (e.g., RAG approach)
∙ Knowledge Management (e.g. acquisition, capture, extraction, authoring, integration, publication)
∙ Terminology, Thesaurus & Ontology Management, Ontology engineering
∙ Reasoning, Rules, and Policies
∙ Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)
∙ Crowdsourcing for/using Knowledge Graphs
∙ Data Quality Management and Assurance
∙ Mathematical Foundation of Knowledge-aware AI
∙ Multimodal Knowledge Graphs
∙ Semantics in Data Science
∙ Semantics in Blockchain environments
∙ Trust, Data Privacy, and Security with Semantic Technologies
∙ IoT, Stream Processing, dealing with temporal data
∙ Conversational AI and Dialogue Systems
∙ Provenance and Data Change Tracking
∙ Semantic Interoperability (via mapping, crosswalks, standards, etc.)
∙ Linked Data storage, triple stores, graph databases
∙ Robust and scalable management, querying and analysis of semantics and data
∙ User interfaces for the Semantic Web & its management
∙ Explainable and Interoperable AI
∙ Decentralised and Federated Knowledge Graphs (e.g., Federated querying, link traversal)
Application of Semantically-Enriched and AI-Based Approaches, such as, but not limited to:
∙ Knowledge Graphs in Bioinformatics, Medical AI and preventive healthcare
∙ Clinical Use Case of semantic-enabled AI-based Approaches
∙ AI for Environmental Challenges
∙ Semantics in Scholarly Communication and Scientific Knowledge Graphs
∙ AI and LOD within GLAM (galleries, libraries, archives, and museums) institutions
∙ Knowledge Graphs & hybrid AI for predictive maintenance and Industry 4.0/5.0
∙ Digital Humanities and Cultural Heritage
∙ LegalTech, AI Safety, EU AI Act
∙ Economics of Data, Data Services, and Data Ecosystems
The Research and Innovation track garnered significant attention with 95 submissions after a call for papers was publicly announced. To ensure high-quality evaluations, an esteemed program committee comprising 103 members collaborated to identify papers of utmost impact and scientific merit. Implementing a double-anonymous review process, wherein author identities and the reviewers were obscured to assure anonymity. A minimum of three independent reviews were conducted for each submission. Upon completion of all reviews, the program committee chairs meticulously compared and deliberated on the evaluations, addressing any disparities or differing viewpoints with the reviewers. This comprehensive approach facilitated a meta-review, enabling the committee to recommend acceptance or rejection of each paper. Ultimately, we were pleased to accept 26 papers, resulting in an acceptance rate of 27%.
In addition to the peer-reviewed work, the conference had two renowned keynotes from Alon Halevy (Senior Principal Scientist at Amazon) and Maria-Esther Vidal (TIB-Leibniz Information Centre for Science and Technology. Leibniz University of Hannover and L3S Research Centre). The keynote of the co-located DBpedia Day was given by Ruben Verborgh (Ghent University - imec).
Additionally, the program had posters and demos, a comprehensive set of workshops, as well as talks from industry leaders.
We thank all authors who submitted papers. We particularly thank the program committee which provided careful reviews in a quick turnaround time. Their service is essential for the quality of the conference.
Special thanks also go to our sponsors without whom this event would not be possible. At the time of submission of the proceedings the sponsors were:
Gold Sponsors: Metaphacts, Ordina, PoolParty, Taxonic
Silver Sponsors: Ontotext, Pantopix, RDFox, redpencil.io, Triply
Startup Sponsor: MeaningFy, ModelDesk, SP - Semantic Partners
Sincerely yours,
The Editors
Amsterdam, September 2024
The interoperability of domain ontologies, developed by domain experts, necessitates their alignment before attempting to match them. Within these ontologies, defined concepts often encounter an ambiguity problem stemming from the use of natural language. This interoperability issue raises the underlying ontology matching (OM) challenge. OM might be defined as the identification of correspondences or relationships between two or more entities, such as classes or properties among two or more ontologies. Rule-based ontology matching approaches, e.g., LogMap and AML have not outperformed machine learning based matchers on the Ontology Alignment Evaluation Initiative (OAEI) benchmark datasets, especially on the OAEI Conference track since 2020. Supervised machine or deep learning approaches produce the best results but require labeled training datasets. In the era of Large Language Models (LLMs), robust zero-shot prompting of LLMs can also return convincing responses. While prompt generation requires prompt template engineering by domain experts, contextual information about the concepts to be aligned can be retrieved by leveraging graph search algorithms. In this work, we explore how graph search algorithms, namely (i) Random Walk and (ii) Tree Traversal can be utilized to retrieve the contextual information to be incorporated into prompt templates. Through these algorithms, our approach refrains from considering all triples connected with a concept to be aligned in its contextual information creation. Our experiments show that including the retrieved contextual information in prompt templates improves the matcher’s performance. Additionally, our approach outperforms previous works leveraging zero-shot prompting.
Compliance with legal documents related to industrial maintenance is the company’s obligation to oversee, maintain, and repair its equipments. As legal documents endlessly evolve, companies are in favour of automatically processing these texts to facilitate the analysis and compliance. The automatic process involves first, in this pipeline, the extraction of legal entities. However, state-of-the-art, like BERT approaches, have so far required a large amount of data to be effective. Creating this training dataset however is a time-consuming task requiring input from domain experts. In this paper, we bootstrap the legal entity extraction by levering Large Language Models and a semantic model in order to reduce the involvement of the domain experts. We develop the industrial perspective by detailing the technical implementation choices. Consequently, we present our roadmap for an end-to-end pipeline designed expressly for the extraction of legal rules while limiting the involvement of experts.
This paper presents an exploratory study that investigates the use of various Large Language Models (LLMs) for the task of taxonomy expansion. Our objective is to enhance the taxonomical structure by querying LLMs for (1) child taxons and (2) alternative labels of existing taxons. Beginning with an incomplete taxonomy, we explore the most effective ways to prompt LLMs exploiting explicit and shared knowledge captured in manually curated taxonomies to provide context for the task at hand. We experiment with different prompting templates, well-recognized taxonomies (EuroVoc, STW, UNESCO), and popular language models (Claude, Claude3, Llama2). Our results suggest feasibility of solving of the proposed task with the modern LLMs and human oversight. Moreover, we observe certain patterns and trends in the performance of the models, noting that it was not possible to identify a single best configuration that would fit all models.
Traditional dataset retrieval systems rely on metadata for indexing, rather than on the underlying data values. However, high-quality metadata creation and enrichment often require manual annotations, which is a labour-intensive and challenging process to automate. In this study, we propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. Our analysis focuses on classifying column headers based on domain-specific topics from the Consortium of European Social Science Data Archives (CESSDA), a Linked Data controlled vocabulary. Our approach operates in a zero-shot setting, integrating the controlled topic vocabulary directly within the input prompt. This integration serves as a Large Context Windows approach, with the aim of improving the results of the topic classification task.
We evaluated the performance of the LLMs in terms of internal consistency, inter-machine alignment, and agreement with human classification. Additionally, we investigate the impact of contextual information (i.e., dataset description) on the classification outcomes. Our findings suggest that ChatGPT and GoogleGemini outperform GoogleBard in terms of internal consistency as well as LLM-human-agreement. Interestingly, we found that contextual information had no significant impact on LLM performance.
This work proposes a novel approach that leverages LLMs for topic classification of column headers using a controlled vocabulary, presenting a practical application of LLMs and Large Context Windows within the Semantic Web domain. This approach has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data on the Web.
The increasing demand for automatic high-level image understanding, including the detection of abstract concepts (AC) in images, presents a complex challenge both technically and ethically. This demand highlights the need for innovative and more interpretable approaches, that reconcile traditional deep vision methods with the situated, nuanced knowledge that humans use to interpret images at such high semantic levels. To bridge the gap between the deep vision and situated perceptual paradigms, this study aims to leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. To facilitate downstream tasks such as AC-based image classification, we compute Knowledge Graph Embeddings (KGE). We experiment with relative representations [1] and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. The adoption of the relative representation method significantly bolsters KGE-based AC image classification, while our hybrid methods outperform state-of-the-art approaches. The posthoc interpretability analyses reveal the visual transformer’s proficiency in capturing pixel-level visual attributes, contrasting with our method’s efficacy in representing more abstract and semantic scene elements. Our results demonstrate the synergy and complementarity between KGE embeddings’ situated perceptual knowledge and deep visual model’s sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neurosymbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available at https://github.com/delfimpandiani/Stitching-Gaps
Entity Linking is crucial for numerous downstream tasks, such as question answering, knowledge graph population, and general knowledge extraction. A frequently overlooked aspect of entity linking is the potential encounter with entities not yet present in a target knowledge graph. Although some recent studies have addressed this issue, they primarily utilize full-text knowledge bases or depend on external information such as crawled webpages. Full-text knowledge bases are not available in all domains and using external information is connected to increased effort. However, these resources are not available in most use cases. In this work, we solely rely on the information within a knowledge graph and assume no external information is accessible.
To investigate the challenge of identifying and disambiguating entities absent from the knowledge graph, we introduce a comprehensive silver-standard benchmark dataset that covers texts from 1999 to 2022. Based on our novel dataset, we develop an approach using pre-trained language models and knowledge graph embeddings without the need for a parallel full-text corpus. Moreover, by assessing the influence of knowledge graph embeddings on the given task, we show that implementing a sequential entity linking approach, which considers the whole sentence, can outperform clustering techniques that handle each mention separately in specific instances.
Knowledge Graphs (KGs) are relational knowledge bases that represent facts as a set of labelled nodes and the labelled relations between them. Their machine learning counterpart, Knowledge Graph Embeddings (KGEs), learn to predict new facts based on the data contained in a KG – the so-called link prediction task. To date, almost all forms of link prediction for KGs rely on some form of embedding model, and KGEs hold state-of-the-art status for link prediction. In this paper, we present TWIG-I (Topologically-Weighted Intelligence Generation for Inference), a novel link prediction system that can represent the features of a KG in latent space without using node or edge embeddings. TWIG-I shows mixed performance relative to state-of-the-art KGE models – at times exceeding or falling short of baseline performance. However, unlike KGEs, TWIG-I can be natively used for transfer learning across distinct KGs. We show that using transfer learning with TWIG-I can lead to increases in performance in some cases both over KGE baselines and over TWIG-I models trained without finetuning. While these results are still mixed, TWIG-I clearly demonstrates that structural features are sufficient to solve the link prediction task in the absence of embeddings. Finally, TWIG-I opens up cross-KG transfer learning as a new direction in link prediction research and application.
Relational graph convolutional networks (RGCNs) have been successful in learning from knowledge graphs. However, training on large-scale knowledge graphs becomes challenging due to the exponential growth of the neighborhood size across the network layers. Moreover, knowledge graphs have multiple relations, and often, the literals can have multimodal content; these properties make it extra challenging to scale up the training of RGCNs to large-scale graphs. Graph sampling techniques have been shown to be effective in scaling learning to large graphs by reducing the number of processed nodes and lowering memory usage. However, only a few studies have focused on sampling for knowledge graphs. In this work, we introduce ReWise, a relation-wise sampling framework that includes a family of sampling methods designed for knowledge graphs. Our experiments demonstrate that sampling reduces the memory usage up to 50% lower than the case without sampling while maintaining the same classification accuracy and, in some cases, outperforming it. Additionally, we show that our sampling strategy is compatible with the multimodal RGCN, showing the same behavior as RGCNs.
Machine learning (ML) is becoming increasingly important in healthcare decision-making, requiring highly interpretable insights from predictive models. Although integrating ML models with knowledge graphs (KGs) holds promise, conveying model outcomes to domain experts remains challenging, hindering usability despite accuracy. We propose semantically describing predictive model insights to overcome communication barriers. Our pipeline predicts lung cancer relapse likelihood, providing oncologists with patient-centric explanations based on input characteristics. Consequently, domain experts gain insights into both the characteristics of classified lung cancer patients and their relevant population. These insights, along with model decisions, are semantically described in natural language to enhance understanding, particularly for interpretable models like LIME and SHAP. Our approach, SemDesLC, documents ML model pipelines into KGs, and fulfills the needs of three types of users: KG builders, analysts, and consumers. Experts’ opinions indicate that semantic descriptions are effective for elucidating relapse determinants. SemDesLC is openly accessible on GitHub, promoting transparency and collaboration in leveraging ML for healthcare decision support.
Purpose:
A previous paper proposed the usage of SHACL to assess the FAIRness of software repositories. Following this call to action, this paper introduces and discusses the changes made to QUARE, a SHACL-based tool for validating GitHub repositories against sets of quality criteria, to facilitate this task.
Methodology:
An operationalization of the abstract FAIR best practices from previous work is devised to enable a FAIRness assessment based on concrete quality criteria. Afterwards, a SHACL shapes graph implementing these constraints is introduced, followed by a discussion of the efficient generation of suitable RDF representations for GitHub repositories. Improvements regarding the usability of QUARE are examined, as well. An evaluation on the FAIRness of 223 GitHub repositories and on the runtime performance of the assessment is conducted.
Findings:
On average, trending repositories comply with fewer FAIR best practices than repositories expected to be FAIR. However, the latter still exhibit deficiencies, for example, regarding the correct application of semantic versioning. The low average runtime of the FAIRness assessment of respectively 3.50 and 5.73 seconds per repository permits the integration of QUARE in, e.g., CI/CD pipelines.
Value:
The FAIR principles are often mentioned as a measure to tackle the reproducibility crisis, which continues to have a significant impact on science. To implement these principles in practice, it is crucial to provide tools that facilitate the automated assessment of the FAIRness of software repositories. The enhanced version of QUARE introduced in this paper represents our proposal for this demand.
Purpose:
The Smart Readiness Indicator (SRI) is an energy rating scheme targeted at buildings to evaluate their capacity to integrate and benefit from smart technologies for enhanced energy efficiency and overall performance. Existing tools for SRI assessment and rating do not provide a standard format for data exchange. However, there are several scenarios in which a FAIR, standardised data format is beneficial, such as data exchange between building tools, comparison of different assessments, or computing statistics about buildings.
Methodology:
We propose the Semantic Smart Readiness Indicator framework, consisting of an SRI information model and a SPARQL-based SRI score calculation. We follow the Linked Open Terms ontology engineering method by specifying the use case from which the requirements and competency questions are derived. We reuse existing ontologies and extend them to create the SRI ontology.
Findings:
The model is published according to the FAIR principles. Moreover, it is flexible to accommodate specific SRI requirements, and can be aligned with existing semantic building models to facilitate data linking and exchange. The score calculation, in turn, is composed of multiple SPARQL queries defined over the model.
Value:
In this paper, we describe our proposed framework, the ontology engineering process, and the evaluation of both the model and the SPARQL-based SRI calculation. All the resources are openly available for reuse.
Sustainability reporting by Small and Medium Enterprises (SMEs) is gaining importance. SMEs form the backbone of European industries, and their customers rely on them to ensure regulatory compliance. In preparing sustainability reports, a combination of standards is commonly used, which encompasses overlapping yet distinct requirements on sustainability indicators. Different standards categorize shared indicators under varying topics, while they also mandate unique indicators to assess identical sustainability phenomena. This poses challenges for SMEs in reporting against multiple standards. Considerable human efforts are demanded to determine the interconnected requirements across different standards. Additionally, reporting on overlapping indicators for new standards results in significant redundant work. Mapping of indicators between different standards allows the semantic interoperability of standards by indicating matching and distinct requirements, aiding in addressing these challenges. Therefore, this paper focuses on developing an ontology for mapping indicators from two significant standards, GRI and ESRS. We introduce the Sustainability Reporting Standards Ontology (RSO). RSO formally represents environmental indicators in GRI and ESRS, and is available online. Furthermore, we provide an ontology-based mapping between indicators, supported by concrete examples that illustrate the interconnections between them.
Purpose:
The first objective of this research is to represent an event with 5W1H characteristics (who, what, where, when, why, and how) through ontologies. The second objective is to propose an approach for enriching an event knowledge graph (EKG) based on this ontology using EvCBR, an outperforming case-based reasoning algorithm found in the literature. Furthermore, we have studied the impact of each W (Who, Where, and When) on the performance of EvCBR on the Wikipedia Causal Event dataset.
Methodology:
We proposed the XPEventCore ontology to represent 5W1H characteristics of events by integrating multiple event ontologies (SEM and FARO) and introduced new object properties for representing Cause and Method to answer “Why” and “How” questions. We adopted this XPEventCore ontology for a specific use case (the MR4AP Wikipedia dataset), and populated the EKG. Furthermore, we adapted EvCBR, a case-based reasoning approach, to enrich this EKG.
Findings:
XPEventCore ontology provides a structured and adaptable foundation for capturing the essential facets of an event. It can be adapted to any domain (like MR4AP Wikipedia dataset) and populated to generate EKGs. Then, we applied EvCBR, and subsequent analysis revealed that reasoning had a significant impact. Notably, EvCBR outperformed with reasoning approaches on the dataset.
Originality:
XPEventCore ontology is the first ontology that represents an event with 5W1H characteristics.
The paper develops the first deepfake domain ontology that could assist common individuals in understanding the growing concerns of AI-manipulated digital media and researchers in domain knowledge integration and inference. For a foundational ontology, authors focused on structuring knowledge related to a deepfake attack, like the vulnerable entity, deepfake creator, attack goal, medium, generation technique, consequences, preventive measures, etc. The authors used knowledge engineering methodology, Protégé Desktop, and the W3C Web Ontology Development Language for ontology creation. The manual literature review from prominent research publications, and evaluation of existing ontologies helped identify 19 core entities and 28 relations describing the deepfake domain. The paper also presents knowledge graphs application of the developed ontology. The textual data of 35 plus global deepfake events in the context of politics, law, world security, etc. is collected and visualized in the form of knowledge graphs. The authors created SWRL rules that helped infer additional information from the deepfake attack knowledgebase via knowledge graphs application, such as various ways a particular entity can be affected by a deepfake, mediums used for attacks, and online security measures victims can adopt. The ontology can be extended iteratively with new domain advancements. As a next step, authors plan on adopting NLP approaches for automating domain entity research and deepfake event knowledge base population.
To advance the adoption of linked data in the context of the Dutch Federated Data System (Dutch synonym: FDS), it is necessary to have robust access control for native linked data sources. For this purpose, research was initiated to assess whether it is feasible to implement access controls on linked data sources in this context. A four-phase design science research methodology is applied. The first phase defines both the question guiding this research and the context in which the research was conducted. The second phase includes a review of the state-of-the-art and an evaluation of the existing approaches to access control that could support the FDS use case. Having determined that no existing approaches completely fulfil the requirements of the FDS use case, the third phase describes a prototype enforcement mechanism designed and developed as part of this research. The fourth and final phase evaluates this protype with respect to its feasibility to support the requirements of the FDS context. At present, there are no standardized solutions for securing native linked data sources. Existing literature and industry examples from the Netherlands and Europe highlight several potential solution directions for access control on SPARQL endpoints. These solution directions are used as inspiration for the development of a prototype enforcement mechanism. This prototype shows potential when applied to the Dutch Federated Data System and suggests a more generic approach could be taken when applying these controls to a broader context. Further research, testing and standardization efforts are required to bring such an approach to maturity. Any linked data ecosystem containing closed information requires a robust approach to access control. This research contributes to the existing literature on approaches taken to such access controls and highlights the increasing need for, and the feasibility of implementing, these controls in governmental contexts. Bringing such a solution to maturity would support wider adoption of linked data technologies in this context.
This paper addresses the issue of searching legislative documents in an international multilingual setting. Legal documents are published in different countries using local languages, as well as country-specific semantic keyword and classification systems. Consequently, when moving from one country to another, citizens may face obstacles when looking for regulations on immigration, heath care, education, etc. To address this challenge, this paper presents FinEstLawSampo, a cross-border solution approach and proof-of-concept demonstrator based on Linked Open Data (LOD) and Semantic Web technologies. We describe the design and implementation of this idea using consolidated laws from Finland and Estonia alongside EU directives as a case study. The demonstrator includes a LOD service and a semantic portal, based on the Sampo Model, which adapts the interface and contents to the user-chosen language. The main novelty presented is the provision of heterogeneous cross-border, multilingual, distributed legal data through faceted searching and data exploration as well as using data analysis in legal informatics.
Traditionally, querying knowledge graphs is free of charge. However, ensuring data and service availability incurs costs to knowledge graphs providers. The Delayed-Answer Auction (DAA) model has been proposed to fund the maintenance of knowledge graph endpoints by allowing customers to sponsor entities in the Knowledge Graph so query results that include them are delivered in priority. However, implementing DAA with time-to-first results acceptable for data consumers is challenging because it requires reordering results according to bid values. In this paper, we present the AuctionKG approach to enable DAA with a low impact on query execution performance. AuctionKG relies on (i) reindexing sponsored entities by bid values to ensure they are processed first and (ii) Web preemption to ensure delayed answering. Experimental results demonstrate that our approach outperforms a baseline approach to enable DAA in terms of time for first results.
One challenge in utilizing knowledge graphs, especially with machine learning techniques, is the issue of scalability. In this context, we propose a method to substantially reduce the size of these graphs, allowing us to concentrate on the most relevant sections of the graph for a specific application or context. We define the notion of context graph as an extract from one or more general knowledge bases (such as DBpedia, Wikidata, Yago) that contains the set of information relevant to a specific domain while preserving the properties of the original graph. We validate the approach on a DBpedia excerpt for entities related to the Data&Musée project and the KORE reference set according to two aspects: the coverage of the context graph and the preservation of the similarity between its entities. The results show that the use of context graphs makes the exploitation of large knowledge bases more manageable and efficient while preserving the features of the initial graph.
Purpose:
With increasing size of Resource Description Framework (RDF) graphs, the resulting graph structures can become too large to be managed on a single compute node, lacking the necessary resources to execute a partitioning of the graph – in particular, when the partitioning method relies on global graph information for which the entire graph has to be loaded into the main memory. This paper introduces a window-based streaming partitioning technique to obtain distributed RDF graphs, overcoming the memory limitations of traditional partitioning methods.
Methodology:
We evaluated our approach, UniPart, by comparing it with established graph partitioning algorithms such as METIS, LDG, and WStream. The comparison focused on key metrics, including the proportion of edge cuts.
Findings:
Through practical assessments using the LUBM dataset, our algorithm demonstrated strong performance in load balance, execution time, and memory usage. Notably, under the DFS streaming order, UniPart achieved a 20% reduction in edge-cut ratio compared to LDG.
Value:
UniPart operates without the need for global graph information, making it exceptionally suited for dynamic environments with unbounded streams and unpredictable data sizes.