Ebook: Semantic Data Mining
Ontologies are now increasingly used to integrate, and organize data and knowledge, particularly in data and knowledge-intensive applications in both research and industry.
The book is devoted to semantic data mining – a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies and knowledge graphs, rather than only purely empirical data.
The introductory chapters of the book provide theoretical foundations of both data mining and ontology representation. Taking a unified perspective, the book then covers several methods for semantic data mining, addressing tasks such as pattern mining, classification and similarity-based approaches. It attempts to provide state-of-the-art answers to specific challenges and peculiarities of data mining with use of ontologies, in particular: How to deal with incompleteness of knowledge and the so-called Open World Assumption? What is a truly “semantic” similarity measure?
The book contains several chapters with examples of applications of semantic data mining. The examples start from a scenario with moderate use of lightweight ontologies for knowledge graph enrichment and end with a full-fledged scenario of an intelligent knowledge discovery assistant using complex domain ontologies for meta-mining, i.e., an ontology-based meta-learning approach to full data mining processes.
The book is intended for researchers in the fields of semantic technologies, knowledge engineering, data science, and data mining, and developers of knowledge-based systems and applications.
Ontologies are now increasingly used to integrate, describe and organize data and knowledge, particularly in data and knowledge-intensive applications in both research and industry. The book is devoted to semantic data mining–a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies and knowledge graphs, rather than only purely empirical data.
Semantic data mining is a young research area. Kralj Novak, Vavpetič, Trajkovski, and Lavrač coined the term in 2009. In 2011, the author of this book co-organized the Semantic Data Mining tutorial as part of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML/PKDD). The topic of data mining with and from ontologies and knowledge graphs has been now present in major Semantic Web, knowledge engineering, data mining and artificial intelligence journals and at major conferences as well as at workshop series largely devoted to semantic data mining such as the Workshop on Knowledge Discovery and Ontologies (KDO), the Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (IRMLeS), the Workshop on Third Generation Data Mining: Towards Service-Oriented Knowledge Discovery (SoKD), the Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data (Know@LOD), the Workshop on Linked Data for Knowledge Discovery (LD4KD) and others.
This monograph intends to provide a synthesis of semantic data mining, summarizing some of the key problems and results. It is the first book on semantic data mining, providing a unifying perspective. This monograph is also intended to fulfill requirements for habilitation, and thus methods and applications that are described in more detail (one per chapter in Chapters 4–6 and Chapters 8–10) are those, which were (co)-developed by the author of this monograph. The envisaged readers of the book are researchers in the fields of semantic technologies, knowledge engineering, data science, and data mining, and developers of knowledge-based systems and applications.
The monograph starts with an introductory chapter to both data mining and ontology representation. Chapter 2 provides theoretical foundations of ontology representation languages and relevant query languages. Chapter 3 briefly introduces basic data mining tasks, and subsequently, discusses data mining problems as search problems. Chapter 4 defines the problem of pattern mining, and presents the algorithm for mining frequent patterns in description logic knowledge bases. Chapter 5 defines the task of classification, and then focuses on the problem of pattern based classification and presents the algorithm for classification of semantic data. Chapter 6 defines the task of clustering, and discusses similarity measures and distances for semantic data, with a special attention to semantic kernels. This chapter also provides answers to the question on what is a truly “semantic” similarity measure. Chapter 7 discusses peculiarities arising from so-called Open World Assumption adopted by ontological representations and how those influence data mining on such representations, and how to deal with those peculiarities. Chapter 8 describes the problem of a refinement of knowledge graphs, and discusses in more detail the task of finding synonymous properties in a knowledge graph. Chapter 9 introduces a concept of semantic categorization and a framework for semantic categorization of conjunctive query results for exploratory search. Chapter 10 first introduces semantic meta-mining, i.e., an ontology-based meta-learning approach over full data mining processes. Then it describes a scenario involving an intelligent discovery assistant, supporting users in designing data mining workflows.
The author would like to thank co-authors of joint publications, co-workers from joint projects, and co-organizers of joint events and endeavours, without whose contribution this book could not have been written, including especially: Jedrzej Potoniec for a number of joint works, and continued collaboration, Łukasz Józefowski, Tomasz Łukaszewski, and Joanna Józefowska, for co-developing semantic kernels, Mikołaj Morzy for his work on substitutive sets, Claudia d'Amato and Nicola Fanizzi, for co-developing a framework for semantic categorization of conjunctive query results, co-workers from the e-LICO EU FP7 project including Huyen Do, Simon Fischer, Dragan Gamberger, Lina Al-Jadir, Simon Jupp, Alexandros Kalousis, Joerg Uwe-Kietz, Petra Kralj Novak, Nada Lavrač, Babak Mougouie, Phong Nguyen, Raúl Palma, Floarea Serban, Robert Stevens, Anze Vavpetič, Jun Wang, Derry Wijaya, Adam Woznica, and especially Melanie Hilario, the coordinator of the project, Maria Keet for her contribution to Data Mining OPtimization Ontology, co-organizers of the Semantic Data Mining Tutorial (Nada Lavrač, Anze Vavpetič, Jedrzej Potoniec, Melanie Hilario, and Alexandros Kalousis), and of the IRMLeS workshops (Claudia d'Amato, Nicola Fanizzi, Blaz Fortuna, Marko Grobelnik, Vojtech Svátek), and colleagues from the W3C Machine Learning Schema Community Group (including Diego Esteves, Panče Panov, Larisa Soldatova, Tommaso Soru, and Joaquin Vanschoren). The author is grateful for hosting her seminars devoted to semantic data mining, and fruitful discussions and feedback from Claudia d'Amato, and Nicola Fanizzi from the University of Bari, the Protégé Group from the Stanford University headed by Mark Musen, especially to Tania Tudorache (also for fruitful research collaboration), the Ontology Engineering Group at the Technical University of Madrid, headed by Asunción Gómez-Pérez and Oscar Corcho, especially to Mari Carmen Suárez-Figueroa, and Advanced Data Analysis in Applications Group (ADAA) at the Silesian University of Technology, including Marek Sikora and Michał Kozielski.
I thank Ingrid Maria Spakler and Arnoud de Kemp from AKA Verlag for the editing work.
I would also like to especially thank Pascal Hitzler, the founding editor-in-chief of the book series Studies on the Semantic Web, for his patience and support for this book.
The author is grateful for the support from funding agencies for the research projects: the Polish Ministry of Science and Higher Education for the grant No N N516 186437 on “Inductive reasoning on ontological knowledge bases” (2009–2012), the European Commission for the EU FP7 ICT-2007.4.4 project No 231519, etitled “e-LICO: An e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science” (2009–2012), the Foundation for Polish Science for the project “Learning and Evolving Ontologies from Linked Open Data (LeoLOD)” under the PARENT-BRIDGE program, cofinanced from European Union, Regional Development Fund (No POMOST/2013-7/8) (2013-2015), and the National Science Centre (Poland) for the project “ARISTOTELES: Methodology and algorithms for automatic revision of ontologies in task based scenarios” (No 2014/13/D/ST6/02076) under the SONATA program (2015–2018).
Above all, I would like to thank my family for their love, patience and continued support during writing this book.
Poznan, March 2017
Agnieszka Ławrynowicz