Ebook: Ontology Learning and Population: Bridging the Gap between Text and Knowledge
The promise of the Semantic Web is that future web pages will be annotated not only with bright colors and fancy fonts as they are now, but with annotation extracted from large domain ontologies that specify, to a computer in a way that it can exploit, what information is contained on the given web page. The presence of this information will allow software agents to examine pages and to make decisions about content as humans are able to do now. The classic method of building an ontology is to gather a committee of experts in the domain to be modeled by the ontology, and to have this committee agree on which concepts cover the domain, on which terms describe which concepts, on what relations exist between each concept and what the possible attributes of each concept are. All ontology learning systems begin with an ontology structure, which may just be an empty logical structure, and a collection of texts in the domain to be modeled. An ontology learning system can be seen as an interplay between three things: an existing ontology, a collection of texts, and lexical syntactic patterns. The Semantic Web will only be a reality if we can create structured, unambiguous ontologies that model domain knowledge that computers can handle. The creation of vast arrays of such ontologies, to be used to mark-up web pages for the Semantic Web, can only be accomplished by computer tools that can extract and build large parts of these ontologies automatically. This book provides the state-of-art of many automatic extraction and modeling techniques for ontology building. The maturation of these techniques will lead to the creation of the Semantic Web.
In recent years, the field of ontology learning from text has attracted a lot of attention, resulting in a wide variety of approaches to the extraction of knowledge from textual data. Yet, results so far are still limited as the semantic gap between human language on the one hand and formalized knowledge on the other is significant. Knowledge formalized in the form of ontologies is declarative, explicit and in general monotonic and crisp. Knowledge expressed by human language is highly diluted, very implicit, vague and even defeasible.
As Brewster et al.  have correctly argued, when writing an article, authors assume a large body of background knowledge which they share with their community and potential readers, while focusing on a very specific aspect, i.e. on the specific message they want their text to convey. Thus, most of the knowledge in texts is actually very implicit and remains “under the surface”. Further, natural language lacks conceptual preciseness, allowing people to have very different conceptualizations and yet use the very same words to express them. In addition, for reasons of economy, people use language in a rather vague and underspecified way and are just precise enough to still allow reasonable communication. Thus, knowledge conveyed by means of language has to be considered as implicit, vague and defeasible in general and is consequently far away from current ontology models assuming knowledge that is defined declaratively, explicitly as well as in a crisp and monotonic manner.
By definition, an ontology is an explicit specification of a shared conceptualization (see  and ). In essence, it is thus a view on how the world or a specific domain is structured as agreed upon by the members of a community. Assuming that we have perfect natural language processing tools for extracting knowledge from text, it is still questionable whether we will be able to actually learn an ontology from text as the conceptualization behind an ontology is typically assumed to be the result of an intentional process. Ontologies therefore cannot be “learned” by machines in the strict sense of the word as they lack intention and purpose. Instead, ontology learning may only support an ontology engineer in defining their conceptualization of a particular part of the world, e.g. a technical domain, on the basis of empirical evidence derived from textual and other data.
If we adopt such a view of ontology learning, we have to conclude however that a number of important questions in this regard remain largely unanswered by the current literature:
• textual evidence: What kind of empirical textual evidence should an ontology engineer actually consider when modelling an ontology?
• evidence-based agreement: How can we foster the process of consensus building and agreement by presenting empirical evidence derived from data for different design choices?
• data-driven ontology engineering: On a more general note, what should the role of data-driven ontology learning be in the overall process of ontology engineering?
• methodological integration: How should ontology learning tools be integrated into a larger framework for ontology engineering from a methodological point of view?
• user interface: What is the best way to support an ontology engineer in presenting empirical evidence at the user interface level and what is the optimal way for a user to interact with such a system?
At least four different research communities may contribute to answering these questions: natural language processing, machine learning, knowledge representation/engineering and user interface design. In fact, it seems to us that the above questions can only be addressed through an interdisciplinary research program across these research communities. We will briefly elaborate why.
The natural language processing community has so far applied their best techniques to the task of ontology learning, mainly for term extraction and for learning paradigmatic relations between terms such as synonymy (see  and ), hyperonymy (see [6,7]) and meronymy (see ). However, these are lexical relations which do not hold between concepts with explicitly defined intensions. Lexical relations do in fact not map straightforwardly to relations between concepts, e.g. A is a subconcept of B iff every A is also a B. In contrast, according to Lyons , hypernymy is defined as “the relation which holds between a more specific, or subordinate, lexeme and a more general, or superordinate, lexeme.”, which is clearly not equivalent to the definition above in terms of subsumption of extension. Typically, hypernym relations are indicated through so called diagnostic frames. In the case of hypernymy, one useful diagnostic frame is “An X is a kind/type of Y” (see ). However, such diagnostic frames clearly lack the necessary preciseness. First of all, they do not distinguish whether the terms are roles (in the sense of OntoClean ) or actually types (concepts). Thus, student and person can be actually found in such a diagnostic frame “A student is a person who studies”, whereas the first is clearly a (material) role and the second a type. Similar remarks hold for the meronymy relation. It is well-known in artificial intelligence that there are various types of part-of relations (compare ) that can clearly not be differentiated from each other through diagnostic frames. In summary, an important problem is that there is neither a straightforward mapping between terms in language to concepts with a well-defined intension and extension nor can lexical relations be mapped to ontological relations in the general case (see also ). NLP research has in many cases ignored such intricate questions in knowledge acquisition and focused instead on learning paradigmatic relations between linguistic objects. In this sense, stronger bonds between the NLP and knowledge representation communities are definitely needed.
The machine learning community provides a large number of sound techniques for data-driven (inductive) learning but, with a few exceptions, is in fact quite opposed to the idea of learning ontologies. Ontologies are logical theories and declarative by nature. Machine learning is in principle concerned with developing analytical models that explain data. In its supervised fashion (compare ), such models serve prediction purposes, i.e. for classifying novel examples. In unsupervised learning, one aims to discover regularities or patterns in data such as homogeneous groups or clusters (see ) or general associations (see for instance ). Many techniques from unsupervised machine learning such as clustering and mining associations have been applied to ontology learning. Mädche and Staab have for example used association rules to discover relations between (lexicalizations) of concepts (compare ) and Cimiano et al. have used clustering techniques to group and hierarchically arrange words (see ). Most of the papers in this volume also apply machine learning techniques in some way, in particular clustering (Brunzel, Poesio et al.), classification (Poesio et al.), memory-based learning (Tanev et al.) as well as induction of patterns from examples (Pantel et al., Alfonseca et al.). However, analytical models as considered in machine learning are generally not declarative in the sense of a logical theory. Some branches of machine learning research have indeed aimed at learning declarative logical theories from data. This is the case for example for Inductive Logic Programming (ILP) . However, theories learned from data through ILP differ crucially from ontologies. The latter reflect a shared understanding of a domain of interest, produced as the byproduct of reflection and consensus within a certain community and thus representing a commitment to a specific conceptualization. For logical theories derived inductively from data, it is unclear in how far they can be seen as expressing a shared conceptualization. The most promising way of applying inductive techniques seems to be in ontology refinement. First blueprints in this direction can be found in the works of Lisi and Esposito  and Rudolph et al. . In general, it seems to us that an important avenue for future machine learning work in ontology learning is to systematically analyze the question how inductively derived models, classifications, associations etc. can support an ontology engineer to formulate or refine their conceptualization in the form of an ontology, seeing ontology learning always as an interactive and cooperative process between an ontology engineer and a system (see also the definition of ontology learning in ).
The knowledge representation community has focused traditionally on methods for efficient reasoning and inference, but to a large extent neglected the following issues: i) integrating insights from linguistics into ontology development (with the exception of some of the work on DOLCE ) ii) integrating ontology learning into methodologies for engineering ontologies, iii) integrating knowledge representation and inferencing paradigms which are closer to the way knowledge is expressed in human language (notable exceptions being the work on computing with words of Zadeh , the work on natural logic  or the conceptual graphs of Sowa ). The linguistics community has in fact developed category systems based on linguistic principles that could be integrated in ontologies, for example Vendler's verb categories  or the so called ‘Aktionsarten’ . Ontologists have largely neglected such distinctions which might be useful exactly in bridging the gap between text and knowledge. While there is some work on integrating machine learning into traditional knowledge acquisition and engineering methodologies such as CommonKADS , the integration of ontology learning with more recent ontology engineering methodologies such as On-To-Knowledge , DILIGENT  or METHONTOLOGY ) has not been approached to a satisfactory extent. A first step in this direction is included in this volume (Paslaru-Bontas et al.), while methodological issues related to the interplay between linguistic analysis and ontology engineering are addressed by Aussenac-Gilles et al. (also included in this volume).
Finally, the contribution from the user interface community is urgently needed in ontology learning. We have argued above that ontology learning cannot be, by its very nature, fully automatic. On the contrary, ontology engineering is a highly interactive task in which a user interacts with a system that presents empirical textual evidence in support of the human task of modelling a particular domain. Novel user interface paradigms are needed here. First blueprints considering usability aspects can be found in the work of Wang et al.  and Missikoff et al. . Unfortunately, we have no contribution on this issue included in this volume.
In summary, ontology learning research in which the “ontology” is taken seriously requires a joint effort of various communities. Through this volume we therefore aim at forging stronger bonds between these by presenting promising research from the different communities in one collection. In this way we hope to have contributed to the development of a more integrated and cross-disciplinary approach to ontology learning. We hope that this book will stimulate further research in the field and encourage researchers to increasingly tackle also the harder challenges in ontology learning as outlined above.
Paul Buitelaar, Philipp Cimiano, Saarbrücken/Karlsruhe, November 2007
 C. Brewster, F. Ciravegna, and Y. Wilks. Background and foreground knowledge in dynamic ontology construction. In Proceedings of the SIGIR Semantic Web Workshop, 2003.
 T.R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993.
 R. Studer, R. Benjamins, and D. Fensel. Knowledge engineering: Principles and methods. Data Knowledge Engineering, 25(1-2):161–197, 1998.
 P.D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML), pages 491 – 502, 2001.
 D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL), pages 768–774, 1998.
 M.A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING), pages 539–545, 1992.
 K. Ahmad and H. Fulford. Knowledge processing: Semantic relations and their use in elaborating terminology. Technical report, University of Surrey, 1992.
 M. Berland and E. Charniak. Finding parts in very large corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 57–64, 1999.
 J. Lyons. Semantics: Volume 1. Cambridge University Press, 1977.
 D. Cruse. Lexical Semantics. Cambridge University Press, 1986.
 C.A. Welty and N. Guarino. Supporting ontological analysis of taxonomic relationships. Data Knowledge Engineering (DKE), 39(1):51–74, 2001.
 A. Artale, E. Franconi, N. Guarino, and L. Pazzi. Part-whole relations in object-centered systems: An overview. Data Knowledge Engineering, 20(3):347–383, 1996.
 J. Völker, P. Hitzler, and P. Cimiano. Acquisition of OWL DL axioms from lexical resources. In Proceedings of the European Semantic Web Conference (ESWC), pages 670–685, 2007.
 Tom Mitchell. Machine Learning. McGraw Hill, 1997.
 A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
 T. Imielinski R. Agrawal and A.N. Swami. Mining association rules between sets of items in large databases. SIGMOD, 22(2):207–216, 1993.
 A. Mädche and S. Staab. Discovering conceptual relations from text. In Proceedings of the 14th European Conference on Artificial Intelligence (ECAI), pages 321–325, 2000.
 P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research (JAIR), 24:305–339, 2005.
 N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.
 F. Lisi and F. Esposito. Two orthogonal biases for choosing the intensions of emerging concepts in ontology refinement. In Proccedings of the European Conference on Artificial Intelligence (ECAI), pages 765–766, 2006.
 S. Rudolph, J. Völker, and P. Hitzler. Supporting lexical ontology learning by relational exploration. In Proceedings of the International Conference on Conceptual Structures (ICCS), pages 488–491, 2007.
 A. Mädche and S. Staab. Handbook of Ontologies, chapter Ontology Learning, pages 173–190. Handbook of Information Systems. Springer, 2004.
 C. Masolo, S. Borgo, A. Gangemi, N. Guarino, and A. Oltramari. Ontology library (final). WonderWeb deliverable D18.
 L.A. Zadeh. From computing with numbers to computing with words – from manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems, 45:105–119, 1999.
 L.S. Moss. Natural language, natural logic, natural deduction. Draft Available from the author.
 John F. Sowa. Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, 1984.
 Z. Vendler. Verbs and times. The Philosophical Review, 66:143–160, 1957.
 B. Comrie. Aspect: Introduction to the Study of Verbal Aspect and Related Problems. Cambridge University Press, 1976.
 W. van de Velde. Machine learning issues in CommonKADS. KADS-II Project Deliverable D2.11, 1992.
 Y. Sure, S. Staab, and R. Studer. Methodology for development and employment of ontology-based knowledge management applications. SIGMOD Record, 31(4):18–23, 2002.
 D. Vrandecic, H.S. Pinto, Y. Sure, and C. Tempich. The DILIGENT knowledge processes. Journal of Knowledge Management, 9(5):85–96, 2005.
 M. Fernandez-Lopez, A. Gomez-Perez, and N. Juristo. METHONTOLOGY: From ontological art towards ontological engineering. In Proceedings of the AAAI Spring Symposium on Ontological Engineering, pages 33–40, 1997.
 Y. Wang, J. Völker, and P. Haase. Towards semi-automatic ontology building supported by large-scale knowledge acquisition. In Proceedings of the AAAI Fall Symposium On Semantic Web for Collaborative Knowledge Acquisition, pages 70–77, 2006.
 M. Missikoff, R. Navigli, and P. Velardi. The usable ontology: An environment for building and assessing a domain ontology. In Proceedings of the International Semantic Web Conference (ISWC), pages 39–53, 2002.
Ontology Learning is up to now dominated by techniques which use text as input. There are only few methods which use a different data source. The techniques which use highly structured data as input have the disadvantage that such data sources are rare. On the other side, there are enormous amounts of Web content present today.
We present the XTREEM (Xhtml TREE Mining) methods which enable Ontology Learning from Web Documents. Those methods rely on the semi-structure of Web Documents. The added value of Web document markup is exploited by the XTREEM methods. We show methods for the acquisition of terms, synonyms and semantic relations.
The XTREEM techniques are based on the structure of Web documents; they are domain and language independent. There is no need for NLP software nor for training. They do not rely on domain or document collection specific resources or background knowledge, such as patterns, rules or other heuristics; nor do they rely on manually assembling a document collection.
When extracting information about concepts from the Web, the problem is not recall, but precision: trying to identify which properties of a concept are genuinely distinctive. We discuss a series of experiments in empirical ontology using both unsupervised and supervised methods, showing that not all semantic relations we can extract from text are equally useful, and suggesting that attempting to identify concept attributes (parts, qualities, and the like) and their values results in better concept descriptions than those obtained by being less selective.
The automatic extraction of ontologies from text and lexical resources has become more and more mature. Nowadays, the results of state-of-the-art ontology learning methods are already good enough for many practical applications. However, most of them aim at generating rather inexpressive ontologies, i.e. bare taxonomies and relationships, whereas many reasoning-based applications in domains such as bioinformatics or medicine rely on much more complex axiomatizations. Those are extremely expensive if built by purely manual efforts, and methods for the automatic or semi-automatic construction of expressive ontologies could help to overcome the knowledge acquisition bottleneck. At the same time, a tight integration with ontology evaluation and debugging approaches is required to reduce the amount of manual post-processing which becomes harder the more complex learned ontologies are. Particularly, the treatment of logical inconsistencies, mostly neglected by existing ontology learning frameworks, becomes a great challenge as soon as we start to learn huge and expressive axiomatizations. In this chapter we present several approaches for the automatic generation of expressive ontologies along with a detailed discussion of the key problems and challenges in learning complex OWL ontologies. We also suggest ways to handle different types of inconsistencies in learned ontologies, and conclude with a visionary outlook to future ontology learning and engineering environments.
Learning ontologies requires the acquisition of relevant domain concepts and taxonomic, as well as non-taxonomic, relations. In this chapter, we present a methodology for automatic ontology enrichment and document annotation with concepts and relations of an existing domain core ontology. Natural language definitions from available glossaries in a given domain are processed and regular expressions are applied to identify general-purpose and domain-specific relations. We evaluate the methodology performance in extracting hypernymy and non-taxonomic relations. To this end, we annotated and formalized a relevant fragment of the glossary of Art and Architecture (AAT) with a set of 10 relations (plus the hypernymy relation) defined in the CRM CIDOC cultural heritage core ontology, a recent W3C standard. Finally, we assessed the generality of the approach on a set of web pages from the domains of history and biography.
Manual ontology building in the biomedical domain is a work-intensive task requiring the participation of both domain and knowledge representation experts. The representation of biomedical knowledge has been found of great use for biomedical text mining and integration of biomedical data. In this chapter we present an unsupervised method for learning arbitrary semantic relations between ontological concepts in the molecular biology domain. The method uses the GENIA corpus and ontology to learn relations between annotated named-entities by means of several standard natural language processing techniques. An in-depth analysis of the output evaluates the accuracy of the model and its potentials for text mining and ontology building applications. The proposed learning method does not require domain-specific optimization or tuning and can be straightforwardly applied to arbitrary domains, provided the basic processing components exist.
This chapter investigates NLP techniques for ontology population, using a combination of rule-based approaches and machine learning. We describe a method for term recognition using linguistic and statistical techniques, making use of contextual information to bootstrap learning. We then investigate how term recognition techniques can be useful for the wider task of information extraction, making use of similarity metrics and contextual information. We describe two tools we have developed which make use of contextual information to help the development of rules for named entity recognition. Finally, we evaluate our ontology-based information extraction results using a novel technique we have developed which makes use of similarity-based metrics first developed for term recognition.
We present a weakly supervised approach to automatic ontology population from text and compare it with two other unsupervised approaches. In our experiments we populate a part of our ontology of Named Entities. We considered two high level categories - geographical locations and person names and ten sub-classes for each category. For each sub-class we automatically learn a syntactic model from a list of training examples and a parsed corpus. A novel syntactic indexing method allowed us to use large quantities of syntactically annotated data. The syntactic model for each named entity sub-class is a set of weighted syntactic features, i.e. words which typically co-occur with the members of the class in the corpus. The method is weakly supervised, since no manually annotated corpus is used in the learning process. The syntactic models are used to classify the unknown Named Entities in the test set. The method achieved promising results, i.e. 65% accuracy, and outperforms significantly the other two approaches.
An architecture is proposed that, focusing on the Wikipedia as a textual repository, aims at enriching it with semantic information in an automatic way. This approach combines linguistic processing, Word Sense Disambiguation and Relation Extraction techniques for adding the semantic annotations to the existing texts.
With the advent of the Web and the explosion of available textual data, it is key for modern natural language processing systems to access, represent and reason over large amounts of knowledge in semantic repositories. Separately, the knowledge representation and natural language processing communities have been developing representations/engines for reasoning over knowledge and algorithms for automatically harvesting knowledge from textual data, respectively. There is a pressing need for collaboration between the two communities to provide large-scale robust reasoning capabilities for knowledge rich applications like question answering. In this chapter, we propose one small step by presenting algorithms for harvesting semantic relations from text and then automatically linking the knowledge into existing semantic repositories. Experimental results show better than state of the art performance on both relation harvesting and ontologizing tasks.
Designed about ten years ago, the TERMINAE method and workbench for ontology engineering from texts have been going on evolving since then. Our investigations integrate the experience gained through its use in industrial and academic projects, the progress of natural language processing as well as the evolution of the ontology engineering. Several new methodological guidelines, such as the reuse of core ontologies, have been added to the method and implemented in the workbench. It has also been modified in order to be compliant to some recent standards such as the OWL knowledge representation.
The paper recalls the terminology engineering principles underlying TERMINAE and comments its originality. Then it presents the kind of conceptual model that is built with this method, and its knowledge representation. The method and the support provided by the workbench are detailed and illustrated with a case-study in law. With regard to the state of the art, TERMINAE is one of the most supervised methods in the trend of ontology learning. This option raises epistemological issues about how language and knowledge can be articulated and the distance that separate formal ontologies from learned conceptual models.
Due to the well-known difficulties implied by manually building an ontology, machine-driven knowledge acquisition techniques—in particular in the field of ontology learning—are promoted by many ontology engineering methodologies as a feasible alternative to aid ontology engineers in this challenging process. Though the benefits of ontology learning are widely acknowledged, to date its systematic application is considerably constricted by the lack of adequate methodological support. The advantages of an elaborated ontology learning methodology are twofold; on the one hand it reduces the need for a high expertise level in this field: a detailed description of the process and best practices in operating it in a variety of situations make ontology learning techniques more accessible to large communities of ontology developers and users; on the other hand the methodology clearly formalizes the ways ontology learning results are integrated into a more general ontology engineering framework, thus opening up new application scenarios for these techniques and technologies. In this article we aim at contributing at the operationalization of ontology learning processes by introducing a methodology describing the major coordinates of these processes in terms of activities, actors, inputs, outputs and support tools. The methodology was employed to build an ontology in the legal domain. We present the lessons learned from the case study, which are used to empirically validate the proposed process model.
An important aspect of ontology learning is a proper evaluation. Generally, one can distinguish between two scenarios: (i) quality assurance during an ontology engineering project in which also ontology learning techniques may be used and (ii) evaluating and comparing ontology learning algorithms in the laboratory during their development. This paper gives an overview of different evaluation approaches and matches them against the requirements of the scenarios. It will be shown that different evaluation approaches have to be applied depending on the scenario. Special attention will be paid to the second scenario and the gold standard based evaluation of ontology learning for which concrete measures for the lexical and taxonomic layer will be presented.