In this book, new approaches are presented for detecting and extracting simultaneously relevant and novel information from unstructured text documents. A major contribution of these approaches is that the information already provided and the extracted information are modeled semantically. This leads to the following benefits: (a) ambiguities in the language can be resolved; (b) the exact information needs regarding relevance and novelty can be specified; and (c) knowledge graphs can be incorporated.
More specifically, this book presents the following scientific contributions:
An assessment of the suitability of existing large knowledge graphs (namely, DBpedia, Freebase, OpenCyc, Wikidata, and YAGO) for the task of detecting novel information in text documents.
A description of an approach by which emerging entities that are missing in a knowledge graph are detected in a stream of text documents.
A suggestion for an approach to extracting novel, relevant, semantically-structured statements from text documents. The developed approaches are suitable for the recommendation of emerging entities and novel statements respectively, for the purpose of knowledge graph population, and for providing assistance to users requiring novel information, such as journalists and technology scouts.
Companies frequently face the challenge of screening the continuously increasing number of (Web) documents and assessing the contained information with respect to its relevance and novelty. For instance, technology scouts need to discover and monitor new technologies, while investors and stock brokers would like to be informed about recent acquisitions. The systems that have been developed so far for detecting novel information (semi-)automatically in text documents are often very inefficient. This is due to the fact that most approaches only consider the relevance, but not the novelty, of text documents. The few existing approaches for novel information detection do not use any semantically-structured representation of the already known and of the extracted information.
In this thesis, new approaches for detecting and extracting novel, relevant information from unstructured text documents are presented that exploit the explicit modeling of the semantics of the given and extracted information. Using semantics has the benefit of resolving ambiguities in the language and specifying the exact information need regarding relevance and novelty. The explicit modeling is performed by using Semantic Web technologies such as the Resource Description Framework (RDF). In the presented work, we assume that all knowledge that is known to the system is available in the form of an RDF knowledge graph. Hence, novelty and relevance are considered with regard to a knowledge graph.
The contributions of this thesis can be summarized as follows:
1. We assess the suitability of existing large knowledge graphs for the task of detecting novel information in text documents.
2. We present an approach by which emerging entities are predicted and recommended, respectively, for a knowledge graph.
3. We present an approach for extracting novel, relevant, semantically-structured statements from text documents.
The contributions are presented, applied, and evaluated with the help of several scenarios. The developed approaches are suitable for the recommendation of emerging entities and novel statements, respectively, for the purpose of knowledge graph population as well as for use by users who are dependent on novel information (such as journalists and technology scouts).
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com