M. d'Aquin, F. Badra, S. Lafrogne, J. Lieber, A. Napoli, L. Szathmary
795 - 796
In case-based reasoning, the adaptation step depends in general on domain-dependent knowledge, which motivates studies on adaptation knowledge acquisition (AKA). CABAMAKA is an AKA system based on principles of knowledge discovery from databases. This system explores the variations within the case base to elicit adaptation knowledge. It has been successfully tested in an application of case-based decision support to breast cancer treatment.
We describe Polynomial Conditional Random Fields for signal processing tasks. It is a hybrid model that combines the ability of Polynomial Hidden Markov models for modeling complex dynamic signals and the discriminant power of Conditional Random Fields. We detail the learning of these models and report experimental results on handwriting recognition.
We present a novel algorithm for clustering streams of multidimensional points based on kernel density estimates of the data. The algorithm requires only one pass over each data point and a constant amount of space, which depends only on the accuracy of clustering. The algorithm recognizes clusters of nonspherical shapes and handles both inserted and deleted objects in the input stream. Querying the membership of a point in a cluster can be answered in constant time.
In this paper, we are interested in learning stratified hypotheses from examples and counter-examples associated with weights that express their prototypical importance. It leads to an extension of the well-known version space learning framework. In order to do that, we emphasize that the treatment of positive and negative examples in version space learning is reminding of a bipolar revision process recently studied in the setting of possibilistic information representation. Bipolarity appears when the positive and negative sides of information are specified in a distinct way. Then, we use the possibilistic bipolar representation setting, which distinguishes between what is guaranteed to be possible, and what is simply not impossible, as a basis for extending version space learning to examples associated with possibility degrees. It allows us to define a formal framework for learning layered hypotheses.
Grafted trees are trees that are constructed using two methods. The first method creates an initial tree, while the second method is used to complete the tree.
In this work, the first classifier is an unpruned tree from a 10% sample of the training data. Grafting is a method for constructing ensembles of decision trees, where each tree is a grafted tree. Grafting by itself is better than Bagging.
Moreover, grafted trees can also be used with any other ensemble method. It is clearly beneficial for Bagging and Random Forests. When using grafted trees with Boosting, the results depends of the considered variant. The best overall method is Grafted Random Forest.
We present a learning algorithm for nominal data. It builds a classifier by adding iteratively a simple patch function that modifies the current classifier. Its main advantage lies in the possibility to learn every patch function parameters optimally from the Bayesian point of view hence avoiding overtraining.
E.N. Smirnov, S. Vanderlooy, I.G. Sprinkhuizen-Kuyper
811 - 812
We propose a meta-typicalness approach to apply the typicalness framework for any type of classifiers. The approach can be used to construct classifiers with specified classification performance. Experiments show that the approach results in classifiers that can outperform an existing typicalness-based classifier.
Authorship identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into sub-samples according to the size of the class. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. Moreover, we explore text re-sampling in order to construct a training set according to a desirable distribution over the classes. Essentially, text re-sampling can be viewed as providing new synthetic data that increase the training size of a class. Based on a corpus of newswire stories in English we present authorship identification experiments on various multi-class imbalanced cases.
This paper contributes to a facet from the area of Web Information Retrieval that has recently received much attention: The satisfaction of a user's personal information need with respect to text type, presentation type, or information quality. We imply that such properties can be quantified for all kinds of Web documents, and we subsume them under the term “Web genre” or “genre”.
Recent surveys show that there is—to a certain degree—a common understanding of Web genre. However, the strictness by which genre and non-genre aspects of a document are experienced is an individual matter. To get a better understanding of the challenges of Web genre identification and its possible limits we investigate in this paper a very interesting question, which has not been posed by now:
Given a categorization C of documents (or bookmarks, links, document identifiers), can we provide a reliable assessment whether C is governed by topic or by genre considerations?
Marco Ernandes, Giovanni Angelini, Marco Gori, Leonardo Rigutini, Franco Scarselli
819 - 820
Term weighting is a crucial task in many Information Retrieval applications. Common approaches are based either on statistical or on natural language analysis. In this paper, we present a new algorithm that capitalizes from the advantages of both the strategies. In the proposed method, the weights are computed by a parametric function, called Context Function, that models the semantic influence exercised amongst the terms. The Context Function is learned by examples, so that its implementation is mostly automatic. The algorithm was successfully tested on a data set of crossword clues, which represent a case of Single-Word Question Answering.
This paper presents an approach to exploit free text de- scriptions of TV programmes as available from EPG data sets for a recommendation system that takes the content of programmes into account
Our research is sponsored by Software-Offensive Bayern. Ferdinand Herrmann, Heike Ott, Kristina Makedonska, and Sebastian von Mammen provided valuable help implementing major parts of the presented system.
. The paper focusses on classifying free text descriptions in relation to natural language user queries.
Richard Nock, Pascal Vaillant, Frank Nielsen, Claudia Henry
823 - 824
Without prior knowledge, distinguishing different languages may be a hard task, especially when their borders are permeable.
We develop an extension of spectral clustering — a powerful unsupervised classification toolbox — that is shown to resolve accurately the task of soft language distinction. At the heart of our approach, we replace the usual hard membership assignment of spectral clustering by a soft, probabilistic assignment, which also presents the advantage to bypass a well-known complexity bottleneck of the method.
Experiments with a readily available system display the potential of the method, which brings a visually appealing soft distinction of languages that may define altogether a whole corpus.
Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Michael Hess
825 - 826
This poster describes OntoGene: an environment, based on a deep-linguistic parser, aimed at supporting the process of Text Mining from Biomedical Scientific Literature. We will illustrate in particular the Relation Extraction component and the development support facilities.
We describe a text summarization system that moves beyond standard approaches by using a hybrid approach of linguistic and statistical analysis and by employing text-sort-specific knowledge of document structure and phrases indicating importance. The system is highly modular and entirely XML-based so that different components can be combined easily.
We introduce a new approach to spellchecking for languages with extreme phonetic irregularities. The spelling for such languages can be significantly improved if knowledge about pronunciation and sound becomes the central part of the spelling algorithm. However, given a weak phoneme-grapheme-correspondence the standard spelling algorithms, which are rule-based or edit-distance-based, are severely limited in their phonetic capabilities.
A production approach to spelling can overcome the limitations—but suffers from its search space size. We describe in this paper the main building blocks to tackle this problem with heuristic search. Our ideas have been operationalized in the SMARTSPELL algorithm, with impressive results related to spelling correction and runtime.
Juan Ángel García-Pardo, Stella Heras Barberá, Rafael Ramos-Garijo, Alberto Palomares, Vicente Julián, Miguel Rebollo, Vicent Botti
833 - 834
In this paper, a new CBR system for help-desk environments is presented. This CBR system provides intelligent customer support for multiple domains. It is also portable and flexible. The system is implemented as a module of a complete help-desk application to make it as independent as possible of any change in the help-desk system. Each phase of the reasoning cycle is also separated as an independent module, making the CBR system easy to update. The system has been tested in a real call center managed by the Spanish company TISSAT S.A.
Arjen Hommersom, Perry Groot, Peter Lucas, Michael Balser, Jonathan Schmitt
835 - 836
The use of a medical guideline can be seen as the execution of computational tasks, sequentially or in parallel, in the face of patient data. It has been shown that many of such guidelines can be represented as a 'network of tasks', i.e., as a number of steps that have a specific function or goal. To investigate the quality of such guidelines we propose a formalization of criteria for good practice medicine a guideline should comply to. We use this theory in conjunction with medical background knowledge to verify the quality of a guideline dealing with diabetes mellitus type 2 using the interactive theorem prover KIV. Verification using task execution and background knowledge is a novel approach to quality checking of medical guidelines.
In many real-world planning environments, some of the information about the world is both external (the planner must request it from external information sources) and volatile (it changes before the planning process completes). In such environments, a planner faces two challenges: how to generate plans despite changes in the external information during planning, and how to guarantee that a plan returned by the planner will remain valid for some period of time after the planning ends. Previous works on planning with volatile information have addressed the first challenge, but not the second one.
This paper provides a general model for planning with volatile external information in which the planner offers a guarantee of how long the solution will remain valid after it is returned, and an incompleteness theorem showing that there is no planner that can succeed in solving all solvable planning problems in which there is volatile external information.
State trajectory and plan preference constraints are the two language features recently introduced to PDDL in the context of the 5th international planning competition. For planning with soft constraints, an objective function monitors their violation. This paper introduces a symbolic approach for finding cost-optimal plans. The set-based branch-and-bound algorithm exploits an efficient symbolic representation of the objective function. State trajectory constraints are compiled into automata, while ordinary preferences are evaluated on-line for the intersection of the search frontier with the goal.