
Ebook: Finite-State Methods and Natural Language Processing

These proceedings contain the final versions of the papers presented at the 7th International Workshop on Finite-State Methods and Natural Language Processing (FSMNLP), held in Ispra, Italy, on September 11–12, 2008. The aim of the FSMNLP workshops is to bring together members of the research and industrial community working on finite-state based models in language technology, computational linguistics, web mining, linguistics and cognitive science on one hand, and on related theory and methods in fields such as computer science and mathematics on the other. Thus, the workshop series is a forum for researchers and practitioners working on applications as well as theoretical and implementation aspects. The special theme of FSMNLP 2008 was high performance finite-state devices in large-scale natural language text processing systems and applications. The papers in this publication cover a range of interesting NLP applications, including machine learning and translation, logic, computational phonology, morphology and semantics, data mining, information extraction and disambiguation, as well as programming, optimization and compression of finite-state networks. The applied methods include weighted algorithms, kernels and tree automata. In addition, relevant aspects of software engineering, standardization and European funding programmes are discussed.
These proceedings contain the final versions of the papers presented at the 7th International Workshop on Finite-State Methods and Natural Language Processing, FSMNLP 2008. The workshop was held in Ispra, Italy, on September 11–12, 2008. The event was the seventh instance in the series of FSMNLP workshops, and the third that was arranged as a stand-alone event. In 2008 FSMNLP was merged with the FASTAR workshop.
The aim of the FSMNLP workshops is to bring together members of the research and industrial community working on finite-state based models in language technology, computational linguistics, web mining, linguistics, and cognitive science on one hand, and, on related theory and methods in fields such as computer science and mathematics, on the other. Thus, the workshop series is a forum for researchers and practitioners working on applications as well as theoretical and implementation aspects. The special theme of FSMNLP 2008 centered around high performance finite-state devices in large-scale natural language text processing systems and applications.
In the context of FSMNLP 2008, we received in total 37 submisions, of which 13 were selected as regular papers, 6 as short papers and 1 as demo paper. The acceptance rate for regular papers was 46,4%. Most of the papers were evaluated by at least four Programme Committee members, with the help of external reviewers. Only 15% of the papers were reviewed by three reviewers. In addition to the submitted papers, four lectures were given by invited speakers. The invited speakers and the authors of the papers represented (at least) Croatia, Finland, France, Gemany, Italy, Luxembourg, Netherlands, Portugal, Puerto Rico, Sweden, U.K., and the USA.
We would like to thank all workshop participants for their contributions and lively interaction during the two days. The presented papers covered a range of interesting NLP applications, including machine learning and translation, logic, computational phonology, morphology and semantics, data mining, information extraction and disambiguation, as well as programming, optimization and compression of finite-state networks. The applied methods included weighted algorithms, kernels, and tree automata. In addition, relevant aspects of software engineering, standardization, and European funding programmes were discussed.
We are greatly indebted to the members of the Programme Committee and the external referees for reviewing the papers and maintaining the high standard of the FSMNLP workshops. The members of the Programme Committee of FSMNLP 2008 were Cyril Allauzen (Google Research, New York, USA), Francisco Casacuberta (Instituto Tecnologico De Informática, Valencia, Spain), Jean-Marc Champarnaud (Université de Rouen, France), Maxime Crochemore (Department of Computer Science, King's College London, U.K.), Jan Daciuk (Gdańsk University of Technology, Poland), Karin Haenelt (Fraunhofer Gesellschaft and University of Heidelberg, Germany), Thomas Hanneforth (University of Potsdam, Germany), Colin de la Higuera (Jean Monnet University, Saint-Etienne, France), André Kempe (Yahoo Search Technologies, Paris, France), Derrick Kourie (Dept. of Computer Science, University of Pretoria, South Africa), Andras Kornai (Budapest Institute of Technology, Hungary and MetaCarta, Cambridge, USA), Marcus Kracht (Univeristy of California, Los Angeles, USA), Hans-Ulrich Krieger (DFKI GmbH, Saarbrücken, Germany), Eric Laporte (Université de Marne-la-Vallée, France), Stoyan Mihov (Bulgarian Academy of Sciences, Sofia, Bulgaria), Herman Ney (RWTH Aachen University, Germany), Kemal Oflazer (Sabanci University, Turkey), Jakub Piskorski (Joint Research Center of the European Commission, Italy), Michael Riley (Google Research, New York, USA), Strahil Ristov (Ruder Boskovic Institute, Zagreb, Croatia), Wojciech Rytter (Warsaw University, Poland), Jacques Sakarovitch (Ecole nationale supérieure des Télécommunications, Paris, France), Max Silberztein (Université de Franche-Comté, France), Wojciech Skut (Google Research, Mountain View, USA), Bruce Watson (Dept. of Computer Science, University of Pretoria, South Africa) (PC co-chair), Shuly Wintner (University of Haifa, Israel), Atro Voutilainen (Connexor Oy, Finland), Anssi Yli-Jyrä (University of Helsinki and CSC – IT Center for Science, Espoo, Finland) (PC co-chair), Sheng Yu (University of Western Ontario, Canada), and Lynette van Zijl (Stellenbosch University, South Africa). The external reviewers were Marco Almeida, Marie-Pierre Beal, Oliver Bender, Jan Bungeroth, Pascal Caron, Loek Cleophas, Matthieu Constant, Stefan Hahn, Christopher Kermorvant, Sylvain Lombardy, Patrick Marty, Evgeny Matusov, Takuya Nakamura, Ernest Ketcha Ngassam, Jyrki Niemi, Sébastien Paumier, Maciej Pilichowski, Adam Przepiórkowski, Magnus Steinby, Yael Sygal, David Vilar, Hsu-Chun Yen, Francois Yvon, Artur Zaroda, and Djelloul Ziadi.
FSMNLP 2008 was organised by the Institute for the Protection and Security of the Citizen of the Joint Research Centre (JRC) of the European Commission in Ispra, Italy, in cooperation with the host of the next FSMNLP event, the FASTAR group of the University of Pretoria in South Africa. The Organizing Committee in 2008 had five JRC representatives: Regina Corradini, Daniela Negri, Jakub Piskorski (OC chair), Hristo Tanev, and Vanni Zavarella, and two members from the Department of Computer Science, University of Pretoria, South Africa: Derrick Kourie and Bruce Watson. A complementary role in long-term planning and coordination was played by the Steering Committee: Lauri Karttunen (Palo Alto Research Center, USA and Stanford University, USA) Kimmo Koskenniemi (University of Helsinki, Finland), Kemal Oflazer (Sabanci University, Turkey) and Anssi Yli-Jyrä (University of Helsinki and CSC – IT Centre for Science, Espoo).
The current year's event is pivotal to the series of FSMNLP workshops since it starts the tradition of organizing the workshops on a yearly basis. Locations for successive events, including FSMNLP 2008 in Ispra and FSMNLP 2009 in Pretoria were proposed already in FSMNLP 2007 in Potsdam. The success of FSMNLP 2008 indicates that there is a growing and wide interdisciplinary community with shared interest in finite-state methods and natural language processing. Therefore, we are looking forward to the FSMNLP 2009 that is to be held in Pretoria, South Africa next year!
In October 2008
Jakub Piskorski, Bruce Watson, Anssi Yli-Jyrä
A new emerging European research infrastructure called CLARIN and a related project called HFST are briefly described. HFST has built a programming interface on top of some existing open source finite-state packages such as SFST and OpenFST. In order to verify its utility, HFST has built open source tools on top of this HFST interface. These tools create lexical transducers, compile morphophonological two-level rules and combine them into a transducer lexicon. The tools have been tested against independently created with full-scale lexicons and rules for Northern Sámi and Lule Sámi languages which have more complicated lexical and morphophonological structure than most other European languages.
Weighted finite-state transducers have been used successfully in a variety of natural language processing applications, including speech recognition, speech synthesis, and machine translation. This paper shows how weighted transducers can be combined with existing learning algorithms to form powerful techniques for sequence learning problems.
The emergence of WWW search engines since the 1990s has changed the scale of many natural language processing applications. Text mining, information extraction and related tasks can now be applied to tens of billions of documents, which sets new efficiency standards for NLP algorithms. Finite-state machines are an obvious choice of a formal framework for such applications. However, the scale of the problem (size of the searchable corpus, number of patterns to be matched) often poses a problem even to well-established finite-state string matching techniques. In my presentation, I will focus on the experience gained in the implementation a finite-state matching library optimized for searching large amounts of complex patterns in a WWW-scale repository of documents. Both algorithmic and implementation-related aspects of the task will be discussed. The library is based on OpenFST.
Kleene is a high-level programming language for building, manipulating and testing weighted finite-state acceptors and transducers. It allows users to define networks using regular expressions and right-linear phrase-structure grammars, and it provides variables, functions and familiar control structures. Pre-edited Kleene scripts can be run from the command line, and a graphical user interface is provided for interactive programming and testing. The Kleene parser is implemented in JavaCC/JJTree, and the interpreter calls functions in the OpenFst library via the Java Native Interface (JNI). The design, implementation, development status and future plans for Kleene are described.
The Cambridge University Engineering Department phrase-based statistical machine translation system follows a generative model of translation and is implemented by the composition of component models of translation and movement realised as Weighted Finite State Transducers. Our flexible architecture requires no special purpose decoder and readily handles the large-scale natural language processing demands of state-of-the-art machine translation systems. In this paper we describe the CUED system's participation in the NIST 2008 Arabic-English machine translation evaluation task.
This paper presents a new approach to proper noun recognition and classification in which the knowledge of ambiguities within morphological analyses is used exhaustively in the analysis. Here a proper noun recognizer/classifier is defined by proper noun context patterns on the one hand and by a filter that takes the ambiguity information into account on the other hand. Furthermore, techniques like a lemma based coreference resolution or the softening of the closed world assumption made by the morphology are presented which improve the analysis. The approach is implemented by weighted finite state transducers and tested within the analysis system SynCoP via a hand-written grammar.
Like common noun phrases, proper names contain ambiguous conjoined phrases that make their delimitation and classification difficult in text. This paper presents a finite-state approach to the disambiguation of Portuguese candidate proper name strings containing the coordinating conjunction e (and). In such name strings, the conjunction can denote a relation between two independent names, but it can also be part of a multiword proper name. The coordination of multiword independent names may involve ellipsis of some lexical constituents, which causes additional difficulties to proper name identification and classification.
Many NLP tasks based on finite-state automata create acyclic result automata which contain a lot of ε-transitions. We propose an refinement of an existing algorithm for ε-removal with a better memory consumption behavior in many practical cases.
This paper proposes an extension to the formalism of regular expressions with a form of predicate logic where quantified propositions apply to substrings. The implementation hinges crucially on the manipulation of auxiliary symbols which has been a common, though previously unsystematized practice in finite-state language processing. We also apply the notation to give alternate compilation methods for two-level grammars and various types of replacement rules found in the literature, and show that, under a certain interpretation, two-level rules and many types of replacement rules are equivalent.
We provide a new term-like representation for multi-dimensional trees as defined by Rogers [1,2] which establishes them as a direct generalization of classical trees. As a consequence these structures can be used as input for finite-state applications based on classical term-based tree language theory. Via the correspondence between string and tree languages these applications can then be conceived to be able to process even some language classes beyond context-freeness.
In this paper, we describe the use of an incremental construction method of minimal, acyclic, deterministic FST. The approach consists in constructing a transducer in a single step by adding new strings one by one and minimizing the resultant automaton incrementally. Then, we present a new method to encode the morphological information associated with the dictionary entries. The new encoding unifies a large number of word forms' analyses, thus reducing the number of terminal states of the dictionary's FST, that triggers a more efficient minimization process. Finally, we present experimental results on the FST that represents the Arabic dictionary.
This paper elaborates a model for representing various types of semantic calendar expressions (SCEs), which correspond to the disambiguated intensional meanings of natural-language calendar phrases. The model uses finite-state transducers (FSTs) to mark denoted periods of time on a set of timelines also represented as an FST. In addition to an overview of the model, the paper presents methods to combine the periods marked on two timeline FSTs into a single timeline FST and to adjust the granularity and span of time of a timeline FST. The paper also discusses advantages and limitations of the model.
As [1] and [2] have shown, some applications of Optimality Theory can be modelled using finite state algebra provided that the constraints are regular. However, their approaches suffered from an upper bound on the number of constraint violations. We present a method to construct finite state transducers which can handle an arbitrary number of constraint violations using a variant of the tropical semiring as its weighting structure. In general, any Optimality Theory system whose constraints can be represented by regular relations, can be modelled this way. Unlike [3], who used roughly the same idea, we can show, that this can be achieved by using only the standard (weighted) automaton algebra.
This paper deals with Finite State Automata used in Natural Language Processing to represent very large dictionaries. We present a method for an important operation applied to these automata, the compression with quick access. Our proposal is to factorize subautomata other than those representing common prefixes or suffixes. Our algorithm uses a DAWG of subautomata to iteratively choose the best substructure to factorize. The linear time accepting complexity is kept in the resulting compact automaton. Experiments performed on ten automata are reported.
This paper reports on our experience of adapting a real-world live event extraction system based on a cascade of finite-state extraction grammars to the processing of a new language, namely Italian. The real-time event extraction processing chain and the pattern specification language are briefly presented. The major part of the paper focuses on the creation of event extraction grammars and related resources for English and their adaptation for extracting events in Italian news articles. Some interesting phenomena which complicate the event extraction task for Italian are pinpointed and the results of the evaluation are presented. In particular, we compared two versions of the system for Italian, one based on surface-level patterns and a hybrid one, which integrates slightly more linguistically sophisticated patterns for covering a rich variety of morphological and syntactic constructions in Italian.
Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors.
In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.
CroMo is a finite state transducer that combines three major functionalities into one high-performance monolithic machine for morphological segmentation, annotation, and lemmatization. It is designed to be flexible, extensible, and applicable to any language that allows for purely morphotactic modeling on the lexical level of morphological structure. While it minimizes the development cycle of the morphology, its annotation schema maximizes interoperability by using a direct mapping from the GOLD ontology of linguistic concepts and features. The use of standardized ontology based annototains provides advanced possibilities for a DL-based post-processing of the annotated output.
Pattern matching, acceptance, and parsing algorithms on node-labeled, ordered, ranked trees (‘tree algorithms’) are important for applications such as instruction selection and tree transformation/term rewriting. Many such algorithms have been developed. They often are based on results from such algorithms on words or generalizations thereof using finite (tree) automata. Regrettably no coherent, extensive toolkit of such algorithms and automata existed, complicating their use.
Our toolkit FOREST FIRE contains many such algorithms and automata constructions. It is accompanied by the graphical user interface (GUI) FIRE WOOD. The toolkit and GUI provide a useful environment for experimenting with and comparing the algorithms. In this tool paper we give an overview of the toolkit and GUI, their context and design rationale, and mention some results obtained with them.
We present an XML format that allows to describe a large class of finite weighted automata and transducers. Our design choices stem from our policy of making the implementation as simple as possible. This format has been tested for the communication between the modules of our automata manipulation platform Vaucanson, but this document is less an experiment report than a position paper intended to open the discussion among the community of automata software writers.
This paper presents a simple formalism for capturing reduplication phenomena in the morphology and phonology of natural languages. After a brief survey of the facts common in reduplicative elements cross-linguistically, these facts are described in terms of finite-state systems. The principal idea is that an operator can be derived to ensure equivalence of finite discontinuous strings at some level of representation.
This paper presents a method for converting back and forth between the Perso-Arabic and a romanized writing system for Persian. Given a word in one writing system, we use finite state transducers to generate morphological analysis for the word that is subsequently used to regenerate the orthography of the word in the other writing system. The system has been implemented in XFST and LEXC.