
Ebook: Language Engineering for Lesser-Studied Languages

Technologies enabling computers to process specific languages facilitate economic and political progress of societies where these languages are spoken. Development of methods and systems for language processing is therefore a worthy goal for national governments as well as for business entities and scientific and educational institutions in every country in the world. As work on systems and resources for the ‘lower-density’ languages becomes more widespread, an important question is how to leverage the results and experience accumulated by the field of computational linguistics for the major languages in the development of resources and systems for lower-density languages. This issue has been at the core of the NATO Advanced Studies Institute on language technologies for middle- and low-density languages held in Georgia in October 2007. This publication is a collection - of publication-oriented versions - of the lectures presented there and is a useful source of knowledge about many core facets of modern computational-linguistic work. By the same token, it can serve as a reference source for people interested in learning about strategies that are best suited for developing computational-linguistic capabilities for lesser-studied languages – either ‘from scratch’ or using components developed for other languages. The book should also be quite useful in teaching practical system- and resource-building topics in computational linguistics.
Technologies enabling the computer processing of specific languages facilitate economic and political progress of societies where these languages are spoken. Development of methods and systems for language processing is, therefore, a worthy goal for national governments as well as for business entities and scientific and educational institutions in every country in the world. Significant progress has been made over the past 20–25 years in developing systems and resources for language processing. Traditionally, the lion's share of activity concentrated on the “major” languages of the world, defined not so much in terms of the number of speakers as with respect to the amount of publications of various kinds appearing in the language. Thus, much of the work in the field has been devoted to English, with Spanish, French, German, Japanese, Chinese and, to some extent, Arabic also claiming strong presence. The term “high-density” has been used to describe the above languages.
The rest of the languages of the world have fewer computational resources and systems available. As work on systems and resources for the “lower-density” languages becomes more widespread, an important question is how to leverage the results and experience accumulated by the field of computational linguistics for the major languages in the development of resources and systems for lower-density languages. This issue has been at the core of the NATO Advanced Studies Institute on language technologies for middle- and low-density languages held in Batumi, Georgia in October 2007. This book is a collection of publication-oriented versions of the lectures presented there.
The book is divided into three parts. The first part is devoted to the development of tools and resources for the computational study of lesser-studied languages. Typically, this is done on the basis of describing the work on creating an existing resource. Readers should find in this part's papers practical hints for streamlining the development of similar resources for the languages on which they are about to undertake comparable resource-oriented work. In particular, Dan Tufis describes an approach to test tokenization, part of speech tagging and morphological stemming as well as alignment for parallel corpora. Rodolfo Delmonte describes the process of creating a treebank of syntactically analyzed sentences for Italian. Marjorie McShane's chapter is devoted to the important issue of recognizing, translating and establishing co-reference of proper names in different languages. Ivan Derzhanski analyzes the issues related to the creation of multilingual dictionaries.
The second part of the book is devoted to levels of computational processing of text and a core application, machine translation. Kemal Oflazer describes the needs of and approaches to computational treatment of morphological phenomena in language. David Tugwell's contribution discusses issues related to syntactic parsing, especially parsing for languages that feature flexible word order. This topic is especially important for lesser-studied languages because much of the work on syntactic parsing has traditionally been carried out in languages with restricted word order, notably, English, while a much greater variety exists in the languages of the world. Sergei Nirenburg's section describes the acquisition of knowledge prerequisites for the analysis of meaning. The approach discussed is truly interlingual – it relies on an ontological metalanguage for describing meaning that does not depend on any specific natural language. Issues of reusing existing ontological-semantic resources to speed up the acquisition of lexical semantics for lesser-studied languages are also discussed. Leo Wanner and François Lareau discuss the benefits of applying the meaning-text theory to creating text generation capabilities into multiple languages. Finally, Stella Makrantonatou and her co-authors Sokratis Sofianopoulos, Olga Giannoutsou and Marina Vassiliou describe an approach to building machine translation systems for lesser-studied languages.
The third and final part of the book contains three case studies on specific language groups and particular languages. Shuly Wintner surveys language resources for Semitic languages. Karine Megerdoomian analyzes specific challenges in processing Armenian and Persian and Oleg Kapanadze describes the results of projects devoted to applying two general computational semantic approaches – finite state techniques and ontological semantics – to Georgian.
The book is a useful source of knowledge about many core facets of modern computational-linguistic work. By the same token, it can serve as a reference source for people interested in learning about strategies that are best suited for developing computational-linguistic capabilities for lesser-studied languages – either “from scratch” or using components developed for other languages. The book should also be quite useful in teaching practical system- and resource-building topics in computational linguistics.
This chapter presents some of the basic language engineering pre-processing steps (tokenization, part-of-speech tagging, lemmatization, and sentence and word alignment). Tagging is among the most important processing steps and its accuracy significantly influences any further processing. Therefore, tagset design, validation and correction of training data and the various techniques for improving the tagging quality are discussed in detail. Since sentence and word alignment are prerequisite operations for exploiting parallel corpora for a multitude of purposes such as machine translation, bilingual lexicography, import annotation etc., these issues are also explored in detail.
In this chapter, we are dealing with treebanks and their applications. We describe VIT (Venice Italian Treebank), focusing on the syntactic-semantic features of the treebank that are partly dependent on the adopted tagset, partly on the reference linguistic theory, and, lastly – as in every treebank – on the chosen language: Italian. By discussing examples taken from treebanks available in other languages, we show the theoretical and practical differences and motivations that underlie our approach. Finally, we discuss the quantitative analysis of the data of our treebank and compare them to other treebanks. In general, we try to substantiate the claim that treebanking grammars or parsers strongly depend on the chosen treebank; and eventually this process seems to depend both on factors such as the adopted linguistic framework for structural description and, ultimately, the described language.
This article discusses methods for developing proper name recognition, translation and cross-linguistic matching capabilities for any language or combination of languages in a short amount of time, with relatively minimal work by native speaker informants. Unlike much work on proper name recognition, this work is grounded in knowledge-based rather than stochastic methods, and it extends to multi-lingual and multi-script name processing.
This paper covers the fundamentals of the design, implementation and use of bi- and multilingual electronic dictionaries. I also touch upon the Bulgarian experience, past and present, in creating electronic dictionaries, as well as specialised tools for their development, within several international projects.
Many language processing tasks such as parsing or surface generation need to either extract and process the information encoded in the words or need to synthesize words from available semantic and syntactic information. This chapter presents an overview of the main concepts in building morphological processors for natural languages, based on the finite state approach – the state-of-the-art mature paradigm for describing and implementing such systems.
This paper presents an approach to the automatic syntactic processing of natural language based on the newly-emerging paradigm of Dynamic Syntax, and argues that this approach offers a number of practical advantages for this task. In particular, it is shown that is particularly suited to tackling the problems by languages displaying a relatively free constituent order, which is often the case for the lesser-studied low- and middle-density languages.
Dynamic Syntax relies on three assumptions, all of which run against the mainstream of generative orthodoxy. These are that the basis of grammar should be taken to be individual grammatical constructions, that it must rely on a rich representational semantics, and most radically that it should be a dynamic system building structure incrementally through the sentence, thus matching the time-flow of language interpretation.
The paper outlines the construction of a probabilistic syntactic model for English, and discusses how it may be extended and adapted for other languages.
We present a methodology and tools that facilitate the acquisition of lexical-semantic knowledge about a language L. The lexicon that results from the process described in this paper expresses the meaning of words and phrases in L using a language-independent formal ontology, the OntoSem ontology. The acquisition process benefits from the availability of an ontological-semantic lexicon for English. The methodology also addresses the task of aligning any existing computational grammar of L with the expectations of the syntax-oriented zone of the ontological-semantic lexicon. Illustrative material in this paper is presented by means of the DEKADE knowledge acquisition environment.
The linguistic model as defined in the Meaning Text Theory (MTT) is, first of all, a language production model. This makes MTT especially suitable for language engineering tasks related to synthesis: text generation, summarization, paraphrasing, speech generation, and the like. In this article, we focus on text generation. Large scale text generation requires substantial resources, namely grammars and lexica. While these resources are increasingly compiled for high density languages, for low- and middle density languages often no generation resources are available. The question on how to obtain them in a most efficient way becomes thus imminent. In this article, we address this question for MTT-oriented text generation resources.
The main aim of this article is to present the prototype hybrid Machine Translation (MT) system METIS. METIS is interesting in two ways. As regards MT for low and middle density languages, METIS relies on relatively cheap resources: monolingual corpora of the target language (TL), flat bilingual lexica and basic NLP tools (taggers, lemmatizers, chunkers). In terms of research, METIS uses pattern matching algorithms and patterns in an innovative way. In order to put the discussion of METIS in context and define the niche that this research prototype fills, the landscape of state-of-the-art Machine Translation, especially as regards low and middle-density languages, is briefly described.
Language resources are crucial for research and development in theoretical, computational, socio- and psycho-linguistics, and for the construction of natural language processing applications. This paper focuses on Semitic languages, a language family that includes Arabic and Hebrew and has over 300 million speakers. The paper discusses the challenge that Semitic languages pose for computational processing, and surveys the current state of the art, providing references to several existing solutions.
This paper presents research on the feasibility and development of methods for the rapid creation of stopgap language technology resources for low-density languages. The focus is on two broad strategies: (i) related language bootstrapping can be used to port existing technology from a resource-rich language to its associated lower-density variant; and (ii) clever use of linguistic knowledge can be employed to scale down the need for large amount of training or development data. Based on Persian and Armenian languages, the paper illustrates several methods that can be implemented in each instance in the goal of reducing human effort and avoiding the scarce data issue faced by statistical systems.
In the first part of the paper application of the Finite State Tools to one of the Southern Caucasian languages – Georgian is discussed. The FST has been very popular in computational morphology and other lower-level applications in natural-language engineering. The basic claim of finite-state morphology is that a morphological analyzer for a natural language can be implemented as a data structure called a Finite State Transducer. They are bidirectional, principled, fast and compact. In Georgian, as in many non-Indo-European agglutinative languages, concatenative morphotactics is impressively productive within its rich morphology. The Georgian language lexical transducer presented is capable of producing (analyzing and generating) all theoretically possible options for the lemmata from identified 21 sets of Georgian nouns and for most of the lemmata from about 150 sets of verb constructions. The second part of the paper is devoted to application of ontological semantics to Georgian. In a general ontological semantics lexicon meanings of words and expressions are represented in terms of instances of concepts from the ontology. Each lexicon entry comprises a morphological category and its syntactic and semantic features' description. A syntactic structure reflects syntactic valency represented as a syntactic subcategorization frame of an entry. A semantic structure links the lexicon entry with the language-independent ontological-semantic static knowledge sources – the ontology and the fact database. In a Georgian version of the ontological lexicon, alongside the mentioned monolingual information, each entry is supplied with the English translation equivalents. Consequently, we consider it as a potential bilingual ontological lexicon for multilingual NLP applications. The paper covers specifics of lexicon entries' formal description in the bilingual lexicon and discuss possible solutions for “toleration” differences in morpho-syntactic structure in the framework of a Georgian-English Ontological Semantics Lexicon.