Preface
Technologies enabling the computer processing of specific languages facilitate economic and political progress of societies where these languages are spoken. Development of methods and systems for language processing is, therefore, a worthy goal for national governments as well as for business entities and scientific and educational institutions in every country in the world. Significant progress has been made over the past 20–25 years in developing systems and resources for language processing. Traditionally, the lion's share of activity concentrated on the “major” languages of the world, defined not so much in terms of the number of speakers as with respect to the amount of publications of various kinds appearing in the language. Thus, much of the work in the field has been devoted to English, with Spanish, French, German, Japanese, Chinese and, to some extent, Arabic also claiming strong presence. The term “high-density” has been used to describe the above languages.
The rest of the languages of the world have fewer computational resources and systems available. As work on systems and resources for the “lower-density” languages becomes more widespread, an important question is how to leverage the results and experience accumulated by the field of computational linguistics for the major languages in the development of resources and systems for lower-density languages. This issue has been at the core of the NATO Advanced Studies Institute on language technologies for middle- and low-density languages held in Batumi, Georgia in October 2007. This book is a collection of publication-oriented versions of the lectures presented there.
The book is divided into three parts. The first part is devoted to the development of tools and resources for the computational study of lesser-studied languages. Typically, this is done on the basis of describing the work on creating an existing resource. Readers should find in this part's papers practical hints for streamlining the development of similar resources for the languages on which they are about to undertake comparable resource-oriented work. In particular, Dan Tufis describes an approach to test tokenization, part of speech tagging and morphological stemming as well as alignment for parallel corpora. Rodolfo Delmonte describes the process of creating a treebank of syntactically analyzed sentences for Italian. Marjorie McShane's chapter is devoted to the important issue of recognizing, translating and establishing co-reference of proper names in different languages. Ivan Derzhanski analyzes the issues related to the creation of multilingual dictionaries.
The second part of the book is devoted to levels of computational processing of text and a core application, machine translation. Kemal Oflazer describes the needs of and approaches to computational treatment of morphological phenomena in language. David Tugwell's contribution discusses issues related to syntactic parsing, especially parsing for languages that feature flexible word order. This topic is especially important for lesser-studied languages because much of the work on syntactic parsing has traditionally been carried out in languages with restricted word order, notably, English, while a much greater variety exists in the languages of the world. Sergei Nirenburg's section describes the acquisition of knowledge prerequisites for the analysis of meaning. The approach discussed is truly interlingual – it relies on an ontological metalanguage for describing meaning that does not depend on any specific natural language. Issues of reusing existing ontological-semantic resources to speed up the acquisition of lexical semantics for lesser-studied languages are also discussed. Leo Wanner and François Lareau discuss the benefits of applying the meaning-text theory to creating text generation capabilities into multiple languages. Finally, Stella Makrantonatou and her co-authors Sokratis Sofianopoulos, Olga Giannoutsou and Marina Vassiliou describe an approach to building machine translation systems for lesser-studied languages.
The third and final part of the book contains three case studies on specific language groups and particular languages. Shuly Wintner surveys language resources for Semitic languages. Karine Megerdoomian analyzes specific challenges in processing Armenian and Persian and Oleg Kapanadze describes the results of projects devoted to applying two general computational semantic approaches – finite state techniques and ontological semantics – to Georgian.
The book is a useful source of knowledge about many core facets of modern computational-linguistic work. By the same token, it can serve as a reference source for people interested in learning about strategies that are best suited for developing computational-linguistic capabilities for lesser-studied languages – either “from scratch” or using components developed for other languages. The book should also be quite useful in teaching practical system- and resource-building topics in computational linguistics.