
Ebook: Human Language Technologies – The Baltic Perspective

Human language technologies continue to play an important part in the modern information society. This book contains papers presented at the fifth international conference ‘Human Language Technologies – The Baltic Perspective (Baltic HLT 2012)’, held in Tartu, Estonia, in October 2012. Baltic HLT provides a special venue for new and ongoing work in computational linguistics and related disciplines, both in the Baltic states and in a broader geographical perspective. It brings together scientists, developers, providers and users of HLT, and is a forum for the sharing of new ideas and recent advances in human language processing, promoting cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. Twenty long papers, as well as the posters or demos accepted for presentation at the conference, are published here. They cover a wide range of topics: morphological disambiguation, dependency syntax and valency, computational semantics, named entities, dialogue modeling, terminology extraction and management, machine translation, corpus and parallel corpus compiling, speech modeling and multimodal communication. Some of the papers also give a general overview of the state of the art of human language technology and language resources in the Baltic states. This book will be of interest to all those whose work involves the use and application of computational linguistics and related disciplines.
This volume contains papers presented at the Fifth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2012), held in Tartu, Estonia on 4–5 October 2012.
Since its first edition in 2004, Baltic HLT has served as a special venue for new and ongoing work in computational linguistics and related disciplines in the Baltic states as well as in a broader geographical perspective.
The main aim of this conference is to provide a forum for the sharing of new ideas and recent advances in human language processing and to promote cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. The conference brings together scientists, developers, providers and users to discuss state-of-the-art of human language technologies in the Baltic countries, to exchange information and to discuss problems, to find new synergies and to promote initiatives for international cooperation.
The call for papers for the fifth Baltic HLT laid special emphasis on multilinguality in language resources and on applications of human language technology, while also encouraging the potential authors to submit papers on other subfields of computational linguistics and related disciplines.
51 submissions were received; each submission was evaluated by at least two reviewers.
The Programme Committee consisted of 25 members from 13 different countries. Based on their scores and the comments they provided on the content and quality of the papers, 20 long papers and 20 posters or demos were accepted for presentation and publication.
The accepted submissions cover a wide range of topics: morphological disambiguation, dependency syntax and valency, computational semantics, named entities, dialogue modeling, terminology extraction and management, machine translation, corpus and parallel corpus compiling, speech modeling and multimodal communication. A few papers give a general overview of the state of the art of the human language technology and/or language resources in the Baltic states.
Completing the programme are the invited lectures by Lori Lamel “Multilingual Speech Processing Activities in Quaero: application to multimedia search in unstructured data” and Bente Maegaard “A Multilingual Research Infrastructure”.
We wish to express our gratitude to the members of the Programme Committee who worked hard to review all submissions.
We also want to thank the organizers and supporters of this conference: Institute of Computer Science, University of Tartu and Estonian Ministry of Education and Research as funder of National Programme for Estonian Language Technology. The conference is also supported by the CLARIN and META-NORD projects.
Arvi Tavast
Kadri Muischnek
Kadri Vider
Mare Koit
Spoken language processing technologies are principle components in most of the applications being developed as part of the Quaero program. Quaero is a large research and industrial innovation program focusing on the development of technologies for automatic analysis and classification of multimedia and multilingual documents. Concerning speech processing, research aims to substantially improve the state-of-the-art in speech-to-text transcription, speaker diarization and recognition, language recognition, and speech translation.
In this presentation we will focus on the scientific and technical goals for CLARIN, the contributions of the various communities and the strategies for the future.
This paper describes a speech-to-text system for semi-spontaneous Estonian speech. The system is trained on about 100 hours of manually transcribed speech and a 300M word text corpus. Compound words are split before building the language model and reconstructed from recognizer output using a hidden event N-gram model. We use a three pass transcription strategy with unsupervised speaker adaptation between individual passes. The system achieves a word error rate of 34.6% on conference speeches and 25.6% on radio talk shows.
The paper presents the ongoing-research to apply a pattern-based approach for Lithuanian which can help to automatically extract term-defining contexts from a specialized corpus of education and science. The stages of research include analysis of constituting elements in definitional patterns; formalization of definitional patterns; automatic extraction of term-defining contexts. The first evaluation shows that despite the relatively low frequency of term-defining contexts, their quality can be high enough to serve as a starting point for definitions.
This paper reports on a specific problem of automatic terminology extraction in Lithuanian – base form inference. While the process of lemmatisation is properly carried out by existing tools, problems arise with normalizing multiword terms. It can be described as the discrepancy between the base form (i. e. lemma) of a term and the sequence of the base forms of constituent lexical items within a term. Lithuanian is a strongly inflected language and the lemmatisation of each word separately within a multiword term breaks the syntactic relations expressed by inflection (case, gender, number) which need to be kept in order to ensure the cohesion of the term.
In this paper, we present the results of a series of experiments done to improve the quality of a Lithuanian-English statistical MT (SMT) system. We particularly focus on word alignment and out of vocabulary issues in SMT translating from a morphologically rich language into English.
There is a significant difficulty in discriminating vowels sung at extremely high fundamental frequencies, especially, when the fundamental frequency (F0) produced is above the region where the first vowel formant (F1) would normally occur. Apart from the difficulties involved in discriminating vowels, aspects of phonology might be expected to contribute to problems of intelligibility. Can such vowels be correctly identified and, if so, does context provide the necessary information or acoustical elements are also operative? The paper studies the perception of sung vowels in order to get insight on the issue of singing intelligibility. A perceptual study on the intelligibility of the Russian vowels was carried out. 49 subjects underwent perceptual tests which were aimed at identifying sung vowels. Classification of the confusions showed that incorrectly identified vowels tend to be confused with [a]. The acoustic analysis of the material was also carried out. The results of perception and acoustic analysis are compared and discussed.
We describe experiments on Estonian-English statistical machine translation with a strong emphasis on domain adaptation. We show that disregarding text domains can harm a translation system and that even a small in-domain corpus can lead to significant translation quality improvements.
The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech-English parallel corpus and a shallow MT system for Czech→Slovak translation. Several options are explored to identify the best possible configuration.
We also present our two contributions to available data resources, namely the English-Slovak parallel corpus and the Slovak variant of the WMT 2011 test set.
This paper presents the work on terminology extraction from comparable corpora for Latvian. In the first section we introduce our work; the second section briefly describes the concept of the project and the implemented general terminology processing chain; the following two sections focus on terminology extraction workflow for Latvian and evaluation of results, respectively.
Biomedical text processing is relying heavily on terminological resources. Independently of the method used for creating terminologies, either automatically extracted from a domain corpus or human crafted, there is one aspect of which is rarely considered – that terms evolve over time. Terms in the domain literature change due to many factors: new factual evidence, proposing new hypothesis or denying old ones, a shift towards increasing specificity, variation in expression, different people working independently on the same novel phenomenon, etc. This paper reports an experimental investigation carried out on biomedical domain literature capturing how specific domain terminology changes over time.
This article presents a simple yet efficient method for solving lemma ambiguity as a part of morphological tagging of Estonian. By lemma ambiguity authors mean the situation when a word-form has several (mostly two) possible morphological readings and the only difference between these readings lies in the correct form of lemma, i.e. the POS and grammatical categories are the same, but possible lemmas are different. This type of ambiguity is characteristic of 1.5% of tokens in an otherwise morphologically disambiguated text. A text- and corpus-based method is used to disambiguate this kind of ambiguity. The precision of the method is 0.94 and recall 0.67.
Geoinformational database of Lithuanian Toponyms serves several goals: to foster scientific research and purposes of applied science (1), to satisfy practical needs of inhabitants (2), to preserve Lithuanian toponyms as a cultural heritage (3). It is the first database in Lithuania that jointly provides linguistic and geographic information about the toponyms.
Our paper describes work we have done for Estonian WordNet according to META-NORD project tasks. We discuss the linking process of Estonian WordNet and Core WordNet from linguistic, lexicographical and technical point of view. Also, cross-language linking is briefly described.
EELex is a web-based dictionary writing system with Estonian language support including various linguistic resources necessary for dictionary making [1, 2]. Nearly 40 dictionaries of different types (monolingual and bilingual, general and learners' dictionaries, etc.) with standard XML markup make EELex a multipurpose lexicographic database. Using the example of the active Basic Estonian Dictionary [3], this paper describes from the point of a lexicographer the functions of EELex that allow various specialized dictionaries to be generated. We focus on the generation of syntagmatic dictionaries, mainly the valency and collocation dictionaries.
This paper discusses different methods that have been used for management of word form variation in information retrieval during the history of textual information retrieval. The techniques have been characterized in many ways during the history of IR. We pinpoint the most meaningful features of the approaches and make comparisons that have practical value. In the discussion we characterize word form variation management methods in different ways and offer the reader an overall practical guide for choosing between different methods to be used.
We analyse dialogues in order to determine the dialogue structure formed by micro-level units – dialogue acts. The empirical material of the study is a sub-corpus of Estonian directory inquiries. Dialogue recordings are transliterated by using the transcription of conversation analysis. Dialogue acts are annotated in the corpus. Rules for identification of different dialogue parts will be formulated which use sequences of dialogue acts and their position in dialogue. Our further aim is to implement software for automatic pragmatic analysis of dialogues in order to recognize their linear structure as well as sub-dialogues.
In this work we study a set of adaptation methods for improving the recognition accuracy of foreign entity names in morph-based speech recognition for Finnish. Supervised forms of language model and lexicon adaptation are evaluated. Morpheme adaptation is performed by restoring over-segmented foreign words back into their dictionary forms. This is important for determining the correct pronunciation. To further improve pronunciation modeling of foreign words, non-native phonemes are included in the acoustic model by augmenting the training set with English sentences spoken by native Finnish speakers. English phonemes which don't have any close native counterpart are included into the Finnish phoneme set. A combination of language model, acoustic model, pronunciation and morpheme adaptation produces the lowest error rate for foreign entity names. We also performed tests to determine whether improved recognition of foreign words improves performance of a spoken document retrieval task. We were unable to verify any significant improvements. However, the test queries in this particular material included few foreign words, so a big performance improvement wasn't expected.
In the current paper we report our first results in the development of audiovisual speech synthesis for Estonian. The MASSY model, developed originally for German, serves as a prototype for the Estonian AV synthesis. First, we give an overview of the methods of AV speech synthesis and the Estonian viseme inventory, then we introduce the MASSY model and its adaptation for Estonian; finally, we discuss the ideas for further development.
The paper introduces work-in-progress on multimodal articulatory data collection involving multiple instrumental techniques such as electrolaryngography (EGG), electropalatography (EPG) and electromagnetic articulography (EMA). The data is recorded from two native Estonian speakers (one male and one female), the target amount of the corpus is approximately one hour of speech from both subjects. In the paper the instrumental systems exploited for data collection and recording set-ups are introduced, examples of multimodal data analysis are given and the possible use of the corpus is discussed.
A development of a verb valency lexicon for Latvian has been recently started. The chosen approach combines and supplements the experience of similar lexical resources developed for other languages. The paper describes our approach to the verb valency annotation—the valency layers (syntactic and semantic valency, selectional restrictions) and the set of the semantic roles. The annotation process using an online tool developed for the valency annotation is also briefly described. From the annotated corpus examples, the valency patterns (models) for each verb are generated considering only the core semantic roles. As a result of this work, a summarized information of valency patterns for each verb will be available, as well as a corpus of the annotated examples. Currently, more than 150 verbs (more than 16 000 sentences) have been annotated using data from the Balanced Corpus of Modern Latvian.
The article describes the creation of Hidden Markov Model based speech models for both male and female voice for Estonian text-to-speech synthesis. A brief overview of text-to-speech synthesis process is given, focusing on statistical parametric synthesis in particular. System HTS is employed to generate voice models. The creation of speech corpus of Institute of the Estonian Language is analyzed. The process of adapting Estonian-related training data and linguistic specification to HTS is described, as well as experiments carried out on data from different speakers, subcorpora and linguistic specifications. The findings from speech model evaluation are given and possible courses of action to improve the quality of HMM-based speech models trained are proposed.
The paper describes a work in progress of building a catalogue of named entities – people, places and organizations – based on a recently digitized large (4.5 billion tokens) Latvian corpus. The authors propose an annotation standard for markup of named entities within Latvian corpus, according to which a representative set of documents (150 000 words) are manually annotated. This corpus is used for training and evaluation of an automated named entity recognition system based on Stanford CRF classifier, achieving an F-score of up to 81%. The named entities indexed within the Latvian National Library corpus and the annnotated documents are publicly available for linguistic and historical research online.
In this paper the authors present various techniques of how to achieve MT domain adaptation with limited in-domain resources. This paper gives a case study of what works and what not if one has to build a domain specific machine translation system. Systems are adapted using in-domain comparable monolingual and bilingual corpora (crawled from the Web) and bilingual terms and named entities. The authors show how to efficiently integrate terms within statistical machine translation systems, thus significantly improving upon the baseline.