Ebook: Human Language Technologies – The Baltic Perspective

Proceedings of the Fifth International Conference Baltic HLT 2012

Series

Frontiers in Artificial Intelligence and Applications

Volume

247

Published

2012

Editors

Arvi Tavast, Kadri Muischnek, Mare Koit

ISBN

978-1-61499-132-8 (print) | 978-1-61499-133-5 (online)

Subject(s)

Artificial Intelligence

Open Access

Description

Human language technologies continue to play an important part in the modern information society. This book contains papers presented at the fifth international conference ‘Human Language Technologies – The Baltic Perspective (Baltic HLT 2012)’, held in Tartu, Estonia, in October 2012. Baltic HLT provides a special venue for new and ongoing work in computational linguistics and related disciplines, both in the Baltic states and in a broader geographical perspective. It brings together scientists, developers, providers and users of HLT, and is a forum for the sharing of new ideas and recent advances in human language processing, promoting cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. Twenty long papers, as well as the posters or demos accepted for presentation at the conference, are published here. They cover a wide range of topics: morphological disambiguation, dependency syntax and valency, computational semantics, named entities, dialogue modeling, terminology extraction and management, machine translation, corpus and parallel corpus compiling, speech modeling and multimodal communication. Some of the papers also give a general overview of the state of the art of human language technology and language resources in the Baltic states. This book will be of interest to all those whose work involves the use and application of computational linguistics and related disciplines.

↓ more

↑ less

Contents

Front Matter

Pages

i - xi

Preface

This volume contains papers presented at the Fifth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2012), held in Tartu, Estonia on 4–5 October 2012.

Since its first edition in 2004, Baltic HLT has served as a special venue for new and ongoing work in computational linguistics and related disciplines in the Baltic states as well as in a broader geographical perspective.

The main aim of this conference is to provide a forum for the sharing of new ideas and recent advances in human language processing and to promote cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. The conference brings together scientists, developers, providers and users to discuss state-of-the-art of human language technologies in the Baltic countries, to exchange information and to discuss problems, to find new synergies and to promote initiatives for international cooperation.

The call for papers for the fifth Baltic HLT laid special emphasis on multilinguality in language resources and on applications of human language technology, while also encouraging the potential authors to submit papers on other subfields of computational linguistics and related disciplines.

51 submissions were received; each submission was evaluated by at least two reviewers.

The Programme Committee consisted of 25 members from 13 different countries. Based on their scores and the comments they provided on the content and quality of the papers, 20 long papers and 20 posters or demos were accepted for presentation and publication.

The accepted submissions cover a wide range of topics: morphological disambiguation, dependency syntax and valency, computational semantics, named entities, dialogue modeling, terminology extraction and management, machine translation, corpus and parallel corpus compiling, speech modeling and multimodal communication. A few papers give a general overview of the state of the art of the human language technology and/or language resources in the Baltic states.

Completing the programme are the invited lectures by Lori Lamel “Multilingual Speech Processing Activities in Quaero: application to multimedia search in unstructured data” and Bente Maegaard “A Multilingual Research Infrastructure”.

We wish to express our gratitude to the members of the Programme Committee who worked hard to review all submissions.

We also want to thank the organizers and supporters of this conference: Institute of Computer Science, University of Tartu and Estonian Ministry of Education and Research as funder of National Programme for Estonian Language Technology. The conference is also supported by the CLARIN and META-NORD projects.

Arvi Tavast

Kadri Muischnek

Kadri Vider

Mare Koit

↓ more

↑ less

Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data

Authors

Lori Lamel

Pages

1 - 8

DOI

10.3233/978-1-61499-133-5-1

Abstract

Spoken language processing technologies are principle components in most of the applications being developed as part of the Quaero program. Quaero is a large research and industrial innovation program focusing on the development of technologies for automatic analysis and classification of multimedia and multilingual documents. Concerning speech processing, research aims to substantially improve the state-of-the-art in speech-to-text transcription, speaker diarization and recognition, language recognition, and speech translation.

↓ more

↑ less

A Multilingual Research Infrastructure

Authors

Bente Maegaard

Page

DOI

10.3233/978-1-61499-133-5-9

Abstract

In this presentation we will focus on the scientific and technical goals for CLARIN, the contributions of the various communities and the strategies for the future.

↓ more

↑ less

Transcription System for Semi-Spontaneous Estonian Speech

Authors

Tanel Alumäe

Pages

10 - 17

DOI

10.3233/978-1-61499-133-5-10

Abstract

This paper describes a speech-to-text system for semi-spontaneous Estonian speech. The system is trained on about 100 hours of manually transcribed speech and a 300M word text corpus. Compound words are split before building the language model and reconstructed from recognizer output using a hidden event N-gram model. We use a three pass transcription strategy with unsupervised speaker adaptation between individual passes. The system achieves a word error rate of 34.6% on conference speeches and 25.6% on radio talk shows.

↓ more

↑ less

Towards the Automatic Extraction of Term-defining Contexts in Lithuanian

Authors

Agnė Bielinskienė, Loïc Boizou, Jolanta Kovalevskaitė, Andrius Utka

Pages

18 - 26

DOI

10.3233/978-1-61499-133-5-18

Abstract

The paper presents the ongoing-research to apply a pattern-based approach for Lithuanian which can help to automatically extract term-defining contexts from a specialized corpus of education and science. The stages of research include analysis of constituting elements in definitional patterns; formalization of definitional patterns; automatic extraction of term-defining contexts. The first evaluation shows that despite the relatively low frequency of term-defining contexts, their quality can be high enough to serve as a starting point for definitions.

↓ more

↑ less

Automatic Inference of Base Forms for Multiword Terms in Lithuanian

Authors

Loïc Boizou, Gintarė Grigonytė, Erika Rimkutė, Andrius Utka

Pages

27 - 35

DOI

10.3233/978-1-61499-133-5-27

Abstract

This paper reports on a specific problem of automatic terminology extraction in Lithuanian – base form inference. While the process of lemmatisation is properly carried out by existing tools, problems arise with normalizing multiword terms. It can be described as the discrepancy between the base form (i. e. lemma) of a term and the sequence of the base forms of constituent lexical items within a term. Lithuanian is a strongly inflected language and the lemmatisation of each word separately within a multiword term breaks the syntactic relations expressed by inflection (case, gender, number) which need to be kept in order to ensure the cohesion of the term.

↓ more

↑ less

Data Pre-Processing to Train a Better Lithuanian-English MT System

Authors

Daiga Deksne, Raivis Skadiņš

Pages

36 - 41

DOI

10.3233/978-1-61499-133-5-36

Abstract

In this paper, we present the results of a series of experiments done to improve the quality of a Lithuanian-English statistical MT (SMT) system. We particularly focus on word alignment and out of vocabulary issues in SMT translating from a morphologically rich language into English.

↓ more

↑ less

Perception of Russian Vowels in Singing

Authors

Karina Evgrafova, Vera Evdokimova

Pages

42 - 49

DOI

10.3233/978-1-61499-133-5-42

Abstract

There is a significant difficulty in discriminating vowels sung at extremely high fundamental frequencies, especially, when the fundamental frequency (F0) produced is above the region where the first vowel formant (F1) would normally occur. Apart from the difficulties involved in discriminating vowels, aspects of phonology might be expected to contribute to problems of intelligibility. Can such vowels be correctly identified and, if so, does context provide the necessary information or acoustical elements are also operative? The paper studies the perception of sung vowels in order to get insight on the issue of singing intelligibility. A perceptual study on the intelligibility of the Russian vowels was carried out. 49 subjects underwent perceptual tests which were aimed at identifying sung vowels. Classification of the confusions showed that incorrectly identified vowels tend to be confused with [a]. The acoustic analysis of the material was also carried out. The results of perception and acoustic analysis are compared and discussed.

↓ more

↑ less

In-domain Data FTW

Authors

Mark Fishel

Pages

50 - 57

DOI

10.3233/978-1-61499-133-5-50

Abstract

We describe experiments on Estonian-English statistical machine translation with a strong emphasis on domain adaptation. We show that disregarding text domains can harm a translation system and that even a small in-domain corpus can lead to significant translation quality improvements.

↓ more

↑ less

Improving SMT by Using Parallel Data of a Closely Related Language

Authors

Petra Galuščáková, Ondřej Bojar

Pages

58 - 65

DOI

10.3233/978-1-61499-133-5-58

Abstract

The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech-English parallel corpus and a shallow MT system for Czech→Slovak translation. Several options are explored to identify the best possible configuration.

We also present our two contributions to available data resources, namely the English-Slovak parallel corpus and the Slovak variant of the WMT 2011 test set.

↓ more

↑ less

Terminology Extraction from Comparable Corpora for Latvian

Authors

Tatiana Gornostay, Anita Ramm, Ulrich Heid, Emmanuel Morin, Rima Harastani, Emmanuel Planas

Pages

66 - 73

DOI

10.3233/978-1-61499-133-5-66

Abstract

This paper presents the work on terminology extraction from comparable corpora for Latvian. In the first section we introduce our work; the second section briefly describes the concept of the project and the implemented general terminology processing chain; the following two sections focus on terminology extraction workflow for Latvian and evaluation of results, respectively.

↓ more

↑ less

Change of Biomedical Domain Terminology Over Time

Authors

Gintarė Grigonytė, Fabio Rinaldi, Martin Volk

Pages

74 - 81

DOI

10.3233/978-1-61499-133-5-74

Abstract

Biomedical text processing is relying heavily on terminological resources. Independently of the method used for creating terminologies, either automatically extracted from a domain corpus or human crafted, there is one aspect of which is rarely considered – that terms evolve over time. Terms in the domain literature change due to many factors: new factual evidence, proposing new hypothesis or denying old ones, a shift towards increasing specificity, variation in expression, different people working independently on the same novel phenomenon, etc. This paper reports an experimental investigation carried out on biomedical domain literature capturing how specific domain terminology changes over time.

↓ more

↑ less

A trivial method for choosing the right lemma

Authors

Heiki-Jaan Kaalep, Riin Kirt, Kadri Muischnek

Pages

82 - 89

DOI

10.3233/978-1-61499-133-5-82

Abstract

This article presents a simple yet efficient method for solving lemma ambiguity as a part of morphological tagging of Estonian. By lemma ambiguity authors mean the situation when a word-form has several (mostly two) possible morphological readings and the only difference between these readings lies in the correct form of lemma, i.e. the POS and grammatical categories are the same, but possible lemmas are different. This type of ambiguity is characteristic of 1.5% of tokens in an otherwise morphologically disambiguated text. A text- and corpus-based method is used to disambiguate this kind of ambiguity. The precision of the method is 0.94 and recall 0.67.

↓ more

↑ less

Geoinformational Database of Lithuanian Toponyms

Authors

Dalia Kačinaitė-Vrubliauskienė

Pages

90 - 95

DOI

10.3233/978-1-61499-133-5-90

Abstract

Geoinformational database of Lithuanian Toponyms serves several goals: to foster scientific research and purposes of applied science (1), to satisfy practical needs of inhabitants (2), to preserve Lithuanian toponyms as a cultural heritage (3). It is the first database in Lithuania that jointly provides linguistic and geographic information about the toponyms.

↓ more

↑ less

Cross-linking Experience of Estonian WordNet

Authors

Neeme Kahusk, Heili Orav, Kadri Vare

Pages

96 - 102

DOI

10.3233/978-1-61499-133-5-96

Abstract

Our paper describes work we have done for Estonian WordNet according to META-NORD project tasks. We discuss the linking process of Estonian WordNet and Core WordNet from linguistic, lexicographical and technical point of view. Also, cross-language linking is briefly described.

↓ more

↑ less

Automatic Generation of Specialized Dictionaries Using the Dictionary Writing System EELex

Authors

Jelena Kallas, Margit Langemets

Pages

103 - 110

DOI

10.3233/978-1-61499-133-5-103

Abstract

EELex is a web-based dictionary writing system with Estonian language support including various linguistic resources necessary for dictionary making [1, 2]. Nearly 40 dictionaries of different types (monolingual and bilingual, general and learners' dictionaries, etc.) with standard XML markup make EELex a multipurpose lexicographic database. Using the example of the active Basic Estonian Dictionary [3], this paper describes from the point of a lexicographer the functions of EELex that allow various specialized dictionaries to be generated. We focus on the generation of syntagmatic dictionaries, mainly the valency and collocation dictionaries.

↓ more

↑ less

Managing Word Form Variation of Text Retrieval in Practice – why Five Character Truncation Takes it all?

Authors

Kimmo Kettunen

Pages

111 - 119

DOI

10.3233/978-1-61499-133-5-111

Abstract

This paper discusses different methods that have been used for management of word form variation in information retrieval during the history of textual information retrieval. The techniques have been characterized in many ways during the history of IR. We pinpoint the most meaningful features of the approaches and make comparisons that have practical value. In the discussion we characterize word form variation management methods in different ways and offer the reader an overall practical guide for choosing between different methods to be used.

↓ more

↑ less

Towards Automatic Recognition of the Structure of Estonian Directory Inquiries

Authors

Mare Koit

Pages

120 - 128

DOI

10.3233/978-1-61499-133-5-120

Abstract

We analyse dialogues in order to determine the dialogue structure formed by micro-level units – dialogue acts. The empirical material of the study is a sub-corpus of Estonian directory inquiries. Dialogue recordings are transliterated by using the transcription of conversation analysis. Dialogue acts are annotated in the corpus. Rules for identification of different dialogue parts will be formulated which use sequences of dialogue acts and their position in dialogue. Our further aim is to implement software for automatic pragmatic analysis of dialogues in order to recognize their linear structure as well as sub-dialogues.

↓ more

↑ less

Adaptation of Morpheme-based Speech Recognition for Foreign Entity Names

Authors

André Mansikkaniemi, Mikko Kurimo

Pages

129 - 137

DOI

10.3233/978-1-61499-133-5-129

Abstract

In this work we study a set of adaptation methods for improving the recognition accuracy of foreign entity names in morph-based speech recognition for Finnish. Supervised forms of language model and lexicon adaptation are evaluated. Morpheme adaptation is performed by restoring over-segmented foreign words back into their dictionary forms. This is important for determining the correct pronunciation. To further improve pronunciation modeling of foreign words, non-native phonemes are included in the acoustic model by augmenting the training set with English sentences spoken by native Finnish speakers. English phonemes which don't have any close native counterpart are included into the Finnish phoneme set. A combination of language model, acoustic model, pronunciation and morpheme adaptation produces the lowest error rate for foreign entity names. We also performed tests to determine whether improved recognition of foreign words improves performance of a spoken document retrieval task. We were unable to verify any significant improvements. However, the test queries in this particular material included few foreign words, so a big performance improvement wasn't expected.

↓ more

↑ less

Towards Audiovisual TTS in Estonian

Authors

Einar Meister, Sascha Fagel, Rainer Metsvahi

Pages

138 - 145

DOI

10.3233/978-1-61499-133-5-138

Abstract

In the current paper we report our first results in the development of audiovisual speech synthesis for Estonian. The MASSY model, developed originally for German, serves as a prototype for the Estonian AV synthesis. First, we give an overview of the methods of AV speech synthesis and the Estonian viseme inventory, then we introduce the MASSY model and its adaptation for Estonian; finally, we discuss the ideas for further development.

↓ more

↑ less

Multimodal Corpus of Speech Production: Work in Progress

Authors

Einar Meister, Lya Meister

Pages

146 - 153

DOI

10.3233/978-1-61499-133-5-146

Abstract

The paper introduces work-in-progress on multimodal articulatory data collection involving multiple instrumental techniques such as electrolaryngography (EGG), electropalatography (EPG) and electromagnetic articulography (EMA). The data is recorded from two native Estonian speakers (one male and one female), the target amount of the corpus is approximately one hour of speech from both subjects. In the paper the instrumental systems exploited for data collection and recording set-ups are introduced, examples of multimodal data analysis are given and the possible use of the corpus is discussed.

↓ more

↑ less

Towards a Latvian Valency Lexicon

Authors

Gunta Nešpore, Baiba Saulīte, Normunds Grūzītis, Ginta Garkāje

Pages

154 - 161

DOI

10.3233/978-1-61499-133-5-154

Abstract

A development of a verb valency lexicon for Latvian has been recently started. The chosen approach combines and supplements the experience of similar lexical resources developed for other languages. The paper describes our approach to the verb valency annotation—the valency layers (syntactic and semantic valency, selectional restrictions) and the set of the semantic roles. The annotation process using an online tool developed for the valency annotation is also briefly described. From the annotated corpus examples, the valency patterns (models) for each verb are generated considering only the core semantic roles. As a result of this work, a summarized information of valency patterns for each verb will be available, as well as a corpus of the annotated examples. Currently, more than 150 verbs (more than 16 000 sentences) have been annotated using data from the Balanced Corpus of Modern Latvian.

↓ more

↑ less

Creation of HMM-based Speech Model for Estonian Text-to-Speech Synthesis

Authors

Tõnis Nurk

Pages

162 - 168

DOI

10.3233/978-1-61499-133-5-162

Abstract

The article describes the creation of Hidden Markov Model based speech models for both male and female voice for Estonian text-to-speech synthesis. A brief overview of text-to-speech synthesis process is given, focusing on statistical parametric synthesis in particular. System HTS is employed to generate voice models. The creation of speech corpus of Institute of the Estonian Language is analyzed. The process of adapting Estonian-related training data and linguistic specification to HTS is described, as well as experiments carried out on data from different speakers, subcorpora and linguistic specifications. The findings from speech model evaluation are given and possible courses of action to improve the quality of HMM-based speech models trained are proposed.

↓ more

↑ less

Towards named entity annotation of Latvian National Library corpus

Authors

Peteris Paikens, Ilze Auzina, Ginta Garkaje, Madara Paegle

Pages

169 - 175

DOI

10.3233/978-1-61499-133-5-169

Abstract

The paper describes a work in progress of building a catalogue of named entities – people, places and organizations – based on a recently digitized large (4.5 billion tokens) Latvian corpus. The authors propose an annotation standard for markup of named entities within Latvian corpus, according to which a representative set of documents (150 000 words) are manually annotated. This corpus is used for training and evaluation of an automated named entity recognition system based on Stanford CRF classifier, achieving an F-score of up to 81%. The named entities indexed within the Latvian National Library corpus and the annnotated documents are publicly available for linguistic and historical research online.

↓ more

↑ less

MT Adaptation for Under-Resourced Domains – What Works and What Not

Authors

Mārcis Pinnis, Raivis Skadiņš

Pages

176 - 184

DOI

10.3233/978-1-61499-133-5-176

Abstract

In this paper the authors present various techniques of how to achieve MT domain adaptation with limited in-domain resources. This paper gives a case study of what works and what not if one has to build a domain specific machine translation system. Systems are adapted using in-domain comparable monolingual and bilingual corpora (crawled from the Web) and bilingual terms and named entities. The authors show how to efficiently integrate terms within statistical machine translation systems, thus significantly improving upon the baseline.

↓ more

↑ less

Ebook: Human Language Technologies – The Baltic Perspective

This website uses cookies

This website uses cookies