Ebook: Human Language Technologies – The Baltic Perspective

Proceedings of the Sixth International Conference Baltic HLT 2014

Series

Frontiers in Artificial Intelligence and Applications

Volume

268

Published

2014

Editors

Andrius Utka, Gintarė Grigonytė, Jurgita Kapočiūtė-Dzikienė, Jurgita Vaičenonienė

ISBN

978-1-61499-441-1 (print) | 978-1-61499-442-8 (online)

Subject(s)

Open Access

Description

This book contains papers from the Fourth International Conference on Human Language Technologies - the Baltic Perspective (Baltic HLT 2010), held in Riga in October 2010. This conference is the latest in a series which provides a forum for sharing recent advances in human language processing, and promotes cooperation between the computer science and linguistics communities of the Baltic countries and the rest of the world. Bringing together scientists, developers, providers and users, the conference is an opportunity to exchange information, discuss problems, find new synergies and promote initiatives for international cooperation.

The 32 papers collected have been submitted by 77 authors from 11 countries, after review by an international program committee. They cover a wide range of research topics in corpus linguistics, machine translation, speech technologies, semantics and other areas of HLT research. This proceedings reflects the current state of HLT in the Baltic countries and the work towards creating a Baltic linguistic infrastructure. Human Language Technologies – The Baltic Perspective is a useful and comprehensive repository of information and will facilitate further research and development of HLT in the Baltic region, and the creation of a pan-European research infrastructure of the language resources and technology.

↓ more

↑ less

Contents

Front Matter

Pages

i - xiii

Preface

It is our great pleasure to introduce the proceedings of the Sixth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2014).

This volume contains papers presented at the Sixth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2014), held in Kaunas, Lithuania on 26–27 September 2014. The conference traditionally gathers scientists from the Baltic region, where they discuss the most relevant problems which they face in their research, strengthen partnership and brainstorm new perspective ideas. The conference completes the second round of the Baltic HLT conferences which have been organised two times in Latvia (in 2004 and in 2010, in Riga), Estonia (in 2005, in Tallinn and in 2012, in Tartu), and Lithuania (in 2007 and in 2014, in Kaunas).

Baltic HLT 2014 is also important for gathering and consolidating ideas for the following reasons:

• a great urge for Baltic countries to establish and join international research infrastructures;

• possible participation in the EU Research and Innovation programme Horizons 2020 and in the new period of the EU structural funds (2014–2020).

We have received papers from a wide range of topics: syntactic analysis, sentiment analysis, coreference resolution, authorship attribution, information extraction, document clustering, machine translation, corpus and parallel corpus compiling, speech recognition and synthesis, and others. There were a total of 51 submissions received. 21 were accepted as full papers and 19 were accepted as short papers.

The Baltic HLT 2014 programme features presentations that are divided into three HLT branches: namely, speech technology (7 papers), methods in computational linguistics (16 papers), and preparation of language resources (16 papers).

The conference features four invited speakers who overview important and actual topics: Steven Krauwer (Holland, CLARIN ERIC) delivers the presentation on the importance of CLARIN, Walter Daelemans (Belgium, Antwerp University) – on computational stylometry, Adam Kilgarriff (UK, Lexical Computing Ltd.) – on corpus evaluation, Ruta Petrauskaite (Lithuania, LMT) – on the future perspectives of language technologies in the Baltics.

We want to express our gratitude to all the people who helped to organize and support this conference. We are indebted to all the members of the programme committee for their detailed inspection of all submitted work and for their valuable comments. We thank Gailius Raškinis and Vytautas Rudžionis for reviewing and editing papers on speech recognition and synthesis, Irena Markievicz for setting up the conference website, Darius Amilevicius for advice on financial matters. A special thanks to the local organizing committee and ViaConventus who helped to make Baltic HLT 2014 a memorable event.

Andrius Utka,

Gintarė Grigonytė,

Jurgita Kapočiūtė-Dzikienė,

and Jurgita Vaičenonienė

↓ more

↑ less

Speech Technology

Page

↓ more

↑ less

Full-duplex Speech-to-text System for Estonian

Authors

Tanel Alumäe

Pages

3 - 10

DOI

10.3233/978-1-61499-442-8-3

Abstract

The paper describes a distributed online speech-to-text system. The main features of the system are real-time speech recognition and full-duplex user experience, meaning that the partially recognized utterance is progressively displayed to the user during speaking. Other benefits include easy client-server communication protocol and system scalability to many concurrent user sessions. The paper also describes two Estonian speech-to-text applications based on the developed frame-work: a general-domain dictation application with an estimated word error rate of 26.4% and a radiology report dictation system with a word error rate of 13.7%. The system is open-source and based on free software.

↓ more

↑ less

Evaluation of the Estonian Audiovisual Speech Synthesis

Authors

Einar Meister, Rainer Metsvahi, Sascha Fagel

Pages

11 - 18

DOI

10.3233/978-1-61499-442-8-11

Abstract

In the current paper we report the first evaluation results of the Estonian virtual talking head. The testing scenario involved perceptual experiments with unimodal audio and bimodal audiovisual speech stimuli in five noise conditions. As expected, in the presence of noise, the scores of consonant recognition of audiovisual stimuli were always higher than the scores of audio stimuli. The average recognition error in audio-only presentation was reduced by 42%–65% when the virtual talking head was displayed along with audio.

↓ more

↑ less

A System of Spoken Subtitles for Estonian Television

Authors

Meelis Mihkla, Indrek Hein, Indrek Kiissel, Artur Räpp, Risto Sirts, Tanel Valdna

Pages

19 - 26

DOI

10.3233/978-1-61499-442-8-19

Abstract

Systems for automatic reading and broadcasting subtitles (spoken subtitles) are meant to eliminate the language barrier that TV-viewers with special needs (such as the visually handicapped and the dyslectics) may experience in watching TV films or broadcasts in foreign languages that are provided with subtitles. In such systems, a speech signal synchronised with TV subtitles is generated through a separate audio channel. The present article focuses on the questions that have arisen during the development and application of the system of spoken subtitles for Estonian Public Broadcasting: selection of a TTS system and of a synthetic voice, synchronization between the subtitles and synthetic speech utterances, and the marking of speaking turns. Such subjects as the editor interface of the system for automatic reading and broadcasting subtitles as well as the foreign names pronunciation database are also included.

↓ more

↑ less

Statistical Parametric Speech Synthesis for Online Dictionaries – Problems and Solutions

Authors

Liisi Piits, Elgar Kudritski, Indrek Kiissel, Indrek Hein

Pages

27 - 32

DOI

10.3233/978-1-61499-442-8-27

Abstract

This paper describes an attempt to use Estonian statistical parametric speech synthesis for audio pronunciation of words and word forms in online dictionaries. Two new HTS-voices were created and compared for this purpose. The paper gives an overview of a design and evaluation process for these voices. Different errors were detected including quantity errors, bad sound quality, accent errors, gemination at the boundary of compound word components, etc. The level of correctness and sound quality for the two parametric speech synthesisers ranged from 69% to 76%. The paper demonstrates that voice Eva-2, which can accept text with diacritics as input, produces fewer errors. Still, the error rate of both new voices is too high to fill the criteria of orthoepy in learner's dictionaries.

↓ more

↑ less

Combining Multiple Foreign Language Speech Recognizers by using Neural Networks

Authors

Tomas Rasymas, Vytautas Rudžionis

Pages

33 - 39

DOI

10.3233/978-1-61499-442-8-33

Abstract

This paper proposes a new method for combining multiple foreign language speech recognizers which are adapted to recognize Lithuanian voice commands. The recognizers are combined by using neural network. The type or structure of speech recognizer is not important for method but at least it must return recognized command and recognition hypothesis. These two parameters are used to train neural network and to make the final decision about recognized command. The proposed method showed that recognition accuracy was increased by 4.94 % as compared to the best single recognizer.

↓ more

↑ less

Medical – pharmaceutical information system with recognition of Lithuanian voice commands

Authors

Vytautas Rudžionis, Gailius Raškinis, Kastytis Ratkevičius, Algimantas Rudžionis, Gintarė Bartišiūtė

Pages

40 - 45

DOI

10.3233/978-1-61499-442-8-40

Abstract

This paper presents a Lithuanian voice recognition system of medical – pharmaceutical terms. The system consists of two separate speech recognition modules working in parallel. One recognizer is a proprietary CD-HMM Lithuanian speech recognizer. The second recognizer is a Spanish speech recognizer adapted to recognize Lithuanian voice commands. The outputs of both recognizers are combined by the decision making block yielding the final decision. The decision making block was automatically derived by an induction algorithm that learns a set of symbolic rules. The investigations showed that both recognizers produce uncorrelated outputs and could complement each other. The investigations also showed that Lithuanian speech recognizer achieves higher accuracy (over 96 percent in a speaker independent mode) but the use of the adapted foreign language recognizer allows increase this baseline accuracy even further (over 98 percent in a speaker independent mode for 1000 voice commands). The voice recognition system is in the process of being embedded into several medical information systems which will be used by healthcare practitioners.

↓ more

↑ less

The Development of Conversational Agent Based Interface

Authors

Inese Vīra, Andrejs Vasiļjevs

Pages

46 - 53

DOI

10.3233/978-1-61499-442-8-46

Abstract

In this paper we present two prototypes of 3D based virtual agents: one chatbot which in addition to the ability to hold a conversation can perform translation from English into Spanish, Russian, and French; and another which supplies currency conversion (lats to euro and euro to lats) in the Latvian language. Both chatbots are voice controlled, with natural mimicry and representations of human-like emotions. We describe the motivation, development process, design and architecture of these mobile applications. The evaluation of both applications and their usage in selected scenarios is also presented.

↓ more

↑ less

Methods in Computational Linguistics

Page

↓ more

↑ less

Comparison of Rule-based and Statistical Methods for Grapheme to Phoneme Modelling

Authors

Ilze Auziņa, Mārcis Pinnis, Roberts Darǵis

Pages

57 - 60

DOI

10.3233/978-1-61499-442-8-57

Abstract

Grapheme to phoneme modelling is one of the key features in automated speech recognition and speech synthesis. In this paper, the authors compare two different approaches: a statistical machine translation based method using the phonetically transcribed Latvian Speech Recognition Corpus and a rule-based method for phonetic transcription of words from grammatically correct forms. The paper provides 10-fold cross-validation results and error analysis for both methods.

↓ more

↑ less

English-Lithuanian Word Alignment with Bilingwis: Evaluation of the Alignment

Authors

Loïc Boizou, Jolanta Kovalevskaitė, Erika Rimkutė, Roger Wechsler

Pages

61 - 68

DOI

10.3233/978-1-61499-442-8-61

Abstract

This paper presents the first qualitative evaluation of the word-alignment system Bilingwis for the English-Lithuanian language pair. The evaluation was performed by scoring alignments for the most frequent autosemantic words from the English-Lithuanian parallel corpus. The main tendencies revealed by the evaluation are presented and some problematic issues as well as future improvements are discussed.

↓ more

↑ less

Syntactic Engine for the Lithuanian Language

Authors

Loïc Boizou, Francesco Zamblera

Pages

69 - 74

DOI

10.3233/978-1-61499-442-8-69

Abstract

This paper describes the syntactic engine for the Lithuanian language which is currently being developed at the Vytautas Magnus University. The program was written in Haskell and consists of four modules: a constituent analysis, a shallow constituent analysis, a dependency analysis and a dependency analysis augmented with thematic roles. The “work-in-progress” status of the project is emphasized, as well as the significant results achieved in the current stage.

↓ more

↑ less

Baseline for Keyword Spotting in Latvian Broadcast Speech

Authors

Roberts Darǵis, Artūrs Znotiņš

Pages

75 - 82

DOI

10.3233/978-1-61499-442-8-75

Abstract

This paper describes the first efforts of keyword identification in unconstrained Latvian broadcast speech. During this research a large vocabulary continuous speech recognition (LVCSR), spotting in LVCSR lattices and acoustic keyword spotting had been compared. Open source tools and recently created 100 hours of Latvian Speech Recognition Corpus have been used.

↓ more

↑ less

Normalization and Automatized Sentiment Analysis of Contemporary Online Latvian Language

Authors

Ginta Garkāje, Evelīna Zilgalve, Roberts Darǵis

Pages

83 - 86

DOI

10.3233/978-1-61499-442-8-83

Abstract

We describe contemporary language transliteration influence on automatized sentiment analysis. We state that the text normalization helps to achieve better results in automatized sentiment analysis and provide results to support the claim. Data used for the experiments are gathered via project Virtual Aggression Barometer. We use a normalization tool and an automatized classifier for the internet user comments with aggressive and non-aggressive sentiment.

↓ more

↑ less

Tracing Mistakes and Finding Gaps in Automatic Word Alignments for Latvian-English Translation

Authors

Valdis Girgzdis, Maija Kale, Martins Vaicekauskis, Ieva Zarina, Inguna Skadiņa

Pages

87 - 94

DOI

10.3233/978-1-61499-442-8-87

Abstract

This paper aims to contribute to an in-depth understanding of computer based word alignment processes in machine translation (MT). The performance of word alignment, based on IBM models and incorporated in GIZA++, has been widely discussed in machine translation literature. The debate has lead towards a general consensus that GIZA++ does not provide sufficiently good results for word alignments. In this paper, we analyse the performance of GIZA++ and Fast Align for the Latvian-English pair against the manually aligned Gold Standard. Experiments showed that Fast Align proved to be approximately 2–3% more accurate and three times faster than GIZA++ in the alignment task. Where it concerns pre-processing, the removal of articles has a small, but positive, influence on alignment quality and machine translation output. We also present a Word Alignment Visualisation tool for analysis and editing of word alignments.

↓ more

↑ less

Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays

Authors

Gintarė Grigonytė, Björn Hammarberg

Pages

95 - 98

DOI

10.3233/978-1-61499-442-8-95

Abstract

This research presents an investigation performed on the ASU corpus. We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.

↓ more

↑ less

Automatic Author Profiling of Lithuanian Parliamentary Speeches: Exploring the Influence of Features and Dataset Sizes

Authors

Jurgita Kapočiūtė-Dzikienė, Ligita Šarkutė, Andrius Utka

Pages

99 - 106

DOI

10.3233/978-1-61499-442-8-99

Abstract

Extraction of demographic, cultural background characteristics or psychometric traits about an author from an anonymous text has a number of potential applications in such fields as forensics, security or user-targeted services. Despite significant advances in the automatic author profiling, the most of the research has been done on Germanic languages and not so much on morphologically rich languages. Consequently, this work is the first attempt at finding a good method for solving automatic author profiling in three dimensions for Lithuanian: age (6 categories), gender (2 categories) and political view (3 categories). To tackle this task we used the dataset, which contains text transcripts of Lithuanian parliamentary speeches and debates, thus representing formal spoken, but normative Lithuanian language. In our paper we explored different feature types (ultimate style markers, lexical, morphological, character, and aggregated) and dataset sizes (of 100, 200, 500, 1,000, 2,000, 5,000 instances in each category). The best results were obtained with Support Vector Machine method, the largest tested dataset and lemmas as features: i.e. 44.6% of accuracy for age with interpolation up to trigrams, 74.6% for gender and 58.7% for political view with interpolation up to bigrams.

↓ more

↑ less

Empirical Study on Unsupervised Feature Selection for Document Clustering

Authors

Aušra Mackutė-Varoneckienė, Tomas Krilavičius

Pages

107 - 110

DOI

10.3233/978-1-61499-442-8-107

Abstract

Unsupervised feature selection is very important in the document clustering process. This paper presents the empirical research on feature selection as well as clustering methods and feature representation suitability for Lithuanian and Russian document clustering.

↓ more

↑ less

Dependency Parsing of Estonian: Statistical and Rule-based Approaches

Authors

Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen

Pages

111 - 118

DOI

10.3233/978-1-61499-442-8-111

Abstract

This paper gives an overview of the latest developments in computational syntactic analysis of Estonian. We present Estonian Dependency Treebank, an ongoing corpus annotation project. Although the treebank construction is still under way, we have used it for training MaltParser and experimenting with combining MaltParser with a rule-based Constraint Grammar parser for Estonian. MaltParser achieves unlabeled attachment score (UAS; correct links to head node) of 83.4% and label accuracy (LA) of 88.6%. Labeled attachment score (LAS) was 80.3%.

Applying different algorithms for combining MaltParser with Constraint Grammar parser improved the results by 1%. Special CG rule set for fixing some typical MaltParser errors improved the UAS by up to 1.5%.

↓ more

↑ less

Latvian Newswire Information Extraction System and Entity Knowledge Base

Authors

Pēteris Paikens

Pages

119 - 125

DOI

10.3233/978-1-61499-442-8-119

Abstract

This paper describes an information extraction system designed for obtaining CV-style structured information about publicly mentioned persons, organizations and their relations by analyzing newswire archives in the Latvian language. The described text analysis pipeline consists of morphosyntactic analysis, NER and coreference resolution, and a semantic role labeling system based on FrameNet principles. We also implement an entity linking process, matching the entity mentions in each document to an entity knowledge base that is initially seeded with authoritative information on relevant people and organizations. The accuracy of automated frame extraction varies depending on specifics of each frame type, but the average accuracy currently is 53% F-score for frame target identification, and 61% for frame element role classification. The currently targeted volume of text is the total archives of Latvian newspapers, magazines and news portals, consisting of about 3.5 million articles.

↓ more

↑ less

Uses of Machine Translation in the Sentiment Analysis of Tweets

Authors

Jānis Peisenieks, Raivis Skadiņš

Pages

126 - 131

DOI

10.3233/978-1-61499-442-8-126

Abstract

This paper reports on the viability of using machine translation (MT) for determining the original sentiment of tweets, when translating tweets made in internationally less used language into more frequently used ones. The results of the study show that it is possible to use MT and sentiment analysis (SA) systems to produce SA results with significant precision.

↓ more

↑ less

Bootstrapping of a Multilingual Transliteration Dictionary for European Languages

Authors

Mārcis Pinnis

Pages

132 - 140

DOI

10.3233/978-1-61499-442-8-132

Abstract

Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dictionary is evaluated: 1) manually for the Latvian-English language pair and 2) automatically within a statistical machine translation based transliteration task for all 23 language pairs.

↓ more

↑ less

Building the World's Best General Domain MT for Baltic Languages

Authors

Raivis Skadiņš, Valters Šics, Roberts Rozis

Pages

141 - 148

DOI

10.3233/978-1-61499-442-8-141

Abstract

In this paper we present our experience in building machine translation (MT) systems for the languages of the Baltic States: Estonian, Latvian, and Lithuanian. The paper reports on the implementation, research, data, data collection methods, and evaluation of the MT. Results of the evaluation show that it is possible to collect a sufficient amount of data and train MT systems that can compete with Google in quality and even overtake it in general domain MT.

↓ more

↑ less

Multi-Domain Recurrent Neural Network Language Model for Medical Speech Recognition

Authors

Ottokar Tilk, Tanel Alumäe

Pages

149 - 152

DOI

10.3233/978-1-61499-442-8-149

Abstract

We evaluate back-off n-gram and recurrent neural network language models for an automatic speech recognition system for medical applications. We also propose an effective and simple multi-domain recurrent neural network architecture which enables training a joint model for all domains. The multi-domain recurrent neural network model outperforms all other compared models.

↓ more

↑ less

Ebook: Human Language Technologies – The Baltic Perspective

This website uses cookies

This website uses cookies