
Ebook: Human Language Technologies – The Baltic Perspective

This book contains papers from the Fourth International Conference on Human Language Technologies – the Baltic Perspective (Baltic HLT 2010), held in Riga in October 2010. This conference is the latest in a series which provides a forum for sharing recent advances in human language processing, and promotes cooperation between the computer science and linguistics communities of the Baltic countries and the rest of the world. Bringing together scientists, developers, providers and users, the conference is an opportunity to exchange information, discuss problems, find new synergies, and promote initiatives for international cooperation. The 32 papers collected have been submitted by 77 authors from 11 countries, after review by an international program committee. They cover a wide range of research topics in corpus linguistics, machine translation, speech technologies, semantics, and other areas of HLT research. This proceedings reflects the current state of HLT in the Baltic countries and the work towards creating a Baltic linguistic infrastructure. This book is a useful and comprehensive repository of information and will facilitate further research and development of HLT in the Baltic region, and the creation of a pan-European research infrastructure of the language resources and technology.
This volume contains papers presented at the Fourth International Conference “Human Language Technologies – the Baltic Perspective” (Baltic Hlt 2010). The series of Baltic HLT conferences provides a forum for the sharing of recent advances in human language processing and for promotion of cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. The conference brings together scientists, developers, providers and users to discuss state-of-the-art of Human Language Technologies (HLT) in the Baltic countries, to exchange information and to discuss problems, to find new synergies, and to promote initiatives for international cooperation.
The first larger pan-Baltic event on HLT research was the seminar “Language and Technology 2000” organized by Andrejs Spektors and the Institute of Mathematics and Computer Science, University of Latvia in Riga in 1994. In 2004, ten years after this seminar, Andrejs Vasiļjevs and Inguna Skadiņa initiated the first international Baltic HTL conference organized by the Commission of the Official Language of the Chancellery of the President of Latvia. Einar Meister took over this initiative with the second conference in 2005 in Tallinn organized by the Institute of Cybernetics and Institute of Estonian Language. Successful continuation of the series was ensured by Rūta Marcinkevičienė who initiated the third Baltic HLT conference in Kaunas in 2007 organized by Vytautas Magnus University and the Institute of Lithuanian language.
This fourth conference takes place in Riga again in October 7–8, 2010. We would like to thank the supporters and organizers of this conference: Tilde and the Institute of Mathematics and Computer Science, University of Latvia. The conference was also supported by the CLARIN, LetsMT!, and ACCURAT projects.
The last three years were very fruitful for HLT researchers and developers. A new concept – language resources as research infrastructures (RI) – was introduced throughout Europe. Baltic countries have actively contributed to the first steps to create such an RI. An overview section includes two invited papers by the creators of the Baltic linguistic infrastructure presenting an analysis of the current situation in HLT in Latvia and Lithuania, and a summary on The National Programme for Estonian HLT.
HLT research in the Baltic countries was boosted by several large-scale national and international activities, such as the projects CLARIN, ACCURAT, LetsMT! and others as described in this volume. Research results, work in progress, descriptions of demonstrations, and position papers on these and other activities form the main content of this volume. The contributions were submitted by more than 75 authors from eleven countries and reviewed by an international program committee in a blind review process. Papers selected for the conference represent a wide range of topics of research in corpus linguistics, machine translation, speech technologies, semantics, and other areas of HLT research.
We hope that this volume will serve as a useful and comprehensive repository of information and will facilitate research and development of HLT in the Baltic countries and the creation of the pan-European RI of language resources and technology.
This paper aims at a short overview of the development of the Lithuanian human language technology infrastructure in the context of European co-operation. It also presents national policies related to research infrastructures and joint activities planned for different levels, institutional, national and European.
The National Programme for Estonian Language Technology (2006–2010) was launched in 2006 and is approaching to the end in 2010. This paper gives an overview of the programme and projects covering different areas of language technology; also the future prospects will be discussed.
The last six years have been very important for research and development of language technologies in Latvia. Several large projects have been funded by the government of Latvia, important tools and resources have been created by the industry, and since 2006 Latvia has participated in the CLARIN initiative. Although there is still a gap in language resources and technology (LRT) for Latvian and the more widely used languages, the current LRT for Latvian can already serve as a basic research infrastructure for the Humanities. The paper presents an overview and the current status of LRT in Latvia. Special attention is paid to the CLARIN project and its role for the humanities in Latvia.
The Estonian Emotional Speech Corpus serves as the acoustic basis for emotional text-to-speech synthesis. Because the Estonian synthesizer is a TTS-synthesizer, we started off by focusing on read texts and the emotions contained in them. The corpus is built on a theoretical model and we are currently at the stage of verifying the components of the model. In the present article we give an overview of the corpus and the principles used in selecting its testers. Some studies show that people who have lived longer in a certain culture can more easily recognize vocal expressions of emotion that are characteristic of the culture without seeing the speaker's facial expressions. We therefore decided not to use people under 30 years of age as testers of emotions in our theoretical model. We used two tests to verify the selection principles for the testers. In the first test, 27 young adults aged under 30 were asked to listen to and identify the emotion (joy, anger, sadness, neutral) of 35 sentences. We then compared the results with those of adults aged over 30. In the second test we asked 32 Latvians listen to the same sentences, and then compared the results with those of Estonians. Our analysis showed that younger and older testers, Estonians and Latvians perceive emotions quite differently. From these test results we can say that the selection principle of corpus testers, using people who are more familiar with Estonian culture, is acceptable.
The study was supported by the National Program for Estonian Language Technology and the project SF0050023s09 “Modeling intermodular phenomena in Estonian”.
This paper describes implementation and evaluation of an Estonian large vocabulary continuous speech recognition system prototype for the radiology domain. We used a 44 million word corpus of radiology reports to build a word trigram language model. We recorded a test set of dictated radiology reports using ten radiologists. Using speaker independent speech recognition, we achieved a 9.8% word error rate. Recognition worked in around 0.5 real-time. One of the prominent sources of errors were mistakes in writing compound words.
The aim of this paper is to present the development stages of Spoken Latvian Corpus and the current situation of Spoken Latvian Corpus. The development of Spoken Latvian Corpus has already begun in 2006 (Latvian Council of Science funding), and some individual speech corpora are developed. There are several stages in the creation of Spoken Latvian Corpus: development of concept, speech data collection procedures, transcription and annotation of speech data, representation of corpus.
Acoustical (quantitative) properties of consonants of Lithuanian standard language are still not extensively and properly covered by the contemporary research. The objective of this paper is to investigate quantity of consonants in a continuous speech of Standard Lithuanian, to qualify spontaneous duration of the analyzed sounds considering qualitative (articulatory) features and ignoring other factors like the length of the segment or the sound's position in a word. The results show that the most significant and distinctive articulatory feature influencing duration is the manner of articulation and the voicing. The place of articulation and palatalization has no impact on the duration of the analyzed consonants.
The study is focused on Estonian rhythmic structure as revealed in fluent read speech. The core of the study involves determining the distinctive features of the three degrees of Estonian phonetic quantity and assessment of the significance of those features by statistical methods with an aim to enhance the naturalness of synthetic speech by using the features best identifying each quantity degree in fluent Estonian speech. The theory of adjacent phones is tested on a large data set and the role of intensity as a possible feature to identify quantity degrees is investigated. According to the results of phonetic and statistical analysis the main constitutive factors of quantity degrees and, thus, of speech rhythm are the classical duration ratio of stressed and unstressed syllables, whereas the rest of the duration ratios and tonal characteristics investigated turned out to be less significant for the data analysed.
This work presents the audio system of electronic texts and audio books, designed for the visually impaired people, helping them read the news, newspapers, magazines and books and listen to audio books over the Internet. Considered are the possibilities of use and the functional range of the system. For selection of various speech rates perception tests were arranged for the blind and those sighted. The perception tests revealed that the so-called trained blind prefer to listen to the speech of a higher rate than the sighted or the blind, lacking the experience of daily use of computer.
The paper describes the development of the Latvian Text-to Speech Synthesizer at the Institute of Mathematics and Computer Science (University of Latvia).
In automatic speech recognition, the standard choice for a language model is the well-known n-gram model. The n-grams are used to predict the probability of a word given its n-1 preceding words. However, the n-gram model is not able to explicitly learn grammatical relations of the sentence. In the present work, in order to augment the n-gram model with grammatical features, we apply the Whole Sentence Maximum Entropy framework. The grammatical features are head-modifier relations between pairs of words, together with the labels of the relationships, obtained with the dependency grammar. We evaluate the model in a large vocabulary speech recognition task with Wall Street Journal speech corpus. The results show a substantial improvement in both test set perplexity and word error rate.
We are studying how an inherent dialogue structure is formed by the conglomeration of an Internet opinion article as a source text and its anonymous comments. We are using methods of conversation analysis and membership categorization analysis, originally intended for the analysis of coherent spoken dialogues. A source text and its comments can be considered as dialogue acts and commentators turn out to be dialogue participants. Every commentator can react to the source text or to some of the previous comments. A dialogue is formed by occurring a ‘bunch’ of many micro-dialogues. On the other hand, the commentators are categorizing themselves as well as agents of the source text. As a result, an additional structure layer is formed. The coherence of turns is also structured non-linearly.
This paper deals with hesitation and uncertainty in spoken dialogue management and discusses communicative signals that are used to express hesitation and uncertainty in conversational interactions. The study focuses especially on the relation between a particular type of gesturing, shoulder shrugging, and speech, and draws conclusions as to how the shoulder shrugging is interpreted by the participants. The analysis includes manual annotation and assumes that dialogues are cooperative activity that is constrained by social obligations and the participants' roles.
This paper discusses an experiment series in the course of which so-called simulated dialogues in Estonian were collected in human-computer interaction. The subjects were asked to test a program, which interacted with a person in written Estonian, which gave information about cinema, TV listings, weather, politics or flights. In reality another person (so-called Wizard of Oz) answered the subject's questions via Internet. I will introduce the Wizard of Oz interface. All the experiments have been recorded in a log file; on the basis of this information the analysis of dialogues takes place. I primarily concentrate on a repair type occurred in dialogues, which based on the conversational analysis is called other-initiated repair. I look for language rules (patterns) which have been followed to produce the utterances. This research is one step closer to the aim to create a computer program, which provides users with the information in the most convenient and pleasant way.
This paper presents a framework for asynchronous dialogue systems. It is used in developing text-based dialogue systems. The framework features web-based asynchronous turn management and AI-assisted live agent chat. Some other features are also briefly covered in this paper (a language independent solution for spell checking the user input, a language independent solution for the word-order problem, semantic resolution and a web-based interface for WOZ data collection). The exploitation of the asynchronous communication pattern has improved the communication style of the user which has resulted in decreased number or single word utterances. Higher word count per utterance is important when performing shallow language analysis without complete semantic understanding. The framework is currently tailored for Estonian language, yet most of its features and modules are language independent.
This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and compare.
In the analysis step, we move beyond automatic scores and contribute presenting a human error analysis of MT systems output that helps to determine the most prominent source of errors typical for SMT systems under consideration.
This paper reports on implementation and evaluation of English-Latvian and Lithuanian-English statistical machine translation systems. It also gives brief introduction of project scope – Baltic languages, prior implementations of MT and evaluation of MT systems. In this paper we report on results of both automatic and human evaluation. Results of human evaluation show that factored SMT gives significant improvement of translation quality compared to baseline SMT.
This position paper presents the recently started European collaboration project LetsMT!. This project creates a platform that gathers public and user-provided MT training data and generates multiple MT systems by combining and prioritizing this data. The project extends the use of existing state-of-the-art SMT methods that are applied to data supplied by users to increase quality, scope and language coverage of machine translation. The paper describes the background and motivation for this work, key approaches, and the technologies used.
This paper gives a brief overview of the composition as well as technical and morphological annotation of the Reference Corpus of Estonian. A user interface using the morphological information about lemmas and grammatical categories of word-forms is presented.