
Ebook: Human Language Technologies – The Baltic Perspective

This book contains papers from the Fourth International Conference on Human Language Technologies - the Baltic Perspective (Baltic HLT 2010), held in Riga in October 2010. This conference is the latest in a series which provides a forum for sharing recent advances in human language processing, and promotes cooperation between the computer science and linguistics communities of the Baltic countries and the rest of the world. Bringing together scientists, developers, providers and users, the conference is an opportunity to exchange information, discuss problems, find new synergies and promote initiatives for international cooperation.
The 32 papers collected have been submitted by 77 authors from 11 countries, after review by an international program committee. They cover a wide range of research topics in corpus linguistics, machine translation, speech technologies, semantics and other areas of HLT research. This proceedings reflects the current state of HLT in the Baltic countries and the work towards creating a Baltic linguistic infrastructure. Human Language Technologies – The Baltic Perspective is a useful and comprehensive repository of information and will facilitate further research and development of HLT in the Baltic region, and the creation of a pan-European research infrastructure of the language resources and technology.
It is our great pleasure to introduce the proceedings of the Sixth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2014).
This volume contains papers presented at the Sixth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2014), held in Kaunas, Lithuania on 26–27 September 2014. The conference traditionally gathers scientists from the Baltic region, where they discuss the most relevant problems which they face in their research, strengthen partnership and brainstorm new perspective ideas. The conference completes the second round of the Baltic HLT conferences which have been organised two times in Latvia (in 2004 and in 2010, in Riga), Estonia (in 2005, in Tallinn and in 2012, in Tartu), and Lithuania (in 2007 and in 2014, in Kaunas).
Baltic HLT 2014 is also important for gathering and consolidating ideas for the following reasons:
• a great urge for Baltic countries to establish and join international research infrastructures;
• possible participation in the EU Research and Innovation programme Horizons 2020 and in the new period of the EU structural funds (2014–2020).
We have received papers from a wide range of topics: syntactic analysis, sentiment analysis, coreference resolution, authorship attribution, information extraction, document clustering, machine translation, corpus and parallel corpus compiling, speech recognition and synthesis, and others. There were a total of 51 submissions received. 21 were accepted as full papers and 19 were accepted as short papers.
The Baltic HLT 2014 programme features presentations that are divided into three HLT branches: namely, speech technology (7 papers), methods in computational linguistics (16 papers), and preparation of language resources (16 papers).
The conference features four invited speakers who overview important and actual topics: Steven Krauwer (Holland, CLARIN ERIC) delivers the presentation on the importance of CLARIN, Walter Daelemans (Belgium, Antwerp University) – on computational stylometry, Adam Kilgarriff (UK, Lexical Computing Ltd.) – on corpus evaluation, Ruta Petrauskaite (Lithuania, LMT) – on the future perspectives of language technologies in the Baltics.
We want to express our gratitude to all the people who helped to organize and support this conference. We are indebted to all the members of the programme committee for their detailed inspection of all submitted work and for their valuable comments. We thank Gailius Raškinis and Vytautas Rudžionis for reviewing and editing papers on speech recognition and synthesis, Irena Markievicz for setting up the conference website, Darius Amilevicius for advice on financial matters. A special thanks to the local organizing committee and ViaConventus who helped to make Baltic HLT 2014 a memorable event.
Andrius Utka,
Gintarė Grigonytė,
Jurgita Kapočiūtė-Dzikienė,
and Jurgita Vaičenonienė
The paper describes a distributed online speech-to-text system. The main features of the system are real-time speech recognition and full-duplex user experience, meaning that the partially recognized utterance is progressively displayed to the user during speaking. Other benefits include easy client-server communication protocol and system scalability to many concurrent user sessions. The paper also describes two Estonian speech-to-text applications based on the developed frame-work: a general-domain dictation application with an estimated word error rate of 26.4% and a radiology report dictation system with a word error rate of 13.7%. The system is open-source and based on free software.
In the current paper we report the first evaluation results of the Estonian virtual talking head. The testing scenario involved perceptual experiments with unimodal audio and bimodal audiovisual speech stimuli in five noise conditions. As expected, in the presence of noise, the scores of consonant recognition of audiovisual stimuli were always higher than the scores of audio stimuli. The average recognition error in audio-only presentation was reduced by 42%–65% when the virtual talking head was displayed along with audio.
Systems for automatic reading and broadcasting subtitles (spoken subtitles) are meant to eliminate the language barrier that TV-viewers with special needs (such as the visually handicapped and the dyslectics) may experience in watching TV films or broadcasts in foreign languages that are provided with subtitles. In such systems, a speech signal synchronised with TV subtitles is generated through a separate audio channel. The present article focuses on the questions that have arisen during the development and application of the system of spoken subtitles for Estonian Public Broadcasting: selection of a TTS system and of a synthetic voice, synchronization between the subtitles and synthetic speech utterances, and the marking of speaking turns. Such subjects as the editor interface of the system for automatic reading and broadcasting subtitles as well as the foreign names pronunciation database are also included.
This paper describes an attempt to use Estonian statistical parametric speech synthesis for audio pronunciation of words and word forms in online dictionaries. Two new HTS-voices were created and compared for this purpose. The paper gives an overview of a design and evaluation process for these voices. Different errors were detected including quantity errors, bad sound quality, accent errors, gemination at the boundary of compound word components, etc. The level of correctness and sound quality for the two parametric speech synthesisers ranged from 69% to 76%. The paper demonstrates that voice Eva-2, which can accept text with diacritics as input, produces fewer errors. Still, the error rate of both new voices is too high to fill the criteria of orthoepy in learner's dictionaries.
This paper proposes a new method for combining multiple foreign language speech recognizers which are adapted to recognize Lithuanian voice commands. The recognizers are combined by using neural network. The type or structure of speech recognizer is not important for method but at least it must return recognized command and recognition hypothesis. These two parameters are used to train neural network and to make the final decision about recognized command. The proposed method showed that recognition accuracy was increased by 4.94 % as compared to the best single recognizer.
This paper presents a Lithuanian voice recognition system of medical – pharmaceutical terms. The system consists of two separate speech recognition modules working in parallel. One recognizer is a proprietary CD-HMM Lithuanian speech recognizer. The second recognizer is a Spanish speech recognizer adapted to recognize Lithuanian voice commands. The outputs of both recognizers are combined by the decision making block yielding the final decision. The decision making block was automatically derived by an induction algorithm that learns a set of symbolic rules. The investigations showed that both recognizers produce uncorrelated outputs and could complement each other. The investigations also showed that Lithuanian speech recognizer achieves higher accuracy (over 96 percent in a speaker independent mode) but the use of the adapted foreign language recognizer allows increase this baseline accuracy even further (over 98 percent in a speaker independent mode for 1000 voice commands). The voice recognition system is in the process of being embedded into several medical information systems which will be used by healthcare practitioners.
In this paper we present two prototypes of 3D based virtual agents: one chatbot which in addition to the ability to hold a conversation can perform translation from English into Spanish, Russian, and French; and another which supplies currency conversion (lats to euro and euro to lats) in the Latvian language. Both chatbots are voice controlled, with natural mimicry and representations of human-like emotions. We describe the motivation, development process, design and architecture of these mobile applications. The evaluation of both applications and their usage in selected scenarios is also presented.
Grapheme to phoneme modelling is one of the key features in automated speech recognition and speech synthesis. In this paper, the authors compare two different approaches: a statistical machine translation based method using the phonetically transcribed Latvian Speech Recognition Corpus and a rule-based method for phonetic transcription of words from grammatically correct forms. The paper provides 10-fold cross-validation results and error analysis for both methods.
This paper presents the first qualitative evaluation of the word-alignment system Bilingwis for the English-Lithuanian language pair. The evaluation was performed by scoring alignments for the most frequent autosemantic words from the English-Lithuanian parallel corpus. The main tendencies revealed by the evaluation are presented and some problematic issues as well as future improvements are discussed.
This paper describes the syntactic engine for the Lithuanian language which is currently being developed at the Vytautas Magnus University. The program was written in Haskell and consists of four modules: a constituent analysis, a shallow constituent analysis, a dependency analysis and a dependency analysis augmented with thematic roles. The “work-in-progress” status of the project is emphasized, as well as the significant results achieved in the current stage.
This paper describes the first efforts of keyword identification in unconstrained Latvian broadcast speech. During this research a large vocabulary continuous speech recognition (LVCSR), spotting in LVCSR lattices and acoustic keyword spotting had been compared. Open source tools and recently created 100 hours of Latvian Speech Recognition Corpus have been used.
We describe contemporary language transliteration influence on automatized sentiment analysis. We state that the text normalization helps to achieve better results in automatized sentiment analysis and provide results to support the claim. Data used for the experiments are gathered via project Virtual Aggression Barometer. We use a normalization tool and an automatized classifier for the internet user comments with aggressive and non-aggressive sentiment.
This paper aims to contribute to an in-depth understanding of computer based word alignment processes in machine translation (MT). The performance of word alignment, based on IBM models and incorporated in GIZA++, has been widely discussed in machine translation literature. The debate has lead towards a general consensus that GIZA++ does not provide sufficiently good results for word alignments. In this paper, we analyse the performance of GIZA++ and Fast Align for the Latvian-English pair against the manually aligned Gold Standard. Experiments showed that Fast Align proved to be approximately 2–3% more accurate and three times faster than GIZA++ in the alignment task. Where it concerns pre-processing, the removal of articles has a small, but positive, influence on alignment quality and machine translation output. We also present a Word Alignment Visualisation tool for analysis and editing of word alignments.
This research presents an investigation performed on the ASU corpus. We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.
Extraction of demographic, cultural background characteristics or psychometric traits about an author from an anonymous text has a number of potential applications in such fields as forensics, security or user-targeted services. Despite significant advances in the automatic author profiling, the most of the research has been done on Germanic languages and not so much on morphologically rich languages. Consequently, this work is the first attempt at finding a good method for solving automatic author profiling in three dimensions for Lithuanian: age (6 categories), gender (2 categories) and political view (3 categories). To tackle this task we used the dataset, which contains text transcripts of Lithuanian parliamentary speeches and debates, thus representing formal spoken, but normative Lithuanian language. In our paper we explored different feature types (ultimate style markers, lexical, morphological, character, and aggregated) and dataset sizes (of 100, 200, 500, 1,000, 2,000, 5,000 instances in each category). The best results were obtained with Support Vector Machine method, the largest tested dataset and lemmas as features: i.e. 44.6% of accuracy for age with interpolation up to trigrams, 74.6% for gender and 58.7% for political view with interpolation up to bigrams.
Unsupervised feature selection is very important in the document clustering process. This paper presents the empirical research on feature selection as well as clustering methods and feature representation suitability for Lithuanian and Russian document clustering.
This paper gives an overview of the latest developments in computational syntactic analysis of Estonian. We present Estonian Dependency Treebank, an ongoing corpus annotation project. Although the treebank construction is still under way, we have used it for training MaltParser and experimenting with combining MaltParser with a rule-based Constraint Grammar parser for Estonian. MaltParser achieves unlabeled attachment score (UAS; correct links to head node) of 83.4% and label accuracy (LA) of 88.6%. Labeled attachment score (LAS) was 80.3%.
Applying different algorithms for combining MaltParser with Constraint Grammar parser improved the results by 1%. Special CG rule set for fixing some typical MaltParser errors improved the UAS by up to 1.5%.
This paper describes an information extraction system designed for obtaining CV-style structured information about publicly mentioned persons, organizations and their relations by analyzing newswire archives in the Latvian language. The described text analysis pipeline consists of morphosyntactic analysis, NER and coreference resolution, and a semantic role labeling system based on FrameNet principles. We also implement an entity linking process, matching the entity mentions in each document to an entity knowledge base that is initially seeded with authoritative information on relevant people and organizations. The accuracy of automated frame extraction varies depending on specifics of each frame type, but the average accuracy currently is 53% F-score for frame target identification, and 61% for frame element role classification. The currently targeted volume of text is the total archives of Latvian newspapers, magazines and news portals, consisting of about 3.5 million articles.
This paper reports on the viability of using machine translation (MT) for determining the original sentiment of tweets, when translating tweets made in internationally less used language into more frequently used ones. The results of the study show that it is possible to use MT and sentiment analysis (SA) systems to produce SA results with significant precision.
Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dictionary is evaluated: 1) manually for the Latvian-English language pair and 2) automatically within a statistical machine translation based transliteration task for all 23 language pairs.
In this paper we present our experience in building machine translation (MT) systems for the languages of the Baltic States: Estonian, Latvian, and Lithuanian. The paper reports on the implementation, research, data, data collection methods, and evaluation of the MT. Results of the evaluation show that it is possible to collect a sufficient amount of data and train MT systems that can compete with Google in quality and even overtake it in general domain MT.
We evaluate back-off n-gram and recurrent neural network language models for an automatic speech recognition system for medical applications. We also propose an effective and simple multi-domain recurrent neural network architecture which enables training a joint model for all domains. The multi-domain recurrent neural network model outperforms all other compared models.