
Ebook: Human Language Technologies – The Baltic Perspective

Computational linguistics, speech processing, natural language processing and language technologies in general have all become increasingly important in an era of all-pervading technological development.
This book, Human Language Technologies – The Baltic Perspective, presents the proceedings of the 8th International Baltic Human Language Technologies Conference (Baltic HLT 2018), held in Tartu, Estonia, on 27-29 September 2018. The main aim of Baltic HLT is to provide a forum for sharing new ideas and recent advances in computational linguistics and related disciplines, and to promote cooperation between the research communities of the Baltic States and beyond.
The 24 articles in this volume cover a wide range of subjects, including machine translation, automatic morphology, text classification, various language resources, and NLP pipelines, as well as speech technology; the latter being the most popular topic with 8 papers.
Delivering an overview of the state-of-the-art language technologies from a Baltic perspective, the book will be of interest to all those whose work involves language processing in whatever form.
It is our great pleasure to introduce the proceedings of the Eighth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2018).
Since its first edition in 2004, the main aim of Baltic HLT has been to provide a forum for sharing new ideas and recent advances in computational linguistics and related disciplines and to promote cooperation between the research communities of Baltic states and beyond.
The call for papers for the Eighth Baltic HLT encouraged the authors to submit papers in the area of natural language processing and language technologies in general, laying special interest on research on languages spoken in the Baltic countries. We received 43 submissions; each submission was evaluated by three reviewers. Hereby we wish to express our gratitude to the members of the Programme Committee who worked hard to review all submissions. Based on their scores and the comments they provided on the content and quality of the papers, 24 papers were accepted for presentation and publication.
Among the accepted submissions speech technology was the most popular topic with 8 accepted papers, but in general papers in this volume cover a wide range of topics: machine translation, automatic morphology, text classification, various language resources and NLP pipelines.
Completing the programme are the invited lectures by Peter Bell, Alexander Fraser and Martin Volk.
We hope that this year's conference “Human Language Technologies – The Baltic Perspective” is an intellectually stimulating event for all of you and that it will generate many new ideas that will help extend your own research.
This year is special for the research community of computational linguistics in the Baltics as the sixtieth anniversary is celebrated by the pioneer of corpus and computational Linguistics in Lithuania, one of the initiators of the Baltic HLT Conference, Professor Rūta Petrauskaitė. She is the founder of the Centre of Computational Linguistics at Vytautas Magnus University (Kaunas), where for twenty-five years works of corpus and computational linguistics have been developed. Thanks to Rūta Petrauskaitė, corpora became an integral part of Lithuanian language research. Rūta Petrauskaitė's research and work are well-known not only in Lithuania, but also internationally and many scientists have been inspired by her ideas. Some research papers by her former students can also be found in this publication.
The organizers would like to express their gratitude to the supporters of this conference: the Centre of Excellence in Estonian Studies (European Regional Development Fund), Mooncascade OÜ, Lingvist Technologies OÜ, Tilde Company and City of Tartu.
This paper describes the current TTÜ speech transcription system for Estonian speech. The system is designed to handle semi-spontaneous speech, such as broadcast conversations, lecture recordings and interviews recorded in diverse acoustic conditions. The system is based on the Kaldi toolkit. Multi-condition training using background noise profiles extracted automatically from untranscribed data is used to improve the robustness of the system. Out-of-vocabulary words are recovered using a phoneme n-gram based decoding subgraph and a FST-based phoneme-to-grapheme model. The system achieves a word error rate of 8.1% on a test set of broadcast conversations. The system also performs punctuation recovery and speaker identification. Speaker identification models are trained using a recently proposed weakly supervised training method.
This is a preliminary study in topic interpretation for the Estonian language. We estimate empirically the best number of topics to compute for a 185 million newspaper corpus. To assess the difficulty of topics, they are independently labeled by two annotators and translated into English. The Estonian Wordnet and Princeton Wordnet are then used to compute the word pairs in the topics that have high taxonomic similarity.
In this research we continue the intrinsic evaluation of two the most popular and publicly available Lithuanian morphological analyzers-lemmatizers Lemuoklis and Semantika.lt. In our previous paper [1] we reported the comparative results of the shallow morphological analysis mostly covering coarse-grained part-of-speech tags. The results were better for Semantika.lt on 3 domains (administrative, fiction and periodicals), but not on the scientific texts. The deeper analysis of the fine-grained morphological categories (case, gender, number, degree, tense, mood, person, and voice) gave a more precise account of the strengths and weaknesses of both analyzers. Further investigations showed that the higher performance of Lemuoklis analyzer on the scientific domain is probably related to a more successful analyse of long distance agreements, in a spite of an overall slight superiority of Semantika.lt analyzer.
Thanks to the advancements in neural networks there have been many new breakthroughs in human-like speech synthesis over the last couple of years. For Latvian, there have been no new publications about speech synthesis since 2010. The paper describes efforts to apply recent advancements in neural speech synthesis to Latvian using open-source tools.
In this study, we propose a practical approach to creating a chatbot on the basis of data accumulated by customer support operations. The selected use case is a typical representative of a chatbot application in customer service centers that want to improve their efficiency and raise customer satisfaction. We show how company support information and logs from support interactions can serve as source data for creating a customer support chatbot. The chatbot developed in this use case is targeted at Latvian-speaking users and demonstrates an implementation of Q&A functionality in the Latvian language. We also propose a simple evaluation metric for chatbot responses to natural language questions. As practical chatbots cannot be perfect in providing appropriate answers to all user questions, this metric can be used to assess the readiness of the chatbot for being released to real users. In our experiment, a chatbot with a score of 0.45 showed positive results in a user survey.
This paper presents an effort to provide a level-appropriate study corpus for Lithuanian language learners. The collected corpus includes levelled texts from study books and unlevelled texts from other sources. The main goal is to assign the level-appropriate labels (A1, A2, B1, B2) to texts from other sources. For automatic classification we use preselected surface features, based on text readability research, and shallow linguistic features. First, we train the model with levelled texts from study books; second, we apply the learned model to classifying other texts. The best classification results are achieved with Logistic Regression method.
Years ago, a dedicated infrastructure for facilitating language technology development was set up by the Divvun language technology development group at the University of Troms.
We present a case study of using the infrastructure while developing a morphological description for Estonian. We are interested in how the environment helps a linguist to focus on the actual linguistics, and supports reuse of code, by which we mean reuse of dedicated scripts and make-files to manipulate the language-specific source material into FST.
The paper's concern is what is the best practice to use with different FST programming language files lexc, twol and xfst – how one should write the expressions, and what conventions should be followed. In many cases there are alternative ways for describing the phenomena, and the choice will prove to be good or bad only after a considerable amount of effort has been put into following it.
This study is focused on analysing whether changes of F0 in the Lithuanian language are influenced by: 1) stress and the type of a syllable accent (acute, circumflex); 2) the type of a sentence (declarative, exclamatory, interrogative); 3) phrase accent (focused word). The research material consists of the recordings of three female Standard Lithuanian speakers. Each of the samples was read 5 times. The analysis of the relation of F0 in a stressed and an unstressed syllable, in an acute and a circumflex, in a focal and non-focal position, in declarative, exclamatory, interrogative sentences allows us to assume that the pitch is an indicator of intonation rather than of a lexical stress and a syllable accent.
Studies on the rhythm of living Baltic languages are scarce (especially comparative ones) and their conclusions are usually ambiguous or even controversial. The aim of this research was to determine the values of the acoustic correlates of Lithuanian and Latvian languages and to compare them with the values found in other languages, identified by researchers as belonging to particular rhythm groups. The empirical material of this study consisted of 5 Lithuanian and 5 Latvian native speakers reading Aesop's fable The North Wind and the Sun in their respective native languages. The acoustic analysis of the audio recordings was performed using Praat and Correlatore applications. The analysis showed that the values of the acoustic rhythm correlates in both Baltic languages were found unequal. Nevertheless,according to the studied acoustic correlators, Lithuanian belongs to stress-timed languages, while Latvian is closer to syllable-timed languages.
The present paper focuses on linguistic experiments as a source of language resources (LRs). It addresses some of the legal requirements regulating their collection. The primary focus of the paper is on processing personal data (PD), especially how the General Data Protection Regulations (GDPR) defines personal data, and what it means to anonymize personal data.
The rise of e-books, the cumulative digitisation of written library materials and the advancement of speech technology have reached a stage enabling library services and e-books to be read out loud to customers in synthetic speech and paper books (either published or still in print) to be delivered in the audio form. The user environment of the digital archive Digar of the Estonian National Library includes a special reading machine capable of producing an audio version of electronic texts in Estonian (books, magazines etc). The application of Elisa Raamat provides access to more than 2500 Estonian e-books, which can not only be read visually from the screen of a smartphone or tablet but also listened to. The speech server of the Institute of the Estonian Language offers, as a public service, the text-to-speech system Vox populi, inviting people to have an audio version synthesized from any text of interest, being prepared to convert any uploaded text (an article, paper, subtitle file, e-book etc.) into an audio file. The present study is focused not only on the description of the systems but also on various issues of text processing and pronunciation as well as on the reflection of text structure in synthetic speech. The quality of self-reading largely depends on how adequately the input abbreviations, numbers and other non-letter sequences are converted into words in correct morphological form and how closely the output pronunciation of foreign names matches that of the source language. In the article we will also discuss a special module for text pre-processing, which helps in the case of more complex text structures and character sequences (e.g. geographic coordinates, sports results, numeral inflection). In addition, book reading requires an as accurate as possible rendering of text structure. The study also analyses audio books to capture the essence of human prosodic phrasing as well as different pauses and the marking of reported speech when talking.
A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, language and topics. The corpus consists of 1.4 million tweets that cover a period from April 2017 to July 2018. The language analysis reveals that the majority of users mostly use one language. Across topics, there is more Latvian content than in the whole collection. Among many potential use cases, the corpus can be used, for example, to study the public engagement of major Latvian media outlets and public figures, or the factors that determine language choice and content of a tweet.
This paper reports the lessons learned while creating a FrameNet-annotated text corpus of Latvian. This is still an ongoing work, a part of a larger project which aims at the creation of a multilayer text corpus, anchored in cross-lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet and PropBank, as well as Abstract Meaning Representation (AMR). For the FrameNet layer, we use the latest frame inventory of Berkeley FrameNet (BFN v1.7), while the annotation itself is done on top of the underlying UD layer. We strictly follow a corpus-driven approach, meaning that lexical units (LU) in Latvian FrameNet are created only based on the annotated corpus examples. Since we are aiming at a medium-sized still general-purpose corpus, an important aspect that we take into account is the variety and balance of the corpus in terms of genres, domains and LUs. We have finished the first phase of the FrameNet corpus annotation, and we have collected and discuss cross-lingual issues and their possible solutions. The issues are relevant for other languages as well, particularly if the goal is to maintain cross-lingual compatibility via BFN.
In this paper, we investigate using different types of neural networks for age and gender identification from children's speech, based on the Corpus of Estonian Adolescent Speech. Feed-forward deep neural networks using i-vectors as input are compared with recurrent neural networks using MFCCs as input. Results show that feed-forward neural networks outperform recurrent neural networks for gender classification, while a model that combines both i-vectors and MFCC via feed-forward and recurrent branches achieve the best performance for age group classification. We also show that for age group classification, it is beneficial to first identify gender and then use a gender-specific age identification model. Experiments with human listeners show that the neural network models outperform humans on both tasks by a big margin.
We present the Latvian Tweet Corpus and its application in sentiment analysis by comparing four different machine learning algorithms and a lexical classification method. We show that the best results are achieved by an averaged perceptron classifier. In our experiments, the more complex neural network-based classification methods (using recurrent neural networks and word embeddings) did not yield better results.
This paper describes the integration of traditional dictionary data from the Tēzaurs.lv online dictionary with a morphological analysis system, resulting in a structured morphological lexicon that improves coverage of morphological tagging and also enhances the dictionary with automatically generated inflection tables. The main challenge in this integration was the extraction of structured data from textual dictionary information, and classification of dictionary entries into morphological paradigms.
Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.
In this paper, we present Tilde's work on boosting the output quality and availability of Estonian machine translation systems, focusing mostly on the less resourced and morphologically complex language pairs between Estonian and Russian. We describe our efforts on collecting parallel and monolingual data for the development of better neural machine translation models, as well as experiments with various model architectures with the goal to find the best-performing model for our data. We attain state-of-the-art MT results by training a multi-way Transformer model that improves the quality by up to +3.27 BLEU points over the baseline system. We also provide a publicly available translation service via a mobile phone application.
This paper covers the analysis of structure and acoustic features of structural units of mirthful and polite laughter. The research material consists of 100 samples of non-overlapping spontaneous laughter. The analysis revealed that the greatest difference between mirthful and polite laughter can be seen in the structure: polite laughter consists of one, whereas mirthful laughter – of one and more bouts. Some of the differences can be seen in the duration of the same structure laughter units and in the diapason of mean F0, F1, F2, shimmer, and jitter: the duration is longer and the diapason of other acoustic features of mirthful laughter is almost always wider than polite laughter.
This paper describes the development of a general-purpose automatic speech recognition system for Lithuanian. The system is capable of performing both the transcription of user submitted audio recordings and real-time speech-to-text conversion. The comparative evaluation results prove that the presented system outperforms all other ASR systems for the Lithuanian language. The system also includes number and date normalization and is paired with an automatic punctuation restoration model that achieves state-of-the-art results for the Lithuanian language. Importantly, the system is publicly available to any Lithuanian speaker for testing via its web-page and mobile application.
As Latvian can still be considered an under-resourced language, several corpora and corpus tools that can be used for its linguistic research are presented in the paper, namely: the InterCorp and Araneum Lettonicum corpora along with the Treq database, a word-sketch grammar for Latvian and the Morfio tool.
We develop neural morphological tagging and disambiguation models for Estonian. First, we experiment with two neural architectures for morphological tagging – a standard multiclass classifier which treats each morphological tag as a single unit, and a sequence model which handles the morphological tags as sequences of morphological category values. Secondly, we complement these models with the analyses generated by a rule-based Estonian morphological analyser (MA) VABAMORF, thus performing a soft morphological disambiguation. We compare two ways of supplementing a neural morphological tagger with the MA outputs: firstly, by adding the combined analyses embeddings to the word representation input to the neural tagging model, and secondly, by adopting an attention mechanism to focus on the most relevant analyses generated by the MA. Experiments on three Estonian datasets show that our neural architectures consistently outperform the non-neural baselines, including HMM-disambiguated VABAMORF, while augmenting models with MA outputs results in a further performance boost for both models.
Quality estimation is an essential step in applying machine translation systems in practice, however state-of-the-art approaches require manual post-edits and other expensive resources. We introduce an approach to quality estimation that uses the attention weights of a neural machine translation system and can be applied to a translation produced by any machine translation system; a lighter version of the approach does not even require any post-edits. Our experiments with German-Estonian and English-Estonian translations show that its performance matches the state-of-the-art baseline.
The paper introduces a modular pipeline that allows to combine multiple Natural Language Processing tools into a unified framework. It aims to make NLP technology more accessible for researchers, non-experts and software developers. The paper describes the architecture of NLP-PIPE and presents publicly available NLP components for Latvian.