
Ebook: Human Language Technologies – The Baltic Perspective

Human language technology is the study of the methods by which computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. It consists of natural language processing and computational linguistics on the one hand, and speech technology on the other.
This book presents the proceedings of the 9th International Conference, Human Language Technologies – The Baltic Perspective (Baltic HLT 2020), organised in Kaunas, Lithuania on 22 and 23 September 2020. This biennial conference offers researchers a platform to share knowledge on recent advances in human language processing for the Baltic languages, as well as promoting interdisciplinary and international cooperation in human language-technology research within and beyond the Baltic States. In addition to the traditional topics of natural language processing and language technologies, this year’s conference featured a special session on resource and tool development for teaching and learning the less resourced Baltic languages. This year, 42 submissions were received, each of which was evaluated by two reviewers, resulting in a total of 34 papers being accepted for presentation and publication. The book is divided into four sections: speech and text analysis (9 papers); machine translation and natural understanding (6 papers); tools and resources (14 papers); and language learning resources (5 papers).
Providing a fascinating overview of current research in the field from a primarily Baltic perspective, the book will be of interest to all those whose work involves human language technology.
It is our great pleasure to introduce the Proceedings of the 9th International Conference “Human Language Technologies – the Baltic Perspective” (Baltic HLT 2020), organized by the Centre of Computational Linguistics and the CLARIN-LT centre at Vytautas Magnus University on September 22–23, in Kaunas, Lithuania. This year’s conference was entirely virtual for the first time.
This biennial conference, first organized in 2004, offers researchers a space to share knowledge on recent advances in human language processing for the Baltic languages, as well as promoting interdisciplinary and international cooperation in human language-technology research within and beyond the Baltic states.
In addition to the traditional topics of natural language processing and language technologies, this year’s conference featured a special session on resource and tool development for teaching and learning the less resourced Baltic languages. The keynote talk for this session, given by Elena Volodina (University of Gothenburg, Sweden), served as a good basis for sharing experiences and discussing ideas for further resource development and the application of NLP in language teaching and learning. The talk from keynote speaker Jan Rybicki (Jagiellonian University in Kraków, Poland) offered an inspiration for the growing community of Digital Humanities, whereas the keynote speaker Daniel Zeman (Charles University in Prague, Czech Republic) discussed the current state of Universal Dependencies – a community effort to define cross-linguistically applicable annotation guidelines for morphology and syntax.
We received 42 submissions this year, each of which was evaluated by two reviewers. We would like to take this opportunity to express our gratitude to the members of the Programme Committee, who worked hard to provide insightful comments. Thirty-four papers were accepted for presentation and publication. Papers in this volume cover speech and text analysis (9 papers), machine translation and natural language understanding (6 papers), tools and resources (14 papers) and language learning resources (5 papers).
We would also like to express our gratitude to the Research Council of Lithuania for funding the conference, Vytautas Magnus University for hosting the event, the European Language Grid for organizing the pre-conference event, the Organizing Committee, our keynote speakers and all participants who, despite all the constraints, attended the virtual conference and contributed to its success.
Andrius Utka
Jurgita Vaičenonienė
Jolanta Kovalevskaitė
Danguolė Kalinauskaitė
The first study for Estonian pronominal coreference resolution using machine learning is presented. Appropriate machine learning algorithms and techniques for balancing the data are tested on a human-annotated corpus. The results are encouraging, showing an F-score comparable with the results obtained for English before the advent of deep neural networks.
This study examines the structural models of Lithuanian plosive consonants in intervocalic, word-initial and word-final positions. The research material consists of 24 sentences read three times by 6 native speakers. The results show that the plosive consonants can be composed of one to three phases, and the most frequent and common models are the closure with a burst release, which might be followed by a different degree of frication.
Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there exist several multilingual BERT models that can handle multiple languages simultaneously and that have been trained also on Estonian data. In this paper, we evaluate four multilingual models—multilingual BERT, multilingual distilled BERT, XLM and XLM-RoBERTa—on several NLP tasks including POS and morphological tagging, NER and text classification. Our aim is to establish a comparison between these multilingual BERT models and the existing baseline neural models for these tasks. Our results show that multilingual BERT models can generalise well on different Estonian NLP tasks outperforming all baselines models for POS and morphological tagging and text classification, and reaching the comparable level with the best baseline for NER, with XLM-RoBERTa achieving the highest results compared with other multilingual models.
We report an analysis of similarities and differences in terms of selected characteristics of 3 Lithuanian functional styles (FS): administrative, scientific, and publicistic. We combined 8 quantitative indicators and multivariate statistical analysis for this task. We also analyzed tendencies of indicators to be more or less pronounced in particular FS.
The paper presents research results for solving the task of targeted aspect-based sentiment analysis in the specific domain of Lithuanian social media reviews. Methodology, system architecture, relevant NLP tools and resources are described, finalized by experimental results showing that our solution is suitable for solving targeted aspect-based sentiment analysis tasks for under-resourced, morphologically rich and flexible word order languages.
The paper presents the results of research on deep learning methods aiming to determine the most effective one for automatic extraction of Lithuanian terms from a specialized domain (cybersecurity) with very restricted resources. A semi-supervised approach to deep learning was chosen for the research as Lithuanian is a less resourced language and large amounts of data, necessary for unsupervised methods, are not available in the selected domain. The findings of the research show that Bi-LSTM network with Bidirectional Encoder Representations from Transformers (BERT) can achieve close to state-of-the-art results.
Automatic Speech Recognition (ASR) requires huge amounts of real user speech data to reach state-of-the-art performance. However, speech data conveys sensitive speaker attributes like identity that can be inferred and exploited for malicious purposes. Therefore, there is an interest in the collection of anonymized speech data that is processed by some voice conversion method. In this paper, we evaluate one of the voice conversion methods on Latvian speech data and also investigate if privacy-transformed data can be used to improve ASR acoustic models. Results show the effectiveness of voice conversion against state-of-the-art speaker verification models on Latvian speech and the effectiveness of using privacy-transformed data in ASR training.
In this paper, we present various pre-training strategies that aid in improving the accuracy of the sentiment classification task. At first, we pre-train language representation models using these strategies and then fine-tune them on the downstream task. Experimental results on a time-balanced tweet evaluation set show the improvement over the previous technique. We achieve 76% accuracy for sentiment analysis on Latvian tweets, which is a substantial improvement over previous work.
Transformer-based language models pre-trained on large corpora have demonstrated good results on multiple natural language processing tasks for widely used languages including named entity recognition (NER). In this paper, we investigate the role of the BERT models in the NER task for Latvian. We introduce the BERT model pre-trained on the Latvian language data. We demonstrate that the Latvian BERT model, pre-trained on large Latvian corpora, achieves better results (81.91 F1-measure on average vs 78.37 on M-BERT for a dataset with nine named entity types, and 79.72 vs 78.83 on another dataset with seven types) than multilingual BERT and outperforms previously developed Latvian NER systems.
Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first method utilises a speech processing workflow to introduce errors and the second method generates commonly found suffix errors using a rule-based method. We show that the methods in combination allow significantly improving speech translation quality by 1.87 BLEU points over a baseline system.
Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We propose a simple generative noise model to generate adversarial examples of ten different types. We use these to augment machine translation systems’ training data and show that, when tested on noisy data, systems trained using adversarial examples perform almost as well as when translating clean data, while baseline systems’ performance drops by 2-3 BLEU points. To measure the robustness and noise invariance of machine translation systems’ outputs, we use the average translation edit rate between the translation of the original sentence and its noised variants. Using this measure, we show that systems trained on adversarial examples on average yield 50 % consistency improvements when compared to baselines trained on clean data.
This paper reports on the development of a toolkit that enables collecting dialog corpus for end-to-end goal-oriented dialog system training. The toolkit includes the neural network model that interactively learns to predict the next virtual assistant (VA) action from the conversation history. We start with exploring methods for VA dialog scenario learning from examples after we perform several experiments with the English DSTC dialog sets in order to find the optimal strategy for neural model training. The chosen algorithm is used for training the next action prediction model for the Latvian dialogs in the public transport inquiries domain collected using the platform. The accuracy for the English and the Latvian dialog models is similar – 0.84 and 0.86. This shows that the chosen method for neural network model training is language independent.
In this paper, we tackle an intent detection problem for the Lithuanian language with the real supervised data. Our main focus is on the enhancement of the Natural Language Understanding (NLU) module, responsible for the comprehension of user’s questions. The NLU model is trained with a properly selected word vectorization type and Deep Neural Network (DNN) classifier. During our experiments, we have experimentally investigated fastText and BERT embeddings. Besides, we have automatically optimized different architectures and hyper-parameters of the following DNN approaches: Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM) and Convolutional Neural Network (CNN). The highest accuracy=∼0.715 (∼0.675 and ∼0.625 over random and majority baselines, respectively) was achieved with the CNN classifier applied on a top of BERT embeddings. The detailed error analysis revealed that prediction accuracies degrade for the least covered intents and due to intent ambiguities; therefore, in the future, we are planning to make necessary adjustments to boost the intent detection accuracy for the Lithuanian language even more.
Human-computer interaction, especially in form of dialogue systems and chatbots, has become extremely popular during the last decade. The dominant approach in the recent development of practical virtual assistants is the application of deep learning techniques. However, in case of less resourced language (or domain), the application of deep learning could be very complicated due to the lack of necessary training data. In this paper, we discuss possibility to apply hybrid approach to dialogue modelling by combining data-driven approach with the knowledge-based approach. Our hypothesis is that by combining different agents (general domain chatbot, frequently asked questions module and goal oriented virtual assistant) into single virtual assistant we can facilitate adequacy and fluency of the conversation. We investigate suitability of different widely used techniques in less resourced settings. We demonstrate feasibility of our approach for morphologically rich less resourced language Latvian through initial virtual assistant prototype for the student service of the University of Latvia.
This paper presents LVBERT – the first publicly available monolingual language model pre-trained for Latvian. We show that LVBERT improves the state-of-the-art for three Latvian NLP tasks including Part-of-Speech tagging, Named Entity Recognition and Universal Dependency parsing. We release LVBERT to facilitate future research and downstream applications for Latvian NLP.
This paper describes the Gaelic Linguistic Analyser, a new resource for the Scottish Gaelic language. The GLA includes a tagger, a lemmatiser and a parser, which were developed largely on the basis of existing resources. This tool is available online as the first component of the Scottish Gaelic Toolkit.
The paper presents the investigation of The Dictionary of Modern Lithuanian (6th edition) from the point of view of its coverage in comparison with a Joint Corpus of Lithuanian. Resources, methods and procedures are described together with the results revealing that only 81 % of the dictionary lemmas have their counterparts in the corpus.
This paper describes lessons learned from developing the most recent Balanced Corpus of Modern Latvian (LVK2018) from various online sources. Most of the new corpora are created from data obtained from various text holders, which requires cooperation agreements with each of the text holders. Reaching these cooperation agreements is a difficult and time consuming task and may not be necessary if the resource to be created is not of hundred millions of size. Although there are many different resources available on the Internet today for a particular language, finding viable online resources to create a balanced corpus is still a challenging task. Developing a balanced corpus from various online sources does not require agreements with text holders, but it presents many more technical challenges, including text extraction, cleaning and validation.
This paper describes an ongoing work on the creation of Latvian language resources for the medical domain focusing on digital imaging to develop a medical speech recognition system for Latvian. The language resources include a pronunciation lexicon, a text corpus for language modelling, and an orthographically transcribed speech corpus for the (i) adaptation of the acoustic model, (ii) evaluation of the speech recognition accuracy, (iii) development and testing of rewrite rules for automatic text conversion to the spoken form and back to the written form. This work is part of a larger industry-driven research project which aims at the development of specific Latvian speech recognition systems for the medical domain.
This paper reports on the development of spell checking and morphological analysis tools for Latgalian. The Latgalian written language is a historic variant of the Latvian language. There is a wide range of language analysis tools available for Latvian, whereas the Latgalian language lacks such tools. The work is done by the joint effort of linguists who work on morphologically marked lexicon creation and IT specialists who work on language tool development. For the creation of a morphological analysis tool, we reuse the FST technology used for the Latvian morphological analyzer. We create a spelling dictionary that can be used with the Hunspell engine. All tools are accessible via Web Service. For now, the Latgalian lexicon contains 13,139 lemmas marked by 105 inflection groups. The work of lexicon replenishment still continues.
This study continues a work in progress for implementing a full-text lexical semantic tagger for Finnish, FiST. The tagger is based on a 46,226 lexeme semantic lexicon of Finnish that was published in 2016 [1]. Kettunen [2], [3] describes the basic working version of FiST. FiST is based on freely available components: the first implementation uses Omorfi and FinnPos for morphological analysis and disambiguation of Finnish words. The current paper describes work with compound splitting for semantic tagging and its effects on the lexical coverage of the tagger. We try out two different approaches to morphological analysis and disambiguation of words for an improved version of FiST, FiSTComp: FinnPos [4], and Turku Dependency Parser [5], [6], UD1. Both these tools disambiguate morphological interpretations of words and provide boundary markings for compounds, but details and granularity of constituent decomposition vary. Our results with two-, three and four-part compounds show that analysis of compounds through their constituents with UD1 may improve the lexical coverage of the tagger with about 6.6 % units at best. Although we are able to proceed in basic problems of compound splitting, the results are still initial and further work is needed as compounds are a complex phenomenon.