Ebook: Human Language Technologies – The Baltic Perspective

Proceedings of the Eighth International Conference Baltic HLT 2018

Series

Frontiers in Artificial Intelligence and Applications

Volume

307

Published

2018

Editors

Kadri Muischnek, Kaili Müürisep

ISBN

978-1-61499-911-9 (print) | 978-1-61499-912-6 (online)

Subject(s)

Artificial Intelligence Computer Sciences, Mathematics & Statistics

Open Access

Description

Computational linguistics, speech processing, natural language processing and language technologies in general have all become increasingly important in an era of all-pervading technological development.

This book, Human Language Technologies – The Baltic Perspective, presents the proceedings of the 8th International Baltic Human Language Technologies Conference (Baltic HLT 2018), held in Tartu, Estonia, on 27-29 September 2018. The main aim of Baltic HLT is to provide a forum for sharing new ideas and recent advances in computational linguistics and related disciplines, and to promote cooperation between the research communities of the Baltic States and beyond.

The 24 articles in this volume cover a wide range of subjects, including machine translation, automatic morphology, text classification, various language resources, and NLP pipelines, as well as speech technology; the latter being the most popular topic with 8 papers.

Delivering an overview of the state-of-the-art language technologies from a Baltic perspective, the book will be of interest to all those whose work involves language processing in whatever form.

↓ more

↑ less

Order

Contents

Front Matter

Pages

i - xii

Preface

It is our great pleasure to introduce the proceedings of the Eighth International Conference “Human Language Technologies – The Baltic Perspective” (Baltic HLT 2018).

Since its first edition in 2004, the main aim of Baltic HLT has been to provide a forum for sharing new ideas and recent advances in computational linguistics and related disciplines and to promote cooperation between the research communities of Baltic states and beyond.

The call for papers for the Eighth Baltic HLT encouraged the authors to submit papers in the area of natural language processing and language technologies in general, laying special interest on research on languages spoken in the Baltic countries. We received 43 submissions; each submission was evaluated by three reviewers. Hereby we wish to express our gratitude to the members of the Programme Committee who worked hard to review all submissions. Based on their scores and the comments they provided on the content and quality of the papers, 24 papers were accepted for presentation and publication.

Among the accepted submissions speech technology was the most popular topic with 8 accepted papers, but in general papers in this volume cover a wide range of topics: machine translation, automatic morphology, text classification, various language resources and NLP pipelines.

Completing the programme are the invited lectures by Peter Bell, Alexander Fraser and Martin Volk.

We hope that this year's conference “Human Language Technologies – The Baltic Perspective” is an intellectually stimulating event for all of you and that it will generate many new ideas that will help extend your own research.

This year is special for the research community of computational linguistics in the Baltics as the sixtieth anniversary is celebrated by the pioneer of corpus and computational Linguistics in Lithuania, one of the initiators of the Baltic HLT Conference, Professor Rūta Petrauskaitė. She is the founder of the Centre of Computational Linguistics at Vytautas Magnus University (Kaunas), where for twenty-five years works of corpus and computational linguistics have been developed. Thanks to Rūta Petrauskaitė, corpora became an integral part of Lithuanian language research. Rūta Petrauskaitė's research and work are well-known not only in Lithuania, but also internationally and many scientists have been inspired by her ideas. Some research papers by her former students can also be found in this publication.

The organizers would like to express their gratitude to the supporters of this conference: the Centre of Excellence in Estonian Studies (European Regional Development Fund), Mooncascade OÜ, Lingvist Technologies OÜ, Tilde Company and City of Tartu.

↓ more

↑ less

Advanced Rich Transcription System for Estonian Speech

Authors

Tanel Alumäe, Ottokar Tilk, Asadullah

Pages

1 - 8

DOI

10.3233/978-1-61499-912-6-1

Abstract

This paper describes the current TTÜ speech transcription system for Estonian speech. The system is designed to handle semi-spontaneous speech, such as broadcast conversations, lecture recordings and interviews recorded in diverse acoustic conditions. The system is based on the Kaldi toolkit. Multi-condition training using background noise profiles extracted automatically from untranscribed data is used to improve the robustness of the system. Out-of-vocabulary words are recovered using a phoneme n-gram based decoding subgraph and a FST-based phoneme-to-grapheme model. The system achieves a word error rate of 8.1% on a test set of broadcast conversations. The system also performs punctuation recovery and speaker identification. Speaker identification models are trained using a recently proposed weakly supervised training method.

↓ more

↑ less

Topic Interpretation Using Wordnet

Authors

Eduard Barbu, Heili Orav, Kadri Vare

Pages

9 - 17

DOI

10.3233/978-1-61499-912-6-9

Abstract

This is a preliminary study in topic interpretation for the Estonian language. We estimate empirically the best number of topics to compute for a 185 million newspaper corpus. To assess the difficulty of topics, they are independently labeled by two annotators and translated into English. The Estonian Wordnet and Princeton Wordnet are then used to compute the word pairs in the topics that have high taxonomic similarity.

↓ more

↑ less

Deeper Error Analysis of Lithuanian Morphological Analyzers

Authors

Loïc Boizou, Jurgita Kapočiūtė-Dzikienė, Erika Rimkutė

Pages

18 - 25

DOI

10.3233/978-1-61499-912-6-18

Abstract

In this research we continue the intrinsic evaluation of two the most popular and publicly available Lithuanian morphological analyzers-lemmatizers Lemuoklis and Semantika.lt. In our previous paper [1] we reported the comparative results of the shallow morphological analysis mostly covering coarse-grained part-of-speech tags. The results were better for Semantika.lt on 3 domains (administrative, fiction and periodicals), but not on the scientific texts. The deeper analysis of the fine-grained morphological categories (case, gender, number, degree, tense, mood, person, and voice) gave a more precise account of the strengths and weaknesses of both analyzers. Further investigations showed that the higher performance of Lemuoklis analyzer on the scientific domain is probably related to a more successful analyse of long distance agreements, in a spite of an overall slight superiority of Semantika.lt analyzer.

↓ more

↑ less

Towards a Modern Text-to-Speech System for Latvian

Authors

Roberts Darǵis, Ilze Auziņa

Pages

26 - 29

DOI

10.3233/978-1-61499-912-6-26

Abstract

Thanks to the advancements in neural networks there have been many new breakthroughs in human-like speech synthesis over the last couple of years. For Latvian, there have been no new publications about speech synthesis since 2010. The paper describes efforts to apply recent advancements in neural speech synthesis to Latvian using open-source tools.

↓ more

↑ less

Collection of Resources and Evaluation of Customer Support Chatbot

Authors

Daiga Deksne, Andrejs Vasiļjevs

Pages

30 - 37

DOI

10.3233/978-1-61499-912-6-30

Abstract

In this study, we propose a practical approach to creating a chatbot on the basis of data accumulated by customer support operations. The selected use case is a typical representative of a chatbot application in customer service centers that want to improve their efficiency and raise customer satisfaction. We show how company support information and logs from support interactions can serve as source data for creating a customer support chatbot. The chatbot developed in this use case is targeted at Latvian-speaking users and demonstrates an implementation of Q&A functionality in the Latvian language. We also propose a simple evaluation metric for chatbot responses to natural language questions. As practical chatbots cannot be perfect in providing appropriate answers to all user questions, this metric can be used to assess the readiness of the chatbot for being released to real users. In our experiment, a chatbot with a score of 0.45 showed positive results in a user survey.

↓ more

↑ less

Linguistically-Motivated Automatic Classification of Lithuanian Texts for Didactic Purposes

Authors

Gintarė Grigonytė, Jolanta Kovalevskaitė, Erika Rimkutė

Pages

38 - 46

DOI

10.3233/978-1-61499-912-6-38

Abstract

This paper presents an effort to provide a level-appropriate study corpus for Lithuanian language learners. The collected corpus includes levelled texts from study books and unlevelled texts from other sources. The main goal is to assign the level-appropriate labels (A1, A2, B1, B2) to texts from other sources. For automatic classification we use preselected surface features, based on text readability research, and shallow linguistic features. First, we train the model with levelled texts from study books; second, we apply the learned model to classifying other texts. The best classification results are achieved with Logistic Regression method.

↓ more

↑ less

Estonian Morphology in the Giella Infrastructure

Authors

Heiki-Jaan Kaalep, Sjur Nørstebø Moshagen, Trond Trosterud

Pages

47 - 54

DOI

10.3233/978-1-61499-912-6-47

Abstract

Years ago, a dedicated infrastructure for facilitating language technology development was set up by the Divvun language technology development group at the University of Troms.

We present a case study of using the infrastructure while developing a morphological description for Estonian. We are interested in how the environment helps a linguist to focus on the actual linguistics, and supports reuse of code, by which we mean reuse of dedicated scripts and make-files to manipulate the language-specific source material into FST.

The paper's concern is what is the best practice to use with different FST programming language files lexc, twol and xfst – how one should write the expressions, and what conventions should be followed. In many cases there are alternative ways for describing the phenomena, and the choice will prove to be good or bad only after a considerable amount of effort has been put into following it.

↓ more

↑ less

F₀ in Lithuanian: The Indicator of Stress, Syllable Accent, or Intonation?

Authors

Asta Kazlauskienė, Regina Sabonytė

Pages

55 - 62

DOI

10.3233/978-1-61499-912-6-55

Abstract

This study is focused on analysing whether changes of F₀ in the Lithuanian language are influenced by: 1) stress and the type of a syllable accent (acute, circumflex); 2) the type of a sentence (declarative, exclamatory, interrogative); 3) phrase accent (focused word). The research material consists of the recordings of three female Standard Lithuanian speakers. Each of the samples was read 5 times. The analysis of the relation of F₀ in a stressed and an unstressed syllable, in an acute and a circumflex, in a focal and non-focal position, in declarative, exclamatory, interrogative sentences allows us to assume that the pitch is an indicator of intonation rather than of a lexical stress and a syllable accent.

↓ more

↑ less

The Speech Rhythm of the Lithuanian and Latvian Languages

Authors

Asta Kazlauskienė, Aistė Zigmantaitė

Pages

63 - 70

DOI

10.3233/978-1-61499-912-6-63

Abstract

Studies on the rhythm of living Baltic languages are scarce (especially comparative ones) and their conclusions are usually ambiguous or even controversial. The aim of this research was to determine the values of the acoustic correlates of Lithuanian and Latvian languages and to compare them with the values found in other languages, identified by researchers as belonging to particular rhythm groups. The empirical material of this study consisted of 5 Lithuanian and 5 Latvian native speakers reading Aesop's fable The North Wind and the Sun in their respective native languages. The acoustic analysis of the audio recordings was performed using Praat and Correlatore applications. The analysis showed that the values of the acoustic rhythm correlates in both Baltic languages were found unequal. Nevertheless,according to the studied acoustic correlators, Lithuanian belongs to stress-timed languages, while Latvian is closer to syllable-timed languages.

↓ more

↑ less

The Legal Aspects of Using Data from Linguistic Experiments for Creating Language Resources

Authors

Jane Klavan, Arvi Tavast, Aleksei Kelli

Pages

71 - 78

DOI

10.3233/978-1-61499-912-6-71

Abstract

The present paper focuses on linguistic experiments as a source of language resources (LRs). It addresses some of the legal requirements regulating their collection. The primary focus of the paper is on processing personal data (PD), especially how the General Data Protection Regulations (GDPR) defines personal data, and what it means to anonymize personal data.

↓ more

↑ less

Self-Reading Texts and Books

Authors

Meelis Mihkla, Indrek Hein, Indrek Kiissel

Pages

79 - 87

DOI

10.3233/978-1-61499-912-6-79

Abstract

The rise of e-books, the cumulative digitisation of written library materials and the advancement of speech technology have reached a stage enabling library services and e-books to be read out loud to customers in synthetic speech and paper books (either published or still in print) to be delivered in the audio form. The user environment of the digital archive Digar of the Estonian National Library includes a special reading machine capable of producing an audio version of electronic texts in Estonian (books, magazines etc). The application of Elisa Raamat provides access to more than 2500 Estonian e-books, which can not only be read visually from the screen of a smartphone or tablet but also listened to. The speech server of the Institute of the Estonian Language offers, as a public service, the text-to-speech system Vox populi, inviting people to have an audio version synthesized from any text of interest, being prepared to convert any uploaded text (an article, paper, subtitle file, e-book etc.) into an audio file. The present study is focused not only on the description of the systems but also on various issues of text processing and pronunciation as well as on the reflection of text structure in synthetic speech. The quality of self-reading largely depends on how adequately the input abbreviations, numbers and other non-letter sequences are converted into words in correct morphological form and how closely the output pronunciation of foreign names matches that of the source language. In the article we will also discuss a special module for text pre-processing, which helps in the case of more complex text structures and character sequences (e.g. geographic coordinates, sports results, numeral inflection). In addition, book reading requires an as accurate as possible rendering of text structure. The study also analyses audio books to capture the essence of human prosodic phrasing as well as different pauses and the marking of reported speech when talking.

↓ more

↑ less

Language Use in a Multilingual Tweet Corpus

Authors

Dmitrijs Milajevs

Pages

88 - 95

DOI

10.3233/978-1-61499-912-6-88

Abstract

A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, language and topics. The corpus consists of 1.4 million tweets that cover a period from April 2017 to July 2018. The language analysis reveals that the majority of users mostly use one language. Across topics, there is more Latvian content than in the whole collection. Among many potential use cases, the corpus can be used, for example, to study the public engagement of major Latvian media outlets and public figures, or the factors that determine language choice and content of a tweet.

↓ more

↑ less

Latvian FrameNet: Cross-Lingual Issues

Authors

Gunta Nešpore-Bērzkalne, Baiba Saulīte, Normunds Grūzītis

Pages

96 - 103

DOI

10.3233/978-1-61499-912-6-96

Abstract

This paper reports the lessons learned while creating a FrameNet-annotated text corpus of Latvian. This is still an ongoing work, a part of a larger project which aims at the creation of a multilayer text corpus, anchored in cross-lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet and PropBank, as well as Abstract Meaning Representation (AMR). For the FrameNet layer, we use the latest frame inventory of Berkeley FrameNet (BFN v1.7), while the annotation itself is done on top of the underlying UD layer. We strictly follow a corpus-driven approach, meaning that lexical units (LU) in Latvian FrameNet are created only based on the annotated corpus examples. Since we are aiming at a medium-sized still general-purpose corpus, an important aspect that we take into account is the variety and balance of the corpus in terms of genres, domains and LUs. We have finished the first phase of the FrameNet corpus annotation, and we have collected and discuss cross-lingual issues and their possible solutions. The issues are relevant for other languages as well, particularly if the goal is to maintain cross-lingual compatibility via BFN.

↓ more

↑ less

Speech-Based Identification of Children's Gender and Age with Neural Networks

Authors

Leo Kristopher Piel, Tanel Alumäe

Pages

104 - 111

DOI

10.3233/978-1-61499-912-6-104

Abstract

In this paper, we investigate using different types of neural networks for age and gender identification from children's speech, based on the Corpus of Estonian Adolescent Speech. Feed-forward deep neural networks using i-vectors as input are compared with recurrent neural networks using MFCCs as input. Results show that feed-forward neural networks outperform recurrent neural networks for gender classification, while a model that combines both i-vectors and MFCC via feed-forward and recurrent branches achieve the best performance for age group classification. We also show that for age group classification, it is beneficial to first identify gender and then use a gender-specific age identification model. Experiments with human listeners show that the neural network models outperform humans on both tasks by a big margin.

↓ more

↑ less

Latvian Tweet Corpus and Investigation of Sentiment Analysis for Latvian

Authors

Mārcis Pinnis

Pages

112 - 119

DOI

10.3233/978-1-61499-912-6-112

Abstract

We present the Latvian Tweet Corpus and its application in sentiment analysis by comparing four different machine learning algorithms and a lexical classification method. We show that the best results are achieved by an averaged perceptron classifier. In our experiments, the more complex neural network-based classification methods (using recurrent neural networks and word embeddings) did not yield better results.

↓ more

↑ less

Extending Tēzaurs.lv Online Dictionary into a Morphological Lexicon

Authors

Lauma Pretkalniņa, Pēteris Paikens

Pages

120 - 125

DOI

10.3233/978-1-61499-912-6-120

Abstract

This paper describes the integration of traditional dictionary data from the Tēzaurs.lv online dictionary with a morphological analysis system, resulting in a structured morphological lexicon that improves coverage of morphological tagging and also enhances the dictionary with automatically generated inflection tables. The main challenge in this integration was the extraction of structured data from textual dictionary information, and classification of dictionary entries into morphological paradigms.

↓ more

↑ less

Impact of Corpora Quality on Neural Machine Translation

Authors

Matīss Rikters

Pages

126 - 133

DOI

10.3233/978-1-61499-912-6-126

Abstract

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

↓ more

↑ less

Advancing Estonian Machine Translation

Authors

Matīss Rikters, Mārcis Pinnis, Roberts Rozis

Pages

134 - 141

DOI

10.3233/978-1-61499-912-6-134

Abstract

In this paper, we present Tilde's work on boosting the output quality and availability of Estonian machine translation systems, focusing mostly on the less resourced and morphologically complex language pairs between Estonian and Russian. We describe our efforts on collecting parallel and monolingual data for the development of better neural machine translation models, as well as experiments with various model architectures with the goal to find the best-performing model for our data. We attain state-of-the-art MT results by training a multi-way Transformer model that improves the quality by up to +3.27 BLEU points over the baseline system. We also provide a publicly available translation service via a mobile phone application.

↓ more

↑ less

Mirthful and Polite Laughter: Acoustic Features

Authors

Regina Sabonytė

Pages

142 - 149

DOI

10.3233/978-1-61499-912-6-142

Abstract

This paper covers the analysis of structure and acoustic features of structural units of mirthful and polite laughter. The research material consists of 100 samples of non-overlapping spontaneous laughter. The analysis revealed that the greatest difference between mirthful and polite laughter can be seen in the structure: polite laughter consists of one, whereas mirthful laughter – of one and more bouts. Some of the differences can be seen in the duration of the same structure laughter units and in the diapason of mean F0, F1, F2, shimmer, and jitter: the duration is longer and the diapason of other acoustic features of mirthful laughter is almost always wider than polite laughter.

↓ more

↑ less

General-Purpose Lithuanian Automatic Speech Recognition System

Authors

Askars Salimbajevs, Jurgita Kapočiūtė-Dzikienė

Pages

150 - 157

DOI

10.3233/978-1-61499-912-6-150

Abstract

This paper describes the development of a general-purpose automatic speech recognition system for Lithuanian. The system is capable of performing both the transcription of user submitted audio recordings and real-time speech-to-text conversion. The comparative evaluation results prove that the presented system outperforms all other ASR systems for the Lithuanian language. The system also includes number and date normalization and is paired with an automatic punctuation restoration model that achieves state-of-the-art results for the Lithuanian language. Importantly, the system is publicly available to any Lithuanian speaker for testing via its web-page and mobile application.

↓ more

↑ less

Czech & Slovak Corpus Resources Go (not only) Latvian

Authors

Michal Škrabal, Vladimír Benko

Pages

158 - 165

DOI

10.3233/978-1-61499-912-6-158

Abstract

As Latvian can still be considered an under-resourced language, several corpora and corpus tools that can be used for its linguistic research are presented in the paper, namely: the InterCorp and Araneum Lettonicum corpora along with the Treq database, a word-sketch grammar for Latvian and the Morfio tool.

↓ more

↑ less

Neural Morphological Tagging for Estonian

Authors

Alexander Tkachenko, Kairit Sirts

Pages

166 - 174

DOI

10.3233/978-1-61499-912-6-166

Abstract

We develop neural morphological tagging and disambiguation models for Estonian. First, we experiment with two neural architectures for morphological tagging – a standard multiclass classifier which treats each morphological tag as a single unit, and a sequence model which handles the morphological tags as sequences of morphological category values. Secondly, we complement these models with the analyses generated by a rule-based Estonian morphological analyser (MA) VABAMORF, thus performing a soft morphological disambiguation. We compare two ways of supplementing a neural morphological tagger with the MA outputs: firstly, by adding the combined analyses embeddings to the word representation input to the neural tagging model, and secondly, by adopting an attention mechanism to focus on the most relevant analyses generated by the MA. Experiments on three Estonian datasets show that our neural architectures consistently outperform the non-neural baselines, including HMM-disambiguated VABAMORF, while augmenting models with MA outputs results in a further performance boost for both models.

↓ more

↑ less

Low-Resource Translation Quality Estimation for Estonian

Authors

Elizaveta Yankovskaya, Mark Fishel

Pages

175 - 182

DOI

10.3233/978-1-61499-912-6-175

Abstract

Quality estimation is an essential step in applying machine translation systems in practice, however state-of-the-art approaches require manual post-edits and other expensive resources. We introduce an approach to quality estimation that uses the attention weights of a neural machine translation system and can be applied to a translation produced by any machine translation system; a lighter version of the approach does not even require any post-edits. Our experiments with German-Estonian and English-Estonian translations show that its performance matches the state-of-the-art baseline.

↓ more

↑ less

NLP-PIPE: Latvian NLP Tool Pipeline

Authors

Artūrs Znotiņš, Elita Cīrule

Pages

183 - 189

DOI

10.3233/978-1-61499-912-6-183

Abstract

The paper introduces a modular pipeline that allows to combine multiple Natural Language Processing tools into a unified framework. It aims to make NLP technology more accessible for researchers, non-experts and software developers. The paper describes the architecture of NLP-PIPE and presents publicly available NLP components for Latvian.

↓ more

↑ less

Ebook: Human Language Technologies – The Baltic Perspective

This website uses cookies

This website uses cookies