
Ebook: Human Language Technologies – The Baltic Perspective

Throughout the last decade, the Baltic states have played an active role in regional and international language technology activities, supporting less-resourced languages in the digital age.
This book presents the proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016), held in Riga, Latvia, in October 2016. Baltic HLT 2016 provided a forum for sharing ideas and recent advances in human language processing with a special focus on less-resourced languages. Papers selected for the conference cover a wide range of topics, including a general overview of language technology progress in the Baltic states, actual research topics in written and spoken language processing, the creation of language resources and their applications, and proposals for a European language platform. The book is divided into five sections: overview; speech technologies and corpora; machine translation; written language resources; and methods and tools for language processing.
The book will be a useful resource, not only for Baltic language researchers, but also for those working with other less-resourced languages in Europe and beyond.
The Baltic HLT conference, which began in Riga back in 2004, is now returning to the Latvian capital for the third time as the 7th international conference “Human Language Technologies – the Baltic Perspective”. What has happened during these 12 years?
The first Baltic HLT conference, over a decade ago, was a significant moment for researchers, developers, and industry representatives, as it was the first event to focus on language technology development in the Baltic region. This success inspired the organization of biennial conferences in Tallinn (2005), Kaunas (2007), Riga (2010), Tartu (2012), and Kaunas (2014). Since 2010, the conference proceedings have been published by IOS Press, and since 2012 are openly accessible online.
During the last decade, the Baltic states have played an active role in regional and international language technology activities, supporting less resourced languages in the digital age. All three countries – Estonia, Latvia and Lithuania – have joined the CLARIN research infrastructure; researchers and developers have become part of the Multilingual Europe Technology Alliance META-NET; and institutions have initiated and participated in numerous projects in European Union Research and Innovation Framework Programmes.
National funding programs in all three Baltic states have played an important role in the research and development of human language technologies. Programs such as the National Program for Estonian Language Technology, the national program “The Lithuanian Language for Information Society,” and the IT Competence Centre Program in Latvia are helping to fill major gaps in language resources and tools, including those identified in the META-NET comparative study of European languages. Moreover, these programs support the transfer of research results into innovative applications. Many of these achievements are presented in the conference proceedings.
Continuing the tradition of previous conferences, Baltic HLT 2016 provides a forum for sharing new ideas and recent advances in human language processing. Its special focus is on less-resourced languages. More than 65 authors from 15 countries submitted their papers for blind review. Papers selected for the conference represent a wide range of topics, including a general overview of language technology progress in the Baltic states, actual research topics in written and spoken language processing, the creation of language resources and their applications, and proposals on the European language platform.
We would like to express our gratitude to all members of the program committee for their hard work. We'd also like to thank the authors for their contributions, which demonstrate the most important achievements in our region and reveal the latest tendencies in language technology research in the Baltic states. We hope that the proceedings will be a useful resource not only for Baltic language researchers, but also for those who work with other less-resourced languages in Europe and beyond.
Inguna Skadiņa
Chair of the Program Committee
We present the results of the Latvian IT Competence Centre (IT CC) in developing several essential language technologies and applications. 11 language technology projects have been completed in the first phase of the IT CC work. We describe how IT CC has contributed to filling in the gaps and improving the quality of the basic language technologies for Latvian in speech processing, machine translation, parsing and grammar checking, intelligent media monitoring and multi-modal human-computer interaction.
The paper presents an overview of recent advances of language technologies in Lithuania. It is shown that the development of Lithuanian language resources and technologies can be divided into three stages: the first (2004–2012), the second (2012–2015), and the third (2016–2020). The paper focuses on the second stage of development, which is labelled as the systematic breakthrough. The paper contains separate sections on the policy of language technologies, research infrastructures and international collaboration, language resources, and tools.
This paper proposes a large scale initiative for the creation of the European Platform of online multilingual services to cover the multilingual needs of the Digital Single Market. We describe the three layers of the Platform: a solutions layer, an infrastructure layer, and a research layer. The infrastructure layer will combine mature language technologies in four clouds: Automated Translation Cloud, Human-Computer Interaction Cloud, Multilingual Knowledge Management Cloud, and European Language Cloud, encompassing basic services and language resources. We identify the key gaps and recommend targeted research activities to provide equal technology coverage for all EU languages.
In this paper the experiments on the identification of the Estonian consonants in VCV nonsense words produced by a human speaker are presented. The test scenario involves audio-only and audiovisual speech stimuli in five noise conditions. The results confirm that in the presence of noise the scores of consonant identification of audiovisual stimuli are always higher than the scores of audio-only stimuli. In addition, we compare the results with the previous study [9] reporting the perception of similar stimuli produced by a virtual talking head.
This paper describes the development of an automatic broadcast data transcription system for Lithuanian. The system performs fully automatic transcription of broadcast media recordings, including speech/non-speech detection, speaker diarization, speech-to-text conversion and automatic punctuation restoration. The system was developed in collaboration with the Baltic Media Monitoring Group (BMMG). The system is currently used in production for performing various broadcast speech monitoring tasks.
The paper provides an overview of the ongoing development of the Annotated Longitudinal Latvian Children's Speech Corpus. The authors outline the design of this corpus and the layers of annotation (both orthographic and part-off-speech tagging) with which the speech signal is enriched.
Paper deals with the recognition of disease codes and with the hybrid recognition technology. It is impossible so far to recognize about 15000 various diseases, but each disease can be identified by its code, consisting of one letter and some digits. The appropriate Lithuanian names were selected for each letter and their recognition accuracy together with Lithuanian digit names recognition accuracy was investigated. By the hybrid approach we assume the combination of two different recognizers to achieve higher recognition accuracy. The first recognizer was HTK-based Lithuanian recognizer, the second one – the Spanish language recognizer adapted to the Lithuanian language. The experimental results show that a hybrid decision-making rule learned by “random forest” classifier decreases the recognition error of Lithuanian digits names speech corpus by 74.1% and the recognition error of Lithuanian names speech corpus by 76.7% compared with HTK-based Lithuanian recognizer when the speaker is unknown.
The output of generic automatic speech recognition systems consists of raw word sequences without any punctuation symbols. When sequences are longer, it is difficult for humans to read and understand them. Also, many natural language understanding and processing tools expect that input will contain punctuation. We present a bidirectional recurrent neural network for punctuation restoration in speech utterances. The proposed model showed promising results, F1-scores of 0.732 for commas and 0.708 for periods on raw output from a speech recognizer.
In this paper, we introduce the first dictation system for the Latvian language. We present its main features, the details of the automatic speech recognition (ASR) system used in this service, software architecture, and an evaluation of recognition quality. The service will provide Latvian-speaking people with the opportunity to dictate text in Latvian into their computers. The presented system achieved a word error rate (WER) of 23.86% in evaluation of dictation scenario, which is a good result for a system in the beta stage.
We describe experiments with syntax-based pre-reordering of Estonian for statistical translation into English. The reordering rules are designed manually to address the most prominent differences in typical constituent order between Estonian and English and are based on syntactic information obtained via parsing Estonian input sentences. In the experiments we obtained mixed results and present a discussion of the possible reasons.
This paper investigates a hybrid method for translation from English into Latvian by chaining an NMT system with an SMT system in order to cover out-of-vocabulary word translation. Different from other works, the primary translation is handled by the NMT system, and the SMT system acts as a secondary system. Automatic evaluation results have shown that the hybrid method allows improving NMT translation quality by up to three BLEU points.
This paper presents an attempt to improve a specific baseline hybrid machine translation (MT) combination system by using brute force and searching through all possibilities for the best-combined translation instead of incrementally building the translation piece by piece. The result is an improved phrase-based multi-system MT system that allows improves the quality of the MT output compared to the baseline while taking much more time to produce the final output. The proposed approach shows improvement up to +3.34 points in BLEU score compared to the baselines and up to +3.61 BLEU compared to related research.
Processing of multi-word expressions (MWEs) is well known ‘pain in the neck’ of human language technology researchers. The problem of MWE treatment affects almost any natural language processing task, including different levels of text analysis and automated translation. It is extremely complicated task for machine translation (MT), as it includes identification, alignment and translation. Many on-line machine translation systems translate MWEs as phrases, not as one complex unit. In this paper several experiments are presented where possible ways how statistical MT system could learn translations are investigated. Although there is no significant improvement achieved in automatic evaluation, manual inspection of translations revealed some improvement in fluency and adequacy of translations.
This paper aims to describe the on-going work on creation of the Lithuanian syntactically annotated corpus ALKSNIS focusing on its structure, morphological and syntactic annotation principles. The corpus is scheduled to be completed at the end of 2016, and it should reach about 2350 sentences from texts of various genres. ALKSNIS is based on a dependency model. The corpus is provided in two formats: PML (Prague Markup Language), as a core format, and PAULA XML. The compilation of the list of abbreviations for syntactic labels and collecting of the information about the presentation of the syntactic relations and dependences were based on the experience (with some changes) of Czech researchers [1]. At present, 18 main syntactic labels (excluding variants) are used in ALKSNIS.
The paper illustrates the ParliSearch – the system that enables easy discourse analysis in large text corpus, providing stem search with additional search criteria. The system contains verbatim reports from debates of plenary sittings of the European Parliament and the Saeima (the Parliament of Latvia).
The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4,5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.
This paper presents the current status of the Latvian-Russian parallel corpus, which is an ongoing project within the Russian National Corpus. It discusses the existing parallel corpora including Latvian texts, availability of sources and the main principles and tools of alignment and morphological annotation, as well as further plans for developing the corpus.
In this paper we present the first Universal Dependency Treebank for Latvian. Latvian UD Treebank contains approx. 1 thousand sentences. It has been created from Latvian Treebank newswire texts with the help of an automatic conversion. This resource is an important prerequisite for integrating Latvian in various international language processing frameworks and making Latvian data more welcoming to international researchers. This paper also includes an analysis of the main conversion problems and describes known discrepancies between annotations in Latvian UD Treebank and Universal Dependency annotation guidelines.
The paper reports on the recent work in the development of a grammar checker for Latvian. The grammar checker is using extended context free grammar (CFG) formalism for description of correct and erroneous syntactic structures. The grammar checking engine uses both of these sets of the rules. The grammar checker is used by language learners as well as native speakers. Our recent work is directed at the creation of an error-annotated corpus of texts that are created by non-native speakers. Based on this corpus, the CFG rule set is refined.