The Latvian Treebank is being developed since 2010. In this paper we describe the latest developments of this project and the problems currently faced. We examine several gaps in our annotation scheme like determinant, ellipsis and insertion annotation and describe solutions we have chosen.
The usual steps of computational analysis of a text include morphological analysis, morphological disambiguation and (shallow) syntactic analysis; often these steps are carried out in linear order and each step of automatic analysis uses the output of previous step as an input. So, the quality of each step depends on the quality of the previous steps. Our article is concerned with the impact of the first aforementioned step – the morphological analysis – on the (shallow) syntactic analysis. The analyzed language is Estonian – a language characterized by rich morphology and relatively free word order.
Knowledge base construction for dialogue systems is a time-consuming process which requires considerable amount of attention from engineers. In this paper a tool for acquiring knowledge for dialogue systems is presented. The tool has been developed to facilitate the knowledge base creation process and utilizes adjacency pairs (or Frequently Asked Questions pairs) as an input. The tool uses three simplistic text-mining algorithms for finding keywords from input pairs and outputs keywords-answer pairs to be used by knowledge engineer or dialogue system developer in the process of knowledge base construction.
This paper proposes a method for encoding textual clinical data while it is being produced by the practitioners. To perform this task, a prototypical user interface is introduced. The interface uses autocomplete-like technology and frequency distributions of terminology and common clinical phrases to support the author with terminologically and grammatically correct constructs. It is the aim of the suggestions to facilitate the data input process, grammatically and terminologically uniform clinical textual data in order to improve inter-doctor and doctor-patient communication, and encode the entered texts in real time to facilitate machine processing of clinical notes.
Morphological analysis is an important task in Estonian learner language studies that gives information about the words and forms used by the learners. Because of the spelling errors frequently occurring in language learner texts, these texts should undergo some error correction step before applying the conventional morphological analysis tools because the morphological analyser fails to find the correct analysis for the misspelled words. In this paper we compare several different spelling correction models with the aim of improving the lemmatisation accuracy of learner language texts. Experiments show that the simplest non-word noisy-channel spelling correction model with a disambiguation model applied on top of the morphological analyser output performs the best while some of the more complicated models even fail to beat the baseline that does not include any spelling correction.
Statistical machine translation (SMT) is a hot research topic not only for languages with large quantities of parallel language resources available, but also for under-resourced languages, including languages of Baltic countries. Evaluation of SMT systems in Baltic countries is mostly done by using automatic metrics. In this paper we present a linguistic analysis for the output of an English-Latvian SMT system. The main purpose of this study was to obtain a clear understanding of main pitfalls for the current state-of-art SMT systems, classify the main types of errors and analyze possible reasons of these errors.
Email is an important source of information. Each day we receive lots of messages—some are related to our work, some are personal, and some are just advertisements. Standard email clients as Microsoft Outlook and Thunderbird, and webmail services as Hotmail and Gmail have little support for relating different messages by topics, or generating summaries of messages. We suggest to use various statistical and frequency methods to improve our email management skills through auto categorization and graph exploration.
Currently the Estonian Emotional Speech Corpus is investigated for the distinctive acoustic parameters of three emotions – anger, joy and sadness – and neutral speech, with a view to recognizable synthesis of emotions in Estonian speech. This article is focused on intensity as one of the parameters vital for emotion synthesis. The research question is whether the intensity of Estonian read speech is in any way affected by emotions. The Estonian Emotional Speech Corpus was used as the acoustic basis of the study. The intensity analysis comprised calculations of the means and ranges of the intensities of emotional and neutral speech. In addition, pairwise studies were applied to find out whether intensity differs across emotions and in comparison with neutral speech in utterance-initial and utterance-final positions. The results revealed that mean intensities make a significant difference between concrete emotions as well as in comparison with neutral speech. The highest intensity was measured in neutral speech and the lowest in the utterances of sadness. Intensity ranges, however, were not significantly different between the utterance groups analysed. Intensity at the beginning and end of utterance was also the highest in neutral speech and the lowest with sadness. Those two groups displayed the only statistically significant differences between the intensities of utterance beginnings as well as ends.
This paper discusses multimodal feedback signalling in Finnish first encounter conversations, especially the use of head nodding to signal shared understanding of the presented information. The goal of the paper is to study the correlation between gestures and speech, and to build a model to describe to which extent head movements correlate with verbal feedback. We distinguish single and repeated nodding, as well as up-nods and down-nods, and hypothesise that downnods are used as backchannels while up-nods signal unexpected information.
The current research is extending the Asynchronous Dialogue System framework (ADS framework) – a software system that we implemented previously. The ADS framework is a collection of reusable components. This framework can be used in developing text-based natural language dialogue systems. The framework is currently tailored for Estonian language, yet it can be used for English language as most of the components are language independent. The goal of our current research was to explore the adaptability issues in dialogue systems. Mainly, we were looking how to adjust the response from the system to the user's style, in order to provide better interactions with the users. To achieve this goal we needed to implement two additional components for the ADS framework. In text-based human-computer conversations on the internet, the user input is a written request to the dialogue system in a natural language and the output of the system is an answer to the user in the same language. We implemented two additional components for the ADS framework that would analyze the style of the user input and adjust the output of the system accordingly.
The goal of the paper is to present different problems related to the building of Parallel Corpus for two small languages, namely, Latvian and Lithuanian. The Lithuanian-Latvian-Lithuania Parallel Corpus (LILA) will contain 8 million running words; will be bidirectional, aligned on the sentence level. The problems include identifying, acquiring, preparing, and aligning parallel texts.
This paper describes our work on identification, assessment and cataloguing of Latvian language resources for sharing through an open language resource infrastructure. This work was carried out in the META-NORD project which is the Baltic and Nordic branch of the pan-European network META-NET. Criteria and results of the assessment are provided for the major groups of Latvian language resources and tools. Critical gaps are discussed and general strategy to address them outlined. The on-going work on language resource selection and preparation of metadata is described for their distribution on the pan-European sharing and distribution platform META-SHARE.
This paper gives an overview of the strategies and national programmes related to the development of language technology in Estonia. It also describes briefly the international initiatives aiming to build infrastructures for gathering and sharing language resources that Estonian linguists and language technology researchers participate in.
In the paper a model of semantic representation of sentences/texts that describe motion events is outlined. The task is to build a model that takes into account the ontological features of the entities involved.
This paper deals with average prosodic characteristics of Estonian as observed in 25 hours of manually transcribed spontaneous speech. The proposed investigations address the prosodic realisations on a lexical basis, as one of the goals is to make use of these characteristics within the lexical decision process of automatic speech recognition systems. As a first step, the speech corpus was split into word categories according to lexical frequency (most frequent vs. less frequent words), syllabic word length and gender. Average prosodic profiles were computed with respect to fundamental frequency, segment duration and intensity. Statistical analyses confirm a word-initial syllable stress with high average f0 and intensity values, which then progressively decrease towards word-final syllables. Prosodic profiles also reveal that the longer the words, the higher these word-initial stress values are. Whereas f0 and intensity profiles show very similar profiles across word categories, duration profiles reveal less regular patters. For the frequent word category, a rising duration can be observed from the first to final vowels, whilst the other words show longer duration in monosyllabic words and in final vowels of 4-syllabic words. Overall, our results suggest that prosodic cues could contribute to word boundary location in continuous speech.