Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish

Kanerva, Jenna; Luotolahti, Juhani; Laippala, Veronika; Ginter, Filip

doi:10.3233/978-1-61499-442-8-184

Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish

Authors

Jenna Kanerva, Juhani Luotolahti, Veronika Laippala, Filip Ginter

Pages

184 - 191

DOI

10.3233/978-1-61499-442-8-184

Series

Frontiers in Artificial Intelligence and Applications

Ebook

Volume 268: Human Language Technologies – The Baltic Perspective

Abstract

In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and noun-argument n-grams. Additionally, distributional vector space representations of the words are induced using the word2vec method. All n-gram collections as well as the vector space models are made available under an open license.

This website uses cookies

This website uses cookies