The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year . The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4,5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com