Truecasing, or capitalization, is the rewriting of each word of an input text with its proper case information. Many medical texts, especially those from legacy systems, are still written entirely in capitalized letters, hampering their readability. We present a pilot study that uses the World Wide Web as a corpus in order to support automatic truecasing. The texts under scrutiny were German-language pathology reports. By submitting token bigrams to the Google Web search engine we collected enough case information so that we achieved 81.3% accuracy for acronyms and 98.5% accuracy for normal words. This is all the more impressive as only half of the words used in this corpus existed in a standard medical dictionary due to the excessive use of ad-hoc single-word nominal compounds in German. Our system performed less satisfactory for spelling correction, and in three cases the proposed word substitutions altered the meaning of the input sentence. For the routine deployment of this method the dependency on a (black box) search engine must be overcome, for example by using cloud-based Web n-gram services.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com