Grepator: Accents &amp;amp; Case Mix for Thesaurus

Mary, Vincent; le Beux, Pierre

Abstract

There is a real need among researchers and students for pedagogical resources. In France, information retrieval techniques have been developed, for example in the Doc'CISMeF web site. As Pubmed, documents are indexed with (French) MeSH terms, one of the problems discovered, in quality studies, is the inadequacies between the user requests and the MeSH controlled vocabulary. Moreover, French (but also Greek or Spanish), pose specific problems for indexing, due to the diacritic characters.

In this article, we present the Grepator project. The main goal is to transform any thesaurus (or any entry) in case mix and accentuated characters, for a specific domain. Furthermore, Grepator has to complete MeSH terms according to their usual form in natural language and finally, to correct user spelling mistakes. Grepator is based on a statistical approach. A large French medical corpus has been constituted from pedagogical resources indexed in CISMeF. Using regular expressions, Grepator searches the more usual ways to spell the word.. Seventy five percent of MeSH terms are found in the corpus, using this method, with less than one mistake for a hundred words. This first evaluation of the tools is analyzed and we discuss further steps that might be developed.

This website uses cookies

This website uses cookies