An implementation of the Trigram Phrase Matching method for text similarity problems

Tardelli, Adalberto O.; An&#231;&#227;o, Meide S.; Packer, Abel L.; Sigulem, Daniel

doi:10.3233/978-1-60750-946-2-43

Abstract

The representation of texts by term vectors with element values calculated by a TFIDF method yields to significant results in text similarity problems, such as finding related documents in bibliographic or full-text databases and identifying MeSH concepts from medical texts by lexical approach and also harmonizing journal citation in ISI/SciELO references and normalizing author's affiliation in MEDLINE. Our work considered “trigrams” as the terms (elements) of a term vector representing a text, according to the Trigram Phrase Matching published by the NLM's Indexing Initiative and its logarithmic Term Frequency – Inverse Document Frequency method for term weighting. Trigrams are overlapping 3-char strings from a text, extracted by a couple of rules, and a trigram matching method may improve the probability of identifying synonym phrases or similar texts. The matching process was implemented as a simple algorithm, and requires a certain amount of computer resources. An efficiency-focused C-programming was adopted. In addition, some heuristic rules improved the efficiency of the method and made it feasible a regular “find your scientific production in SciELO collection” information service. We describe an implementation of the Trigram Matching method, the software tool we developed and a set of experimental parameters for the above results.

This website uses cookies

This website uses cookies