Usually, clustering algorithms consider that document collections are static and are processed as a whole. However, in contexts where data is constantly being produced (e.g. the Web), systems that receive and process documents incrementally are becoming more and more important. We propose OHDOCLUS, an online and hierarchical algorithm for document clustering. OHDOCLUS creates a tree of clusters where documents are classified as soon as they are received. It is based on COBWEB and CLASSIT, two well-known data clustering algorithms that create hierarchies of probabilistic concepts and were seldom applied to text data. An experimental evaluation was conducted with categorized corpora, and the preliminary results confirm the validity of the proposed method.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com