The crux of data compression is to process a string of bits in order predicting each subsequent bit as accurately as possible. The accuracy of this prediction is reflected directly in compression effectiveness. Dynamic Markov Compression (DMC) uses a simple finite state model which grows and adapts in response to each bit, and achieves state-of-the art compression on a variety of data streams. While its performance on text is competitive with the best known techniques, its major strength is that is lacks prior assumptions about language and data encoding and therefore works well for binary data like executable programs and aircraft telemetry. The DMC model alone may be used to predict any activity represented as a stream of bits. For example, DMC plays “Rock, Paper, Scissors” quite effectively against humans. Recently, DMC has been shown to be applicable to the problem of email and web spam detection – one of the best known techniques for this purpose. The reasons for its effectiveness in this domain are not completely understood, because DMC performs poorly for some other standard text classification tasks. I conjecture that the reason is DMC's ability to process non-linguistic information like the headers of email, and to predict the nature of polymorphic spam rather than relying on fixed features to identify spam. In this presentation I describe DMC and its application to classification and prediction, particularly in an environment where particular patterns of data and behavior cannot be anticipated, and may be chosen by an adversary so as to defeat classification and prediction.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com