The ever growing flood of data arises from many different sources: huge databases store increasing amounts of data on customers, sales, expenses, traveling, credit card usage, tax payments and the like; the Web gives us access to basically unlimited information in the form of text, images, videos and sound. Everything we do leaves an information trail which can potentially be stored and used. Most of the times, the uses are beneficial (providing us with a list of our expenses, information on our next trip or the delivery of some book we bought on the Web ...), but sometimes this information can also be exploited by rogues to put us at risk (identity theft, credit card fraud, intrusion on a sensitive computer network, terrorism ...). Information security breach is thus becoming an every-day threat for the citizens of modern societies and – when linked to terrorism – it is becoming more and more dangerous. However, discriminating good uses from bad ones is not easy if we want to maintain our standards of democracy and individual freedom. We therefore need to develop different – both efficient and non-invasive – security applications that can be deployed across a wide range of activities in order to deter and to detect the bad uses. The main research challenges of such security applications are: to gather and share large amounts of data or even just sample them when the volume is too big (data streams), to fuse data from different origins (such as numerical, text), to extract the relevant information in the correct context, to develop effective user interfaces from which people can obtain quick interpretations and security-alerts, and to preserve people's privacy.
All security issues share common traits: one has to handle enormous amounts of data (massive data sets), heterogeneous in nature, and one has to use a variety of techniques to store and analyze this data to achieve security. This requires an interdisciplinary approach, combining techniques from computer science, machine learning and statistics, computational linguistics, Web search, social networks and aggregation techniques to present results easily interpretable by the user. Academic research in all of the above areas has made great progress in recent years, while commercial applications now exist all around us (Amazon, eBay, Google, Yahoo!, ...). The real power for security applications will come from the synergy of academic and commercial research focusing on the specific issue of security.
Special constraints apply to this domain, which are not always taken into consideration by academic research, but are critical for successful security applications:
Large volumes: techniques must be able to handle huge amounts of data, and perform ‘on-line’ computation;
Scalability: algorithms must have processing times that scale well with ever growing volumes;
Automation: the analysis process must be automated so that information extraction can ‘run on its own’;
Ease of use: every-day citizens should be able to extract and assess the necessary information;
Robustness: systems must be able to cope with data of poor quality (missing or erroneous data).
The NATO Advanced Study Institute (ASI) on Mining Massive Data Sets for Security, held in Villa Cagnola, Gazzada, Varese (Italy) from 10 to 21 September 2007, brought together around 90 participants to discuss these issues. The scientific program consisted of invited lectures, oral presentations and posters from participants. The present volume includes the most important contributions, but can of course not entirely reflect the lively interactions which allowed the participants to exchange their views and share their experience.
The book is organized along the five themes of the workshop, providing both introductory reviews and state-of-the-art contributions, thus allowing the reader a comprehensive view of each of the themes. The bridge between academic methods and industrial constraints is systematically discussed throughout. This volume will thus serve as a reference book for anyone interested in understanding the techniques for handling very large data sets and how to apply them in conjunction for solving security issues.
Section 1 on Data Mining brings together contributions around algorithms for learning large data sets. Section 2 on Search highlights the problems of scale and threats of the web. Section 3 on Social Networks presents the theoretical tools and various issues around very large network structures. Section 4 on Text Mining focuses on techniques to extract structured information from multilingual and very large text collections. Finally, Section 5 presents various applications of the mentioned techniques to security: fraud, money laundering, intelligence, terrorism, geolocalization, intrusion.
The ASI event and the publication of this book have been co-funded by the NATO Program ‘Science for Peace and Security’, by the European Commission's (EC) Enlargement Program and the EC-funded PASCAL Network of Excellence. All lectures have been filmed and are now freely accessible via the website http://videolectures.net/.
The NATO ASI was made possible through the efforts of the four co-Directors Clive Best (JRC, Ispra, Italy), Françoise Fogelman-Soulié (Kxen, France), Patrick Gallinari (University of Paris 6 – LIP6, France) and Naftali Tishby (Hebrew University, Israel), as well as of the other members of the program committee: Léon Bottou (NEC Labs America, USA), Lee Giles (Pennsylvania State University, USA), Domenico Perrotta (JRC, Ispra, Italy), Jakub Piskorski (JRC, Ispra, Italy) and Ralf Steinberger (JRC, Ispra, Italy). We would like to thank the additional reviewers, whose valuable feedback helped to improve the quality of this book: Spyros Arsenis, Thierry Artières, Carlo Ferigato, Nicholas Heard, Hristo Tanev, Cristina Versino, Dennis Wilkinson and Roman Yangarber. Our special thanks go to Sabine Gross (Public Relations, JRC Ispra) for her dedication and the endless hours she spent on organizing the ASI. Without her, the event would not have taken place.
June 2008
Clive Best and Françoise Fogelman-Soulié
Domenico Perrotta, Jakub Piskorski and Ralf Steinberger