
Ebook: Mining Massive Data Sets for Security

The real power for security applications will come from the synergy of academic and commercial research focusing on the specific issue of security. Special constraints apply to this domain, which are not always taken into consideration by academic research, but are critical for successful security applications: large volumes: techniques must be able to handle huge amounts of data and perform ‘on-line’ computation; scalability: algorithms must have processing times that scale well with ever growing volumes; automation: the analysis process must be automated so that information extraction can ‘run on its own’; ease of use: everyday citizens should be able to extract and assess the necessary information; and robustness: systems must be able to cope with data of poor quality (missing or erroneous data). The NATO Advanced Study Institute (ASI) on Mining Massive Data Sets for Security, held in Italy, September 2007, brought together around ninety participants to discuss these issues. This publication includes the most important contributions, but can of course not entirely reflect the lively interactions which allowed the participants to exchange their views and share their experience. The bridge between academic methods and industrial constraints is systematically discussed throughout. This volume will thus serve as a reference book for anyone interested in understanding the techniques for handling very large data sets and how to apply them in conjunction for solving security issues.
The ever growing flood of data arises from many different sources: huge databases store increasing amounts of data on customers, sales, expenses, traveling, credit card usage, tax payments and the like; the Web gives us access to basically unlimited information in the form of text, images, videos and sound. Everything we do leaves an information trail which can potentially be stored and used. Most of the times, the uses are beneficial (providing us with a list of our expenses, information on our next trip or the delivery of some book we bought on the Web ...), but sometimes this information can also be exploited by rogues to put us at risk (identity theft, credit card fraud, intrusion on a sensitive computer network, terrorism ...). Information security breach is thus becoming an every-day threat for the citizens of modern societies and – when linked to terrorism – it is becoming more and more dangerous. However, discriminating good uses from bad ones is not easy if we want to maintain our standards of democracy and individual freedom. We therefore need to develop different – both efficient and non-invasive – security applications that can be deployed across a wide range of activities in order to deter and to detect the bad uses. The main research challenges of such security applications are: to gather and share large amounts of data or even just sample them when the volume is too big (data streams), to fuse data from different origins (such as numerical, text), to extract the relevant information in the correct context, to develop effective user interfaces from which people can obtain quick interpretations and security-alerts, and to preserve people's privacy.
All security issues share common traits: one has to handle enormous amounts of data (massive data sets), heterogeneous in nature, and one has to use a variety of techniques to store and analyze this data to achieve security. This requires an interdisciplinary approach, combining techniques from computer science, machine learning and statistics, computational linguistics, Web search, social networks and aggregation techniques to present results easily interpretable by the user. Academic research in all of the above areas has made great progress in recent years, while commercial applications now exist all around us (Amazon, eBay, Google, Yahoo!, ...). The real power for security applications will come from the synergy of academic and commercial research focusing on the specific issue of security.
Special constraints apply to this domain, which are not always taken into consideration by academic research, but are critical for successful security applications:
Large volumes: techniques must be able to handle huge amounts of data, and perform ‘on-line’ computation;
Scalability: algorithms must have processing times that scale well with ever growing volumes;
Automation: the analysis process must be automated so that information extraction can ‘run on its own’;
Ease of use: every-day citizens should be able to extract and assess the necessary information;
Robustness: systems must be able to cope with data of poor quality (missing or erroneous data).
The NATO Advanced Study Institute (ASI) on Mining Massive Data Sets for Security, held in Villa Cagnola, Gazzada, Varese (Italy) from 10 to 21 September 2007, brought together around 90 participants to discuss these issues. The scientific program consisted of invited lectures, oral presentations and posters from participants. The present volume includes the most important contributions, but can of course not entirely reflect the lively interactions which allowed the participants to exchange their views and share their experience.
The book is organized along the five themes of the workshop, providing both introductory reviews and state-of-the-art contributions, thus allowing the reader a comprehensive view of each of the themes. The bridge between academic methods and industrial constraints is systematically discussed throughout. This volume will thus serve as a reference book for anyone interested in understanding the techniques for handling very large data sets and how to apply them in conjunction for solving security issues.
Section 1 on Data Mining brings together contributions around algorithms for learning large data sets. Section 2 on Search highlights the problems of scale and threats of the web. Section 3 on Social Networks presents the theoretical tools and various issues around very large network structures. Section 4 on Text Mining focuses on techniques to extract structured information from multilingual and very large text collections. Finally, Section 5 presents various applications of the mentioned techniques to security: fraud, money laundering, intelligence, terrorism, geolocalization, intrusion.
The ASI event and the publication of this book have been co-funded by the NATO Program ‘Science for Peace and Security’, by the European Commission's (EC) Enlargement Program and the EC-funded PASCAL Network of Excellence. All lectures have been filmed and are now freely accessible via the website http://videolectures.net/.
The NATO ASI was made possible through the efforts of the four co-Directors Clive Best (JRC, Ispra, Italy), Françoise Fogelman-Soulié (Kxen, France), Patrick Gallinari (University of Paris 6 – LIP6, France) and Naftali Tishby (Hebrew University, Israel), as well as of the other members of the program committee: Léon Bottou (NEC Labs America, USA), Lee Giles (Pennsylvania State University, USA), Domenico Perrotta (JRC, Ispra, Italy), Jakub Piskorski (JRC, Ispra, Italy) and Ralf Steinberger (JRC, Ispra, Italy). We would like to thank the additional reviewers, whose valuable feedback helped to improve the quality of this book: Spyros Arsenis, Thierry Artières, Carlo Ferigato, Nicholas Heard, Hristo Tanev, Cristina Versino, Dennis Wilkinson and Roman Yangarber. Our special thanks go to Sabine Gross (Public Relations, JRC Ispra) for her dedication and the endless hours she spent on organizing the ASI. Without her, the event would not have taken place.
June 2008
Clive Best and Françoise Fogelman-Soulié
Domenico Perrotta, Jakub Piskorski and Ralf Steinberger
The classical setting of the supervised learning problem is to learn a decision rule from labeled data where data is points in some space [Xscr ] and labels are in {+1,−1}. In this paper we consider an extension of this supervised learning setting: given training vectors in space [Xscr ] along with labels and description of this data in another space [Xscr ]*, find in space [Xscr ] a decision rule better than the one found in the classical setting [1]. Thus, in this setting we use two spaces for describing the training data but the test data is given only in the space [Xscr ]. In this paper, using SVM type algorithms, we demonstrate the potential advantage of the new setting.
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. For instance, a mediocre optimization algorithms, stochastic gradient descent, is shown to perform very well on large-scale learning problems.
Feature selection encompasses a wide variety of methods for selecting a restricted number of input variables or “features”, which are “relevant” to a problem at hand. In this chapter, we guide practitioners through the maze of methods, which have recently appeared in the literature, particularly for supervised feature selection. Starting from the simplest methods of feature ranking with correlation coefficients, we branch in various direction and explore various topics, including “conditional relevance”, “local relevance”, “multivariate selection”, and “causal relevance”. We make recommendations for assessment methods and stress the importance of matching the complexity of the method employed to the available amount of training data. Software and teaching material associated with this tutorial are available [12].
Today data mining is more and more extensively used by very competitive enterprises. This development, brought by the increasing availability of massive datasets, is only possible if solutions to challenges, both theoretic and operational, are found: identify algorithms which can be used to produce models when datasets have thousands of variables and millions of observations; learn how to run and control the correct execution of hundreds of models; automate the data mining process. We will present these constraints in industrial contexts; we will show how to exploit theoretical results (coming from Vladimir Vapnik's work) to produce robust models; we will give a few examples of real-life applications. We will thus demonstrate that it is indeed possible to industrialize data mining so as to turn it into an easy-to-use component whenever data is available.
Labeling data is expensive, whilst unlabeled data is often abundant and cheap to collect. Semi-supervised learning algorithms that use both types of data can perform significantly better than supervised algorithms that use labeled data alone. However, for such gains to be observed, the amount of unlabeled data trained on should be relatively large. Therefore, making semi-supervised algorithms scalable is paramount. In this work we review several recent techniques for semi-supervised learning, and methods for improving the scalability of these algorithms.
This paper presents a survey of UM tasks and of related machine learning and data mining problems. It first proposes a panorama of UM applications merged in a few categories according to the similarity of the tasks from the machine learning point of view. Then it details a little more details usage mining for hypermedia and web-based applications.
Using unlabeled data to unravel the structure of the data to leverage the learning process is the goal of semi supervised learning. Kernel framework allows to model the data structure using graphs and to build kernel machines such as Laplacian SVM [1]. But a remark is the lack of sparsity in variables of the obtained model leading to a long running time for classification of new points. We provide a way of alleviating this problem by using a L1 penalty and a algorithm to efficiently compute the solution. Empirical evidence shows the benefit of the algorithm.
We describe here a categorization system that allows to manage an arbitrary number of categories. It is based on a cascade of categorizers arranged as a tree. Categorizers are deployed on a set of machines to parallelize the processing.
This paper provides an introduction to the field of data stream management and mining. The increase of data production in operational information systems prevents from monitoring these systems with the old paradigm of storing data before analyzing it. New approaches have been developed recently to process data ‘on the fly’ considering they are produced in the form of structured data streams. These approaches cover both querying and mining data.
We propose a method to specify, in a modular way, complex systems formed by interacting agents. The method is based on the notion of view, that is a partial representation of the system, reflecting one of its specific aspects. By composing the different views, we get the overall system, described as a special kind of transition system. By means of a suitable logical language, we can express interesting properties of the system; model-checking techniques can then be used to assess their validity. Views can be specified using different languages or notations, provided they can be translated in so-called agent aware transition systems. The method is explained with the help of a simple, but non trivial example.
In the area of web search, there is a disconnect between some of the research in the field and what is actually done within a large-scale commercial search engine. Everything from measuring the wrong things to the objectives of the research, have the potential to produce work that although academically interesting, has little commercial value. This paper attempts to reduce this gap by summarizing my experience as both a researcher and an engineer at commercial search engines.
This paper focuses on the theoretical components of a commercial search engine and then describes several of the real-world issues and challenges faced by commercial search engines. Next, it explores briefly the issue of “relevance” and some challenges specific to studying and improving relevance in a commercial search engine.
In this article we introduce and discuss briefly the issue of privacy preservation for the publication of search engine query logs. In particular we present a new privacy concern, website privacy as a special case of business privacy.
High ranking of a Web site in search engines can be directly correlated to high revenues. This amplifies the phenomenon of Web spamming which can be defined as preparing or manipulating any features of Web documents or hosts to mislead search engines' ranking algorithms to gain an undeservedly high position in search results. Web spam remarkably deteriorates the information quality available on the Web and thus affects the whole Web community including search engines. The struggle between search engines and spammers is ongoing: both sides apply increasingly sophisticated techniques and counter-techniques against each other.
In this paper, we first present a general background concerning the Web spam phenomenon. We then explain why the machine learning approach is so attractive for Web spam combating. Finally, we provide results of our experiments aiming at verification of certain open questions. We investigate the quality of data provided as the Web Spam Reference Corpus, widely used by the research community as a benchmark, and propose some improvements. We also try to address the question concerning parameter tuning for cost-sensitive classifiers and we delve into the possibility of using linguistic features for distinguishing spam from non-spam.
The Internet has enabled people to coactively create, share, rate and classify content on an unprecedented scale. Examples of coactive systems include open source software development, wiki communities, social news aggregators, and many others. This paper discusses regularities in online coactive systems of thousands or millions of participants. First, user participation levels are shown to follow a heavy-tail power-law distribution over their entire range, so that a small number very active users make the vast majority of contributions. The power law arises from a simple rule where the probability a person stops contributing varies inversely with the number of contributions. The power law exponent is moreover demonstrated to discriminate between systems according to the effort required to contribute. Next, the level of activity per topic is shown to follow a heavy-tailed distribution, this time lognormal, generated by a simple stochastic popularity reinforcement mechanism. The vast majority of activity thus occurs among a small number of very popular topics. The trends are demonstrated to hold for four large independent online communities with different scopes and purposes.
Information cascades are phenomena in which individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. Here we consider information cascades in the context of recommendations and information propagation on the blogosphere. In particular, we study the patterns of cascading recommendations that arise in large social networks. We review recent studies of cascading behavior in product recommendation networks, and information diffusion on the blogosphere. Next, we examine theoretical models of information, virus and influence propagation. Last, we present developments on selecting and targeting nodes in networks to maximize the influence or detect cascades and disease/information outbreaks effectively.
This chapter presents a link analysis approach for analyzing large networks of entities extracted from document collections. We assume that an initial NER and relationship extraction was performed and we are interested in analyzing the structure of the network of these entities. We start by presenting the basic definitions and then follow to focus on centrality analysis and computing equivalence classes on entities within the network. We use the network of the 9-11 hijackers as a running example.
Most real networks often evolve through time: changes of topology can occur if some nodes and/or edges appear and/or disappear, and the types or weights of nodes and edges can also change even if the topology stays static. Mobile devices with wireless capabilities (mobile phones, laptops, etc.) are a typical example of evolving networks where nodes or users are spread in the environment and connections between users can only occur if they are near each other. This who-is-near-whom network evolves every time users move and communication services (such as the spread of any information) will deeply rely on the mobility and on the characteristics of the underlying network. This paper presents some recent results concerning the characterization of the dynamics of complex networks through three different angles: evolution of some parameters on snapshots of the network, parameters describing the evolution itself, and intermediate approaches consisting in the study of specific phenomena or users of interest through time.
Interactive visualization supports the analytical process: interacting with abstract views, using the data as a malleable material, analysts build hypothesis that can be further validated on the whole dataset. We use graph clustering in order to group elements and show them as meta-nodes and reduce the number of visual items, further organizing data over into several layers, in an effort to provide a multilevel model of the studied phenomenon. In doing so, hierarchical clustering not only contributes to the study of data but also brings in ingredients for interaction, enabling the user to zoom in and out of clusters while exploring the data in quest for evidence.
Anomaly detection on graphs of social or communication networks has important security applications. The definition of a graph anomaly typically depends on the data and application of interest. Usually central in anomaly detection are the connections amongst the graph's entries and various methods have been developed for their analysis. We review these techniques and also discuss challenges currently faced for high-dimensional dynamic graphs.
A lot of information is hidden in foreign language text, making multilingual information extraction tools – and applications that allow cross-lingual information access – particularly useful. Only a few system developers offer their products for more than two or three languages. Typically, they develop the tools for one language and then adapt them to the others. Information on guidelines how to produce highly multilingual applications with the least possible effort is scarce. Due to the multinational character of the European institutions, the Text Mining applications developed at the Joint Research Centre (JRC) had to be planned with the multilinguality requirements in mind. The purpose of this chapter is to present the design principles behind the in-house developments and to show a number of different applications that are built according to these guidelines. The results of the text analysis applications presented here are visible on the publicly accessible multilingual news gathering, analysis and exploration tool NewsExplorer.
NewsExplorer currently covers 19 languages. It is publicly accessible at http://press.jrc.it/NewsExplorer.