Ebook: Security Informatics and Terrorism: Patrolling the Web
This work is intended to be of interest to counter-terrorism experts and professionals, to academic researchers in information systems, computer science, political science and public policy, and to graduate students in these areas. The goal of this book is to highlight several aspects of patrolling the Web that were raised and discussed by experts from different disciplines. The book includes academic studies from related technical fields, namely, computer science and information technology, the strategic point of view as presented by intelligence experts, and finally the practical point of view by experts from related industry describing lessons learned from practical efforts to tackle these problems. This volume is organized into four major parts: definition and analysis of the subject, data-mining techniques for terrorism informatics, other theoretical methods to detect terrorists on the Web, and practical relevant industrial experience on patrolling the Web.
This book is based on presentations
For a variety of reasons most, but not all, of the presentations could be included in this book. A complete list of the presenters and their contact information is given in Appendix 1. Readers can find slides of some of the presentations posted on: http://cmsprod.bgu.ac.il/Eng/conferences/Nato/Presentations.htm
Audience
This work is intended to be of interest to counter-terrorism experts and professionals, to academic researchers in information systems, computer science, political science, and public policy, and to graduate students in these areas.
The goal of this book is to highlight several aspects of patrolling the Web that were raised and discussed during the workshop by experts from different disciplines. The book includes academic studies from related technical fields, namely, computer science, and information technology, the strategic point of view as presented by intelligence experts, and finally the practical point of view by experts from related industry describing lessons learned from practical efforts to tackle these problems.
This volume is organized into four major parts: definition and analysis of the subject, data-mining techniques for terrorism informatics, other theoretical methods to detect terrorists on the Web, and practical relevant industrial experience on patrolling the Web.
Part I addresses the current status of the relationship between terrorists and the Internet. The presenters are experienced intelligence experts and describe the causes and impacts of terrorists' use of the Web, the current status, and the governmental responses, and provide an overview of methods for the detection of, and the prevention of terrorist use of the Web, in different parts of the world.
Part II addresses data and Web mining techniques for identifying and detecting terrorists on the Web. The presenters are primarily computer scientists and they present recent studies suggesting data-mining techniques that are applicable for detecting terrorists and their activities on the Web.
Part III addresses theoretical techniques (other than data mining) applicable to terrorism informatics. The presenters are again computer and information scientists, but they propose computational methods that are not (presently) commonly used in terrorism informatics. These papers suggest new directions and promising techniques for the detection problem, such as visual recognition, information extraction and machine learning techniques.
Part IV reports on “learning from experience” and the presenters are industry practitioners who describe their applications and their experiences with operations attempting to patrol the Web.
Together, the participants worked to fashion a summary statement, drawing attention to the strengths and the limitations of our present efforts to patrol and limit the use of the World Wide Web by terrorist organizations. The summary statement may be regarded as representing a consensus, but the reader is cautioned that not every participant agrees with every element of the summary.
Summary Statement
As the proceedings show, a wide range of topics were discussed, and several points of view were presented. In a final session reflecting upon the entire workshop, participants identified a number of key points which should be kept in mind for future studies and efforts to limit the effectiveness of the World Wide Web as an aid to terrorists. These key points can be divided into two areas: 1) Social/Policy Issues, and 2) more narrow Technical Issues.
1. Social and Policy Issues
The Internet is used by terrorists for various activities such as recruitment, propaganda, operations, etc., without their physically meeting. But the Internet can be used to track the conspirators once they are identified. Some activities are open, others are hidden. The struggle against terrorism is the quintessential example of asymmetric warfare. In addition, terrorism stands on the ill-defined boundary where criminality, warfare, and non-governmental actors meet. It was agreed that, collectively, nations have the resources needed to counter terrorism, but that it is essential to share information in order to combat the geographically distributed nature of the terrorist organizations, which may have a very small footprint in any one nation. Therefore a key recommendation of the workshop is that: International cooperation is required among intelligence and law enforcement experts and computer scientists.
Other policy issues concern the interplay between the called-for cooperation, and the rights of individual citizens. This may be formulated as a technical question: “Can data mining truly protect privacy when the data is held and mined by ‘distrusted’ custodians?”. This important question was not addressed at the present workshop, but should definitely be on the agenda for future research.
At the interface between policy and technical matters, several participants stressed that some idea of a scheme, a model or a scenario is needed to interdict terrorists, because simple searching cannot cover every possibility. Therefore, whatever technical means and alarms are developed will have to be triggered by considerations of likelihood or probability, and/or by existing intelligence from other sources.
2. Technical Issues
As noted, technical means and alarms will have to be triggered by considerations of likelihood or probability, and/or by existing intelligence from other sources. In summary, a second key agreement is that: It is necessary to have people in the analytical loop, supplying human judgment.
An important point to consider in patrolling the Web for terrorism is that rates of false positives are necessarily high for any automated method of identification or discovery. As noted, the search must be moderated by some understanding of plausible scenarios. There is a clear interface to policy and social issues when considering the consequences of “false positive” (that is, naming a person, an organization, or a website as terrorist when it is not) which must be weighed, and balanced, against the consequences of failing to identify terrorist activity on the Web.
It was noted that computers work “from the bottom up”, digesting large masses of data and producing indications of when something is out of the ordinary. Human analysts, in contrast, work “from the top down”, guided by models or scenarios which may be drawn from previous experience, or may be suggested, for the very first time, by some configuration in the available data.
In working to make the computer a more powerful ally, it would be of immense value to have some common “challenge tasks”. This is the final finding of the workshop: It is necessary to have some model tasks which are well defined, and which have a “gold standard” known correct resolution or answer. Ideally these model challenges should be driven by the real missions of the several NATO nations. It was noted that most of the presentations dealt with websites in English, and a few with sites in Arabic. All of the technical work needs also to be extended to other languages.
Overall, there was an extremely effective exchange of ideas and of concerns between the experts in technical/computer issues and the experts in social/policy issues. It is highly recommended that this type of boundary spanning workshop be expanded and replicated in the future.
Bracha Shapira, Paul Kantor, Cecilia Gal.
A note on the production of this volume
The papers of this volume have been produced by a multi-step process. First, we recorded the talk given by each author at the conference in June (due to some technical difficulties a few presentations were not recorded). Next, we transcribed each recording. The authors then produced a draft of their paper from these transcriptions, refining each draft until the final version. Although the papers are not exactly like the talks given, some do retain the informal and conversational quality of the presentations. Other authors, however, preferred to include a more formal paper based on the material presented at the conference.
A few notes about language and conventions used in the book. Since the authors in this volume come from different parts of the globe we have tried to preserve their native cadences in the English versions of their papers. We have also decided to use both British English and American English spelling and standards depending on the specific style the author preferred to use. Language conventions for naming entities – such as Al Qaeda, 9/11 or Hezbollah – which have more than one standard spelling, were taken from the New York Times editorial practices. The formatting and style of references, when used, are consistent within each paper but vary between papers. And finally, a number of papers have pictures from screen captures of illustrations or of proprietary software. Although every effort was made to include the highest quality pictures so they would reproduce well in print, in some instances these pictures may not reproduce as well as might be desired, and we beg the reader's indulgence.
Cecilia Gal, Rutgers.
Acknowledgements
It is a pleasure to acknowledge the superb hospitality of Ben-Gurion University, which provided their magnificent faculty Senate Hall for the two days of the Conference, together with excellent audio-visual support.
The Deutsche Telekom Laboratory at BGU, and Dr. Roman Englert provided additional hospitality for the participants.
We want to thank Rivka Carmi the President of Ben-Gurion University for her gracious welcome, Yehudith Naftalovitz and Hava Oz for all their hard work with the conference arrangements and Professor Fernando Carvalho Rodrigues and Elizabeth Cowan at NATO for their generous support.
Terrorism was part of our lives before the Internet and it will be a part of our lives even if we have all the means and all the possibilities to control the Internet some day. So when we look at the Internet as a tool for terrorists, we must remember that at the end of the day, the Internet is not the source of terrorism and the Internet is not the only way to implement terrorism. There are four major areas where the Internet has had a major impact on terrorist activities First, the Internet has become a substitute for the “open squares” in cities, the Internet as an anonymous meeting place for radicals of like minds. Second, non-governmental organizations can spread propaganda, stories, movies, pictures, about their successes and no one can stop them. Third, the Internet is used by terrorists as a source of information. Fourth, the Internet can be employed as a command and control vehicle; operational coordination via the Internet. Finally, a smaller point, the Internet can be used by terrorist organizations to raise money for their activities. Anti-terrorist organizations can also use the Internet to collect information on known terrorists and ultimately democratic governments can find the standards that are needed and share knowledge about the way that terrorists are using the Internet, and in the long run the governmental systems will prevail and not the terrorist organizations.
Cyber terrorism is a commonly heard expression but lacks definition and precision. Does it apply to terrorists' use of the Internet or to computer attacks against critical infrastructures? There are many entities that play a role in this complex topic from governments, technology and telecommunication service providers to the terrorists themselves. This paper explores the role of law enforcement in the international fight against terrorists' use of the Internet and provides an overview of the wide extent to which terrorists and their support structures have fully embraced cyber space.
The nature of threats to international and national security is changing at a rapid pace, forcing a series of fundamental changes in intelligence management and tasking methods and processes. These are clearly visible in counterterrorism operations, particularly in the intelligence collection against terrorist targets in the Internet environment. The need for the development of new analytical capabilities and tools requires a new level of cooperation and dialogue between the intelligence community and research organizations, both at the national and international level. NATO can aid and foster international intelligence cooperation through the requirements definition by NC3A, which will help address some of the compatibility and tools availability issues the intelligence services currently face while developing joint response to the terrorist threat.
As with many other new technologies in the past, the World Wide Web presents us with great opportunities for progress as well as potentials for misuse. A list of only a few of such crimes committed on the Internet are: network attacks, credit card fraud, stealing money from bank accounts, corporate espionage, and child pornography distribution. These crimes represent a significant social danger to Ukraine and other CIS Confederation of Independent States countries. Below I discuss three different countries and the ways in which they have managed their technology infrastructure and dealt with Internet crime.
Terrorist organizations and their supporters make extensive use of the Internet for a variety of purposes, including the recruitment, training and indoctrination of jihad fighters and supporters. The overwhelming majority of the Islamist/jihadist websites are hosted by Internet Service Providers (ISPs) based in the West, many of which are unaware of the content of the Islamist/jihadist sites they are hosting. Experience shows that, once informed, most of the ISPs remove these sites from their servers. Therefore, an effective way to fight the phenomenon is to expose the sites via the media. It would be advisable to establish an organization – governmental or non-governmental – which would maintain a database and publish information about Islamist/jihadist sites on a regular basis, and/or provide such information to ISPs upon their request.
Web server logs can be used to build a variable length Markov model representing user's navigation through a web site. With the aid of such a Markov model we can attempt to predict the user's next navigation step on a trail that is being followed, using a maximum likelihood method that predicts that the highest probability link will be chosen. We investigate three different scoring metrics for evaluating this prediction method: the hit and miss score, the mean absolute error and the ignorance score. We present an extensive experimental evaluation on three data sets that are split into training and test sets. The results confirm that the quality of prediction increases with the order of the Markov model, and further increases after removing unexpected, i.e. low probability, clicks from the test set.
As the Internet becomes more pervasive in all areas of human activity, attackers can use the anonymity of cyberspace to commit crimes and compromise the IT infrastructure. As currently there is no generally implemented authentification technology we have to monitor the content and relations of messages and Internet traffic to detect infringements. In this paper, we present recent research on Internet threats such as fraud or hampering critical information infrastructure. One approach concentrates on the rapid detection of phishing email, designed to make it next to impossible for attackers to obtain financial resources or commit identity theft in this way. Then we address how another type of Internet fraud, the violation of the rights of trademark owners by the selling of faked merchandise, can be semi-automatically solved with text mining methods. Thirdly, we report on two projects that are designed to prevent fraud in business processes in public administration, namely in the healthcare sector and in customs administration. Finally, we focus on the issue of critical infrastructures, and describe our approach towards protecting them using a specific middleware architecture.
The ATDS system is aimed at detecting potential terrorists on the Web by tracking and analyzing the content of pages accessed by users in a known environment (e.g., university, organization). The system would alert and report on any user who is “too” interested in terrorist-related content. The system learns and represents the typical interests of the users in the environment. It then monitors the content of pages the users access and compares it to the typical interests of the users in the environment. The system issues an alarm if it discovers a user whose interests are significantly and consistently dissimilar to the other users' interests. This paper briefly reviews the main ideas of the system and suggests improving the detection accuracy by learning terrorists' typical behaviors from known terrorist related sites. An alarm would be issued only if a “non-typical” user is found to be similar to the typical interests of terrorists. Another enhancement suggested is the analysis of the visual content of the pages in addition to the textual content.
This chapter presents statistical and algorithmic approaches to discovering groups of actors that hide their communications within the myriad of background communications in a large communication network. Our approach to discovering hidden groups is based on the observation that a pattern of communications exhibited by actors in a social group pursuing a common objective is different from that of a randomly selected set of actors. We distinguish two types of hidden groups: temporal, which exhibits repeated communication patterns; and spatialwhich exhibits correlations within a snapshot of communications aggregated over some time interval. We present models and algorithms, together with experiments showing the performance of our algorithms on simulated and real data inputs.
Typical authorship attribution methods are based on the assumption that we have a small closed set of candidate authors. In law enforcement scenarios, this assumption is often violated. There might be no closed set of suspects at all or there might be a closed set containing thousands of suspects. We show how even under such circumstances, we can make useful claims about authorship.
Research on modeling the identification of materials related to a given topic, person or country is reported. The models use Bayesian analysis, and a sparsity-inducing prior distribution of the chance that a term will be useful. The result is concise computer-generated models, which can be understood, and improved, by human users of the tools. Examples are given based on technical literature, and on materials of interest to intelligence and policy analysts. The methods are particularly effective in learning to recognize materials pertinent to a specific topic, from very small sets of learning materials, provided that general background information (such as might be found in a text-book or encyclopedia) can be used to set the prior probability that a term will be used in the machine learning model.
In the last few years the Internet has become a prominent vehicle for communication with the side effect that digital media also has become more relevant for criminal and terrorist activities. This necessitates the surveillance of these activities on the Internet. A simple way to monitor content is the spotting of suspicious words and phrases in texts. Yet one of the problems with simply looking for words is the ambiguity of words, whose meaning often depends on context. Information extraction aims at recovering the meaning of words and phrases from the neighboring words. We give an overview of term and relation extraction methods based on pattern matching and trainable statistical methods and report on experiments of semi-supervised training of such methods.
The crux of data compression is to process a string of bits in order predicting each subsequent bit as accurately as possible. The accuracy of this prediction is reflected directly in compression effectiveness. Dynamic Markov Compression (DMC) uses a simple finite state model which grows and adapts in response to each bit, and achieves state-of-the art compression on a variety of data streams. While its performance on text is competitive with the best known techniques, its major strength is that is lacks prior assumptions about language and data encoding and therefore works well for binary data like executable programs and aircraft telemetry. The DMC model alone may be used to predict any activity represented as a stream of bits. For example, DMC plays “Rock, Paper, Scissors” quite effectively against humans. Recently, DMC has been shown to be applicable to the problem of email and web spam detection – one of the best known techniques for this purpose. The reasons for its effectiveness in this domain are not completely understood, because DMC performs poorly for some other standard text classification tasks. I conjecture that the reason is DMC's ability to process non-linguistic information like the headers of email, and to predict the nature of polymorphic spam rather than relying on fixed features to identify spam. In this presentation I describe DMC and its application to classification and prediction, particularly in an environment where particular patterns of data and behavior cannot be anticipated, and may be chosen by an adversary so as to defeat classification and prediction.
Paper documents are routinely found in general litigation and criminal and terrorist investigations. The current state-of-the-art processing of these documents is to simply OCR them and search strictly the text. This ignores all handwriting, signatures, logos, images, watermarks, and any other non-text artifacts in a document. Technology, however, exists to extract key metadata from paper documents such as logos and signatures and match these against a set of known logos and signatures. We describe a prototype that moves beyond simply the OCR processing of paper documents and relies on additional documents artifacts rather than only on text in the search process. We also describe a benchmark developed for the evaluation of paper document search systems.
Visual objects are composed of parts like a body, arms, legs and a head for a human or wheels, a hood, a trunk and a body for a car. This compositional structure significantly limits the representation complexity of objects and renders learning of structured object models tractable. Adopting this modeling strategy I describe a system, which both (i) automatically de-composes objects into a hierarchy of relevant compositions and which (ii) learns such a compositional representation for each category without supervision. Compositions are represented as probability distributions over their constituent parts and the relations between them. The global shape of objects is captured by a graphical model which combines all compositions. Experiments on large standard benchmark data sets underline the competitive recognition performance of this approach and they provide insights into the learned compositional structure of objects.
Numerous counterterrorist activities on the Web have to distinguish between terror-related and non-terror related items. In this case Machine Learning algorithms can be employed to construct classifiers from examples. Machine Learning applications often face the problem that real-life concepts tend to change over time and some of the learned classifiers from old observations become out-of-date. This problem is known as concept drift. It seems to be doubly valid for terrorists acting on the Web, because they want avoid being tracked. This paper gives a brief overview of the approaches that aim to deal with drifting concepts. Further it describes, in more detail, two mechanisms for dealing with drifting concepts, which are able to adapt dynamically to changes by forgetting irrelevant data and models. The presented mechanisms are general in nature and can be an add-on to any concept learning algorithm. Results from experiments that give evidences for the effectiveness of the presented approaches are reported and discussed.
The following is an introduction to the world of techno-intelligence signatures, given by Gadi Aviran, CEO of Hazard Threat Analysis Ltd. Hazard Threat Analysis Ltd. (HTA) specializes in assessment of terror threats derived from Internet-sourced intelligence material (WebInt), utilizing an array of expertise in the collection, translation and analysis of focused data. ‘Techno-Intelligence signatures’ is a collective name for the indicative effects an action has on its environment. Either physical or behavioral, these ‘signatures’ vary in nature, ranging from patterns of communication to unusual scents and sounds. During the development of a new technical or operational capability, or during the execution of one, these ‘signatures’ are emitted, allowing intelligence analysts to collect and analyze them, in an effort to issue an intelligence alert. The difficulty in achieving a timely alert based on the analysis of such signatures is the process of “connecting the dots” – making sense of information particles often lacking an obvious connection. In order to begin ‘connecting the dots’, it is first vital to know what the ‘dots’, the analyst is looking at, are by making the distinction between theoretical threats and current intelligence threats. A profound understanding of both current capabilities and their distinctive ‘signatures’ and the nature of information sources, results in the life saving intelligence alert the counter-terrorism agencies are after.
The terror and anti-terror activities on the Web pose a real challenge for web harvesting and data analysis tasks. The Internet virtual domain is likely to be used intensively by terrorist organizations. Using advanced innovative web harvesting AI-based technologies give a surprising advantage to those who master the Internet open source. In addition, developing and implementing a large-scale web harvesting intelligent terror & security IT system is an extensive and time consuming organizational endeavor. In order to ensure a successful system implementation resulting in the payoff of a technological intelligent terror and security IT system, the organization has to initiate and manage a large scale operation that consists of methodological, organizational, data integration and technological components. In this paper we present a comprehensive system to tackle this challenge. Starting with a theoretical web harvesting framework and leading to a collection of web technologies that were developed and which resulted in a complete product that gave the users vast field experience throughout the globe. The paper has two main objectives: present a comprehensive technological and methodological framework for developing and implementing a Web Harvesting IT security system; draw and discuss lessons learned and the conclusions from field experience gathered over the last decade. For achieving these goals we introduce and describe in detail an IT solution equipped with all necessary algorithms and tools, and then describe how this system and methodology was implemented in the intelligence community over the last decade. Lessons learned and real field experience conclude the paper with a discussion of the system's advantages in the fight against terror and terrorist attacks in cyber space.
This article deals with protection of home and enterprise users and in particular Critical Infrastructures (CIs) against attacks unleashed by terrorists or criminals. Threats and challenges in large-scale network protection are discussed and their congruent defense mechanisms are classified into defensive and offensive. One defensive and one offensive mechanism is described. The Early Detection, Alert and Response (eDare) framework is a defensive mechanism aimed at removing malware from NSPs' traffic. eDare employs powerful network traffic scanners for sanitizing web traffic from known malware. Remaining traffic is monitored and various types of algorithms are used for identifying unknown malware. To augment judgments of the algorithms, experts' opinions are used to classify files suspected as malware which the algorithms are not decisive about. Finally, collaborative feedback and tips from end-users are meshed into the identification process. DENAT system is an offensive mechanism which uses Machine Learning algorithms to analyze traffic that is sent from organizations such as universities through Network Address Translators (NAT). The analysis associates users with the content of their traffic in order to identify access to terror related websites.