Ebook: Web Intelligence and Security
Terrorists are continuously learning to utilize the Internet as an accessible and cost-effective information infrastructure. Since a constant manual monitoring of terrorist-generated multilingual web content is not a feasible task, automated Web Intelligence and Web Mining methods are indispensable for efficiently securing the Web against its misuse by terrorists and other dangerous criminals. Web Intelligence and Security contains chapters by the key speakers of the NATO Advanced Research Workshop on Web Intelligence and Security that took place on November 18-20, 2009 in Ein-Bokek, Israel. This Workshop has brought together a multinational group of leading scientists and practitioners interested in exploiting data and text mining techniques for countering terrorist activities on the Web. Most talks were focused on presenting available methods and tools that can alleviate the information overload of intelligence and security experts. The key features of this book include: An up-to-date analysis of the current and future threats of the Internet misuse by terrorists and other malicious elements including cyberterrorism, terror financing and interactive online communication by terrorists and their supporters; Detailed presentation of the state-of-the-art algorithms and tools aimed at detecting and monitoring malicious online activities on the Web; Introduction of novel data mining and text mining methods that can be used to efficiently analyze the massive amounts of multi-lingual Web content; The book's wide audience includes research scientists, graduate students, intelligence analysts, and data / text mining practitioners.
Terrorists are continuously learning to use the Internet as an accessible and cost-effective information infrastructure. Secure and non-secure web sites, online forums, and file-sharing services are routinely used by terrorist groups for spreading their propaganda, recruiting new members, and communicating with their supporters, along with sharing knowledge on forgery, explosive preparation, and other “core” terrorist activities. The current number of known terrorist sites and active extremist forums is so large and their URL addresses are so volatile that a constant manual monitoring of their multilingual content is definitely out of the question. Moreover, terrorist web sites often try to conceal their real identity, e.g., by masquerading themselves as news portals or religious forums. This is why automated Web Intelligence and Web Mining methods are so important for efficiently securing the Web against its misuse by terrorists and other dangerous criminals.
This book contains chapters by the key speakers of the NATO Advanced Research Workshop on Web Intelligence and Security that took place on November 18-20, 2009 in Ein-Bokek, Israel. The goal of the Advanced Research Workshop was to bring together scientists and practitioners interested in recent developments in exploiting data and text mining techniques for countering terrorist activities on the Web. The emphasis was placed on presenting available methods and tools that can alleviate the information overload of intelligence and security experts. The main areas of discussion included terrorism origins, the threats of the “Dark Web”, web content mining and Open Source Intelligence (OSI), text mining and data mining methods for security applications, and methods of Financial Intelligence (FI) for stopping terror financing activities. State-of-the-art solutions and open problems in the defense against web-based crime were highlighted by world-renowned experts in intelligence and security informatics (ISI) from eight NATO countries, three Partner countries, and one Mediterranean Dialogue country. The videos of all presentations are posted on the Workshop website (http://cmsprod.bgu.ac.il/Eng/conferences/nato2009/). The Workshop was attended by 44 participants from 15 countries.
Similar to the Workshop program, this volume is organized into three main parts: Terror and the Dark Web, Web Content Mining and Open Source Intelligence, and Data and Text Mining for Security. A brief overview of each part is provided below.
Part I, Terror and the Dark Web, discusses the current and future threats of the Internet misuse by terrorists and other malicious elements. These threats include hardly predictable (“black swan”) events caused by cyberterrorism (Chapter 1 by A. Kandel), new ways of terror financing as a result of the worldwide finance crisis (Chapter 2 by J. Bollag), the increasing use of social networking tools by terrorists (Chapter 3 by G. Weimann), the “Virtual Jihad” as the source and the “trigger” of the “Real Jihad” (Chapter 4 by Sh. Shay), and highly effective cyber attacks based on a variety of “social engineering” tricks (Chapter 5 by A. Barseghyan).
Part II, Web Content Mining and Open Source Intelligence, presents state-of-the-art algorithms and tools aimed at detecting and monitoring malicious activities on the Web. Several web-based early warning systems and multi-lingual information extraction tools are described by C. Best in Chapter 6. An ontology-driven information extraction approach to discovering extremist groups from the Dark Web is presented by Hladky et al. in Chapter 7. The problem of Internet traffic monitoring poses new challenges and research directions, which are discussed by B. Porat and E. Porat in Chapter 8. Chapter 9 by G. Margarov describes data hiding using steganography and some steganalysis techniques for detecting such hidden information. Finally, a new approach to detecting Internet banking fraud and money laundering transactions is presented in Chapter 10 by M. H. Özçelik and E. Duman.
Part III, Data and Text Mining for Security, covers several data mining and text mining methods that can be used to efficiently analyze the massive amounts of multi-lingual Web content. These methods include data stream mining algorithms (Chapter 11 by J. Gama et al.), visual analytics techniques (Chapter 12 by D.A. Keim et al.), fuzzy system models (Chapter 13 by I. B. Türkşen), fuzzy logic approaches to database querying (Chapter 14 by J. Kacprzyk and S. Zadrożny), query log analysis (Chapter 15 by R. Baeza-Yates), natural language understanding tools (Chapter 16 by D. Roth), automated summarization of multilingual textual content (Chapter 17 by M. Last and M. Litvak), and semantic web applications (Chapter 18 by V.F. Khoroshevsky).
Acknowledgements
We are grateful to the NATO Science for Peace and Security Programme for their generous support of the Advanced Research Workshop on Web Intelligence and Security and the publication of this book. We are also grateful to the Homeland Security Research Institute, Ben-Gurion University of the Negev, Israel who has sponsored the videotaping of the Workshop presentations. We thank all key speakers and other Workshop participants for making it such a successful event. Our special thanks go to Ms. Keren Solomon, the Workshop Secretary, for her hard and remarkable work before, during and after the Workshop.
Mark Last and Abraham Kandel
June 2010
In this paper a comprehensive assessment of black swan capabilities is discussed. The use of fuzzy logic as a possible tool for uncertainty management is being investigated and some fuzzy techniques are being demonstrated in order to remind us of the strength as well as of the weakness of these methods.
This chapter discusses the impact of the worldwide financial crisis on terror financing. An overview of the current crisis is followed by the description of the Western governments’ reaction to the crisis and its impact on financing of terror. Then we discuss the trend of investors to invest their assets in countries of the Muslim world. Finally, some serious concerns are raised in the Conclusions section.
The Internet has long been a favorite tool for terrorists. Terrorists use the Net to expand the reach of their propaganda to a global audience, to recruit adherents, to communicate with international supporters and ethnic Diasporas, to solicit donations, and to foster public awareness and sympathy for their causes. About 90% of terrorist activity on the Internet is using social networking tools, be it independent bulletin boards, Paltalk, or Yahoo! eGroups. These forums act as a virtual firewall to safeguard the identities of those who participate, and they offer surfers an easy access to terrorist material, to ask questions, and even to contribute and help out the Cyber jihad. This chapter examines the current use of interactive online communication by terrorists and their supporters.
In the 21 century the theater of the conflict and the war between Al Qaeda, the Global Jihad and their adversaries is located in two spheres: the “Real” theater of the jihad (Iraq, Afghanistan, Kashmir, Somalia, Palestine) and the “Virtual” theater of jihad (the cyber space, media). In order to win the war against the radical Islam it is necessary to develop a global comprehensive strategy that should engage simultaneously Al Qaeda and other jihadi entities on both “Real” and “Virtual” theaters.
One of the strongest methods of attack to cyber security is attack on human factor. This is based on human week parts. Here attacker manipulates people for divulging confidential information or performing actions. This is known as social engineering. Some classical types of social engineering pretexting, phishing, vishing, baiting, etc., schemes of attacks that social engineers use, defending schemes and real life examples are discussed.
Live monitoring of Internet content is now providing new possibilities for early warning of security threats across a range of fields. This paper reviews the techniques available, their methodologies, and their application to three areas. The first is regional conflict early warming. The second is terrorism forecasting and early warning. The third area is disease outbreaks and health threat monitoring. The main challenges in this research are information retrieval in multiple languages, extraction of relevant data and information from unstructured text, the development of methodologies and indicators capable of detecting trends and small trigger signals. For such early warning systems to be effective they must inform operators of potential threats without an undue number of false alarms. This paper reviews the techniques used for information retrieval and extraction, and describes three applications in the areas concerned.
Due to the lack of multilingual and multimedia extremist collections and advanced analytical methodologies, our empirical understanding of the Internet or dark web is still very limited. Content mining and intelligence inside the Internet becomes more and more a challenge to different bodies from security, financial organizations (e.g. financial intelligence units “FIU”) and law enforcement agencies. Tracking of large digital information from various sources like public Internet, dark web, long tail web or blogs and other social networks creates new challenges to the research community. A first test is to create intelligent crawlers that can identify any link in the web and extract the digital footprint from the web page. Some of the key challenge that we face are in the area of automatic multilingual text analysis, the harmonization of extracted knowledge and the unique identity resolution. Taxonomies and thesauruses do not offer a complete solution for the automatic discovery of hidden relations or newly defined expressions for named entities. In order to understand shadow groups we need to apply advanced technologies from artificial intelligence and computation linguistics. In this paper we will share our experience which we developed from various projects in Europe, Russia and Central Asia. We will discuss how an ontology-driven information extraction approach from large multilingual document collections can help to create an understanding and therefore valuable knowledge. We will further demonstrate how to solve the merging of various ontologies used for different domains and languages using the concept of upper ontology and conclude the discussion by sharing insights on how to create rules for automatic identity resolution for specific named entities.
Pattern matching is one of the basic topics in computer science. There were two waves of tens research into pattern matching. Due to new problems and changes in the model as well as its practical application, we must now examine pattern matching problems in the context of new models. Such considerations give rise to many interesting research problems in this area. The main difference between the last two waves of pattern matching research and the new wave is the need to deal with more practical questions. Such problems are encountered in industry applications first, while in the past the researchers had suggested the problems themselves. Two papers [4, 16] could be already tagged as the beginning of the new wave. We discuss the new challenges that must be handled in these exciting future research directions.
This article is devoted to problems of Web based data hiding by means of steganography and investigation of the Internet for the purpose of steganographic content detection by means of steganalysis. Rumors in mass media about terrorists using steganography are revealed. The basic idea of steganography, its history and Web based application is considered. Classification and examples of available software are outlined. Main principles of steganalysis and detection of steganographic content are discussed.
Income sources of most terrorist organizations are based on illegal activities. These mainly include the money stolen from the bank accounts of innocent people or black money made by illegal activities. As the use of internet banking becomes more popular the frauds based on stealing passwords of some people and transferring the amount therein to mule accounts increased also. In this study we aim at detecting such transactions by establishing a Customer Trustability Index (CTI) for all customers available in the database of a bank. The CTI has a rule based component and a similarity based component based on link analysis and the RFM techniques.
In the last two decades, machine learning research and practice has focused on batch learning, usually with small datasets. Nowadays there are applications in which the data are modeled best not as persistent tables, but rather as transient data streams. Learning from data streams is an increasing research area with challenging applications and contributions from fields like data bases, learning theory, machine learning, and data mining. In this work we identify the main characteristics of stream mining algorithms, and present two illustrative examples of such algorithms.
This paper presents a visual analytics approach to exploring large news articles collection in the domains of polarity, spatial and entity analysis. The exploration is performed on the data collected with Europe Media Monitor (EMM), a system which monitors over 2,500 online sources and processes 90,000 articles per day. In the analysis of the news feeds, we want to find out which topics are important in different countries, what is the general polarity of the articles within these topics and how the quantitative evolution of entities that are mentioned in the news, such as persons and organizations, developed over time. To assess the polarity of a news article, automatic techniques for polarity analysis are employed and the results are represented using Literature Fingerprinting for visualization. In the spatial description of the news feeds, every article can be represented by two geographic attributes, news origin and the location of the event itself. In order to assess these spatial properties of news articles, we conducted our analysis, which is able to cope with size and spatial distribution of the data. To demonstrate the use of our system, we also present case studies that show a) temporal analysis of entities, and b) analysis of their co-occurrence in news articles. Within this application framework, we show opportunities how real-time news feed data can be efficiently analyzed.
We first review the development of Fuzzy System Models from “Fuzzy Rule bases” proposed by Zadeh (1965, 1975) and applied by Mamdani, et al. (1981) to “Fuzzy Functions” proposed by Türksen (2007-2008) and further developed by Celikyilmaz and Türksen (2007-2009) in a variety of versions. Next, we also review a complementary development of “Fuzzy C-Regression Model”, (FCRM) proposed by Hathaway and Bezdek, (1993) as well as a “Combined FCM, and FCRM” algorithms proposed by Höppner and Klawonn (2003).
We are concerned with the browsing of numerical data stored in popular relational databases by querying those data repositories using some non-standard, human consistent querying tools that fall within a broad category of flexible queries. We present queries with linguistic quantifiers and bipolar queries which can help express a real human intention which crucial because information that can suggest a security threat concerns a sophisticated concept and is related to a sophisticated combination of values of attributes that is difficult to express by using traditional querying tools, and also terms in those representations concern mostly issues and aspects which are difficult to precisely specify and quantify, and hence tools of fuzzy logic are employed. We assume a numerical relational database. Some very specific combination of values of attributes can suggest a potentially dangerous situation. To retrieve information the user has to express his or her interest, intention or information needs as a query. Usually, resulting query criteria cannot imply a binary decision of the acceptance/rejection of a row, and a degree is employed. A flexibility is obtained by a fuzzy modeling of linguistic terms and a non-standard aggregation of elementary conditions in a query, in our case a user may be fully satisfied with, e.g., most of them being fulfilled, also with varying importances, which results in queries with linguitsic quantifiers. We also discuss another non-standard approach to flexible querying that involves a specific flexible aggregation scheme assuming two types of conditions within a query, so called bipolar queries involving mandatory (required) and to some extent optional (desired) conditions.
Query logs are the traces of queries sent to search engines. This chapter covers the new privacy problem inherent in query logs, which was uncovered by the America Online (AOL) incident, in August of 2006. Then we propose several ways to protect the privacy of query logs. Finally, we discuss some related problems in web security.
Recent studies have shown that over 85% of the information organizations deal with is unstructured – the vast majority of which is in text. A multitude of techniques has to be used in order to enable intelligent access to this information and to support transforming it to forms that allow sensible use of the information. While the most commonly used approach is keyword search, popularized by commercial search engines such as Google, Yahoo! and Bing, the research community has made a lot of progress moving beyond these techniques. The fundamental issue that all these advances need to address is that of semantics – there is a need to move toward understanding the text at an appropriate level, beyond the word level, in order to support access, knowledge extraction and synthesis. This paper surveys some of our research in these directions, focusing on software tools and products more than techniques; in doing that it addresses several dimensions of text understanding that can facilitate access to and extraction of knowledge from unstructured text, transforming it to forms that are useful to different users in different settings, and integrating it along multiple dimensions and with existing institutional resources.
Text summarization is the process of distilling the most important information from source/sources to produce an abridged version for a particular user/users and task/tasks. Automatically generated summaries can significantly reduce the information overload on intelligence analysts in their daily work. Moreover, automated text summarization can be utilized for automated classification and filtering of text documents, information search over the Internet, content recommendation systems, online social networks, etc.
The increasing trend of cross-border globalization accompanied by the growing multi-linguality of the Internet requires text summarization techniques to work equally well on multiple languages. However, only some of the automated summarization methods proposed in the literature can be defined as “multi-lingual” or “language-independent,” as they are not based on any morphological analysis of the summarized text.
In this chapter, we present a novel approach called MUSE (MUltilingual Sentence Extractor) to “language-independent” extractive summarization, which represents the summary as a collection of the most informative fragments of the summarized document without any language-specific text analysis. We use a Genetic Algorithm to find the best linear combination of 31 sentence scoring metrics based on vector and graph representations of text documents. Our summarization methodology is evaluated on two monolingual corpora of English and Hebrew documents, and, in addition, on a bilingual collection of English and Hebrew documents. The results are compared to 15 statistical sentence scoring methods for extractive single-document summarization found in the literature and to several state-of-the-art summarization tools. These bilingual experiments show that the MUSE methodology significantly outperforms the existing approaches and tools in both languages.
There are two main parts in this keynote talk. The first one is devoted to the overview of R&D in information extraction domain fulfilled by academic institutions and commercial organizations in Russia. Main topics in this part deal with the discussion of multilingual documents collections processing methods and appropriate instrumental tools for information extraction. Second part of the talk presents semantic technology oriented to the Semantic Web applications development and implementation elaborated in cooperation by the specialists of Computer Centre of Russian Academy of Sciences, Russian IT-company “Avicomp Services”, and company Ontos AG from Switzerland. Some challenges of shadow knowledge highlighting are outlined.