Ebook: Applications of Data Mining in E-Business and Finance
The application of Data Mining (DM) technologies has shown an explosive growth in an increasing number of different areas of business, government and science. Two of the most important business areas are finance, in particular in banks and insurance companies, and e-business, such as web portals, e-commerce and ad management services. In spite of the close relationship between research and practice in Data Mining, it is not easy to find information on some of the most important issues involved in real world application of DM technology, from business and data understanding to evaluation and deployment. Papers often describe research that was developed without taking into account constraints imposed by the motivating application. When these issues are taken into account, they are frequently not discussed in detail because the paper must focus on the method. Therefore knowledge that could be useful for those who would like to apply the same approach on a related problem is not shared. The papers in this book address some of these issues. This book is of interest not only to Data Mining researchers and practitioners, but also to students who wish to have an idea of the practical issues involved in Data Mining.
We have been watching an explosive growth of application of Data Mining (DM) technologies in an increasing number of different areas of business, government and science. Two of the most important business areas are finance, in particular in banks and insurance companies, and e-business, such as web portals, e-commerce and ad management services.
In spite of the close relationship between research and practice in Data Mining, it is not easy to find information on some of the most important issues involved in real world application of DM technology, from business and data understanding to evaluation and deployment. Papers often describe research that was developed without taking into account constraints imposed by the motivating application. When these issues are taken into account, they are frequently not discussed in detail because the paper must focus on the method. Therefore, knowledge that could be useful for those who would like to apply the same approach on a related problem is not shared.
In 2007, we organized a workshop with the goal of attracting contributions that address some of these issues. The Data Mining for Business workshop was held together with the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), in Nanjing, China.
This book contains extended versions of a selection of papers from that workshop. Due to the importance of the two application areas, we have selected papers that are mostly related to finance and e-business. The chapters of this book cover the whole range of issues involved in the development of DM projects, including the ones mentioned earlier, which often are not described. Some of these papers describe applications, including interesting knowledge on how domain-specific knowledge was incorporated in the development of the DM solution and issues involved in the integration of this solution in the business process. Other papers illustrate how the fast development of IT, such as blogs or RSS feeds, opens many interesting opportunities for Data Mining and propose solutions to address them.
These papers are complemented with others that describe applications in other important and related areas, such as intrusion detection, economic analysis and business process mining. The successful development of DM applications depends on methodologies that facilitate the integration of domain-specific knowledge and business goals into the more technical tasks. This issue is also addressed in this book.
This book clearly shows that Data Mining projects must not be regarded as independent efforts but they should rather be integrated into broader projects that are aligned with the company's goals. In most cases, the output of DM projects is a solution that must be integrated into the organization's information system and, therefore, in its (decisionmaking) processes.
Additionally, the book stresses the need for DM researchers to keep up with the pace of development in IT technologies, identify potential applications and develop suitable solutions. We believe that the flow of new and interesting applications will continue for many years.
Another interesting observation that can be made from this book is the growing maturity of the field of Data Mining in China. In the last few years we have observed spectacular growth in the activity of Chinese researchers both abroad and in China. Some of the contributions in this volume show that this technology is increasingly used by people who do not have a DM background.
To conclude, this book presents a collection of papers that illustrates the importance of maintaining close contact between Data Mining researchers and practitioners. For researchers, it is useful to understand how the application context creates interesting challenges but, simultaneously, enforces constraints which must be taken into account in order for their work to have higher practical impact. For practitioners, it is not only important to be aware of the latest developments in DM technology, but it may also be worthwhile to keep a permanent dialogue with the research community in order to identify new opportunities for the application of existing technologies and also for the development of new technologies.
We believe that this book may be interesting not only for Data Mining researchers and practitioners, but also to students who wish to have an idea of the practical issues involved in Data Mining. We hope that our readers will find it useful.
Porto, Bradford, Hangzhou, Osaka and Nanjing – May 2008
Carlos Soares, Yonghong Peng, Jun Meng, Takashi Washio, Zhi-Hua Zhou
This chapter introduces the volume on Applications of Data Mining in E-Business and Finance. It discusses how application-specific issues can affect the development of a data mining project. An overview of the chapters in the book is then given to guide the reader.
It is a non-trivial task to effectively and efficiently optimize trading strategies, not to mention the optimization in real-world situations. This paper presents a general definition of this optimization problem, and discusses the application of evolutionary technologies (genetic algorithm in particular) to the optimization of trading strategies. Experimental results show that this approach is promising.
In this study, we analyze the ability of support vector machines (SVM) for credit risk modeling from two different aspects: credit classification and estimation of probability of default values. Firstly, we compare the credit classification performance of SVM with the widely used technique of logistic regression. Then we propose a cascaded model based on SVM in order to obtain a better credit classification accuracy. Finally, we propose a methodology for SVM to estimate the probability of default values for borrowers. We furthermore discuss the advantages and disadvantages of SVM for credit risk modeling.
Client credibility plays an important role in the financial and banking industry. This paper combines C4.5 and Apriori algorithms in the data mining process and discusses the relationship between these two algorithms. It then tests them using the WEKA software to acquire information from the analysis of the historical data of credit card clients. Finally, it offers decision-making support for the evaluation of client credibility.
Traditionally retail banks have supported the credit decision-making on scorecards developed for predicting default in a six-month period or more. However, the underlying pay/no pay cycles justify a decision in a 30-day period. In this work several classification models are built on this assumption. We start by assessing binary scorecards, assigning credit applicants to good or bad risk classes according to their record of defaulting. The detection of a critical region between good and bad risk classes, together with the opportunity of manually classifying some of the credit applicants, led us to develop a tripartite scorecard, with a third output class, the review class, in-between the good and bad classes. With this model 87% decisions are automated, which compares favourably with the 79% automation rate of the actual scorecards.
On-line advertising is booming. Compared to traditional media, such as Press and TV, Web advertising is cheap and offers interesting returns. Thus, it is attracting more and more consideration by the industry. In particular, it is now a consistent part of the marketing mix, that is, the set of different approaches to advertise a product. Data mining based optimization on Web advertising can take place at many different levels. From a data miner perspective, Internet advertising is a very interesting domain as it offers a very large amount of data produced at fast pace with a rich and precise amount of details. It also offers the valuable possibility of live hypothesis testing. Here we discuss an Apriori based optimization experiment we performed on live data. We show how effective such optimization is.
Blogs, or weblogs, have rapidly gained in popularity over the past decade. Because of the huge volume of existing blog posts, information in the blogosphere is difficult to access and retrieve. Existing studies have focused on analyzing personal blogs, but few have looked at corporate blogs, the numbers of which are dramatically rising. In this paper, we use probabilistic latent semantic analysis to detect keywords from corporate blogs with respect to certain topics. We then demonstrate how this method can represent the blogosphere in terms of topics with measurable keywords, hence tracking popular conversations and topics in the blogosphere. By applying a probabilistic approach, we can improve information retrieval in blog search and keywords detection, and provide an analytical foundation for the future of corporate blog search and mining.
The RSS technique provides a fast and effective way to publish up-to-date information or renew outdated content for information subscribers. So far, RSS information is mostly managed by content publishers and Internet users have less initiative to choose what they really need. More attention needs to be paid on techniques for user-initiated information discovery from RSS feeds. In this paper, a quantitative semantic matchmaking method for RSS based applications is proposed. The semantic information of an RSS feed can be described by numerical vectors and semantic matching can then be conducted in a quantitative form. The ontology is applied to provide a common-agreed matching basis for the quantitative comparison. In order to avoid semantic ambiguity of literal statements from distributed RSS publishers, fuzzy inference is used to transform an individual-dependent vector into an individual-independent vector and semantic similarities can then be revealed as the result.
Negotiation is a process between self-interested agents trying to reach an agreement on one or multiple issues in an ecommerce domain. The knowledge of an agent about the opponents' strategies improves the negotiation outcome. However, an agent negotiates with incomplete information about its opponent. Given this, to detect the opponent's strategy, we can use the similarity between opponents' strategies. In this paper we present a method for measuring the similarity between negotiators' strategies. Offers are generated by the agent's strategy therefore our similarity measure is based on the history of offers in negotiation sessions. We extended the Levenshtein distance technique to detect similarity between strategies. We implement this measure and experimentally show that the result of using the measure improves the recognition of the opponent's strategy.
From the evolution of developing a pattern interestingness perspective, data mining has experienced two phases, which are Phase 1: technical objective interestingness focused research, and Phase 2: technical objective and subjective interestingness focused studies. As a result of these efforts, patterns mined are of significant interest to technical concern. However, technically interesting patterns are not necessarily of interest to business. In fact, real-world experience shows that many mined patterns, which are interesting from the perspective of the data mining method used, are out of business expectations when they are delivered to the final user. This scenario actually involves a grand challenge in next-generation KDD (Knowledge Discovery in Databases) studies, defined as actionable knowledge discovery. To discover knowledge that can be used for taking actions to business advantages, this paper addresses a framework that extends the evolution process of knowledge evaluation to Phase 3 and Phase 4. In Phase 3, concerns with objective interestingness from a business perspective are added on top of Phase 2, while in Phase 4 both technical and business interestingness should be satisfied in terms of objective and subjective perspectives. The introduction of Phase 4 provides a comprehensive knowledge actionability framework for actionable knowledge discovery. We illustrate applications in governmental data mining showing that the considerations and adoption of the framework described in Phase 4 has potential to enhance both sides of interestingness and expectation. As a result, knowledge discovered has better chances to support action-taking in the business world.
This paper addresses the use of an evolutionary algorithm for the optimization of a K-nearest neighbor classifier to be used in the implementation of an intrusion detection system. The inclusion of a diversity maintenance technique embodied in the design of the evolutionary algorithm enables us to obtain different subsets of features extracted from network traffic data that lead to high classification accuracies. The methodology has been preliminarily applied to the Denial of Service attack detection, a key issue in maintaining continuity of the services provided by business organizations.
In recent years, the academic circle has been paying increasing attention to the economic development in the Yangtze Delta of China, particularly its continuous development driven by growing Foreign Direct Investment (FDI). This article studies, by way of quantitative analysis, the correlation between FDI and economic development in the Yangtze Delta, and on this basis, analyzes the squeezing-in and out effect of FDI on the regional economic development, and draw some the conclusion.
We develop techniques for mining labor records from a large number of historical IT consulting projects in order to discover clusters of projects exhibiting similar resource usage over the project life-cycle. The clustering results, together with domain expertise, are used to build a meaningful project taxonomy that can be linked to project resource requirements. Such a linkage is essential for project-based workforce demand forecasting, a key input for more advanced workforce management decision support. We formulate the problem as a sequence clustering problem where each sequence represents a project and each observation in the sequence represents the weekly distribution of project labor hours across job role categories. To solve the problem, we use a model-based clustering algorithm based on explicit state duration left-right hidden semi-Markov models (HsMM) capable of handling high-dimensional, sparse, and noisy Dirichlet-distributed observations and sequences of widely varying lengths. We then present an approach for using the underlying cluster models to estimate future staffing needs. The approach is applied to a set of 250 IT consulting projects and the results discussed.