Ebook: Data Mining for Business Applications
Data mining is already incorporated into the business processes in many sectors such as health, retail, automotive, finance, telecom and insurance as well as in government. This technology is well established in applications such as targeted marketing, customer churn detection and market basket analysis. It is also emerging as an important technology in a wide range of new application areas, such as social media, social networks and sensor networks. These areas pose new challenges both in terms of the nature of available data and the underlying support technology. This book contains extended versions of a selection of papers presented at a series of workshops held between 2005 and 2008 on the subject of data mining for business applications. It covers the entire spectrum of issues involved in the development of data mining systems. Areas covered include methodological issues and research challenges, typical problems for which data mining has proved to be an invaluable tool, and innovative applications of data mining which make this an exciting field to work in. The contributions illustrate the importance of maintaining close contact between researchers and practitioners: it is essential that researchers are exposed to and motivated by the real problems and practical constraints experienced by organizations, and practitioners need to interact with the research community to identify new opportunities to apply the latest technology. This book will be of interest not only to data mining researchers and practitioners, but also to students seeking a better understanding of the practical issues involved in building data mining systems.
The field of data mining is currently experiencing a very dynamic period. It has reached a level of maturity that has enabled it to be incorporated in IT systems and business processes of companies across a wide range of industries. Information technology and E-commerce companies such as Amazon, Google, Yahoo, Microsoft, IBM, HP and Accenture, are naturally at the forefront of these developments. In addition, data mining technologies are also getting well established in other industries and government sectors, such as health, retail, automotive, finance, telecom and insurance, as part of large corporations such as Siemens, Daimler, Walmart, Washington Mutual, Progressive Insurance, Portugal Telecom as well as in governments across the world.
As data mining becomes a mainstream technology in businesses, data mining research has been experiencing explosive growth. In addition to well established application areas such as targeted marketing, customer churn, and market basket analysis, we are witnessing a wide range of new application areas, such as social media, social networks, and sensor networks. In addition, more traditional industries and business processes, such as healthcare, manufacturing, customer relationship management and marketing are also applying data mining technologies in new and interesting ways. These areas pose new challenges both in terms of the nature of the data available (e.g., complex and dynamic data structures) as well as in terms of the underlying supporting technology (e.g., low-resource devices). These challenges can sometimes be tackled by adapting existing algorithms but at other times need new classes of techniques. This can be observed by looking at the topics being covered at existing major data mining conferences and journals as well as by the introduction of new ones.
A major reason behind the success of the data mining field has been the healthy relationship between the research and the business worlds. This relationship is strong in many companies where researchers and domain experts collaborate to solve practical business problems. Many of the companies that integrate data mining into their products and business processes also employ some of the best researchers and practitioners in the field. Some of the most successful recent data mining companies have also been started by distinguished researchers. Even researchers in universities are getting more connected with businesses and are getting exposed to business problems and real data. Often, new breakthroughs in data mining research have been motivated by the needs and constraints of practical business problems. This can be observed at data mining scientific conferences, where companies are participating very actively and there is a lot of interaction between academia and industry.
As part of our (small) contribution to strengthen the collaboration between companies and universities in data mining, we have been helping organize a series of workshops on Data Mining for Business Applications, with major conferences in the field:
• “Data Mining for Business” workshop, with ECML/PKDD, organized by Carlos Soares, Luís Moniz (SAS Portugal) and Catarina Duarte (SAS Portugal), which was held in Porto, Portugal, in 2005 (http://www.liaad.up.pt/dmbiz/).
• “Data Mining for Business Applications” workshop, with KDD, organized by Rayid Ghani and Carlos Soares, in Philadelphia, USA, in 2006 (http://labs.accenture.com/kdd2006\ %5Fworkshop/).
• “Practical Data Mining: Applications Experiences and Challenges” workshop, with ECML/PKDD, organized by Markus Ackermann (Univ. of Leipzig), Carlos Soares and Bettina Guidemann (SAS Deutschland), which took place in Berlin, Germany, in 2006 (http://wortschatz.uni-leipzig.de/ macker/dmbiz06/).
• “Data Mining for Business Applications” workshop, with KDD, organized by Rayid Ghani, Carlos Soares, Françoise Soulié-Fogelman (KXEN), Katharina Probst (Accenture Technology Labs) and Patrick Gallinari (Univ. of Paris), that was held in Las Vegas, USA, in 2008 (http://labs.accenture.com/kdd2008\ %5Fworkshop/).
This book contains extended versions of a selection of papers from these workshops. The chapters of this book cover the entire spectrum of issues in the development of data mining systems with special attention to methodological issues. Although data mining has reached a reasonable level of maturity and a large number and variety of algorithms, tools and knowledge is available to develop good models and integrate them into business processes, there is still space for research in new data mining methods. Many methodological issues still remain open, affecting several phases of data mining projects, from business and data understanding to evaluation and deployment. As data mining gets applied to new business problems, new research challenges are encountered opening up large unexplored areas of research. The chapters in Part 1, discuss some of the most important of those issues. The authors offer diverse perspectives on those issues due to the different nature of their backgrounds and experience, which include the automotive industry, the data mining industry and the research community.
The book also covers a wide range of business domains, illustrating both classical applications as well as emerging ones. The chapters in Part 2 describe typical problems for which data mining has proved to be an invaluable tool, such as churn and fraud detection, and customer relationship management (CRM). They also cover some of the more important industries, namely banking, government, energy and healthcare. The issues addressed in these papers include important aspects such as how to incorporate domain-specific knowledge in the development of data mining systems and the integration of data mining technology in larger systems that aim to support core business processes. The applications in this book clearly show that data mining projects must not be regarded as independent efforts. They need to be integrated into larger systems to align with the goals of the organization and those of its customers and partners. Additionally, the output of data mining components must, in most cases, be integrated into the IT systems of the business and, therefore, in its (decision-making) processes, sometimes as part of decision-support systems (DSS).
The chapters in Part 3 are devoted to emerging applications of data mining. These chapters discuss the application of novel methods that deal with complex data like social networks and spatial data, to explore new opportunities in domains such as criminology and marketing intelligence. These chapters illustrate some of the exciting developments going on in the field and identify some of the most challenging opportunities. They stress the need for researchers to keep up with emerging business problems, identify potential applications and develop suitable solutions. They also show that companies must not only pay attention to the latest developments in research but also continuously challenge the research community with new problems. We believe that the flow of new and interesting applications will continue for many years and drive the research community to come up with exciting and useful data mining methods.
This book presents a collection of contributions that illustrates the importance of maintaining close contact between data mining researchers and practitioners. For researchers, it is essential to be exposed to and motivated by real problems and understand how business problems not only provide interesting challenges but also practical constraints which must be taken into account in order for their work to have high practical impact. For practitioners, it is not only important to be aware of the latest technology developments in data mining, but also to have continuous interactions with the research community to identify new opportunities to apply existing technologies and also provide the motivation to develop new ones.
We believe that this book will be interesting not only for data mining researchers and practitioners that are looking for new research and business opportunities in DM, but also for students who wish to get a better understanding of the practical issues involved in building data mining systems and find further research directions. We hope that our readers will find this book useful.
Porto, Chicago – July 2010,
Carlos Soares and Rayid Ghani
This chapter introduces the volume on Data Mining (DM) for Business Applications. The chapters in this book provide an overview of some of the major advances in the field, namely in terms of methodology and applications, both traditional and emerging. In this introductory paper, we provide a context for the rest of the book. The framework for discussing the contents of the book is the DM methodology, which is suitable both to organize and relate the diverse contributions of the chapters selected. The chapter closes with an overview of the chapters in the book to guide the reader.
After nearly two decades of data mining research there are many commercial mining tools available, and a wide range of algorithms can be found in literature. One might think there is a solution to most of the problems practitioners face. In our application of descriptive induction on warranty data, however, we found a considerable gap between many standard solutions and our practical needs. Confronted with challenging data and requirements such as understandability and support of existing work flows, we tried many things that did not work, ending up in simple solutions that do. We feel that the problems we faced are not so uncommon, and would like to advocate that it is better to focus on simplicity—allowing domain experts to bring in their knowledge—rather than on complex algorithms. Interactivity and simplicity turn out to be key features to success.
Predictive analytics is a well known practice among corporations having business with private consumers (B2C) as a means to achieve competitive advantage. The first part of this article intends to show that corporations operating in a business to business (B2B) setting have similar conditions to use predictive analytics on their favor. Predictive analytics can be applied to solve a myriad of business problems. The solutions to solve some of these problems are well known while the resolution of other problems requires quite an amount of research and innovation. However, predictive analytics professionals tend to solve similar problems in very different ways, even those to which there are known best practices. The second part of this article uses predictive analytics applications identified in a B2B context to describe a set of best practices to solve well known problems (the “let's not re-invent the wheel” attitude) and innovative practices to solve challenging problems.
Rigor data mining (DM) research has successfully developed advanced data mining techniques and algorithms, and many organizations have great expectations to take more benefit of their vast data warehouses in decision making. Even when there are some success stories the current status in practice is mainly including great expectations that have not yet been fulfilled. DM researchers have recently become interested in utility-based DM (UBDM) starting to consider some of the economic utility factors (like cost of data, cost of measurement, cost of class label and so forth), but yet many other utility factors are left outside the main directions of UBDM. The goal of this position paper is (1) to motivate researchers to consider utility from broader perspective than usually done in UBDM context and (2) to introduce a new generic framework for these broader utility considerations in DM research. Besides describing our multi-criteria utility based framework (MCUF) we present a few hypothetical examples showing how the framework might be used to consider utilities of some potential DM research stakeholders.
A central need in the emerging business of model-based prediction is to enable customers to validate the accuracy of a predictive product. This paper discusses how analysts can evaluate data mining models and their inferences from the customer viewpoint, where the customer is not particularly knowledgeable in data mining. To date, academia has focused primarily on the validation of algorithms through mathematical metrics and benchmarking studies. This type of validation is not sufficient in the business context, where organizations must validate specific models in terms that customers can understand quickly and effortlessly. We describe our predictive business and our customer validation needs. To that end, we discuss examples of customer needs, review issues associated with model validation, and point out how academic research may help to address these business needs.
This work focuses on one of the central topics in customer relationship management (CRM): transfer of valuable customers to a competitor. Customer retention rate has a strong impact on customer lifetime value, and understanding the true value of a possible customer churn will help the company in its customer relationship management. Customer value analysis along with customer churn predictions will help marketing programs target more specific groups of customers. We predict customer churn with logistic regression techniques and analyze the churning and nonchurning customers by using data from a consumer retail banking company. The result of the case study show that using conventional statistical methods to identify possible churners can be successful.
This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The detectiong of these rare errors is a manual, time-consuming task. This type of tasks is usually constrained by a limited amount of available resources. Our proposal addresses this issue by producing a ranking of outlyingness that allows a better management of the available resources by allocating them to the cases which are most different from the other and, thus, have a higher probability of being errors. Our method is based on the output of standard agglomerative hierarchical clustering algorithms, resulting in no significant additional computational costs. Our results show that it enables large savings by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we compare our proposal to a state of the art outlier ranking method (LOF) and show that our method achieves better results on this particular application. The results of our experiments are also competitive with previous results on the same data. Finally, the outcome of our experiments raises important questions concerning the method currently followed at INE concerning items with small number of transactions.
This paper presents an integrated system that helps both retail companies and electricity consumers on the definition of the best retail contracts and tariffs. This integrated system is composed by a Decision Support System (DSS) based on a Consumer Characterization Framework (CCF). The CCF is based on data mining techniques, applied to obtain useful knowledge about electricity consumers from large amounts of consumption data. This knowledge is acquired following an innovative and systematic approach able to identify different consumers' classes, represented by a load profile, and its characterization using decision trees. The framework generates inputs to use in the knowledge base and in the database of the DSS. The rule sets derived from the decision trees are integrated in the knowledge base of the DSS. The load profiles together with the information about contracts and electricity prices form the database of the DSS. This DSS is able to perform the classification of different consumers, present its load profile and test different electricity tariffs and contracts. The final outputs of the DSS are a comparative economic analysis between different contracts and advice about the most economic contract to each consumer class. The presentation of the DSS is completed with an application example using a real data base of consumers from the Portuguese distribution company.
Hospitals are adept at capturing large volumes of highly multi-dimensional data about their activities including clinical, demographic, administrative, financial and, increasingly, outcome data (such as adverse events). Managing and understanding this data is difficult as hospitals typically do not have the staff and/or the expertise to assemble, query, analyse and report on the potential knowledge contained within such data. The Power Knowledge Builder (PKB) project investigated the adaption of data mining algorithms to the domain of patient costing, with the aim of helping practitioners better understand their data and therefore facilitate best practice.
In Criminology research the question arises if certain types of delinquents can be identified from data, and while there are many cases that can not be clearly labeled, overlapping taxonomies have been proposed in [1,2,3]. In a recent study Juvenile offenders (N = 1572) from three state systems were assessed on a battery of criminogenic risk and needs factors and their official criminal histories. Cluster analysis methods were applied. One problem we encountered is the large number of hybrid cases that have to belong to two or more classes. To eliminate these cases we propose a method that combines the results of Bagged K-Means and the consistency method [4], a semi-supervised learning technique. A manual interpretation of the results showed very interpretable patterns that were linked to existing criminologic research.
We propose a dynamic forecasting model for price in online auctions. One of the key features of our model is that it operates during the live-auction, generating real-time forecasts which makes it different from previous static models. Our model is also different with respect to how information about price is incorporated. While one part of the model is based on the more traditional notion of an auction's price-level, another part incorporates its dynamics in the form of price-velocity and -acceleration. In that sense, it incorporates key features of a dynamic environment such as an online auction. The use of novel functional data methodology allows us to measure, and subsequently include, dynamic price characteristics. We illustrate our model on a diverse set of eBay auctions across many different book categories. It achieves significantly higher prediction accuracy compared to standard approaches.
In this paper, we present a technology platform that can be customized to create a wide range of corporate radar applications that can turn the Web into a systematic source of business insight. This platform integrates a combination of established AI technologies – i.e. semantic models, natural language processing, and inference engines – in a novel way. We present two prototype corporate radars built using this platform: the Business Event Advisor, which detects threats and opportunities relevant to a decision maker's organization, and the Technology Investment Radar which assesses the maturity of technologies that impact a decision maker's business. The Technology Investment Radar has been piloted with business users, and we present encouraging initial results from this pilot.
Almost any data can be referenced in geographic space. Such data permit advanced analyses that utilize the position and relationships of objects in space as well as geographic background information. Even though spatial data mining is still a young research discipline, in the past years research advances have shown that the particular challenges of spatial data can be mastered and that the technology is ready for practical application when spatial aspects are treated as an integrated part of data mining and model building. In this chapter in particular, we give a detailed description of several customer projects that we have carried out and which all involve customized data mining solutions for business relevant tasks. The applications range from customer segmentation to the prediction of traffic frequencies and the analysis of GPS trajectories. They have been selected to demonstrate key challenges, to provide advanced solutions and to arouse further research questions.