
Ebook: Multi-Relational Data Mining

With the increased possibilities in modern society for companies and institutions to gather data cheaply and efficiently, the subject of Data Mining has become of increasing importance. This interest has inspired a rapidly maturing research field with developments both on a theoretical, as well as on a practical level with the availability of a range of commercial tools. Unfortunately, the widespread application of this technology has been limited by an important assumption in mainstream Data Mining approaches. This assumption – all data resides, or can be made to reside, in a single table – prevents the use of these Data Mining tools in certain important domains, or requires considerable massaging and altering of the data as a pre-processing step. This limitation has spawned a relatively recent interest in richer Data Mining paradigms that do allow structured data as opposed to the traditional flat representation. This publication goes into the different uses of Data Mining, with Multi-Relational Data Mining (MRDM), the approach to Structured Data Mining, as the main subject of this book.
As is customary for a Ph.D. thesis, the road towards completion of this text has been long. Two people have been instrumental in reaching the end successfully, and getting me started in the first place. I am very grateful to Pieter Adriaans for convincing me that my research ideas were a suitable basis for a dissertation, and that getting a degree was a mere formality and would be a matter of one or two years (a slight underestimate). Equal praise to my supervisor Arno Siebes, for supporting my ideas, having the patient conviction all would end well, and letting me do things my way. Whenever my research led me off the beaten track of mainstream Data Mining, he encouraged me to press on.
I would also like to thank my colleagues at the Large Distributed Databases (read ‘Data Mining’) group at Utrecht University, who, for obscure reasons, tended to come up with Spanish nicknames, ranging from Arniño to Pensionado. In particular Lennart Herlaar, Rainer Malik and Carsten Riggelsen were of great help in getting the document printer-ready. Ad Feelders, also at the LDD group, devoted his time reading through an early draft. I hope this was as beneficial to him as it was to me.
A greatly appreciated effort was done by Kathy Astrahantseff, who checked the manuscript for typos and bad phrasing. Thanks a lot for spending so much time crossing the t's and dotting the last 1's.
The person with probably the most visible impact on the book as you are currently holding it is Lieske Meima, who spent lots of here valuable spare time designing the cover and taking wonderful pictures.
Many of the experimental results in this thesis would have been impossible without the hard work of the team at Kiminkii: Eric Ho, Bart Marseille, Wouter Radder and Michel Schaake. I am particularly indebted to Eric and Bart for helping me implement Safarii and ProSafarii. Even though at times, they must have been wondering where all their efforts were leading, they can be proud of the end result.
Although at the end of the day, every letter in this thesis was conceived by me, a surprisingly small fraction of these letters was actually typed in person. Many thanks to Karin Klompmakers, Tiddo Evenhuis and Hans van Kampen for typing out endless pages of manuscript and sitting down with me to make corrections and draw tables and diagrams.
I want to express my gratitude to the members of the reading committee, Jean-François Boulicaut, Luc De Raedt, Peter Flach, Joost Kok and Hannu Toivonen, for voluntarily spoiling their summer carefully reading the manuscript and approving its publication.
Thanks also to my two assistants at the public defence of this thesis, Marc de Haas and Leendert van den Berg. Looking like a clown is best done in teams.
The following institutions have supported or contributed to the research reported in this thesis: Perot Systems Nederland B.V., the CWI (the Dutch national research laboratory for mathematics and computer science), the Telematica Institute, Utrecht University and Kiminkii.
Finally I have to mention my dad, Freerk Knobbe, who provided a lot of technical support and still recognizes randomly located paragraphs he claims to have typed. On many occasions, he helped out with tedious jobs such as creating an index or editing formulae in Word. His only complaint was that the randomness of his contributions prevented him from seeing the big picture and understanding the ‘plot’. I guess with the present dissertation in print, he will have to read it start to finish.
But all the technical and scientific support would have been in vain if it hadn't been for the moral support provided by my friends and family, in particular my parents.
Houten, September 2004