At the International Research Workshop on Advanced High Performance Computing Systems held in Cetraro in July 2014 a number of topics related to the concepts and processing of large data sets were discussed. In this volume a selection of contributions is published that on the one hand cover the concept of Big Data and on the other the processing and storage of very large data sets.
The advantages offered by Big Data has been expounded in the popular press over recent years. Reports on results obtained by analysing large data sets covering various fields, such as marketing, medicine, finances, etc., led to the claim that any important real world problem could be solved if sufficient data is available. This claim ignores the fact that a deeper understanding of the fundamental characteristics of a problem is essential in order to ensure that an obtained result is valid and correct.
The collection, processing and storage of large data sets must always be seen in relative terms compared to the communication, processing and storage capabilities of the computing platforms available at a given point in time. Thus the problem faced by Herman Hollerith with the collection, processing and archiving of the data collected with the census of 1890 in the USA was a Big Data problem, as very large data sets compared to the processing and storage capabilities of his tabulating machine using punch cards, had to be processed. The collected data were analysed with the aim to detect different patterns in order to suggest answers to diverse questions.
Although researchers today have a large selection of extremely powerful computers available, they are faced with similar problems. The data sets that must be processed, analysed and archived are vastly bigger. Thus the requirement to process large data sets with available equipment has not changed fundamentally over the last century. The way scientific research is conducted has, however, changed through the advent of high performance and large scale computing systems.
With the introduction of the Third Paradigm of Scientific Research computer simulations have replaced (or complemented) experimental and theoretical procedures. Simulations routinely create massive data sets that need to be analysed in detail, both in-situ and in post processing.
The use of available data for another purpose than that for which these were originally collected introduced the so-called Fourth Paradigm of Scientific Research. As mentioned above, this approach was already used by Herman Hollerith in the various analyses of the collected census data. Recently this approach, popularly called Big Data Analytics, received wide spread attention. Many such data analyses resulted in detecting patterns in data sets that resulted in significant new insights.
Such successes prompted the collection of data on a massive scale in many areas. Many of these massive data collections are created without necessarily having specific directions from a well identified experiment. In astronomy, for example, massive sky surveys replace individual, targeted observation and the actual discoveries are achieved at a later stage with pure data analysis tools. In other words, the construction of massive data collections has become an independent, preliminary stage of the scientific process. The assumption is that accumulating enough data will automatically result in a repository rich of new events or features of interest. Therefore, it is hoped that a proper data mining process will find such new elements and lead to novel scientific insight. While in principle this is not a guaranteed outcome, experience shows that indeed collecting massive amounts of data has systematically led to new discoveries in astronomy, genomics, climate, and many other scientific disciplines.
In each case, the validity and correctness of such new discoveries must, however, be verified. Such verifications are essential in order to substantiate or repudiate scientific theories, such as, for example, the possible cause of an illness, the existence of a particle or the cause of climate change.
The papers selected for publication in this book discuss fundamental aspects of the definition of Big Data as well as considerations from practice where complex data sets are collected, processed and stored. The concepts, problems, methodologies and solutions presented are of much more general applicability than may be suggested by the particular application areas considered. In this sense these papers are also a contribution to the now emerging field of Data Science.
The editors hope that readers can benefit from the contributions on theoretical and practical views and experiences included in this book.
The editors are especially indebted to Dr Maria Teresa Guaglianone for her valuable assistance, as well as to Microsoft for making their CMT system available.
Lucio Grandinetti, Italy
Gerhard Joubert, Netherlands/Germany
Marcel Kunze, Germany
Valerio Pascucci, USA