Ebook: Big Data and High Performance Computing
Big Data has been much in the news in recent years, and the advantages conferred by the collection and analysis of large datasets in fields such as marketing, medicine and finance have led to claims that almost any real world problem could be solved if sufficient data were available. This is of course a very simplistic view, and the usefulness of collecting, processing and storing large datasets must always be seen in terms of the communication, processing and storage capabilities of the computing platforms available.
This book presents papers from the International Research Workshop, Advanced High Performance Computing Systems, held in Cetraro, Italy, in July 2014. The papers selected for publication here discuss fundamental aspects of the definition of Big Data, as well as considerations from practice where complex datasets are collected, processed and stored.
The concepts, problems, methodologies and solutions presented are of much more general applicability than may be suggested by the particular application areas considered. As a result the book will be of interest to all those whose work involves the processing of very large data sets, exascale computing and the emerging fields of data science.
At the International Research Workshop on Advanced High Performance Computing Systems held in Cetraro in July 2014 a number of topics related to the concepts and processing of large data sets were discussed. In this volume a selection of contributions is published that on the one hand cover the concept of Big Data and on the other the processing and storage of very large data sets.
The advantages offered by Big Data has been expounded in the popular press over recent years. Reports on results obtained by analysing large data sets covering various fields, such as marketing, medicine, finances, etc., led to the claim that any important real world problem could be solved if sufficient data is available. This claim ignores the fact that a deeper understanding of the fundamental characteristics of a problem is essential in order to ensure that an obtained result is valid and correct.
The collection, processing and storage of large data sets must always be seen in relative terms compared to the communication, processing and storage capabilities of the computing platforms available at a given point in time. Thus the problem faced by Herman Hollerith with the collection, processing and archiving of the data collected with the census of 1890 in the USA was a Big Data problem, as very large data sets compared to the processing and storage capabilities of his tabulating machine using punch cards, had to be processed. The collected data were analysed with the aim to detect different patterns in order to suggest answers to diverse questions.
Although researchers today have a large selection of extremely powerful computers available, they are faced with similar problems. The data sets that must be processed, analysed and archived are vastly bigger. Thus the requirement to process large data sets with available equipment has not changed fundamentally over the last century. The way scientific research is conducted has, however, changed through the advent of high performance and large scale computing systems.
With the introduction of the Third Paradigm of Scientific Research computer simulations have replaced (or complemented) experimental and theoretical procedures. Simulations routinely create massive data sets that need to be analysed in detail, both in-situ and in post processing.
The use of available data for another purpose than that for which these were originally collected introduced the so-called Fourth Paradigm of Scientific Research. As mentioned above, this approach was already used by Herman Hollerith in the various analyses of the collected census data. Recently this approach, popularly called Big Data Analytics, received wide spread attention. Many such data analyses resulted in detecting patterns in data sets that resulted in significant new insights.
Such successes prompted the collection of data on a massive scale in many areas. Many of these massive data collections are created without necessarily having specific directions from a well identified experiment. In astronomy, for example, massive sky surveys replace individual, targeted observation and the actual discoveries are achieved at a later stage with pure data analysis tools. In other words, the construction of massive data collections has become an independent, preliminary stage of the scientific process. The assumption is that accumulating enough data will automatically result in a repository rich of new events or features of interest. Therefore, it is hoped that a proper data mining process will find such new elements and lead to novel scientific insight. While in principle this is not a guaranteed outcome, experience shows that indeed collecting massive amounts of data has systematically led to new discoveries in astronomy, genomics, climate, and many other scientific disciplines.
In each case, the validity and correctness of such new discoveries must, however, be verified. Such verifications are essential in order to substantiate or repudiate scientific theories, such as, for example, the possible cause of an illness, the existence of a particle or the cause of climate change.
The papers selected for publication in this book discuss fundamental aspects of the definition of Big Data as well as considerations from practice where complex data sets are collected, processed and stored. The concepts, problems, methodologies and solutions presented are of much more general applicability than may be suggested by the particular application areas considered. In this sense these papers are also a contribution to the now emerging field of Data Science.
The editors hope that readers can benefit from the contributions on theoretical and practical views and experiences included in this book.
The editors are especially indebted to Dr Maria Teresa Guaglianone for her valuable assistance, as well as to Microsoft for making their CMT system available.
Lucio Grandinetti, Italy
Gerhard Joubert, Netherlands/Germany
Marcel Kunze, Germany
Valerio Pascucci, USA
The general concept of the scientific method or procedure consists in systematic observation, experiment and measurement, and the formulation, testing and modification of hypotheses. In many cases a hypothesis is formulated in the form of a model, for example a mathematical or simulation model. The correctness of a solution of a problem produced by a model is verified by comparing it with collected data. Alternatively, observational data may be collected without a clear specification that the data could also apply to the solution of other, unforeseen problems. In such cases data analytics are used to extract relationships from and detect structures in data sets. In accordance with the scientific method, the results obtained can then be used to formulate one or more hypotheses and associated models as solutions for such problems. This approach allows for ensuring the validity of the solutions obtained. The results thus obtained may lead to a deeper insight in such problems and can represent significant progress in scientific research. The increased interest in so-called Big Data resulted in a growing tendency to consider the structures detected by analysing large data sets as solutions in their own right. A notion is thus developing that the scientific method is becoming obsolete. In this paper it is argued that data, hypotheses and models are essential to gain deeper insights into the nature of the problems considered and to ensure that plausible solutions were found. A further aspect to consider is that the processing of increasingly larger data sets result in an increased demand for HTC (High Throughput Computing) in contrast to HPC (High Performance Computing). The demand for HTC platforms will impact the future development of parallel computing platforms.
Data analysis applications often include large datasets and complex software systems in which multiple data processing tools are executed in a coordinated way. Data analysis workflows are effective in expressing task coordination and they can be designed through visual- and script-based programming paradigms. The Data Mining Cloud Framework (DMCF) supports the design and scalable execution of data analysis applications on Cloud platforms. A workflow in DMCF can be developed using a visual- or a script-based language. The visual language, called VL4Cloud, is based on a design approach for high-level users, e.g., domain expert analysts having a limited knowledge of programming paradigms. The script-based language JS4Cloud is provided as a flexible programming paradigm for skilled users who prefer to code their workflows through scripts. Both languages implement a data-driven task parallelism that spawns ready-to-run tasks to Cloud resources. In addition, they exploit implicit parallelism that frees users from duties like workload partitioning, synchronization and communication. In this chapter, we present the DMCF framework and discuss how its workflow paradigm has been integrated with the MapReduce model. In particular, we describe how VL4Cloud/JS4Cloud workflows can include MapReduce tools, and how these workflows are executed in parallel on DMCF enabling scalable data processing on Clouds.
Scientific simulations often generate massive amounts of data used for debugging, restarts, and scientific analysis and discovery. Challenges that practitioners face using these types of big data are unique. Of primary importance is speed of writing data during a simulation, but this need for fast I/O is at odds with other priorities, such as data access time for visualization and analysis, efficient storage, and portability across a variety of supercomputer topologies, configurations, file systems, and storage devices. The computational power of high-performance computing systems continues to increase according to Moore's law, but the same is not true for I/O subsystems, creating a performance gap between computation and I/O. This chapter explores these issues, as well as possible optimization strategies, the use of in situ analytics, and a case study using the PIDX I/O library in a typical simulation.
This paper reviews the Ogre classification of Big Data application with 50 facets divided into four groups or views. These four correspond to Problem Architecture, Execution mode, Data source and style, and the Processing model used. We then look at multiple existing or proposed benchmark suites and analyze their coverage of the different facets suggesting a process to obtain a complete set. We illustrate this by looking at parallel data analytics benchmarked on multicore clusters.
This position paper addresses the technical foundation and non-technical framework of Big Data. A new era of data analytics promises tremendous value on the basis of cloud computing technology. Can our models scale to rapidly growing data? Can we perform predictive analytics in real-time? How do we efficiently deal with bad data? Some practical examples as well as open research questions are discussed.
Extreme data science is becoming increasingly important at the U.S. Department of Energy's National Energy Research Scientific Computing Center (NERSC). Many petabytes of data are transferred from experimental facilities to NERSC each year. Applications of importance include high-energy physics, materials science, genomics and climate modeling, with an increasing emphasis on large-scale simulations and data analysis. In response to the emerging data-intensive workloads of its users, NERSC made a number of critical design choices to enhance the usability of its pre-exascale supercomputer, Cori, which is scheduled to be delivered in 2016. These data enhancements include a data partition, a layer of NVRAM for accelerating I/O, user defined images and a customizable gateway for accelerating connections to remote experimental facilities.
The sheer volume of data accumulated in many scientific disciplines as well as in industry is a critical point that requires immediate attention. The handling of large data sets will become a limiting factor - even for data intensive applications running on future Exascale systems. Nowadays, Big Data can be more a collection of challenges for data processing at large scale and less a tool box of solutions used to improve applications, scale well, and handle the constantly growing data sets. There is an urgent need for intelligent mechanisms to acquire, process, and analyze data, which have to run and scale efficiently on current and future computing architectures. The complexity of Big Data applications will highly profit from flexible workflow systems that consider the full data life cycle, from data acquisition to long-term storage and towards the curation of knowledge. To maximize the applicability of HPC systems for Big Data workflows, several changes in the system architecture and its software need to be considered. First, in order to exploit all available I/O capacities an adaptable monitoring system needs to collect information about I/O patterns of application and workflows as well as provide information to model the I/O subsystem. The goal is to collect long term performance data, to evaluate this data, and finally to show how and why resources cannot be used to their full potential. Second, as the complexity of systems is continuously increasing, the level of abstraction that is presented to the user needs to increase with at least the same rate in order to ensure that the current usability is at least maintained. This is accomplished by employing science gateways as well as workflow and metadata technologies.
Advances in both sensor and computing technologies promise new approaches to discovery in materials science and engineering. For example, it appears possible to integrate theoretical modeling and experiment in new ways, test existing models with unprecedented rigor, and infer entirely new models from first principles. But, before these new approaches can become useful in practice, practitioners must be able to work with petabytes and petaflops as intuitively and interactively as they do with gigabytes and gigaflops today. The Discovery Engines for Big Data project at Argonne National Laboratory is tackling key bottlenecks along the end-to-end discovery path, focusing in particular on opportunities at Argonne's Advanced Photon Source. Here, we describe results relating to data acquisition, management, and analysis. For acquisition, we describe automated pipelines based on Globus services that link instruments, computations, and people for rapid and reliable data exchange. For management, we describe digital asset management solutions that enable the capture, management, sharing, publication, and discovery of large quantities of complex and diverse data, along with associated metadata and programs. For analysis, we describe the use of 100K+ supercomputer cores to enable new research modalities based on near-real-time processing and feedback, and the use of Swift parallel scripting to facilitate authoring, understanding, and reuse of data generation, transformation, and analysis software.
The paper starts with a classification of climate modeling in Big Data and presents research activities in DKRZ's two basic climate modeling workflows, the climate model development and the climate model data production. Research emphasis in climate model development is on code optimization for efficient use of modern and future multi-core high performance computing architectures. Complementary research is related to increase of I/O bandwidth between compute nodes and hard discs as well as efficient use of storage resources. Research emphasis in climate model data production is on optimization of the end-to-end workflow in its different stages starting from climate model calculations over generation and storage of climate data products and ending in long-term archiving, interdisciplinary data utilization research data publication for integration of citable data entities in scientific literature articles.