
Ebook: New Frontiers in High Performance Computing and Big Data

For the last four decades, parallel computing platforms have increasingly formed the basis for the development of high performance systems primarily aimed at the solution of intensive computing problems, and the application of parallel computing systems has also become a major factor in furthering scientific research. But such systems also offer the possibility of solving the problems encountered in the processing of large-scale scientific data sets, as well as in the analysis of Big Data in the fields of medicine, social media, marketing, economics etc.
This book presents papers from the International Research Workshop on Advanced High Performance Computing Systems, held in Cetraro, Italy, in July 2016. The workshop covered a wide range of topics and new developments related to the solution of intensive and large-scale computing problems, and the contributions included in this volume cover aspects of the evolution of parallel platforms and highlight some of the problems encountered with the development of ever more powerful computing systems. The importance of future large-scale data science applications is also discussed.
The book will be of particular interest to all those involved in the development or application of parallel computing systems.
At the International Research Workshop on Advanced High Performance Computing Systems held in Cetraro in July 2016 a wide spectrum of topics on new developments related to the solution of compute intensive and large scale problems were discussed. A selection of contributions presented at the workshop are included in this volume.
During the last four decades parallel compute platforms increasingly formed the basis for the development of High Performance Systems primarily aimed at the solution of compute intensive problems. However, such systems also offer the opportunity to solve large-scale data intensive problems including medical, societal, marketing, economics, and geographical, all characterized by very large data sets.
The papers collected for publication in this book cover aspects of the development of parallel systems, highlighting some of the problems encountered with the development of ever more powerful compute systems. On the one hand the energy consumption of millions of processing elements places a limit on the future expansion of parallel systems, and on the other the scheduling of parallel processes to utilise the processing power of a large number of processors becomes a serious obstacle.
The application of parallel compute systems for the processing and analysis of large data sets has become a major factor in furthering scientific research. Data Science aspects are discussed by a number of authors. In addition a number of case studies underlines the importance of future large scale data science applications.
The editors hope that readers can benefit from the contributions on theoretical and practical views and experiences included in this book.
The editors are especially indebted to Dr Claudia Rotella and Dr Maria Teresa Guaglianone for their valuable assistance as well as to Microsoft for making their CMT system available.
Geoffrey Fox, USA
Vladimir Getov, UK
Lucio Grandinetti, Italy
Gerhard Joubert, Germany
Thomas Sterling, USA
Date: 2017-08-31
This paper focuses primarily on Sandia National Laboratories' Advanced Simulation and Computing (ASC) program progress and accomplishments to support transformative co-design, but also includes the collaborative efforts with Los Alamos National Laboratory (LANL) and Lawrence Livermore National Laboratory (LLNL). The goal of transformative co-design is to develop future hardware and system architecture designs that ease the burden on ASC application, algorithm and system software developers. We describe the impact of Sandia's architectural simulation framework, the Structural Simulation Toolkit (SST), presenting case studies for co-design analysis that include: analysis of multi-level DRAM memory in the ASC Trinity Cray XC40 platform, modern interconnect scaling and photonic networks. The SST analysis capability has also been a useful vehicle for vendor collaborations and has been used to provide concrete recommendations for how optical device and interconnection network designs can address HPC needs. We describe the deployment and impact of our ASC advanced architecture test beds. We also describe our development of a new computational fluid dynamics (CFD) application, and the associated Kokkos programming model abstractions to achieve performance portability on diverse advanced architectures. Finally, we describe our most recent efforts in performance analysis for next generation architectures.
Recent developments of high-end processors recognize energy monitoring and tuning as one of the main challenges towards achieving higher performance given the growing power and temperature constraints. Our thermal energy model is based on application-specific parameters such as consumed power, execution time, and equilibrium temperature as well as hardware-specific parameters such as half time for thermal rise or fall. As observed with the out-of-band instrumentation and monitoring infrastructure on our experimental cluster with air cooling, the temperature changes follow a relatively slow capacitor-style charge-discharge process. Therefore, we use the lumped thermal model that initiates an exponential process whenever there is a change in processor's power consumption. Experiments with two codes – Firestarter and Nekbone – validate our approach and demonstrate its use for analyzing and potentially improving the application-specific balance between temperature, power, and performance.
While the very far future well beyond exaflops computing may encompass such paradigm shifts as quantum computing or neuromorphic computing, a critical window of change exists within the domain of semiconductor digital logic technology but beyond conventional practices of architecture, system software, and programming. As key parameters such as Dennard scaling, nano-scale component densities, clock rates, pin I/O, and voltage represent asymptotic operational regimes, one major area of untapped opportunity is computer architecture which has been severely limited by conventional practices of organization and control semantics. Mainstream computer architecture in HPC has been inhibited in innovation by the original von Neumann architecture of seven decades ago. Although notably diverse in form of parallelism exploited, six major epochs of computer architecture through to the present are all von Neumann derivatives. At their core is the use of single instruction issue and the prioritization of Floating Point ALU (FPU) utilization. However, in the modern age, FPUs consume only a small part of die real estate while the plethora of mechanisms to achieve maximum floating point efficiency take up the majority of the chip. The von Neumann bottleneck, the separation of memory and processor, is also retained. A revolution in computer architecture design is possible by undoing the damage of the von Neumann heritage and emphasizing the key challenges of data movement latency and bandwidth which are the true precious resources along with operation/instruction issue control. This paper discusses key tradeoffs that should drive computer architecture in what might be called the “Neo-Digital Age”.
Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytics applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers.
Communication networks in recent high performance computing machines often have multi-dimensional torus topologies, which influences the way jobs should be scheduled into the system. With the rapid growth of the size of modern HPC system's interconnect, network contention has become a critical issue for the performance of parallel jobs, especially for those which are communication-intensive and not tolerant to inter-job interference. Moreover, to improve the runtime consistency, a contiguous allocation strategy is usually adopted, and each job is allocated a convex prism. However, using this strategy brings in internal and external fragmentation, which can degrade the system utilization. To this end, in this work, we investigate and develop various strategies in topology-aware job scheduling strategies for multidimensional torus-based systems, with the objective of improving job performance and system utilization.
Because of the steadily growing volume and complexity of the data exchanged through the Internet and among connected devices, and the need to rapidly elaborate the incoming information, new challenges have been posed to High Performance Computing (HPC). Several architectures, programming languages, Cloud services and software have been proposed to efficiently deal with modern HPC applications, in particular with Big Data, and solve the related issues that arise. As a consequence, users often find it difficult to select the right platform, application or service fitting their requirements. In order to support them in deploying their applications, Patterns have been adopted and exploited, providing a base canvas on which users can develop their own applications. In this paper we offer an overview of modern Pattern advancements for HPC and Big Data, also providing an insight on semantic-based technologies which have been successfully applied to provide a flexible representation thereof.
While High-Performance Computing (HPC) typically focuses on very large, parallel machines, i.e., Big Iron, running massive numerical codes, the importance of extracting knowledge from massive amounts of information, i.e., Big Data, has been clearly recognized. While many massive data sets can be produced within a single administrative domain, many more massive data sets can be, and must be, assembled from multiple sources. Aggregating data from multiple sources can be a tedious task. First, the locations of the desired data must be known. Second, access to the data sets must be allowed. For publicly accessible data, this may not pose a serious problem. However, many application domains and user groups may wish to facilitate, and have some degree of control over, how their resources are discovered and shared. Such collaboration requirements are addressed by federation management technologies. In this paper, we argue that effective, widely-adopted federation management tools, i.e., Big Identity, are critical for enabling many Big Data applications, and will be central to how the Internet of Things is managed. To this end, we re-visit the NIST cloud deployment models to extract and identify the fundamental aspects of federation management: crossing trust boundaries, trust topologies, and deployment topologies. We then review possible barriers to adoption and relevant, existing tooling and standards to facilitate the emergence of a common practice for Big Identity.
A main research goal of IT nowadays is to investigate and design new scalable solutions for big data analysis. This goal asks for coupling scalable algorithms with high-performance programming tools and platforms. Addressing these challenges requires a seamless integration of the scalable computing techniques and big data analytics research approaches and frameworks. In fact, scalability is a key item for big data analysis and machine learning applications. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of HPC systems and clouds, whereas in the next future exascale systems will be used to implement extreme scale data analysis. This chapter introduces and discusses cloud models that support the design and development of scalable data mining applications and report on challenges and issues to be addressed and solved for developing data analysis algorithms on extreme-scale systems.
As applications approach extreme scales, data staging and in-situ/in-transit processing have been proposed to address the data challenges and improve scientific discovery. However, further research is necessary in order to understand how growing data sizes from data intensive simulations coupled with the limited DRAM capacity in High End Computing systems will impact the effectiveness of this approach. Moreover, the complex and dynamic data exchange patterns exhibited by the workflows coupled with the varied data access behaviors make efficient data placement within the staging area challenging. In this paper, we explore how we can use deep memory levels for data staging and develop a multi-tiered data staging method that spans both DRAM and solid state disks (SSD). This approach allows us to support both code coupling and data management for data-intensive simulation workflows. We also show how adaptive application-aware data placement mechanisms can dynamically manage and optimize data placement vertically across the DRAM and SSD storage levels and horizontally across different staging nodes in this multi-tiered data staging method. We present an experimental evaluation of our approach using two OLCF resources: an Infiniband cluster (Sith) and a Cray XK7 system (Titan), and using combustion (S3D) and fusion (XGC1) simulations. The evaluation results demonstrate that our approach can effectively improve data access performance and overall efficiency of coupled scientific workflows.
This paper proposes model rotation as a general approach to parallelize big data machine learning applications. To solve the big model problem in parallelization, we distribute the model parameters to inter-node workers and rotate different model parts in a ring topology. The advantage of model rotation comes from maximizing the effect of parallel model updates for algorithm convergence while minimizing the overhead of communication. We formulate a solution using computation models, programming interfaces, and system implementations as design principles and derive a machine learning framework with three algorithms built on top of it: Latent Dirichlet Allocation using Collapsed Gibbs Sampling, Matrix Factorization using Stochastic Gradient Descent and Cyclic Coordinate Descent. The performance results on an Intel Haswell cluster with max 60 nodes show that our solution achieves faster model convergence speed and higher scalability than previous work by others.
Today, about 55 per cent of the world's population lives in urban areas, a proportion that is expected to increase to 66 per cent by 2050. Such a steadily increasing urbanization is already bringing huge social, economic and environmental transformations and, at the same time, poses big challenges in city management issues, like resource planning (water, electricity), traffic, air and water quality, public policy and public safety services. To face such challenges, the exploitation of information coming from urban environments and the development of Smart City applications to enhance quality, improve performance and safety of urban services, are key elements. This chapter discusses how the analysis of urban data may be exploited for forecasting crimes and presents an approach, based on seasonal auto-regressive models, for reliably forecasting crime trends in urban areas. In particular, the main goal of this work is to discuss the impact of data mining on urban crime analysis and design a predictive model to forecast the number of crimes that will happen in rolling time horizons. As a case study, we present the analysis performed on an area of Chicago. Experimental evaluation results show that the proposed methodology can achieve high predictive accuracy for long term crime forecasting, thus can be successfully exploited to predict the time evolution of the number of crimes in urban environments.
Future exascale computers will have to be suitable for both data science applications and for more “traditional” modeling and simulation. However, data science applications are often posed as questions about discrete objects such as graphs while problems in modeling and simulation are usually stated initially in terms of classical mathematical analysis. We will present examples and arguments to show that the two points of view are not as distinct as one might think. Recognizing the connections between the two problem sets will be essential to development of algorithms capable of exascale performance. Our main examples will be from applications of Monte Carlo to attacking hard problems of the kind that occur both in data science and in computational modeling of physical phenomena. We will illustrate how taking ideas from both worlds pays wonderful dividends.
In this chapter, we present the optimization strategy developed at NERSC for moving user applications from traditional x86 CPU based HPC systems to the many-core Xeon-Phi powered Cori system. We target the Xeon-Phi many-cores, large vector units and high-bandwidth memory. The developed strategy is general, intended to be applicable to NERSC's 6,000 users. We present three application case studies from different science areas to illustrate the Cori optimization process.
The utility of satisfiability (SAT) as an application focused hard computational problem is well established. We explore the potential of quantum annealing to enhance classical SAT solving, especially where sampling from the space of all possible solutions is of interest. We address the formulation of SAT problems to make them suitable for commercial quantum annealers, practical concerns in their implementation, and how the performance of the resulting quantum solver compares to and complements classical SAT solvers.