
Ebook: Transition of HPC Towards Exascale Computing
The US, Europe, Japan and China are racing to develop the next generation of supercomputers – exascale machines capable of 10 to the 18th power calculations a second – by 2020. But the barriers are daunting: the challenge is to change the paradigm of high-performance computing.
The 2012 biennial high performance workshop, held in Cetraro, Italy in June 2012, focused on the challenges facing the computing research community to reach exascale performance in the next decade. This book presents papers from this workshop, arranged into four major topics: energy, scalability, new architectural concepts and programming of heterogeneous computing systems.
Chapter 1 introduces the status of present supercomputers, which are still about two orders of magnitude separated from the exascale mark. Chapter 2 examines energy demands, a major limiting factor of today's fastest supercomputers; the quantum leap in performance required for exascale computing will require a shift in architectures and technology. In Chapter 3, scalable computer paradigms for dense linear algebra on massive heterogeneous systems are presented, and Chapter 4 discusses architectural concepts. Finally, Chapter 5 addresses the programming of heterogeneous systems.
This book will be of interest to all those wishing to understand how the development of modern supercomputers is set to advance in the next decade.
The 2012 biennial high performance workshop in Cetraro, Italy, focused on the challenges facing the computing research community to reach exascale performance in the next decade. The contributions illustrated the wish to achieve this feat but also revealed a need for a coordinated strategy to develop, deploy and program exascale systems. The papers in this book are arranged into four major topics: energy, scalability, new architectural concepts and programming of heterogeneous computing systems.
Chapter 1 introduces the status of present supercomputers. The fastest computers today are still about two orders of magnitude separated from the exascale mark. It turns out that the exascale challenges require a rethinking of computing systems at all levels: hardware, software, algorithms and their interaction.
Energy demands are a major limiting factor of today's fastest supercomputers. In Chapter 2 it is argued that the next quantum leap in performance will require a shift in architectures and technology to achieve the 20MW target for exascale computing. Energy saving data caches are proposed to minimize the memory subsystem power consumption.
Grand computational problems will run on millions of cores. In Chapter 3 scalable compute paradigms for dense linear algebra on massive heterogeneous systems are presented. Scalability also implies new strategies for fast synchronization between cooperating threads running at different levels of the hardware hierarchy.
Architectural concepts are presented in Chapter 4. Data centric computing and blending of an algorithm with the architecture requires a new vision on hardware/ software co-design and its impact on communication.
Finally, Chapter 5 addresses programming heterogeneous systems. The challenge is to create a uniform programming environment capable of unleashing the maximal performance from accelerators and multicores. A uniform environment allows tuning a program to harvest the benefits of all parallel hardware.
The editors thank the authors for their high-quality research leading to a collection of papers presenting a balanced perspective on the transition of high performance computing towards exascale computing.
Special thanks go to the reviewers for their expert assistance and valuable suggestions.
We are especially grateful to Maria Teresa Guaglianone for her efficient management of the manuscripts and extensive communication with reviewers, editors and authors.
Erik D'Hollander, Belgium
Jack Dongarra, USA
Ian Foster, USA
Lucio Grandinetti, Italy
Gerhard Joubert, Netherlands/Germany
12 July 2013
We describe the development and key characteristics of the “K computer” supercomputer system, which was developed by RIKEN and Fujitsu as part of an initiative led by Japan's Ministry of Education, Culture, Sports, Science and Technology (MEXT) from 2006–2012. We highlight the technologies used to create the K computer, which is a massively parallel supercomputer system with over 80,000 CPUs and eight cores per CPU. The balance of power consumption, reliability, and efficiency in application performance was the key consideration in the design. The K computer achieved the top performance in the TOP500 ranking in June 2011 (8.16 petaflop/s with 548,352 processor cores) and again in November 2011 (10.51 petaflop/s with 705,024 processor cores), thus breaking the 10 petaflop/s barrier for the first time in the world. The K computer is now operational at the Advanced Institute for Computational Science (AICS) in Kobe, Japan. We also describe the PRIMEHPC FX10, which utilizes enhanced technologies implemented in the K computer, and the future directions we intend to take.
Development of Exascale Computers before the end of this decade is an ambitious goal being pursued by the international HPC community. The challenges are numerous; the important ones being power consumption & dissipation, application scaling and reliability. The challenges also occur at all levels of abstraction: materials & devices, circuit design, logic design, architecture, the software stack, algorithms. We need to meet these challenges not only at every level of abstraction but also need to look at the interaction between levels.
The growth rate in energy consumed by data centers in the United States has been declining in the past five years compared to its earlier accelerating pace. This reduced growth rate was achieved in large part due to energy efficiency improvements. Measuring, monitoring and managing usage has been key to making these improvements in energy efficiency. The metric Power Usage Effectiveness (PUE) has been effective in driving the energy efficiency of data centers, but it has limitations. PUE does not account for the power distribution and cooling losses inside the IT equipment, which is particularly problematic for HPC. Similarly, reporting performance and analyzing the amount of power used to run High Performance Linpack for a Top500 and/or Green500 submission has been successful in helping to drive improvements in supercomputing system energy efficiency. Power efficiency, (Megaflops per watt) shows average efficiency nearly tripling between 2007 and 2011. But just as PUE isn't perfect for the data center, so are there problems with the power/energy measurement methodologies, workloads and metrics for supercomputer systems. Work is actively being done on both of these topics. The ability to achieve the 20MW target for an Exascale system is challenging and will require shifts in architecture, technology and application usage models as well as tighter coupling between the data center infrastructure and the computer system. This paper will describe initiatives led by the Energy Efficient High Performance Computing Working Group that will help us hit our 20 MW target.
With each CMOS technology generation, leakage energy has been increasing at an exponential rate. Since modern processors employ large last level caches (LLCs), their leakage energy consumption has become an important concern in modern chip design. To address this issue, several techniques have been proposed. However, most of these techniques require offline profiling and hence, cannot be used in real-life systems which run trillions of instructions of arbitrary applications. In this paper, we propose Palette, a technique for saving cache leakage energy using cache coloring. Palette uses a small hardware component called reconfigurable cache emulator, to estimate performance and energy consumption of multiple cache configurations and then selects the configuration with least energy consumption. Simulations performed with SPEC2006 benchmarks show the superiority of Palette over existing cache energy saving technique. With a 2MB baseline cache, the average saving in memory sub-system energy and EDP (energy delay product) are 31.7% and 29.5%, respectively.
Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA)schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.
This paper explores asynchrony in the context of high performance computing. We explore where asynchrony comes from, why it matters, and how exascale computing impacts our consideration of asynchrony. We present a list of key concepts that are expected to permit practical exascale computing in the presence of asynchrony. We then introduce the ParalleX execution model along with its open source runtime system implementation, High Performance ParalleX (HPX). Performance examples of HPX using open source benchmarks are then presented.
Many-core architectures are omnipresent in today's modern life. They can be found in mobile phones, tablet computers, game consoles, laptops, desktop and server systems. Although, many-core systems are common in powerful mainstream systems, their core count is still in the lower tens and has not increased much over the last years. Truly massively parallel systems with core counts in the higher tens or even hundreds have only been seen so far in custom-made architectures for High Performance Computing systems or innovative new architectures from startup companies. Nevertheless, the core count has already reached acritical mass that shows the difficulty of increasing performance and reducing power. This requires careful orchestration of the many cores with efficient synchronization constructs such that they reduce the idle time of waiting cores and use power efficient synchronization operations.
In this paper we investigate the feasibility, usefulness and trade-offs of different synchronization mechanisms, especially fine-grain in-memory synchronization support, in a real-world large-scale many-core chip (IBM Cyclops-64). We extended the original Cyclops-64 architecture design at the gate level to support the fine-grain in-memory synchronization features. We performed an in-depth study of a well-known kernel code: the wavefront computation. Several optimized versions of the kernel code were used to test the effects of different synchronization constructs using our chip emulation framework. Furthermore, we compared selected SPEC OpenMP kernel loops using these mechanisms against existing well-known software-based synchronization approaches.
In our wavefront benchmark study, the combination of fine-grain dataflow-like in-memory synchronization with non-strict scheduling methods yields a thirty percent improvement over the best optimized traditional synchronization method provided by the original Cyclops-64 design. For the SPEC OpenMP kernel loops, we achieved speeds of three to fourteen times the speed of software-based synchronization methods.
Preparations for Exascale computing have led to the realization that future computing environments will be significantly different from those that provide Petascale capabilities. This change is driven by energy constraints, which is compelling architects to design systems that will require a significant re-thinking of how algorithms are developed and implemented. Co-design has been proposed as a methodology for scientific application, software and hardware communities to work together. This chapter gives an overview of co-design and discusses the early application of this methodology to High Performance Computing.
Application programming for modern heterogeneous systems which comprise multi-core CPUs and multiple GPUs is complex and error-prone. Approaches like OpenCL and CUDA are relatively low-level as they require explicit handling of parallelism and memory, and they do not offer support for multiple GPUs within a stand-alone computer, nor for distributed systems that integrate several computers. In particular, distributed systems require application developers to use a mix of programming models, e.g., MPI together with OpenCL or CUDA.
We propose a uniform, high-level approach for programming both stand-alone and distributed systems with many cores and multiple GPUs. The approach consists of two parts: 1) the dOpenCL runtime system for transparent execution of OpenCL programs on several stand-alone computers connected by a network, and 2) the SkelCL library for high-level application programming on heterogeneous stand-alone systems with multi-core CPUs and multiple GPUs. While dOpenCL provides transparent accessibility of arbitrary computing devices (multi-core CPUs and GPUs) across distributed systems, SkelCL offers a set of pre-implemented patterns (skeletons) of parallel computation and communication which greatly simplify programming these devices. Both parts are built on top of OpenCL which ensures their high portability across different kinds of processors and GPUs.
We describe dOpenCL and SkelCL, demonstrate how our approach simplifies programming for distributed systems with many cores and multiple GPUs and report experimental results on a real-world application from the field of medical imaging.
The performance and the versatility of today's PCs exceeds many times the power of the fastest number crunchers in the 90s. Yet the computational hunger of many scientific applications has led to the development of GPU- and FPGA-accelerator cards. In this paper the programming environment and the performance analysis of a super desktop with a combined GPU/FPGA architecture is presented. A unified roofline model is used to compare the performance of the GPU and the FPGA taking into account the computational intensity of the algorithm and the resource consumption. The model is validated by two image processing kernels which are compiled using OpenCL for the GPU and a C-to-VHDL compiler for the FPGA. It is shown that an FPGA compiler outperforms handwritten code and is highly productive, but also uses more resources. While both the GPU and FPGA excel in particular applications, both devices suffer from the limited I/O bandwidth to the processor.
With the amount of sequence data deluge as a result of next generation sequencing, there comes the need to leverage the large-scale biological sequence data. Therefore, the role of high performance computational methods for mining interesting information solely from these sequence data becomes increasingly important. Almost every research issue in bioinformatics counts on the inter-relationship between sequences, structure and function. Although pairwise statistical significance (PSS) has been found to be capable of accurately mining related sequences (homologs), its estimation is both computationally and data intensive. To prevent it from being a performance bottleneck, high performance computation (HPC) approaches are used for accelerating the computation. In this chapter, we first present the algorithm of pairwise statistical significance estimation, then highlight the use of such HPC approaches for its acceleration employing multi-core CPU and many-core GPU, both of which enable significant performance improvement for pairwise statistical significance estimation (PSSE).