Ebook: Parallel Computing: On the Road to Exascale
As predicted by Gordon E. Moore in 1965, the performance of computer processors increased at an exponential rate. Nevertheless, the increases in computing speeds of single processor machines were eventually curtailed by physical constraints. This led to the development of parallel computing, and whilst progress has been made in this field, the complexities of parallel algorithm design, the deficiencies of the available software development tools and the complexity of scheduling tasks over thousands and even millions of processing nodes represent a major challenge to the construction and use of more powerful parallel systems.
This book presents the proceedings of the biennial International Conference on Parallel Computing (ParCo2015), held in Edinburgh, Scotland, in September 2015. Topics covered include computer architecture and performance, programming models and methods, as well as applications. The book also includes two invited talks and a number of mini-symposia.
Exascale computing holds enormous promise in terms of increasing scientific knowledge acquisition and thus contributing to the future well-being and prosperity of mankind. A number of innovative approaches to the development and use of future high-performance and high-throughput systems are to be found in this book, which will be of interest to all those whose work involves the handling and processing of large amounts of data.
During the first decades of development of integrated circuits the performance of processors increased at an exponential rateas was predicted by Gordon E. Moore when he published his so-called Moore's Law in 1965. It was clear, however, that the increase in compute speeds of single processor machines would eventually be curtailed by physical constraints. The ever increasing demand to solve more complex and larger problems could thus in the long run only be met by harnessing the power of multiple processors by using these in parallel fashion.
As a need arose to stimulate the development of parallel computing technologies, the biennial International Conference series on Parallel Computing (ParCo) was started in 1983. Since then this conference series has played a stimulating role in furthering the development and use of parallel machines. The success of these conferences was continued by ParCo2015, which was held from 1-4 September 2015 in Edinburgh, Scotland, UK.
As was the case with all previous events, ParCo2015 attracted many notable contributions depicting present and future developments in the parallel computing field. The contributed papers illustrate the many different trends, both established and newly emerging, that are influencing parallel computing.
The number of processors incorporated in parallel systems have rapidly increased over the last decade, raising the question of how to efficiently and effectively utilise the combined processing capabilities on a massively parallel scale. The difficulties experienced with the design of algorithms that scale well over a large number of processing elements have become increasingly apparent. The combination of the complexities encountered with parallel algorithm design, the deficiencies of the available software development tools to produce and maintain the resulting complex software and the complexity of scheduling tasks over thousands and even millions of processing nodes, represent a major challenge to constructing and using more powerful systems consisting of ever more processors. These challenges may prove to be even more difficult to overcome than the requirement to build more energy efficient systems.
To reach the goal of exascale computing, the next stage in the development of high performance systems, fundamentally new approaches are needed in order to surmount the aforesaid constraints. Exascale computing holds enormous promise in terms of increasing scientific knowledge acquisition and thus contributing to the future wellbeing and prosperity of humankind. Such powerful systems are, for example, needed for executing complex simulations and large information processing tasks resulting from large-scale scientific experiments. It is therefore vital that the parallel computing community succeeds in overcoming the associated challenges.
Innovative approaches that can assist in solving the problems encountered with the development and use of future high performance and high throughput systems were suggested by a number of conference speakers. Thus, for example, the incorporation of learning capabilities into processors may form the basis for more intelligent systems that can more readily adapt to changing processing requirements. Improved automatic scheduling of processing tasks may lead to greater efficiency and make systems easier to program. More flexible hardware designs, such as are proposed by, for example, FPGAs offer further perspectives. Hardware could be made more reliable by improved monitoring, with automatic action taken if components fail.
This volume is a record of the stimulating exchange of information and innovative ideas to which all attendees contributed. In addition to the contributed papers, a number of mini-symposia were held, each focusing on one research theme. The topics covered by all contributors are reflected in the selection of accepted papers constituting these proceedings.
The organisers wish to thank all organisations and individuals who contributed to the success of this event. In particular the organisers wish to thank Pavlos Petoumenos for his assistance with producing this manuscript.
Gerhard Joubert
Hugh Leather
Mark Parsons
Frans Peters
Mark Sawyer
Date: 2015-12-15
The SpiNNaker (Spiking Neural Network Architecture) project will soon deliver a machine incorporating a million ARM processor cores for real-time modeling of large-scale spiking neural networks. Although the scale of the machine is in the realms of high-performance computing, the technology used to build the machine comes very much from the mobile embedded world, using small integer cores and Network-on-Chip communications both on and between chips. The full machine will use a total of 10 square meters of active silicon area with 57,600 routers using predominantly multicast algorithms to convey real-time spike information through a lightweight asynchronous packet-switched fabric. This paper presents the philosophy behind the machine, and the future prospects for systems with increased cognitive capabilities based on an increasing understanding of how biological brains process information.
The performance of multi-threaded applications depends on efficient scheduling of parallel tasks. Manually selecting schedulers is difficult because the best scheduler depends on the application, machine and input. We present a frame-work that automatically selects the best scheduler based on empirical tuning results. We applied our framework to tune eleven applications parallelized using OpenMP, TBB or the Galois system. Depending on the application and machine, we observed up to 4X performance improvement over the default scheduler. We were also able to prune the search space by an order of magnitude while still achieving performance within 16% of the best scheduler.
In this paper, we address the problem of the efficient parallel exploitation of different types of computing devices inside a single machine, to solve a scientific problem. As a first step, we apply our scheme to the Jacobi relaxation. Despite its simplicity, it is a good example of iterative process for scientific simulation. Then, we evaluate and analyze the performance of our parallel implementation on two configurations of hybrid machine.
We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with fundamental algorithms from scientific computing. For two case studies, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter. For the others, our model is used to compare different algorithms solving the same problem. In each case, the studied algorithms were implemented and the results of their experimental comparison are coherent with the theoretical analysis based on our model.
Performance tuning tools are required to reach a high level of performance in a large-scale parallel system. There are many open source software (OSS) tools that can assist programmers with the performance analysis. Score-P is an efficient OSS tool for analyzing MPI communication and tuning performance. A problem is that it only uses hardware counters to analyze CPU operating states. We developed an interface between Score-P and Fujitsu's advanced profiler (FAPP), which has more functions for analyzing CPU operating states. The key benefit of our interface is that users can find the cost balance among threads in a CPU and investigate the causes of the performance problem by using there individual favorite performance tool. We demonstrated this interface's ability in tuning the CCS quantum chromodynamics (QCD) benchmark program.
Numerical simulations using the Discrete Element Method (DEM) are used at LEAT in the context of several important, energy related particulate flow systems. The focus of these investigations is the understanding of the heat and mass transfer processes on the micro-scale and the prediction of the related macroscopic behaviour. Most of the currently available DEM implementations, especially if the required number of particles is large, only allow for small variations in particle size if the computational effort must be kept within reasonable bounds. This is contrary to the actual requirements of many technically relevant processes where a broad size spectrum must be considered. Parallel processing helps to ease this situation to a certain degree, but the ongoing search for algorithmic improvements has not yet accomplished a definitive solution.
The process of neighbourhood detection, which is required to identify the partners of the pairwise interactions determining momentum fluxes among the particles and between particles and surrounding walls is one common bottleneck. Besides the commonly used Linked-Cell method, hierarchically structured “background” meshes or octrees were proposed in the past and applied in various implementations. A new variant of the octree approach is presented and its performance with respect to particle number, particle size distribution and parallelisation is analysed and compared to conventional approaches. In order to obtain a realistic analysis, for a given code in a typical hardware environment (small engineering companies or university institutes), the benchmark addresses the technical application of particle movement in a section of a rotary drum.
Performance analysis of the ABySS genome sequence assembler (ABYSS-P) executing on the K computer with up to 8192 compute nodes is described which identified issues that limited scalability to less than 1024 compute nodes and required prohibitive message buffer memory with 16384 or more compute nodes. The open-source Scalasca toolset was employed to analyse executions, revealing the impact of massive amounts of MPI point-to-point communication used particularly for master/worker process coordination, and inefficient parallel file operations that manifest as waiting time at later MPI collective synchronisations and communications. Initial remediation via use of collective communication operations and alternate strategies for parallel file handling show large performance and scalability improvements, with partial executions validated on the full 82,944 compute nodes of the K computer.
Performance of memory intensive applications executed on multi-core multi-socket environments is closely related to the utilization of shared resources in the memory hierarchy. Depending on the characteristics of the application, the shared resources utilization can lead to a significant performance degradation. The exploration of different thread affinity configurations allows the selection of a proper configuration that balances the performance improvement obtained by increasing parallelism with the performance degradation due to memory contention. However, as the number of cores per processor and the number of processors per node increases, testing all possible configurations is not reasonable. We propose a performance model to estimate the execution time for a thread configuration (number and affinity distribution) for an OpenMP application parallel region based on runtime hardware counters information and the estimation of performance degradation due to memory contention generated at last level cache (LLC). The latter is estimated considering features obtained from the memory access pattern and the memory footprint. The experimental results show that the proposed methodology identifies the thread configurations which maximize the application performance by preventing memory contention on main memory.
The company Numascale is one of the few companies offering shared memory systems with thousands of cores. To produce machines of such a size, a proprietary cache-coherent interconnect is used to couple standard servers into a large single system. In this work we investigate the ability of such a huge system to run OpenMP applications in an efficient way. Therefore, we use kernel benchmarks to investigate basic performance characteristics of the machine and we present a real world application from the Institute of Combustion Technology at RWTH Aachen University, called TrajSearch, which has been optimized to run efficient on such a large shared memory machine.
The SparseLinSol library for solving large sparse systems of linear algebraic equations is presented in the paper. The key implementation features for multicore CPUs and GPUs are discussed. Performance characteristics are compared for “solve” part of the methods for MPI and hybrid code implementations and GPU-accelerated implementation against the hypre library for a set of matrices of 41–99 mln. unknowns on up to 128 GPU-equipped compute nodes. Preliminary results on coupling the developed library with OpenFOAM package to speedup the hydrodynamic modelling problems are discussed.
We consider the general problem of generating code for the automated selection of the expected best implementation variants for multiple subcomputations on a heterogeneous multicore system, where the program's control flow between the subcomputations is structured by sequencing and loops. A naive greedy approach as applied in previous works on multi-variant selection code generation would determine the locally best variant for each subcomputation instance but might miss globally better solutions. We present a formalization and a fast algorithm for the global variant selection problem for loop-based programs. We also show that loop unrolling can additionally improve performance, and prove an upper bound of the unroll factor which allows to keep the run-time space overhead for the variant-dispatch data structure low. We evaluate our method in case studies using an ARM big.LITTLE based system and a GPU based system where we consider optimization for both energy and performance.
Keeping up with the performance trend of the last decades cannot be achieved anymore by stepping up the clock speed of processors. The usual strategy is nowadays to use lower frequency and to increase the number of cores, where data communication and memory bandwidth can become the main barrier. In this paper, we introduce an MPI design and its implementation on the MPPA-256 (Multi Purpose Processor Array) processor from Kalray Inc., one of the first worldwide actors in the many-core architecture field. A model was developed to evaluate the communication performance and bottlenecks on MPPA. Our achieved result of 1.2 GB/s, e.g. 75% of peak throughput, for on-chip communication shows that the MPPA is a promising architecture for next-generation HPC systems, with its high performance-to-power ratio and high-bandwidth network-on-chip.
PCIe is the common way for integrating high-speed interconnects and manycore accelerators into cluster nodes. In this paper we describe a zero copy approach for communicating between two PCIe devices involving shared DMA memory buffers. Furthermore, we show the requirements for transferring data directly between two PCIe devices without using the main memory of the host computer. We included the support for direct device communication into our PDA (Portable Driver Architecture) which is a library for implementing user-space drivers.
EDGE is a complex application for computational fluid dynamics used e.g. for aerodynamic simulations in avionics industry. In this work we present the portable, high-level parallelization of EDGE for execution on multicore CPU and GPU based systems by using the multi-backend skeleton programming library SkePU. We first expose the challenges of applying portable high-level parallelization to a complex scientific application for a heterogeneous (GPU-based) system using (SkePU) skeletons and discuss the encountered flexibility problems that usually do not show up in skeleton toy programs. We then identify and implement necessary improvements in SkePU to become applicable for applications containing computations on complex data structures and with irregular data access. In particular, we improve the MapArray skeleton and provide a new MultiVector container for operand data that can be used with unstructured grid data structures. Although there is no SkePU skeleton specifically dedicated to handling computations on unstructured grids and its data structures, we still obtain portable speedup of EDGE with both multicore CPU and GPU execution by using the improved MapArray skeleton of SkePU.
In this work we discuss the parallelization of the model selection process for Tree Echo State Networks. We consider two different “structured” parallelization strategies: one based on functional replication of the computations needed to evaluate the different steps in the model selection process, and the other one exposing and exploiting the dependency graph in the aggregate selection process computations. Both parallelizations have been implemented using FastFlow. Experimental results on state-of-the-art multicore architectures are discussed in detail that demonstrate the feasibility and the efficiency of the parallelizations.
Multi-core processors with up to 8 cores and more as well as GPUs with thousands of cores are omnipresent. In order to fully exploit their parallel computing capability, programmers have to deal with low-level concepts of parallel programming. These low-level concepts constitute a high barrier to efficient development of parallel applications. A higher level of abstraction would be desirable.
In order to assist programmers in developing parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel programming patterns, thereby shielding programmers from low-level aspects of parallel programming.
The main contribution of this paper is the design and implementation of data parallel skeletons with GPU support in Java. Additionally, on the basis of three benchmark applications, including Matrix multiplication, N-Body calculations, and Shortest paths, we evaluate the performance of the presented implementation on a GPU cluster.
We introduce a library supporting execution of data parallel kernels on GPUs from Erlang. The library provides calls with the same semantics of the map and fold functions of the lists Erlang library, where the functions to be computed on the input list(s) are actually provided as OpenCL C kernels. The map and reduce (fold) higher order functions are provided in such a way that subsequent calls may leave temporary data (partial results) on the GPU memory while computing complex, possibly composed data parallel patterns. In addition, data transfers to and from the GPU, from and to the Erlang subsystem, are overlapped with Erlang to C and C to Erlang marshaling, such that the cost of the overall type conversion is minimized. We assess the performances of the data parallel library via simple synthetic benchmarks and real application kernels showing substantial speedups with respect to pure Erlang implementation of the same synthetic benchmarks/application kernels.
Accelerators such as NVIDIA GPUs and Intel MICs are currently provided as co-processor devices, usable only through a CPU host. For Intel MICs it is planned that this constraint will be lifted in the near future: CPU and accelerator(s) will then form a single, many-core, processor capable of peak performance of several Teraflops with high energy efficiency. In order to exploit the available computational power, the user will be compelled to write a code more “hardware-aware”, in contrast to the common philosophy of hiding hardware details as much as possible. The simple two-sided communication approach often used in message-passing applications introduces synchronization costs that may limit the performance on the next generation machines. PGAS languages, like coarray Fortran and UPC, propose a one-sided approach where a process accesses directly the remote memory of another process without interrupting its execution. In this paper, we propose a CUDA-aware coarray implementation, capable of merging the expressive syntax of coarrays with the computational power of GPUs. We propose a new keyword for the Fortran language, which allows the user to map with a high-level syntax some hardware features of the many-core machines. Our hybrid coarray implementation is based on OpenCoarrays, the coarray transport layer currently adopted by the GNU Fortran compiler.
We describe Lapedo, a novel library of hybrid parallel skeletons for programming heterogeneous multi-core/many-core CPU/GPU systems in Erlang. Lapedo's skeletons comprise a mixture of CPU and GPU components, allowing skeletons to be flexibly and dynamically mapped to available resources, with all of the low-level tedious code to divide work between CPUs and GPUs, transfer the data between the main and GPU memory and offload computations to the GPUs provided by the library. We evaluate the effectiveness of Lapedo on three realistic use cases from different domains, demonstrating significant improvements in speedups compared to CPU-only and GPU-only executions.
Accelerators like the Intel Xeon Phi aim to fulfill the computational requirements of modern applications, including stencil computations. Stencils are finite-difference algorithms used in many scientific and engineering applications for solving large-scale and high-dimension partial differential equations. However, programmability on such massively parallel architectures is still a challenge for inexperienced developers.
This paper provides the evaluation of the Intel Xeon Phi (Knights Corner) for 3-D Stencil Codes using different optimization strategies. Our evaluation is based on three kernels that are widely applied to simulate heat, acoustic diffusion as well as isotropic seismic wave equations. Our experimental results yield performance gains over 25x when compared to high-level sequential implementations (e.g., Matlab). Energy measurements show a similar trend to that of performance. Vectorization is the key strategy from our results, from both performance and energy points of view.
In addition, we propose a set of tips to optimize stencil codes based on a C/C++ OpenMP implementation. We guide developers in maximizing the benefits of hardware-software co-design for computing 3-D stencil codes running on the this architecture.
Modern computer hardware provides parallelism at various different levels – most obviously, multiple multicore processors allow many independent threads to execute at once. At a finer-grained level, each core contains a vector unit allowing multiple integer or floating point calculations to be performed with a single instruction. Additionally, GPU hardware is highly parallel and performs best when processing large numbers of independent threads. At the same time, tools such as CUDA have become steadily more abundant and mature, allowing more of this parallelism to be exploited.
In this paper we describe the process of optimising a physical modelling sound synthesis code, the Multiplate 3D code, which models the acoustic response of a number of metal plates embedded within a box of air. This code presented a number of challenges and no single optimisation technique was applicable to all of these. However, by exploiting parallelism at several different levels (multithreading, GPU acceleration, and vectorisation), as well as applying other optimisations, it was possible to speed up the simulation very significantly.