
Ebook: Parallel Computing: From Multicores and GPU's to Petascale

Parallel computing technologies have brought dramatic changes to mainstream computing; the majority of today’s PC's, laptops and even notebooks incorporate multiprocessor chips with up to four processors. Standard components are increasingly combined with GPU's (Graphics Processing Unit), originally designed for high-speed graphics processing, and FPGA's (Free Programmable Gate Array) to build parallel computers with a wide spectrum of high-speed processing functions. The scale of this powerful hardware is limited only by factors such as energy consumption and thermal control. However, in addition to hardware factors, the practical use of petascale and exascale machines is often hampered by the difficulty of developing software which will run effectively and efficiently on such architecture. This book includes selected and refereed papers, presented at the 2009 international Parallel Computing conference (ParCo2009), which set out to address these problems. It provides a snapshot of the state-of-the-art of parallel computing technologies in hardware, application and software development. Areas covered include: numerical algorithms, grid and cloud computing, programming – including GPU and cell programming. The book also includes papers presented at the six mini-symposia held at the conference.
Parallel Computing technologies brought dramatic changes to mainstream computing. This trend is accelerating as the end of the development of hardware following Moore's law looms on the horizon. The majority of standard PC's and even notebooks today incorporate multiprocessor chips with up to four processors. This number is expected to soon reach eight and more.
These standard components, COTS (Components Off The Shelf), are increasingly combined with powerful parallel processors originally designed for high-speed graphics processing, GPU's (Graphics Processing Units), and FPGA's (Free Programmable Gate Arrays) to build heterogeneous parallel computers that offer a wide spectrum of high speed processing functions. The number of processors incorporated in such systems are today of the order of up to 104 to 106. This vast number of processors allows the construction of high speed computers in the petascale range, and even the exascale range, at a reasonable cost. The limiting factor for constructing more powerful hardware is the energy consumption and thermal control of such systems. Many research efforts concentrate on reducing the overall energy consumed.
In addition to the hardware design and build limitations, the practical use of petascale or exascale machines is hampered by the difficulties of developing software that efficiently and effectively run on such architectures. This holds for system as well as application software. The ParCo conference aimed at addressing many of these issues through contributed papers as well as the various mini-symposia presentations.
In this book, which includes selected and refereed papers presented at the international Parallel Computing conference (ParCo2009) held from 1–4 September 2009 at ENS (École Normale Supérieure), Lyon, France, problems associated with the development of high speed parallel systems using new hardware concepts and the associated software development issues are considered. The papers were presented as keynote papers, in regular sessions, an industrial session and various mini-symposia covering specialised topics. Overall these give a snapshot of the state-of-the-art of parallel computing technologies, both in hardware as well as application and software development.
This year's highlight is no doubt the increasing number of papers addressing the programming of general-purpose graphics processing units. Considering the main track of the conference, as well as the mini-symposium dedicated to GPU's, ParCo2009 turned into one of the main scientific venues covering this important research topic in 2009.
The editors wish to express their sincere gratitude for all persons who supported this venture and lastly made it feasible. In particular we wish to thank the many reviewers who, as members of the international Program Committee, not only assessed papers, but also acted as session chairmen during the conference.
Sincere thanks is due to the members of the Organising Committee, and in particular to Laurent Lefèvre, Eddy Caron and Jean-Christophe Mignot, who spent many hours assisting in organising a very successful event. We are also very grateful for work done by Virginie Mahdi from Genci, Paris in attracting a considerable number of sponsors as well as participants in the Industrial Session.
Please note that versions of papers with colour images and diagrams are available in the electronic version of the book on http://www.booksonline.iospress.nl/ under Advances in Parallel Computing.
Barbara Chapman, University of Houston, USA
Frédéric Desprez, INRIA, France
Gerhard Joubert, TU Clausthal, Germany
Alain Lichnewsky, GENCI, France
Frans Peters, Philips Research, Netherlands
Thierry Priol, INRIA, France
31 December 2009
Exascale computing platforms will soon emerge over the horizon. Architectures for such platforms are already on drawing boards. This paper will focus on some of the key drivers of the technology that will be needed on exascale platforms along with the impact that these new technology drivers will have on system architecture. The important implications for users of such exascale systems will also be discussed.
In this paper we will introduce work being supported by the EU in the Apple-CORE project (http://www.apple-core.info). This project is pushing the boundaries of programming and systems development in multi-core architectures in an attempt to make multi-core go mainstream, i.e. continuing the current trends in low-power, multi-core architecture to thousands of cores on chip and supporting this in the context of the next generations of PCs. This work supports dataflow principles but with a conventional programming style. The paper describes the underlying execution model, a core design based on this model and its emulation in software. We also consider system issues that impact security. The major benefits of this approach include asynchrony, i.e. the ability to tolerate long latency operations without impacting performance and binary compatibility. We present results that show very high efficiency and good scalability despite the high memory access latency in the proposed chip architecture.
We present scaling results of the parallel tree code PEPC on an IBM BlueGene/P and identify performance bottlenecks and intrinsic algorithmic issues. Using more than 8192 processors the tree code is capable of simulating more than 100 million particles, but as our analysis shows, fundamental changes will be necessary for porting this code to petaflop systems. However, an efficiency examination reveals a very good ratio between communication and computation in the traversal process. Furthermore, we present a library version of the code, which may act as a ‘black box’ for front-end users.
We are interested in this work by the combination of iterative solvers when solving linear systems of equations in an on-line setting. Our study targets users who may not be able to choose the best solvers for solving a set of linear systems while minimizing the total execution time. We propose a framework and algorithms in which the combination of solvers depends on informations gathered at runtime. The framework is assessed by extensive experiments using 5 SPARSKIT solvers over more than 70 matrices. The results show that the proposed approach is robust for solving linear sytems since we were able to solve more linear systems than each individual solver with an execution time nearly two times equal to those of the worst individual solver. Morever, we were able to predict a set of two solvers containing the best solver on more than 80% cases.
This paper presents a comparative study of some distributed solvers on a set of linear systems arising from Navier-Stokes equations and provided by an industrial software. Solvers under consideration implement direct, iterative or domain decomposition methods and most of them are freely available packages. Numerical tests with various parameters are made easier by developing a unified toolbox that links with interface functions provided by these libraries. The intensive numerical tests performed on various sets of processors reveal the good performance results achieved by the recently proposed parallel preconditioner for Krylov methods based on an explicit formulation of multiplicative Schwarz [1].
Optimization-based approaches for image deblurring and denoising on Graphics Processing Units (GPU) are considered. In particular, a new GPU implementation of a recent gradient projection method for edge-preserving removal of Poisson noise is presented. The speedups over standard CPU implementations are evaluated on both synthetic data and astronomical and medical imaging problems.
Simulation of large scale seismic wave propagation is an important tool in seismology for efficient strong motion analysis and risk mitigation. Being particularly CPU-consuming, this three-dimensional problem makes use of parallel computing to improve the performance and the accuracy of the simulations. The trend in parallel computing is to increase the number of cores available at the shared-memory level with possible non-uniform cost of memory accesses. We therefore need to consider new approaches more suitable to such parallel systems. In this paper, we firstly report on the impact of memory affinity on the parallel performance of seismic simulations. We introduce a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy. The MAi (Memory Affinity interface) is used to smoothly adapt the memory policy to the underlying architecture. We evaluate our methodology on computing nodes with different NUMA characteristics. A maximum gain of 53% is reported in comparison with a classical OpenMP implementation.
New parallel methods, based on the Schwarz and Schur domain decomposition technique, are proposed for time domain decomposition of Cauchy problem. Firstly, the initial value problem is transformed in a time boundary values problem. Then the demonstration of the pure linear divergence/convergence of the Schwarz algorithm, applied to systems of linear ODE, allows us to accelerate the convergence of the solution at the time slices boundaries with the Aitken's acceleration technique of the convergence. Secondly, a Schur complement method is developed for linear problem. Numerical results show the efficiency of the proposed method on the linear system of ODEs modeling the harmonic oscillator problem.
We developed a Performance Modeling Tools (PMTOOLS)library to enable simulation-based performance modeling for parallel sparse linear algebra algorithms. The library includes micro-benchmarks for calibrating the system's parameters, functions for collecting and retrieving performance data, and a cache simulator for modeling the detailed memory system activities. Using these tools, we have built simulation modules to model and predict performance of different variants of parallel sparse LU and Cholesky factorization algorithms. We validated the simulated results with the existing implementation in SuperLU_DIST, and showed that our performance prediction errors are only 6.1% and 6.6% with 64 processors IBM power5 and Cray XT4, respectively. More importantly, we have successfully used this simulation framework to forecast the performance of different algorithm choices, and helped prototyping new algorithm implementations.
This paper presents a performance review and reconsideration of the conventional algorithm for the eigenvalue problem for dense-real-symmetric matrices (DRSM's) on multicore parallel computer systems. We examine a narrow-band reduction approach on a multicore based cluster system.
In this work, extended version of “Hierarchical Interface Decomposition (HID)” for parallel preconditioning method has been developed. Extension of overlapped elements between domains and effect of thicker separators were considered. Proposed method has been implemented to finite-element based simulations of linear elasticity problems for simple cubic geometries with heterogeneous distribution of distortion angle of elements. The developed code has been tested on the “T2K Open Supercomputer (T2K/Tokyo)” using up to 512 cores. Extended HID provides more robust and scalable performance than original HID and localized-block-Jacobi-type BILU with extension of overlapped elements.
Plane Wave based first principleselectronic structurecalculationsare the most widely used approach for electronic structure calculations in materials science. In this formulation the electronicwavefunctionsare expandedin plane waves (Fourier components)in threedimensionalspaceand 3d FFTs are used to construct the chargedensity in real space.Manyotherscientific application codesin the areas of fluid mechanics, climate research and accelerator design also require efficient parallel 3d FFTs. Due to the large amount of communications required in parallel 3d FFTs the scaling of these application codes on large parallel machines depends critically on having a 3d FFT that scalesefficiently to large processorcounts.In this paper we compare different implementations for the communications in a 3d FFT to determinethemost scalablemethodto usefor ourapplication.We presentresults up to 16K cores on the Cray XT4 and IBM Blue Gene/P as well as compare our implementations to publicly available 3d FFTs such as P3DFFT and FFTW. In our application our 3d FFTs significantly outperform any publicly available software. Our 3d FFT has been implemented in many different first principles codes used for research in materials science, nanoscience, energy technologies etc. as well as being a stand alone benchmark code used for the procurement of new machines at the Department of Energy NERSC computing center.
Mathematical models involving ordinary differential equation (ODEs) arise in many diverse applications, such as fluid flows, physics-based animation, mechanical systems, mathematical finances, or chemical reaction. The realistic simulation of these applications depends on fast methods for the numerical solution of ODEs as well as adequate parallel computation schemes exploiting the potential parallelism in an optimal way. Due to the advent of multicore technology, parallel resources are now widely available in form of multicore processors or clusters. It is required to revisit parallel computation schemes of ODE solvers for the use on these multicore platforms. The objective of this article is a survey and classification of computational techniques for the numerical integration of ODE systems. The emphasis lies on a computational model which captures the specifics of ODE codes as well as a hierarchical architectural model of multicore systems.
In this paper we analyze the performance of two parallel sparse matrix partitioning software packages, ParMETIS and PT-SCOTCH. We focus our analysis on the parallel performance of the nested dissection partitioning stage as well as its impact on the performance of the numerical stages of the solution of sparse linear systems via multilevel ILU preconditioned iterative methods. Experimental results on a shared-memory platform with 16 processors show that ParMETIS seems to be the best choice for our approach.
We present a parallel implementation of the Davidson method for the numerical solution of large-scale, sparse, generalized eigenvalue problems. The implementation is done in the context of SLEPc, the Scalable Library for Eigenvalue Problem Computations. In this work, we focus on the Hermitian version of the method, with several optimizations. We compare the developed solver with other available implementations, as well as with Krylov-type eigensolvers already available in SLEPc, particularly in terms of parallel efficiency.
Nucleotideand aminoacidsequences research is actual for molecular biology and bioengineering. An important aspect of analysis of such sequences is multiple alignment. This article discusses the implementations of the MUSCLE and ClustalW programs on multiprocessors and a web interface to them. The modification of the MUSCLE algorithm realize a data-flow manner of sequence alignment. It uses the PARUS system to build a data-flow graph and execute this graph on one multiprocessor as a MPI-program. The data-flow algorithm has been tested on the sequences of human Long Terminal Repeats class five (LTR5) and several other examples.
The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Nowadays, with the advent of novel high throughput sequencing technologies, the approximate string matching algorithms are used to identify similarities, molecular functions and abnormalities in DNA sequences. We consider a generalization of this problem, the fixed-length approximate string matching problem: given a text t, a pattern ρ and an integer ℓ, compute the optimal alignment of all substrings of ρ of length ℓ and a substring of t. We present a practical parallel algorithm of comparable simplicity that requires only
Dot plots are a standard method for local comparison of biological sequences. In a dot plot, a substring to substring distance is computed for all pairs of fixed-size windows in the input strings. Commonly, the Hamming distance is used since it can be computed in linear time. However, the Hamming distance is a rather crude measure of string similarity, and using an alignment-based edit distance can greatly improve the sensitivity of the dot plot method. In this paper, we show how to compute alignment plots of the latter type efficiently. Given two strings of length m and n and a window size w, this problem consists in computing the edit distance between all pairs of substrings of length w, one from each input string. The problem can be solved by repeated application of the standard dynamic programming algorithm in time O(mnw2). This paper gives an improved data-parallel algorithm, running in time O(mnw/γ/p) using vector operations that work on γ values in parallel and p processors.
In this paper we present four different parallel implementations of the popular LM OSEM medical image reconstruction algorithm. While two of them use libraries such as MPI, OpenMP, or Threading Building Blocks (TBB) directly, the other two implementations use algorithmic skeletons of the Münster Skeleton Library Muesli to hide the parallelism. We compare the implementations w.r.t. runtime, efficiency, and programming style and show the resulting benchmarks which have been conducted on a multi-processor, multi-core cluster computer.
The article proposes a scientific data visualization system for high performance computing. Suggested system has hierarchical architecture where software modules of the system are implemented on the different hardware modules. Supercomputer IBM Blue Gene /P is used as the main test system. Visualization system includes parallel modules for video rendering, video enhancement, video compression and video up-sampling. System proposes visualization for 2D displays, 3D displays, multidisplay complexes. The results of using proposed visualization system for 3D torus network simulation are presented.