Ebook: Parallel Computing: From Multicores and GPU's to Petascale

Series

Advances in Parallel Computing

Volume

Published

2010

Editors

Barbara Chapman, Frédéric Desprez, Gerhard R. Joubert, Alain Lichnewsky, Frans Peters, Thierry Priol

ISBN

978-1-60750-529-7 (print) | 978-1-60750-530-3 (online)

Subject(s)

Computer Sciences Computer Sciences, Mathematics & Statistics

Description

Parallel computing technologies have brought dramatic changes to mainstream computing; the majority of today’s PC's, laptops and even notebooks incorporate multiprocessor chips with up to four processors. Standard components are increasingly combined with GPU's (Graphics Processing Unit), originally designed for high-speed graphics processing, and FPGA's (Free Programmable Gate Array) to build parallel computers with a wide spectrum of high-speed processing functions. The scale of this powerful hardware is limited only by factors such as energy consumption and thermal control. However, in addition to hardware factors, the practical use of petascale and exascale machines is often hampered by the difficulty of developing software which will run effectively and efficiently on such architecture. This book includes selected and refereed papers, presented at the 2009 international Parallel Computing conference (ParCo2009), which set out to address these problems. It provides a snapshot of the state-of-the-art of parallel computing technologies in hardware, application and software development. Areas covered include: numerical algorithms, grid and cloud computing, programming – including GPU and cell programming. The book also includes papers presented at the six mini-symposia held at the conference.

↓ more

↑ less

Contents

Front Matter

Pages

i - xvii

Preface

Parallel Computing technologies brought dramatic changes to mainstream computing. This trend is accelerating as the end of the development of hardware following Moore's law looms on the horizon. The majority of standard PC's and even notebooks today incorporate multiprocessor chips with up to four processors. This number is expected to soon reach eight and more.

These standard components, COTS (Components Off The Shelf), are increasingly combined with powerful parallel processors originally designed for high-speed graphics processing, GPU's (Graphics Processing Units), and FPGA's (Free Programmable Gate Arrays) to build heterogeneous parallel computers that offer a wide spectrum of high speed processing functions. The number of processors incorporated in such systems are today of the order of up to 10⁴ to 10⁶. This vast number of processors allows the construction of high speed computers in the petascale range, and even the exascale range, at a reasonable cost. The limiting factor for constructing more powerful hardware is the energy consumption and thermal control of such systems. Many research efforts concentrate on reducing the overall energy consumed.

In addition to the hardware design and build limitations, the practical use of petascale or exascale machines is hampered by the difficulties of developing software that efficiently and effectively run on such architectures. This holds for system as well as application software. The ParCo conference aimed at addressing many of these issues through contributed papers as well as the various mini-symposia presentations.

In this book, which includes selected and refereed papers presented at the international Parallel Computing conference (ParCo2009) held from 1–4 September 2009 at ENS (École Normale Supérieure), Lyon, France, problems associated with the development of high speed parallel systems using new hardware concepts and the associated software development issues are considered. The papers were presented as keynote papers, in regular sessions, an industrial session and various mini-symposia covering specialised topics. Overall these give a snapshot of the state-of-the-art of parallel computing technologies, both in hardware as well as application and software development.

This year's highlight is no doubt the increasing number of papers addressing the programming of general-purpose graphics processing units. Considering the main track of the conference, as well as the mini-symposium dedicated to GPU's, ParCo2009 turned into one of the main scientific venues covering this important research topic in 2009.

The editors wish to express their sincere gratitude for all persons who supported this venture and lastly made it feasible. In particular we wish to thank the many reviewers who, as members of the international Program Committee, not only assessed papers, but also acted as session chairmen during the conference.

Sincere thanks is due to the members of the Organising Committee, and in particular to Laurent Lefèvre, Eddy Caron and Jean-Christophe Mignot, who spent many hours assisting in organising a very successful event. We are also very grateful for work done by Virginie Mahdi from Genci, Paris in attracting a considerable number of sponsors as well as participants in the Industrial Session.

Please note that versions of papers with colour images and diagrams are available in the electronic version of the book on http://www.booksonline.iospress.nl/ under Advances in Parallel Computing.

Barbara Chapman, University of Houston, USA

Frédéric Desprez, INRIA, France

Gerhard Joubert, TU Clausthal, Germany

Alain Lichnewsky, GENCI, France

Frans Peters, Philips Research, Netherlands

Thierry Priol, INRIA, France

31 December 2009

↓ more

↑ less

Invited Talks

Page

↓ more

↑ less

Exascale computing: What future architectures will mean for the user community

Authors

Alan Gara, Ravi Nair

Pages

3 - 15

DOI

10.3233/978-1-60750-530-3-3

Abstract

Exascale computing platforms will soon emerge over the horizon. Architectures for such platforms are already on drawing boards. This paper will focus on some of the key drivers of the technology that will be needed on exascale platforms along with the impact that these new technology drivers will have on system architecture. The important implications for users of such exascale systems will also be discussed.

↓ more

↑ less

Making multi-cores mainstream – from security to scalability

Authors

Chris Jesshope, Michael Hicks, Mike Lankamp, Raphael Poss, Li Zhang

Pages

16 - 31

DOI

10.3233/978-1-60750-530-3-16

Abstract

In this paper we will introduce work being supported by the EU in the Apple-CORE project (http://www.apple-core.info). This project is pushing the boundaries of programming and systems development in multi-core architectures in an attempt to make multi-core go mainstream, i.e. continuing the current trends in low-power, multi-core architecture to thousands of cores on chip and supporting this in the context of the next generations of PCs. This work supports dataflow principles but with a conventional programming style. The paper describes the underlying execution model, a core design based on this model and its emulation in software. We also consider system issues that impact security. The major benefits of this approach include asynchrony, i.e. the ability to tolerate long latency operations without impacting performance and binary compatibility. We present results that show very high efficiency and good scalability despite the high memory access latency in the proposed chip architecture.

↓ more

↑ less

Numerical Algorithms

Page

↓ more

↑ less

Efficiency and scalability of the parallel Barnes-Hut tree code PEPC

Authors

Robert Speck, Paul Gibbon, Martin Hoffmann

Pages

35 - 42

DOI

10.3233/978-1-60750-530-3-35

Abstract

We present scaling results of the parallel tree code PEPC on an IBM BlueGene/P and identify performance bottlenecks and intrinsic algorithmic issues. Using more than 8192 processors the tree code is capable of simulating more than 100 million particles, but as our analysis shows, fundamental changes will be necessary for porting this code to petaflop systems. However, an efficiency examination reveals a very good ratio between communication and computation in the traversal process. Furthermore, we present a library version of the code, which may act as a ‘black box’ for front-end users.

↓ more

↑ less

Combining Numerical Iterative Solvers

Authors

Yanik Ngoko, Denis Trystram

Pages

43 - 50

DOI

10.3233/978-1-60750-530-3-43

Abstract

We are interested in this work by the combination of iterative solvers when solving linear systems of equations in an on-line setting. Our study targets users who may not be able to choose the best solvers for solving a set of linear systems while minimizing the total execution time. We propose a framework and algorithms in which the combination of solvers depends on informations gathered at runtime. The framework is assessed by extensive experiments using 5 SPARSKIT solvers over more than 70 matrices. The results show that the proposed approach is robust for solving linear sytems since we were able to solve more linear systems than each individual solver with an execution time nearly two times equal to those of the worst individual solver. Morever, we were able to predict a set of two solvers containing the best solver on more than 80% cases.

↓ more

↑ less

A comparative study of some distributed linear solvers on systems arising from fluid dynamics simulations

Authors

Désiré Nuentsa Wakam, Jocelyne Erhel, Édouard Canot, Guy-Antoine Atenekeng Kahou

Pages

51 - 58

DOI

10.3233/978-1-60750-530-3-51

Abstract

This paper presents a comparative study of some distributed solvers on a set of linear systems arising from Navier-Stokes equations and provided by an industrial software. Solvers under consideration implement direct, iterative or domain decomposition methods and most of them are freely available packages. Numerical tests with various parameters are made easier by developing a unified toolbox that links with interface functions provided by these libraries. The intensive numerical tests performed on various sets of processors reveal the good performance results achieved by the recently proposed parallel preconditioner for Krylov methods based on an explicit formulation of multiplicative Schwarz [1].

↓ more

↑ less

Gradient projection methods for image deblurring and denoising on graphics processors

Authors

Thomas Serafini, Riccardo Zanella, Luca Zanni

Pages

59 - 66

DOI

10.3233/978-1-60750-530-3-59

Abstract

Optimization-based approaches for image deblurring and denoising on Graphics Processing Units (GPU) are considered. In particular, a new GPU implementation of a recent gradient projection method for edge-preserving removal of Poisson noise is presented. The speedups over standard CPU implementations are evaluated on both synthetic data and astronomical and medical imaging problems.

↓ more

↑ less

Parallel simulations of seismic wave propagation on NUMA architectures

Authors

Fabrice Dupros, Christiane Pousa Ribeiro, Alexandre Carissimi, Jean-François Méhaut

Pages

67 - 74

DOI

10.3233/978-1-60750-530-3-67

Abstract

Simulation of large scale seismic wave propagation is an important tool in seismology for efficient strong motion analysis and risk mitigation. Being particularly CPU-consuming, this three-dimensional problem makes use of parallel computing to improve the performance and the accuracy of the simulations. The trend in parallel computing is to increase the number of cores available at the shared-memory level with possible non-uniform cost of memory accesses. We therefore need to consider new approaches more suitable to such parallel systems. In this paper, we firstly report on the impact of memory affinity on the parallel performance of seismic simulations. We introduce a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy. The MAi (Memory Affinity interface) is used to smoothly adapt the memory policy to the underlying architecture. We evaluate our methodology on computing nodes with different NUMA characteristics. A maximum gain of 53% is reported in comparison with a classical OpenMP implementation.

↓ more

↑ less

Aitken-Schwarz and Schur complement methods for time domain decomposition

Authors

Patrice Linel, Damien Tromeur-Dervout

Pages

75 - 82

DOI

10.3233/978-1-60750-530-3-75

Abstract

New parallel methods, based on the Schwarz and Schur domain decomposition technique, are proposed for time domain decomposition of Cauchy problem. Firstly, the initial value problem is transformed in a time boundary values problem. Then the demonstration of the pure linear divergence/convergence of the Schwarz algorithm, applied to systems of linear ODE, allows us to accelerate the convergence of the solution at the time slices boundaries with the Aitken's acceleration technique of the convergence. Secondly, a Schur complement method is developed for linear problem. Numerical results show the efficiency of the proposed method on the linear system of ODEs modeling the harmonic oscillator problem.

↓ more

↑ less

Performance Modeling Tools for Parallel Sparse Linear Algebra Computations

Authors

Pietro Cicotti, Xiaoye S. Li, Scott B. Baden

Pages

83 - 90

DOI

10.3233/978-1-60750-530-3-83

Abstract

We developed a Performance Modeling Tools (PMTOOLS)library to enable simulation-based performance modeling for parallel sparse linear algebra algorithms. The library includes micro-benchmarks for calibrating the system's parameters, functions for collecting and retrieving performance data, and a cache simulator for modeling the detailed memory system activities. Using these tools, we have built simulation modules to model and predict performance of different variants of parallel sparse LU and Cholesky factorization algorithms. We validated the simulated results with the existing implementation in SuperLU_DIST, and showed that our performance prediction errors are only 6.1% and 6.6% with 64 processors IBM power5 and Cray XT4, respectively. More importantly, we have successfully used this simulation framework to forecast the performance of different algorithm choices, and helped prototyping new algorithm implementations.

↓ more

↑ less

Narrow-band reduction approach of a DRSM eigensolver on a multicore-based cluster system

Authors

Toshiyuki Imamura, Susumu Yamada, Masahiko Machida

Pages

91 - 98

DOI

10.3233/978-1-60750-530-3-91

Abstract

This paper presents a performance review and reconsideration of the conventional algorithm for the eigenvalue problem for dense-real-symmetric matrices (DRSM's) on multicore parallel computer systems. We examine a narrow-band reduction approach on a multicore based cluster system.

↓ more

↑ less

Parallel Multistage Preconditioners by Extended Hierarchical Interface Decomposition for Ill-Conditioned Problems

Authors

Kengo Nakajima

Pages

99 - 106

DOI

10.3233/978-1-60750-530-3-99

Abstract

In this work, extended version of “Hierarchical Interface Decomposition (HID)” for parallel preconditioning method has been developed. Extension of overlapped elements between domains and effect of thicker separators were considered. Proposed method has been implemented to finite-element based simulations of linear elasticity problems for simple cubic geometries with heterogeneous distribution of distortion angle of elements. The developed code has been tested on the “T2K Open Supercomputer (T2K/Tokyo)” using up to 512 cores. Extended HID provides more robust and scalable performance than original HID and localized-block-Jacobi-type BILU with extension of overlapped elements.

↓ more

↑ less

A Comparison of Different Communication Structures for Scalable Parallel Three Dimensional FFTs in First Principles Codes

Authors

A. Canning, J. Shalf, L.-W. Wang, H. Wasserman, M. Gajbe

Pages

107 - 116

DOI

10.3233/978-1-60750-530-3-107

Abstract

Plane Wave based first principleselectronic structurecalculationsare the most widely used approach for electronic structure calculations in materials science. In this formulation the electronicwavefunctionsare expandedin plane waves (Fourier components)in threedimensionalspaceand 3d FFTs are used to construct the chargedensity in real space.Manyotherscientific application codesin the areas of fluid mechanics, climate research and accelerator design also require efficient parallel 3d FFTs. Due to the large amount of communications required in parallel 3d FFTs the scaling of these application codes on large parallel machines depends critically on having a 3d FFT that scalesefficiently to large processorcounts.In this paper we compare different implementations for the communications in a 3d FFT to determinethemost scalablemethodto usefor ourapplication.We presentresults up to 16K cores on the Cray XT4 and IBM Blue Gene/P as well as compare our implementations to publicly available 3d FFTs such as P3DFFT and FFTW. In our application our 3d FFTs significantly outperform any publicly available software. Our 3d FFT has been implemented in many different first principles codes used for research in materials science, nanoscience, energy technologies etc. as well as being a stand alone benchmark code used for the procurement of new machines at the Department of Energy NERSC computing center.

↓ more

↑ less

Parallelization Strategies for ODE Solvers on Multicore Cluster Systems

Authors

Thomas Rauber, Gudula Rünger

Pages

117 - 124

DOI

10.3233/978-1-60750-530-3-117

Abstract

Mathematical models involving ordinary differential equation (ODEs) arise in many diverse applications, such as fluid flows, physics-based animation, mechanical systems, mathematical finances, or chemical reaction. The realistic simulation of these applications depends on fast methods for the numerical solution of ODEs as well as adequate parallel computation schemes exploiting the potential parallelism in an optimal way. Due to the advent of multicore technology, parallel resources are now widely available in form of multicore processors or clusters. It is required to revisit parallel computation schemes of ODE solvers for the use on these multicore platforms. The objective of this article is a survey and classification of computational techniques for the numerical integration of ODE systems. The emphasis lies on a computational model which captures the specifics of ODE codes as well as a hierarchical architectural model of multicore systems.

↓ more

↑ less

Evaluation of Parallel Sparse Matrix Partitioning Software for Parallel Multilevel ILU Preconditioning on Shared-Memory Multiprocessors

Authors

José I. Aliaga, Matthias Bollhöfer, Alberto F. Martín, Enrique S. Quintana-Ortí

Pages

125 - 132

DOI

10.3233/978-1-60750-530-3-125

Abstract

In this paper we analyze the performance of two parallel sparse matrix partitioning software packages, ParMETIS and PT-SCOTCH. We focus our analysis on the parallel performance of the nested dissection partitioning stage as well as its impact on the performance of the numerical stages of the solution of sparse linear systems via multilevel ILU preconditioned iterative methods. Experimental results on a shared-memory platform with 16 processors show that ParMETIS seems to be the best choice for our approach.

↓ more

↑ less

A Parallel Implementation of the Davidson Method for Generalized Eigenproblems

Authors

Eloy Romero, Jose E. Roman

Pages

133 - 140

DOI

10.3233/978-1-60750-530-3-133

Abstract

We present a parallel implementation of the Davidson method for the numerical solution of large-scale, sparse, generalized eigenvalue problems. The implementation is done in the context of SLEPc, the Scalable Library for Eigenvalue Problem Computations. In this work, we focus on the Hermitian version of the method, with several optimizations. We compare the developed solver with other available implementations, as well as with Krylov-type eigensolvers already available in SLEPc, particularly in terms of parallel efficiency.

↓ more

↑ less

Bio-Informatics

Page

141

↓ more

↑ less

A data-flow modification of the MUSCLE algorithm for multiprocessors and a web interface for it

Authors

Alexey N. Salnikov

Pages

143 - 149

DOI

10.3233/978-1-60750-530-3-143

Abstract

Nucleotideand aminoacidsequences research is actual for molecular biology and bioengineering. An important aspect of analysis of such sequences is multiple alignment. This article discusses the implementations of the MUSCLE and ClustalW programs on multiprocessors and a web interface to them. The modification of the MUSCLE algorithm realize a data-flow manner of sequence alignment. It uses the PARUS system to build a data-flow graph and execute this graph on one multiprocessor as a MPI-program. The data-flow algorithm has been tested on the sequences of human Long Terminal Repeats class five (LTR5) and several other examples.

↓ more

↑ less

A Parallel Algorithm for the Fixed-length Approximate String Matching Problem for High Throughput Sequencing Technologies

Authors

Costas S. Iliopoulos, Laurent Mouchard, Solon P. Pissis

Pages

150 - 157

DOI

10.3233/978-1-60750-530-3-150

Abstract

The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Nowadays, with the advent of novel high throughput sequencing technologies, the approximate string matching algorithms are used to identify similarities, molecular functions and abnormalities in DNA sequences. We consider a generalization of this problem, the fixed-length approximate string matching problem: given a text t, a pattern ρ and an integer ℓ, compute the optimal alignment of all substrings of ρ of length ℓ and a substring of t. We present a practical parallel algorithm of comparable simplicity that requires only time, where w is the word size of the machine (e.g. 32 or 64 in practice) and p the number of processors, by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k and the alphabet size |Σ|.

↓ more

↑ less

Computing alignment plots efficiently

Authors

Peter Krusche, Alexander Tiskin

Pages

158 - 165

DOI

10.3233/978-1-60750-530-3-158

Abstract

Dot plots are a standard method for local comparison of biological sequences. In a dot plot, a substring to substring distance is computed for all pairs of fixed-size windows in the input strings. Commonly, the Hamming distance is used since it can be computed in linear time. However, the Hamming distance is a rather crude measure of string similarity, and using an alignment-based edit distance can greatly improve the sensitivity of the dot plot method. In this paper, we show how to compute alignment plots of the latter type efficiently. Given two strings of length m and n and a window size w, this problem consists in computing the edit distance between all pairs of substrings of length w, one from each input string. The problem can be solved by repeated application of the standard dynamic programming algorithm in time O(mnw²). This paper gives an improved data-parallel algorithm, running in time O(mnw/γ/p) using vector operations that work on γ values in parallel and p processors.

↓ more

↑ less

Image Processing & Visualisation

Page

167

↓ more

↑ less

Parallelizing the LM OSEM Image Reconstruction on Multi-Core Clusters

Authors

Philipp Ciechanowicz, Philipp Kegel, Maraike Schellmann, Sergei Gorlatch, Herbert Kuchen

Pages

169 - 176

DOI

10.3233/978-1-60750-530-3-169

Abstract

In this paper we present four different parallel implementations of the popular LM OSEM medical image reconstruction algorithm. While two of them use libraries such as MPI, OpenMP, or Threading Building Blocks (TBB) directly, the other two implementations use algorithmic skeletons of the Münster Skeleton Library Muesli to hide the parallelism. We compare the implementations w.r.t. runtime, efficiency, and programming style and show the resulting benchmarks which have been conducted on a multi-processor, multi-core cluster computer.

↓ more

↑ less

Hierarchical Visualization System for High Performance Computing

Authors

Oxana Dzhosan, Nina Popova, Anton Korzh

Pages

177 - 184

DOI

10.3233/978-1-60750-530-3-177

Abstract

The article proposes a scientific data visualization system for high performance computing. Suggested system has hierarchical architecture where software modules of the system are implemented on the different hardware modules. Supercomputer IBM Blue Gene /P is used as the main test system. Visualization system includes parallel modules for video rendering, video enhancement, video compression and video up-sampling. System proposes visualization for 2D displays, 3D displays, multidisplay complexes. The results of using proposed visualization system for 3D torus network simulation are presented.

↓ more

↑ less

Ebook: Parallel Computing: From Multicores and GPU's to Petascale

This website uses cookies

This website uses cookies