
Ebook: Parallel Computing: Accelerating Computational Science and Engineering (CSE)

Parallel computing has been the enabling technology of high-end machines for many years. Now, it has finally become the ubiquitous key to the efficient use of any kind of multi-processor computer architecture, from smart phones, tablets, embedded systems and cloud computing up to exascale computers.
This book presents the proceedings of ParCo2013 – the latest edition of the biennial International Conference on Parallel Computing – held from 10 to 13 September 2013, in Garching, Germany. The conference focused on several key parallel computing areas. Themes included parallel programming models for multi- and manycore CPUs, GPUs, FPGAs and heterogeneous platforms, the performance engineering processes that must be adapted to efficiently use these new and innovative platforms, novel numerical algorithms and approaches to large-scale simulations of problems in science and engineering.
The conference programme also included twelve mini-symposia (including an industry session and a special PhD Symposium), which comprehensively represented and intensified the discussion of current hot topics in high performance and parallel computing. These special sessions covered large-scale supercomputing, novel challenges arising from parallel architectures (multi-/manycore, heterogeneous platforms, FPGAs), multi-level algorithms as well as multi-scale, multi-physics and multi-dimensional problems.
It is clear that parallel computing – including the processing of large data sets (“Big Data”) – will remain a persistent driver of research in all fields of innovative computing, which makes this book relevant to all those with an interest in this field.
This volume of the series “Advances in Parallel Computing” contains the proceedings of the International Conference on Parallel Programming – ParCo 2013 – held from 10 to 13 September 2013 in Garching, Germany. The conference was hosted by the Technische Universität München (Department of Informatics) and the Leibniz Supercomputing Centre.
With ParCo 2013, the biennial ParCo conference series now looks back at 30 years of top-level research in parallel algorithms, architectures and applications. It has finally entered an era in which parallel computing – for many years the enabling technology of high-end machines – is now ubiquitous and the key for the efficient use of any kind of computer architecture: from embedded and personal up to exascale systems.
The trend towards heterogeneous architectures, multiple levels of parallelism and towards higher and higher core numbers of supercomputing platforms, which was already addressed in the previous ParCo instances, can now be seen in full bloom. Parallel programming models for multi- and manycore CPUs, GPUs, FPGAs, and heterogeneous platforms have been one of the clear focal points at ParCo 2013. In addition, performance engineering processes, including analysis, tools and metrics, must be adapted to these new and innovative platforms. It also becomes apparent from the contributions that novel numerical algorithms are required: for basic tasks in numerical linear algebra as well as for adaptive or space-time parallel simulations. Most important, all these aspects need to be combined in the parallelisation and optimisation of large-scale applications, in order to make parallel computing – including the processing of large data sets (“Big Data”) – a persistent driver of research in many fields of science and engineering.
ParCo 2013 strongly profited from its 12 mini-symposia (including an industry session and a special PhD Symposium), which represented and intensified the discussion of current “hot topics” in high performance and parallel computing in an excellent manner. At least three mini-symposia were dedicated to large-scale supercomputing, in particular. Three mini-symposia focused on novel challenges arising from parallel architectures (multi-/manycore, heterogeneous platforms, FPGAs). A further mini-symposium hotspot was established by the “multi”-challenges: multi-level algorithms as well as multi-scale, multi-physics and multi-dimensional problems.
We would like to express our sincerest thanks to ParCo's four keynote speakers – Pete Beckman, Sudip Dosanjh, Wolfgang Nagel and Martin Schulz – who, in their presentations, gave an exciting overview of both promises and challenges for the age of exascale and Big Data. We are equally obliged to all presenters at the conference, all authors and co-authors who contributed to these proceedings, and of course to all attendees at ParCo 2013 – all of them contributed to the excellent scientific quality of the conference and to its inspiring atmosphere. Last, but definitely not least, special thanks go to all (co-)organisers, including the mini-symposium organisers, to the members of the international programme committee, and to all persons who assisted during the conference.
Michael Bader
Arndt Bode
Hans-Joachim Bungartz
Michael Gerndt
Gerhard R. Joubert
Frans Peters
Date: 2013-12-01
Extreme data science is becoming increasingly important at NERSC. Many Petabytes of data are transferred from experimental facilities to NERSC every year. Applications of importance include high energy physics, genomics and climate. Computing and storage systems are being deployed to enable the analysis of these large scientific data sets. This paper discusses recent scientific breakthroughs enabled by extreme data science and future science and technology trends related to data.
Efficient and effective performance analysis techniques are critical for the development of future generation systems. They are the drivers behind the required co-design process that helps establish the principles needed for their design. In this paper, we will highlight two such approaches: PAVE, a project that investigates mapping of performance data to more intuitive domains and uses advanced visualization techniques to expose problems, and GREMLIN, a system evaluation environment capable of emulating expected properties of exascale architectures on petascale machines. Combined with other approaches in system modeling and simulation, these projects enable us to provide a meaningful introspection into a target application's characteristics as well as its expected behavior and, more importantly, likely bottlenecks on future generation machines.
In a large-scale parallel system, parallel I/O libraries, such as MPI-IO, are required to perform I/O operations between nodes and file systems. XcalableMP is a PGAS parallel programming language, which allows the programmer to define a distributed array over nodes to support typical data-parallel programming with global-view programming. In order to perform I/O operations of a distributed array efficiently, the information of a distributed array described in the program can be used for parallel I/O operations. XMP-IO is a parallel I/O API defined as a part of the XcalableMP specification. In this paper, we show that the productivity and performance of XMP-IO are nearly equal to those of MPI-IO. We verified that XMP-IO can be applied to the I/O of the MapReduce application. We evaluated the performance of these I/O on the K computer, which uses the Fujitsu Exabyte File System (FEFS), a high-speed parallel distributed processing file system FEFS.
Parallel and portable development of programs becomes an increasingly important subject for usage of future computing systems. With the introduction of heterogeneous many-core systems and the increasing impact of the memory wall, classical software development paradigms no longer hold true. In particular the lack of consideration for data access and communication cost poses a crucial performance obstacle. Dataflow models are therefore getting attention from large scale and performance oriented developers. Such models however are principally error prone, and require reorganization of the code to maximize usability.
This paper outlines a programming approach that annotates the code with its algebraic logic, so that accurate information on data dependencies can be obtained. Using this information allows not only more intuitive and fine-granular programmability, but in particular enables code rearrangement for manipulating the execution behavior, its concurrency and even its destination platform. This model is specifically geared towards mathematical problems and thus addresses in particular High Performance Computing needs.
Image processing is traditionally a very compute-intensive process. In case of stringent timing constraints, traditional approaches scale down application quality, therefore, compromising the visual clarity of processed images. In order to overcome such drawbacks, we follow a new resource-aware parallel computing paradigm called invasive computing in this paper, where an application can dynamically claim, execute, and release resources on a multi-core computing system. In this context, we show how an invasive image processing application is able to make run-time tradeoffs of quality or throughput depending on the requirements and the number of available processing resources. As a target architecture, a class of massively parallel architectures called tightly-coupled processor arrays is chosen, to show the adaptivity provided by invasive computing. The applications gain the ability to fulfill constraints in two directions: a) In case of strict throughput requirements, the image quality may be adjusted in dependence on the number of available resources by run-time selection and loading of different algorithmic kernels, e.g., image filters in dependence of the success of invasion. b) Alternatively, a certain level of quality can be guaranteed by dynamically adjusting the throughput with respect to available resources.
The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.
Nowadays, multi-core processors and GPUs with thousands of cores are omnipresent. Fully exploiting their resources involves dealing with low-level concepts of parallel programming. These low-level concepts still constitute a high barrier to efficient development of parallel applications. That is why we need high-level tools for parallel programming.
In order to assist programmers in developing performant and reliable parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel programming patterns, thereby shielding programmers from low-level aspects of parallel programming.
In this paper we take on the design and implementation of the well-known Farm skeleton. In order to address heterogeneous computing platforms we present a multi-tier implementation on top of MPI, OpenMP, and CUDA. On the basis of two benchmark applications, including an interacting particles system and a ray tracing application, we illustrate the advantages of both skeletal programming in general and this multi-tier approach in particular.
Booleans are the most basic values in computing. Machines, however, store Boolean values in larger compounds such as bytes or integers due to limitations in addressing memory locations. For individual values the relative waste of memory is huge, but the absolute waste is negligible. The latter radically changes if large numbers of Boolean values are processed in (multidimensional) arrays. Most programming languages, however, only provide sparse implementations of Boolean arrays. Thus, they waste large quantities of memory and, potentially, make poor use of cache hierarchies.
In the context of the functional data-parallel array programming language SAC we investigate dense implementations of Boolean arrays and compare their performance with traditional sparse implementations. A particular challenge arises in data-parallel execution on today's shared memory multi-core architectures: scheduling of loops over Boolean arrays is unaware of the non-standard addressing of dense Boolean arrays. We discuss our proposed solution and report on experiments analysing the impact of the runtime representation of Boolean arrays both on sequential performance as well as on scalability using up to 32 cores of a large ccNUMA multi-core system.
The analysis of packet payload is mandatory for network security and traffic monitoring applications. The computational cost of this activity pushed the industry towards hardware-assisted deep packet inspection (DPI) that have the disadvantage of being more expensive and less flexible.
This paper covers the design and implementation of a new DPI framework using FastFlow, a skeleton-based parallel programming library targeting efficient streaming on multi-core architectures. The experimental results demonstrate the efficiency of the DPI framework proposed, proving the feasibility to perform 10Gbit DPI analysis using modern commodity hardware.
Task support was introduced into OpenMP to address irregular parallelism in shared memory architectures. Creating tasks that are extremely fine granular in applications, however, impedes performance. In this paper, a methodology for analyzing the performance of task-based OpenMP programs and its implementation in Periscope is presented. The paper unveils and concentrates on the newly formulated high-level performance properties that formalize typical performance bottlenecks of task-based programs. In addition, the paper reports on the experimental results which were accomplished for the codes of the Barcelona OpenMP Tasks Suite (BOTS) using Periscope in the SuperMUC supercomputing machine at Garching, Germany.
Recently the latest generation of Blue Gene machines became available. In this paper we introduce general metrics to characterize the performance of applications and apply it to a diverse set of applications running on Blue Gene/Q. The applications range from regular, floating-point bound to irregular event-simulator like types. We argue that the proposed metrics are suitable to characterize the performance for a larger set of computational science applications running on today's massively-parallel systems. They therefore do not only allow to assess usability of the Blue Gene/Q architecture for the considered (types of) applications. They also provide more general information on application requirements and valuable input for evaluating the usability of various architectural features, i.e. information, which is needed for future co-design efforts aiming for exascale performance.
Performance analysis and tuning are important steps during the development of parallel software. There are performance analysis tools that help developers understand application performance. However, these tools do not give any recommendations on how to tune the code to obtain an optimized version. This paper proposes a methodology for automated analysis and tuning based on experimental space searches. The European research project AutoTune will extend Periscope, an automatic online and distributed performance analysis tool developed by Technische Universität München, with automatic online tuning plugins. Each plugin provides knowledge for performance or energy efficiency tuning. The resulting Periscope Tuning Framework (PTF) is able to tune serial and parallel codes for homogeneous and heterogeneous target hardware. The output of the framework are tuning recommendations that can be integrated into the production version of the code. The research results of AutoTune will be integrated into a commercial development environment of a European software vendor and validated with real-world applications.
This paper is devoted to modifications of a two-step algorithm for reducing a matrix to tridiagonal form. At the first stage the matrix is reduced to a symmetric band matrix. A successive band reduction is then used to reduce the symmetric band matrix to tridiagonal form. Both stages apply two-sided orthogonal transformations to the matrix based on Householder reflectors. The underlying idea for our modifications is to use speculative computations of components of the Householder transformations. Performance comparisons between the Intel® Math Kernel Library (Intel® MKL), PLASMA tridiagonal reduction, and our implementation are provided.
We propose a new implementation of the sign function based spectral divide-and-conquer method for the generalized non-symmetric eigenvalue problem. The method scales well on modern multicore architectures and favorably compared to the multithreaded current QZ implementations. This is illustrated by numerous examples with computations performed on different hardware platforms with satisfactory accuracy of the computed eigenvalues.
We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We apply these placement strategies and present performance results for a hybrid multicore/GPU LU algorithm as it is implemented in the public domain library MAGMA.
We present and discuss the parallel implementation of a multi-elimination incomplete LU factorization method for solving sparse linear systems. The proposed solver exploits any available block structure during the factorization, achieving increased throughput during the computation and improved reliability on realistic applications.
We propose a new iterative method based on two strategies, improvement stabilization by applying two-term recurrences and utilization of associate residual. The former adopts the coupled two-term recurrences proposed by Rutishauser as stabilizing polynomials, and the latter adopts the associate residual vector for computation of the recurrence parameters ζk,ηk instead of that of the residual vector. Through numerical experiments, we make clear that the proposed method outperforms other methods from the view points of computational and elapsed time and speed-up on both serial and parallel computers.
The popularity of GPGPUs in high performance platforms for scientific computing in recent times has renewed interest in approximate inverse preconditioners for Krylov methods. We have recently introduced some new algorithmic variants [6] of popular approximate inverse methods. We now report on the behaviour of these variations in high performance multilevel preconditioning frameworks, and we present the software framework that enables us to easily deal with heterogeneous computing platforms that include GPGPUs.
On a “Sandy Bridge E5-2670” CPU, we compare different performance metrics (speed, energy, cache traffic) of two multi-threaded Sparse Matrix-Vector Multiplication implementations: Intel MKL's CSR and librsb's RSB. We find out primarily that the benefit of RSB over MKL's CSR varies substantially depending on the numerical type (12–40%). We also confirm other studies' results in that energy savings are strongly correlated with fast code, and that memory traffic minimization is the primary performance factor.
Hybrid GPU/CPU clusters are becoming very popular in the scientific computing community, as attested by the number of such systems present in the Top 500 list. In this paper, we address one of the key algorithms for scientific applications: the computation of sparse matrix-vector products that lies at the heart of iterative solvers for sparse linear systems.
We detail how design patterns for sparse matrix computations enable us to easily adapt to such a heterogeneous GPU/CPU platform using several sparse matrix formats in order to achieve best performance; then, we analyze static load balancing strategies for devising a suitable data decomposition and propose our approach. We discuss our experience in using different sparse matrix formats and data partitioning algorithms with a number of computational experiments executed on three different hybrid GPU/CPU platforms.