Ebook: Applications, Tools and Techniques on the Road to Exascale Computing
Single processing units have now reached a point where further major improvements in their performance are restricted by their physical limitations. This is causing a slowing down in advances at the same time as new scientific challenges are demanding exascale speed. This has meant that parallel processing has become key to High Performance Computing (HPC).This book contains the proceedings of the 14th biennial ParCo conference, ParCo2011, held in Ghent, Belgium. The ParCo conferences have traditionally concentrated on three main themes: Algorithms, Architectures and Applications. Nowadays though, the focus has shifted from traditional multiprocessor topologies to heterogeneous and manycores, incorporating standard CPUs, GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays). These platforms are, at a higher abstraction level, integrated in clusters, grids and clouds. The papers presented here reflect this change of focus. New architectures, programming tools and techniques are also explored, and the need for exascale hardware and software was also discussed in the industrial session of the conference.This book will be of interest to all those interested in parallel computing today, and progress towards the exascale computing of tomorrow.
This volume of the book series “Advances in Parallel Computing” contains the proceedings of ParCo2011, the 14th biennial ParCo Conference, held from 31 August to 3 September 2011, in Ghent, Belgium.
In an era when physical limitations have slowed down advances in the performance of single processing units, and new scientific challenges require exascale speed, parallel processing has gained momentum as a key gateway to HPC (High Performance Computing).
Historically, the ParCo conferences have focused on three main themes: Algorithms, Architectures (both hardware and software) and Applications. Nowadays, the scenery has changed from traditional multiprocessor topologies to heterogeneous manycores, incorporating standard CPUs, GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays). These platforms are, at a higher abstraction level, integrated in clusters, grids, and clouds. This is reflected in the papers presented at the conference and the contributions as included in these proceedings. An increasing number of new algorithms are optimized for heterogeneous platforms and performance tuning is targeting extreme scale computing. Heterogeneous platforms utilising the compute power and energy efficiency of GPGPUs (General Purpose GPUs) are clearly becoming mainstream HPC systems for a large number of applications in a wide spectrum of application areas. These systems excel in areas such as complex system simulation, real-time image processing and visualisation, etc. High performance computing accelerators may well become the cornerstone of exascale computing applications such as 3-D turbulent combustion flows, nuclear energy simulations, brain research, financial and geophysical modelling.
The exploration of new architectures, programming tools and techniques was evidenced by the mini-symposia “Parallel Computing with FPGAs” and “Exascale Programming Models”. The need for exascale hardware and software was also stressed in the industrial session, with contributions from Cray and the European exascale software initiative.
Our sincere appreciation goes to the keynote speakers who gave their perspectives on the impact of parallel computing today and the road to exascale computing tomorrow. Our heartfelt thanks go to the authors for their valuable scientific contributions and to the programme committee who reviewed the papers and provided constructive remarks. The international audience was inspired by the quality of the presentations. The attendance and interaction was high and the conference has been an agora where many fruitful ideas were exchanged and explored.
We wish to express our sincere thanks to the organizers for the smooth operation of the conference.
The University conference centre Het Pand offered an excellent environment for the conference as it allowed delegates to interact informally and easily. A special word of thanks is due to the management and support staff of Het Pand for their proficient and friendly support.
The organizers managed to put together an extensive social programme. This included a reception at the medieval Town Hall of Ghent as well as a memorable conference dinner. These social events stimulated interaction amongst delegates and resulted in many new contacts being made.
Finally we wish to thank all the many supporters who assisted in the organization and successful running of the event.
Erik D'Hollander, Ghent University, Belgium
Koen De Bosschere, Ghent University, Belgium
Gerhard R. Joubert, TU Clausthal, Germany
David Padua, University of Illinois, USA
Frans Peters, Philips Research, Netherlands
Society nowadays faces grand challenges in competitiveness of industry and science where High-Performance Computing (HPC) is a key asset building European innovation capacity. HPC has been applied to increase discoveries and innovation in industry and academia, to decrease time to production and market and to lower costs in many industrial sectors. Even more it is needed to address societal challenges in understanding, curing, and preventing diseases to increase the quality of life: forecast the impact of climate change or help locating fuel are only some of the processes facilitated by the HPC. A reinforced and coordinated effort in Europe will ensure an independent access to HPC systems and services, an advance in computing technologies and software in a transition period from peta- to exa-flop technology. This will generate benefits for the ICT industry in computing and software in general. Thus, high-performance computing is crucial for Europe's industrial capabilities as well as for every individual citizen.
Since June 2010, the “Partnership for Advanced Computing in Europe” (PRACE) is established as a persistent pan-European research infrastructure for High Performance Computing (HPC). It represents the leadership-level (Tier-0) of the European HPC ecosystem. PRACE has been created within the last four years by a consortium of as yet 22 European countries. A preparatory project, supported by the EU's DG INFSO, has built PRACE's legal and technical foundations. Today PRACE is an international non-profit association with seat in Brussels.
Four member organizations have committed to provide compute cycles worth 100 Million Euro each for 5 years, beginning 2010. Access to the infrastructure is exclusively granted on the basis of scientific quality through a truely European, science governed peer review process. The provision of CPU time started in August 2010 on the 1.0 petaflop/s supercomputer JUGENE hosted by Forschungszentrum Jülich, a member of the German Gauss Centre for Supercomputing. End of 2011, the 1.6 petaflop/s system CURIE operated by the French société civile “Grand Equipement National de Calcul Intensif” (GENCI) and the GCS 1.0 petaflop/s machine HERMIT at Stuttgart University have boosted the PRACE capacity substantially. In 2012 the multi-petaflop/s supercomputers at CINECA in Italy, at LRZ (GCS) in Germany and at BSC in Spain will follow.
As summarized in this contribution, PRACE is continuing its development within the first implementation project (2010-2012). PRACE-1IP is overlapping with the second implementation project PRACE-2IP (2011-2013). The third implementation project PRACE-3IP (2012-2014) is under EU review. As an important element, PRACE's Tier-0 infrastructure will be complemented by provision of supercomputers to European science and industry through national centres and national review committees (Tier-1).
Public cloud infrastructures have experimented a great increase in their demand nowadays. For this reason, cloud service providers need to choose wisely which services they should accept for provisioning via Admission Control mechanisms. The impact from faults due to unprovisioned resources has to be minimized and the net income of provisioning has to be maximized at the same time. We propose a service model which allows to define easily SLAs. Additionally, a double set of Admission Control policies is described and compared.
A program execution model (PXM) is a precise specification of the features of a computer system that application programs use to meet goals for users of the system. Program execution models have been studied since the late 1960s.
The PXM that is most prevalent in commercial computation had its beginnings in the operating systems built in the 1950s and 1960s and consists of the concept of a process running in an address space. The features included in the model have grown over the decades as new demands have arisen, especially for support of file access methods and databases. The result is an unplanned agglomeration of features including duplicate access methods for data and support for concurrency based on the hardware facilities easiest for engineers to provide. A valuable property not supported by the conventional PXM is the ability to freely build large programs by combining program modules; the six principles of modular programming support are broken by the conventional PXM.
Attempts to modify the conventional PXM have attracted scant attention because it is essentially impossible to construct a better PXM without invalidating the massive collection of legacy software that depends on it for correct operation. With the arrival of the multi-core era, there is a new opportunity to make a significant move to a new basis for computing – one that supports program composition, including arbitrary parallel programs.
In this article, we recount the history of program execution models and review weaknesses of the current conventional PXM. We motivate and discuss the characteristics of a better programming model for massively parallel computing – One that supports sound program structure concepts including unconstrained composition of parallel programs, and also can achieve superior performance when implemented on appropriate hardware.
Program execution models are basic to understanding significant aspects of computer systems and the structure of programs that run on them. Further research and exploration of such models is highly warranted.
A Physarum machine is a programmable amorphous biological computing device experimentally implemented in plasmodium of Physarum polycephalum. A Physarum machine is programmed by configurations of repelling and attracting gradients, and functions on a wide range of substrates and in a broad scope of environmental conditions. A Physarum machine can play a role of universal and general-purpose computer yet also efficiently solve specialised tasks. We overview few topics of Physarum machine research including routing signals with light, plane tessellation, and planar hull approximation.
Parallel vortex particle methods are an efficient technique for massively parallel simulations of turbulent fluid flows. One of their big advantages is the intrinsic adaptivity of vortex particles, since computational elements exist only where the vorticity field is non-zero. To overcome O(N2)-complexity of the corresponding N-body problem, multipole-based fast summation methods can be used. However, the convergence condition of vortex particle methods is only satisfied for very short times, prohibiting long-term simulations. To circumvent this, many recent codes use the concept of remeshing with an underlying mesh structure. In this paper, we demonstrate that the classical remeshing technique can be implemented directly and efficiently into a mesh-free parallel Barnes-Hut tree code. Using a dynamic 3D numerical example, we analyze the scaling behavior of this algorithm from 512 to 16,384 cores on an IBM Blue Gene/P system.
This paper analyzes the applicability of the task-programming model to the parallelization of the wavefront pattern. Computations for this type of problem are characterized by a data dependency pattern across a data space. This pattern can produce a variable number of independent tasks through traversing this space. Different implementations of this pattern are studied based on the current state-of-the-art threading frameworks that support tasks. For each implementation, the specific issues are discussed from a programmer's point of view, highlighting any advantageous features in each case. In addition, several experiments are carried out, and the factors that can limit performance in each implementation are identified. Moreover, some optimizations that the programmer can exploit to reduce overheads (task recycling, prioritization of tasks based on locality hints and tiling) are proposed and assessed.
Data mining is used to extract valuable knowledge from vast pools of data. Due to the computational complexity of the algorithms applied and the problems of handling large data sets themselves, data mining applications often require days to perform their analysis when dealing with large data sets. This paper presents the design and evaluation of a parallel computation framework for CLEVER, a prototype-based clustering algorithm which has been successfully used for a wide range of application scenarios. The algorithm supports plug-in fitness functions and employs randomized hill climbing to maximize a given fitness function. We explore various parallelization strategies using OpenMP and CUDA, and evaluate the performance of the parallel algorithms for three different data sets. Our results indicate a very good scalability of the parallel algorithm using multi-core processors, reducing the execution time and allowing to solve problems which were considered not feasible with the sequential version of CLEVER.
Adaptive mesh refinement (AMR) algorithms require efficient structures to store and operate upon volumetric grid data that will be used to distribute work among processes. When inefficient algorithms are used, AMR-enabled software may have its scaling negatively impacted as processes are starved for work. We present a variable-resolution octree-based structure which efficiently handles the set operations needed for AMR operations such as calculating process communication schedules and buffer zones. Results indicate a marked improvement in execution times when large amounts of volumetric data are processed, at the price of a structure creation and traversal penalty for smaller amounts of data. The BL-octree (“blocktree”) data structure presented in this paper will therefore be of use in massively parallel AMR environments, where the cost of distributing AMR workloads across many tens of thousands or more processes can be reduced.
We describe an approach to parallelize sequential object-oriented general purpose programs automatically adapting well-known analysis and transformation techniques and combined with context-aware composition. First experiments demonstrate the potential speed-up. This approach allows sequential object-oriented programs to benefit from modern hardware developments without bearing unacceptable re-engineering costs.
In this paper we demonstrate a novel intra-procedural technique for detecting heap dependences in sequential programs that use recursive data structures. The novelty of our technique lies in the way we compute, for each statement, abstract heap access paths that approximate the locations accessed by the statement, and the way we convert these paths into equations that can be solved using traditional tests, e.g. GCD test, Banerjee test and Lamport test. The dependence test also uses a field sensitive shape analysis to detect dependences among heap locations arising out of sharing within the data structure. In presence of loops, the technique can be used to discover loop dependences. i.e. the dependence among two different iterations of the same loop. This information can be used by a parallelizing compiler to transform sequential input program for better parallel execution.
Green computing denotes energy efficiency in all components of computing systems i.e. hardware, software, local area, etc. Placement of jobs is a critical decision to address Poor Performance, Resource Contention and High Energy Consumption problems. Consolidation policies are one of the sources of information for effective placement of jobs in any computing paradigm. In this paper, we design two energy aware consolidation policies according to workload characteristics to model mainly resource contention among jobs and implement them as host selection policies of a scheduler. We evaluate our policies in HPC and cloud computing paradigms with traces from Parallel Workload Archives. The simulation results show improvements on resource contention metric against the greedy host selection policy.
Scientific computing deals with large-scale scientific modelling and simulation in different domains like astrophysics, climate research, mechanical engineering and bio-informatics. Execution of large and accurate simulations in these domains requires significant computing resources. As such, scientific computing has always been closely connected to High Performance Computing (HPC) and Distributed Systems, utilising the computing resources of supercomputers, computer clusters and grids to perform the large scale calculations needed. Cloud is not different from it's predecessors as a resource for computing infrastructure, but provides even easier access to public resources with the increasing popularity of cloud computing and the success of many Infrastructure as a Service (IaaS) cloud providers, who rent out virtual infrastructure as utility. Cloud computing frameworks like MapReduce provide tools for implementing algorithms and can automatically parallelise them, which can greatly simplify the work of researchers. But MapReduce is focused on huge data processing and is less suitable for complex algorithms of scientific computing. We studied using different frameworks based on the MapReduce model for scientific computing and compared them to other distributed computing frameworks. In the process we explain our motivation for designing a new distributed scientific computing framework and the considered preliminary design choices.
Future computing environments will integrate cloud platforms with existing computational infrastructures providing frameworks to combine traditional Grid services with on-demand Cloud services, so that additional resources can be requested during peak loads and released after that. Nowadays, workflows are the preferred means for the composition of services into added value service chains representing functional business processes or complex scientific experiments. This paper describes Sunflower an innovative P2P agent-based framework for configuring, enacting, managing and adapting workflows on hybrid computing infrastructures. To orchestrate Grid and Cloud services, Sunflower uses a bio-inspired autonomic choreography model and integrates the scheduling algorithm with a provisioning component that can dynamically launch virtual machines in a Cloud infrastructure to provide on-demand services in peak-load situations.
In the automotive sector, the product cycles are by means longer than in the mobile markets. This often leads to long timeframes for introducing innovations of the infotainment domain in automobiles. In this paper, we present an approach that allows to downsize on-board ECUs to support the typical case of running a specified number of applications while featuring the opportunity to back-off computational power in worst case scenarios. Furthermore, this concept enables car manufacturers to offer new functionality to customers over the complete product-lifetime without the need to upgrade the hardware of ECUs or to have cost intensive automotive high performance hardware available on-board. This is achieved by enabling automotive applications to utilize either built-in computational power, CE (Consumer Electronics) devices like the customer's smartphone or the computational power of cloud servers. Therefore, we extend OpenCL, an industry standard that is supported by many different companies, to work locally on automotive platforms as well as remotely over the network. This becomes possible with the emergence of IP-based in-car networks.
Monte Carlo is a common method in financial engineering for a wide variety of problems, one being option pricing. Growing volumes and complexity of work that needs to be performed as well as strict requirements for fast responses makes for a pressing demand for high performance computing.
In this paper we study the applicability of GPUs for pricing exotic options (basket, Asian) using Monte Carlo techniques within the environment of a large financial institution. Contrary to many experimental results achieved by analyzing the computational kernel we focus on the performance of the end-to-end system that is now in production use at Handelsbanken, one of the major Swedish financial institutes.
We show that graphics cards can outperform CPUs given certain conditions and for reasonable problem sizes we find a 12x improvement over sequential code when pricing options in a production system.
In this work we propose several parallel algorithms to compute the two-dimensional discrete wavelet transform (2D-DWT), exploiting the available hardware resources. In particular, we will explore OpenMP optimized versions of 2D-DWT over a multicore platform and we will also develop CUDA-based 2D-DWT algorithms which are able to run on GPUs (Graphics Processing Unit). The proposed algorithms are based on several 2D-DWT computation approaches as (1) filter-bank convolution, (2) lifting transform and (3) matrix convolution, so we can determine which of them better adapts to our parallel versions. All proposed algorithms are based on the Daubechies 9/7 filter which is widely used in image/video compression.
SkePU is a skeleton programming framework for multicore CPU and multi-GPU systems. StarPU is a runtime system that provides dynamic scheduling and memory management support for heterogeneous, accelerator-based systems.
We have implemented support for StarPU as a possible backend for SkePU while keeping the generic SkePU interface intact. The mapping of a SkePU skeleton call to one or more StarPU tasks allows StarPU to exploit independence between different skeleton calls as well as within a single skeleton call. Support for different StarPU features, such as data partitioning and different scheduling policies (e.g. history based performance models) is implemented and discussed in this paper.
The integration proved beneficial for both StarPU and SkePU. StarPU got a high level interface to run data-parallel computations on it while SkePU has achieved dynamic scheduling and heterogeneous parallelism support. Several benchmarks including ODE solver, separable Gaussian blur filter, Successive Over-Relaxation (SOR) and Coulombic potential are implemented. Initial experiments show that we can even achieve super-linear speedups for realistic applications and can observe clear improvements in performance with the simultaneous use of both CPUs and GPU (heterogeneous execution).
We describe the enablement of the Ludwig lattice Boltzmann parallel fluid dynamics application, designed specifically for complex problems, for massively parallel GPU-accelerated architectures. NVIDIA CUDA is introduced into the existing C/MPI framework, and we have given careful consideration to maintainability in addition to performance. Significant performance gains are realised on each GPU through restructuring of the data layout to allow memory coalescing and the adaptation of key loops to reduce off-chip memory accesses. The halo-swap communication phase has been designed to efficiently utilise many GPUs in parallel: included is the overlapping of several stages using CUDA stream functionality. The new GPU adaptation is seen to retain the good scaling behaviour of the original CPU code, and scales well up to 256 NVIDIA Fermi GPUs (the largest resource tested). The performance on the NVIDIA Fermi GPU is observed to be up to a factor of 4 greater than the (12-core) AMD Magny-Cours CPU (with all cores utilised) for a binary fluid benchmark.