
Ebook: Parallel Computing: Architectures, Algorithms and Applications

ParCo2007 marks a quarter of a century of the international conferences on parallel computing that started in Berlin in 1983. The aim of the conference is to give an overview of the state-of-the-art of the developments, applications and future trends in high performance computing for all platforms. The conference addresses all aspects of parallel computing, including applications, hardware and software technologies as well as languages and development environments. Special emphasis was placed on the role of high performance processing to solve real-life problems in all areas, including scientific, engineering and multidisciplinary applications and strategies, experiences and conclusions made with respect to parallel computing. The book contains papers covering: (1) Applications: The application of parallel computers to solve computationally challenging problems in the physical and life sciences, engineering, industry and commerce. The treatment of complex multidisciplinary problems occurring in all application areas was discussed. (2) Algorithms: Design, analysis and implementation of generic parallel algorithms, including their scalability, in particular to a large number of processors (MPP), portability and adaptability. (3) Software and Architectures: Software engineering for developing and maintaining parallel software, including parallel programming models and paradigms, development environments, compile-time and run-time tools. A number of symposia on specialized topics formed part of the scientific program. The following topics were covered: Parallel Computing with FPGA's, The Future of OpenMP in the Multi-Core Era, Scalability and Usability of HPC Programming Tools, DEISA: Extreme Computing in an Advanced Supercomputing Environment and Scaling Science Applications on Blue Gene. The conference was organized by the renowned research and teaching institutions Forschungszentrum Jülich and the RWTH Aachen University in Germany.
Parallel processing technologies have become omnipresent in the majority of new processors for a wide spectrum of computing equipment from game computers and standard PC's to workstations and supercomputers. The main reason for this trend is that parallelism theoretically enables a substantial increase in processing power using standard technologies. This results in a substantial reduction in cost compared to that of developing specialised high-performance hardware. Today the processing capacity of a desktop PC with a multicore processor supersedes the compute power of a supercomputer of two decades ago at a fraction of the cost.
The utilisation of such powerful equipment requires suitable software. In practice it appears that the construction of appropriate parallel algorithms and the development of system and application software that can exploit the advantages of parallel hardware is not a simple matter. These problems have been studied for nearly five decades and, although much progress was made in the areas of parallel architectures, algorithm and software design, major problems remain to be addressed. The increasing replication of processing elements on chips and the use of standard components (COTS) for the relatively easy assembly of parallel systems comprising of a large number of processors (MPP) to achieve hitherto unachievable processing capacities, highlight the problems associated with the utilisation of these. Combined with the fast growth in the number of multi-core processors for PC's there is an increasing need for methods and tools to support the development of software to effectively and efficiently utilise parallel structures.
The international Parallel Computing conference series (ParCo) reported on progress and stimulated research in the high speed computing field over the past quarter century. New research results and techniques associated with the development and use of parallel systems were discussed at ParCo2007. This international event brought together a number of the top researchers in the field of parallel computing. Their research interests covered all aspects from architectures and networks to software engineering and application development. The use of FPGA's (Free Programmable Gate Arrays) was discussed in the same vein as the development of software for multi-core processors. Papers on a wide variety of application areas using high performance computers were presented. In contrast to software for specialised high speed computing applications, where specialists spend considerable time to optimise a particular piece of code, the challenge for the future is to make software development tools available that allow non-specialists to develop 'good' parallel software with minimum effort. All of these areas are in dire need of fundamentally new ideas to overcome the limitations imposed by existing paradigms.
In addition to the contributed papers a total of five mini-symposia on special topics and an industrial session formed part of the program. Prior to the conference two well attended tutorials were offered.
As with all previous conferences in this series the emphasis with the publication of the proceedings of ParCo2007 was on quality rather than quantity. Thus all contributions were reviewed prior to and again during the conference. Organisers of mini-symposia were given the option to publish reviewed papers presented at the conference in these proceedings. In total two invited papers, 63 contributed papers and 26 papers from minisymposia are included in this book.
The Editors are greatly indebted to the members of the International Program Committee as well as the organisers of the mini-symposia for their support in selecting and reviewing the large number of papers. The organisers of the conference are also greatly indebted to the members of the various committees for the time they spent in making this conference such a successful event. Special thanks are due to the staff of the Jülich Supercomputing Centre at Research Centre Jülich and the Institute for Scientific Computing, RWTH Aachen University for their enthusiastic support.
Christian Bischof, RWTH Aachen University, Germany; Martin Bücker, RWTH Aachen University, Germany; Paul Gibbon, Forschungszentrum Jülich, Germany; Gerhard Joubert, TU Clausthal, Germany; Thomas Lippert, Forschungszentrum Jülich, Germany; Bernd Mohr, Forschungszentrum Jülich, Germany; Frans Peters, Philips Research, The Netherlands
November 2007
A large body of existing, sequential applications must be migrated to multicore platforms with maximum transparency. To achieve this, techniques for automatic parallelization must be pushed to their limits, and current shared memory programming paradigms must be reconsidered with respect to their ease of use, performance characteristics and ability to support a wide range of programming needs. Despite advances in automated and dynamic exposure of exploitable parallelism, the application developer will frequently need to support the process of code migration by making the parallelism in a program explicit. One of the most straightforward ways in which this can be done is by inserting the directives or pragmas of the portable OpenMP shared memory programming interface, a de facto standard.
In this paper we discuss the programming implications of this radical change in the approach to building mainstream processors. We briefly consider the suitability of existing and recently proposed programming models to satisfy multicore application development needs, arguing that OpenMP is the most broadly appropriate of them for higher level multicore programming. Yet OpenMP was designed before the multicore “revolution”. We discuss some ways in which the implementation of OpenMP might be improved to meet scalability demands posed by emerging platforms and introduce a potential extension that might simplify application development in some cases. We briefly discuss our own OpenMP implementation. Finally, we offer our conclusions on the ability of the state of the art in programming models and their implementations to satisfy the needs of multicore programming.
A parallel solver for incompressible fluid flow simulation, used in biomedical device design among other applications, is discussed. The major compute- and communication-intensive portions of the code are described. Using unsteady flow in a complex implantable axial blood pump as a model problem, scalability characteristics of the solver are briefly examined. The code that exhibited so far good scalability on typical PC clusters is shown to suffer from a specific bottleneck when thousands of processors of a Blue Gene are employed. After resolution of the problem, satisfactory scalability on up to four thousand processors is attained.
We describe a domain decomposition approach applied to the specific context of electronic structure calculations. The approach has been introduced in Ref 1. We briefly present the algorithm and its parallel implementation. Simulation results on linear hydrocarbon chains with up to 2 millions carbon atoms are reported. Linear scaling with respect to the number of atoms is observed. High parallel scalability has been obtained on a Blue Gene/L machine with up to 1024 processors.
We propose a new pure Lagrangian method for the parallel load balanced simulation of particle-fluid systems with moving boundaries or free surfaces. Our method is completely meshless and models solid objects as well as the fluid as particles. By an Orthogonal Recursive Bisection we obtain a domain decomposition that is well suited for a controller based load balancing. This controller approach is designed for being used on clusters of workstations as it can cope with load imbalances not only as emerging from the simulation dynamics but also from competitive processes of other users. In this paper we present the most important concepts such as neighbourhood search, particle interactions, domain decomposition and controller based load balancing.
A new implementation of a force decomposition method for parallel molecular dynamics simulations is presented. It is based on a geometrical decomposition of the influence matrix where sections are dynamically reorganized during the simulation in order to maintain a good load balance. Furthermore space filling curves are used to sort particles in space, which makes memory access more efficient and furthermore reduces communication between processors due to better locality. Benchmark runs are presented which shows in improvement in scalability due to well balanced work on the processors.
The simulation of fluid flow on the nano-scale in the field of process engineering involves a large number of relatively small molecules. The systems we are simulating can be modelled using rigid molecular models assembled from sites with non-bonded short-range pair potentials Each of the sites is described by a set of parameters which are required for the calculation of interactions between sites of the same type. For the interaction of unequal sites, mixed parameter sets have to be calculated. This has to be done for each possible pair of sites. We describe an approach to precalculate and store those mixed parameter sets in a stream, which allows efficient access and gives the flexibility to add new site types easily.
Another focus of our work has been on software engineering techniques. Using the adapter design pattern, we achieved a complete decoupling of the physical parts of the simulation (e.g. molecule models and interactions) from the data structures and the parallelisation. This eases the further concurrent development of the software and reduces the complexity of the different modules. It also gives us the opportunity to swap modules in a plug-in like fashion.
Finally, we demonstrate the advantages of a “pair ownership” of processes for the parallelisation which allows the joint calculation of macroscopic values and the forces on molecules.
We discuss an extension of the Massively Parallel Quantum Computer Simulator by a gate level error model which covers operational errors and decoherence. Applying this error model to the Quantum Fourier Transformation (the kernel of Shor's algorithm) and Grover's quantum search algorithm, one finds that the QFT circuit is more robust to operational inaccuracies than Grover's algorithm on comparable scales. Critical parameters can be derived which give a first estimate of tolerable error thresholds. At present ion traps are regarded as the most promising technology for the realization of quantum computers due to the long coherence time of trapped ions. We discuss Hamiltonian based dynamical ion-trap simulations which have been developed in collaboration with the experimental working group of Prof. Rainer Blatt. In contrast to standard approaches no approximations like the rotating wave approximation or an expansion in the Lamb-Dicke parameter are required which allow for very accurate simulations. This permits to identify critical system parameters which limit the stability of the experiment.
A typical commodity camera rarely supports selecting a region of interest to reduce bandwidth, and depending on the extent of image processing, a single CPU may not be sufficient to process data from the camera. Further, such cameras often lack support for synchronized inter-camera image capture, making it difficult to relate images from different cameras. This paper presents a scalable, dedicated parallel camera system for detecting objects in front of a wall-sized, high-resolution, tiled display. The system determines the positions of detected objects, and uses them to interact with applications. Since a single camera can saturate either the bus or CPU, depending on its characteristics and the image processing complexity, the system supports configuring the number of cameras per computer according to bandwidth and processing needs. To minimize image processing latency, the system focuses only on detecting where objects are, rather than what they are, thus reducing the problem's complexity. To overcome the lack of synchronized cameras, short periods of waiting are used. An experimental study using 16 cameras has shown that the system achieves acceptable latency for applications such as 3D games.
Virtual Reality has shown to be a useful tool in the visualization of complex flow simulation data sets. Maintaining interactivity for the user exploring unstructured, time-varying data demands parallel computing power. We propose a decoupling of interactive and non-interactive tasks to avoid latencies. We introduce an optimized resampling algorithm for unstructured grids which remains scalable using hybrid parallelization. With a resampled region-of-interest, interactive execution of arbitrary visualization techniques is made possible. To validate findings, the user can access different error metrics made during resampling.
In this work we analyze the performance of the Weather Research and Forecasting (WRF) model using both empirical data and an accurate analytic performance model. WRF is a large-scale mesoscale numerical weather prediction system designed for both operational forecasting and atmospheric research. It is in active development at the National Center for Atmospheric Research (NCAR), and can use 1,000 s of processors in parallel. In this work we compare the performance of WRF on a cluster-based system (AMD Opteron processors interconnected with 4x SDR Infiniband) to that on a mesh-based system (IBM Blue Gene/L interconnected with a proprietary 3-D torus). In addition, we develop a performance model of WRF that is validated against these two systems and that exhibits high prediction accuracy. The model is then used to examine the performance of a near-term future generation supercomputer.
This paper presents a framework based on an user driven methodology to obtain analytical models on parallel systems and, in particular, clusters. This framework consists of two interconnected stages. In the first one, the analyst instruments the source code and some performance parameters are monitored. In the second one, the monitored data are used to obtain an analytical model using statistical processes. The main functionalities added to the analysis stage include an automatic fit process that provides accurate performance models and the automatic data collection from monitoring. Some examples are used to show the automatic fit process. The accuracy of the models is compared with a complexity study of the selected examples.
Computational force, also called computational intensity, is a unifying concept for understanding the performance of parallel numerical algorithms. Dimensional analysis reduces a formula for execution time, from a paper by Stewart, to an exercise in differential geometry for a single efficiency surface. Different machines move on the surface along different paths defined by curvilinear coordinates that depend on ratios of hardware forces to software forces.
Periscope is a distributed automatic online performance analysis system for large scale parallel systems. It consists of a set of analysis agents distributed on the parallel machine. This article presents the approach taken on the ALTIX 4700 supercomputer at LRZ to distribute the analysis agents and the application processes on to the set of processors assigned to a parallel job by the batch scheduling system. An optimized mapping reducing the distance between the analysis agents and the application processes is computed based on the topology information of the processors. This mapping is then implemented via the dplace command on the Altix.
This paper describes case studies with the Eden Trace Viewer (EdenTV), a post-mortem trace analysis and visualisation tool for the parallel functional language Eden. It shows program executions in terms of Eden's abstract units of computation instead of providing a machine-oriented low level view like common tools for parallelism analysis do. We show how typical inefficiencies in parallel functional programs due to delayed evaluation, or unbalanced workload can be detected by analysing the trace visualisations.
In the recent years, strong efforts have been dedicated on the study of applications' execution phases. In this paper, we show an approach focused on the automatic detection of the phases on MPI applications' execution. This detection is based on Wavelet Analysis, which provides a meaningful and fast methodology. Information derived from phase detection is very useful to study the performance of the application. To illustrate this importance, we show a study of the scalability of several applications based on phase detection.
In order to share interesting data located in various institutions, it may be necessary to ensure secure accesses and privacy preservation. For example, it is the case of biomedical images, since they are related to confidential information corresponding to real patients. This paper describes an approach to enable the use of PIMA(GE)2 Lib, Parallel IMAGE processing GEnoa Library, in the the analysis of distributed medical images complying with privacy preservation policy imposed by the owner institutions. To achieve this goal, a set of specialized I/O functions was developed to enable secure remote data access and elaboration, exploiting the Grid infrastructure and without copy data from their original position. The work considers the use of the EGEE infrastructure and the GFAL API. The approach was experimented in the analysis of images obtained through the Tissue MicroArray technique; the tests gave positive results also in the execution time.
In this paper a parallel workflow for the reconstruction of molecular surfaces based on the isosurface extraction operation is proposed. The input is represented by the atomic coordinates of a molecule, the output is both its volumetric description and the isosurface that, on the basis of the user selection, corresponds to the Van der Waals, Lee & Richards or Connolly surface. The main feature of the workflow is represented by the efficient production of high resolution surfaces. This is a key aspect in Bioinformatics applications, considering that the amount of data to process may be very huge. This goal is achieved through a parallel implementation of the stages of the workflow.
High performance computer (HPC) simulations provide helpful insights to the process of magnetic resonance image (MRI) generation, e.g. for general pulse sequence design and optimisation, artefact detection, validation of experimental results, hardware optimisation, MRI sample development and for education purposes. This manuscript presents the recently developed simulator JEMRIS (Jülich Environment for Magnetic Resonance Imaging Simulation). JEMRIS is developed in C++, the message passing is realised with the MPI library. The approach provides generally valid numerical solutions for the case of classical MR spin physics governed by the Bloch equations. The framework already serves as a tool in research projects, as will be shown for the important example of multidimensional, spatially-selective excitation.
Clusters are, by now, one of most popular architectures. Hiding latencies as well as getting an optimal assignment of processors, are two issues for many scientific applications, specially if non dedicated clusters are used. Traditionally, high performance scientific applications are parallelized using MPI libraries. Typical implementations of MPI, minimize dynamic features required to face latencies or shared resource usage. Switching from the MPI process model to a threaded programming model in the parallel environment, can help to achieve efficient overlapping and provide abilities for load balancing. Gains achieved by BICAV (our research group's tomographic reconstruction software) in a multithreaded framework, are exposed.