
Ebook: Parallel Computing: Technology Trends

The year 2019 marked four decades of cluster computing, a history that began in 1979 when the first cluster systems using Components Off The Shelf (COTS) became operational. This achievement resulted in a rapidly growing interest in affordable parallel computing for solving compute intensive and large scale problems. It also directly lead to the founding of the Parco conference series.
Starting in 1983, the International Conference on Parallel Computing, ParCo, has long been a leading venue for discussions of important developments, applications, and future trends in cluster computing, parallel computing, and high-performance computing. ParCo2019, held in Prague, Czech Republic, from 10 – 13 September 2019, was no exception. Its papers, invited talks, and specialized mini-symposia addressed cutting-edge topics in computer architectures, programming methods for specialized devices such as field programmable gate arrays (FPGAs) and graphical processing units (GPUs), innovative applications of parallel computers, approaches to reproducibility in parallel computations, and other relevant areas.
This book presents the proceedings of ParCo2019, with the goal of making the many fascinating topics discussed at the meeting accessible to a broader audience. The proceedings contains 57 contributions in total, all of which have been peer-reviewed after their presentation. These papers give a wide ranging overview of the current status of research, developments, and applications in parallel computing.
The ParCo2019 conference was hosted by Charles University, Prague, Czech Republic. This event marked thirty-six years of ParCo conferences. During the past decades the conference proved to be not only a mirror of the advances made in the development and use of parallel computing, but also a stimulator for advancing research and development of many new technologies.
During the opening session of the conference it was noted that 2019 also marked four decades of cluster computing. In 1979 a cluster system with a simple MIMD (Multiple Instruction Multiple Data) architecture using COTS (Components Off The Shelf) be-came operational. A short historical overview of the development and results obtained with the cluster system is given in the note titled: Four Decades of Cluster Computing included in these proceedings. The research done with this system brought about in-ternational cooperation involving a number of researchers. These international connections resulted in the start of the international ParCo conferences in 1983, which in turn initiated the journal Parallel Computing (Elsevier) and later the book series Advances in Parallel Computing (Elsevier and IOS Press).
The scientific program of the ParCo2019 conference consisted of invited talks, con-tributed papers and papers presented as part of symposia. The invited speakers were:
-
Torsten Hoeffler: Performance Portability with Data-Centric Parallel Programming
-
Thomas Lippert: Scalability, Cost-Effectiveness and Composability of Large-Scale Supercomputers through Modularity
-
Ian Foster: Coding the Continuum (slides at www.parco.org/slides/Foster.pdf)
-
Jaroslav Cervinka: High Performance Computing at Skoda Auto
-
Mikhail Dyakonov: Will we ever have a quantum computer?
-
Erik D’Hollander: Empowering Parallel Computing with Field Programmable Gate Arrays
-
Jean-Pierre Panziera: Addressing the Exascale Challenges (slides at www.parco.org/slides/Panziera.pdf)
-
Jean-Marc Denis: European Processor Initiative: The European Vision for Exascale Ages and Beyond (slides at www.parco.org/slides/Denis.pdf)
Two Invited Speakers chose to submit a paper for inclusion in the proceedings. In addition slides presented by three speakers are available on the ParCo Website.
Contributed papers presented at the conference were selected based on paper proposals. Submitted papers were reviewed in a first round prior to the conference and again in a second round at the presentation of the papers. The results and recommendations from the two review rounds were communicated to authors. Authors of accepted papers were given the option to submit a revised version of their paper. The proceedings thus only include papers that were accepted after their presentation at the conference. The papers ultimately selected for publication give a wide ranging overview of the current status of Parallel Computing research, developments and applications.
Four symposia were organised and presented as part of the conference. Organisers of Symposia were responsible for the reviewing of the respective papers presented and submitted for publication.
The Editors are indebted to all persons who assisted in making the conference a suc-cess. These include the staff at the registration desk and the technical staff who ensured the functioning of all equipment. A particular word of thanks is due to Siavash Ghiasvand from the Dresden University of Technology for his support in compiling the final version of the manuscript of these proceedings.
A final note needs mentioning: This is the first volume in the Advances in Parallel Computing book series that is published as an Open Access (OA) book, making the contents of the book freely accessible to everyone. The publication of the proceedings as an OA book does not change the indexing of the published material in any way.
Ian Foster
Gerhard Joubert
Luděk Kučera
Wolfgang Nagel
Frans Peters
During the latter half of the 1970s high performance computers (HPC) were constructed using specially designed and manufactured hardware. The preferred architectures were vector or array processors, as these allowed for high speed processing of a large class of scientific/engineering applications. Due to the high cost of the development and construction of such HPC systems, the number of available installations was limited. Researchers often had to apply for compute time on such systems and wait for weeks before being allowed access. Cheaper and more accessible HPC systems were thus in great need. The concept to construct high performance parallel computers with distributed Multiple Instruction Multiple Data (MIMD) architectures using standard off-the-shelf hardware promised the construction of affordable supercomputers. Considerable scepticism existed at the time about the feasibility that MIMD systems could offer significant increases in processing speeds. The reasons for this were due to Amdahl’s Law, coupled with the overheads resulting from slow communication between nodes and the complex scheduling and synchronisation of parallel tasks. In order to investigate the potential of MIMD systems constructed with existing off-the-shelf hardware a first simple two processor system was constructed that finally became operational in 1979. In this paper aspects of this system and some of the results achieved are reviewed.
In the hypothetical quantum computing one replaces the classical two-state bit by a quantum element (qubit) with two basic states, ↑ and ↓. Its arbitrary state is described by the wave function ψ = a↑ + b↓, where a and b are complex amplitudes, satisfying the normalization condition. Unlike the classical bit, that can be only in one of the two states, ↑ or ↓, the qubit can be in a continuum of states defined by the quantum amplitudes a and b. The qubit is a continuous object. At a given moment, the state of a quantum computer with N qubits is characterized by 2N quantum amplitudes, which are continuous variables restricted by the normalization condition only. Thus, the hypothetical quantum computer is an analog machine characterized by a super-astronomical number of continuous variables (even for N∼100÷1000). Their values cannot be arbitrary, they must be under our control. Thus the answer to the question in title is: When physicists and engineers will learn to keep under control this number of continuous parameters, which means – never.
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumann-like architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements.
Nowadays, machine learning techniques based on deep neural networks are everywhere, from image classification and recognition or language translation to autonomous driving or stock market prediction. One of the most prominent fields of application is medicine, where AI techniques promote and promise the personalized medicine. In this work, we entered in this field to study the prostate cancer prediction from images digitalized from hematoxylin and eosin stained biopsies. We chose this illness since prostate cancer is a very common type of cancer and the second cause of death in men. We did this work in collaboration with Hospital Reina Sofia of Murcia. As newcomers, we faced a lot of problems to start with, and questioned ourselves about many issues. This paper shows our experiences in developing and training two convolutional neural networks from scratch, exposing the importance of both the preprocessing steps (cropping raw images to tiles, labeling, and filtering), and the postprocessing steps (i.e., to obtain results understandable for doctors). Therefore, the paper describes lessons learned in building CNN models for prostate cancer detection from biopsy slides.
Significant progress in computer hardware and software have enabled molecular dynamics (MD) simulations to model complex biological phenomena such as protein folding. However, enabling MD simulations to access biologically relevant timescales (e.g., beyond milliseconds) still remains challenging. These limitations include (1) quantifying which set of states have already been (sufficiently) sampled in an ensemble of MD runs, and (2) identifying novel states from which simulations can be initiated to sample rare events (e.g., sampling folding events). With the recent success of deep learning and artificial intelligence techniques in analyzing large datasets, we posit that these techniques can also be used to adaptively guide MD simulations to model such complex biological phenomena. Leveraging our recently developed unsupervised deep learning technique to cluster protein folding trajectories into partially folded intermediates, we build an iterative workflow that enables our generative model to be coupled with all-atom MD simulations to fold small protein systems on emerging high performance computing platforms. We demonstrate our approach in folding Fs-peptide and the β-β- α (BBA) fold, FSD-EY. Our adaptive workflow enables us to achieve an overall root-mean squared deviation (RMSD) to the native state of 1.6 Å and 4.4 Å respectively for Fs-peptide and FSD-EY. We also highlight some emerging challenges in the context of designing scalable workflows when data intensive deep learning techniques are coupled to compute intensive MD simulations.
We propose a novel approach combining vector autoregressive models and data assimilation to conduct econometric inference for high dimensional problems in cryptocurrency markets. We label this new model TVP-VAR-DA. As the resulting algorithm is computationally very expensive, it mandates the introduction of a problem decomposition and its implementation in a parallel computing environment. We study its scalability and prediction accuracy under various specifications.
Cloud Computing has emerged as an interesting alternative for running business applications, but this might not be true for scientific applications. A comparison between HPC systems and cloud infrastructure not always sees the latter winning over the former, especially when only performance and economical aspects are taken into account. But if other factors, such as turnaround time and user preference, come into play, the landscape of the usage convenience changes. Choosing the right infrastructure, then, can be essentially seen as a multi-attribute decision-making problem. In this paper we introduce an evaluation model, based on a weighted geometric aggregation function, that takes into account a set of parameters, among which job geometry, cost, execution and turnaround time. The notion of user preference modulates the model, and allows to determine which platform, cloud or HPC, might be the best one. The model has then been used to evaluate the best architecture for several runs of two applications, based on two different communication models. Results show that the model is robust and there is a not negligible number of runs for which a cloud infrastructure seems to be the best place for running scientific jobs.
The real time coding of high resolution JPEG2000 video requires specialized hardware architectures like Field-Programmable Gate Arrays (FPGAs). Commonly, implementations of JPEG2000 in other architectures such as Graphics Processing Units (GPUs) do not attain sufficient throughput because the algorithms employed in the standard are inherently sequential, which prevents the use of fine-grain parallelism needed to achieve the full GPU performance. This paper presents an architecture for an end-to-end wavelet-based video codec that uses the framework of JPEG2000 but introduces distinct modifications that enable the use of fine-grain parallelism for its acceleration in GPUs. The proposed codec partly employs our previous research on the parallelization of two stages of the JPEG2000 coding process. The proposed solution achieves real-time processing of 4K video in commodity GPUs, with much better power-efficiency ratios compared to server-grade Central Processing Unit (CPU) systems running the standard JPEG2000 codec.
GPGPU computation of microscopic pedestrian simulations has been largely restricted to Cellular Automata and differential equations models, leaving out most agent-based models that rely on sequential updates.We combine a linked-cell data structure to reduce neighborhood complexity with a massive parallel filtering technique to identify agents that can be updated in parallel, thus extending GPGPU computation to one such model, the Optimal Steps Model. We compare two different OpenCL implementations: a parallel event-driven update scheme and a parallel update scheme that violates the event order for the sake of parallelism.We achieve significant speed ups for both in two benchmark scenarios making faster than real-time simulations possible even for large-scale scenarios.
The exact diagonalization is the most accurate approach for solving the Hubbard model. The approach calculates the ground state of the Hamiltonian derived exactly from the model. Since the Hamiltonian is a large sparse symmetric matrix, we usually utilize an iteration method. It has been reported that the LOBPCG method is one of the most effectual solvers for this problem. Since most operations of the method are linear operations, the method can be executed on CUDA GPU, which is one of the mainstream processors, by using cuBLAS and cuSPARSE libraries straightforwardly. However, since the routines are executed one after the other, cached data can not be reused among other routines. In this research, we tune the routines by fusing some of their loop operations in order to reuse cached data. Moreover, we propose the tuning strategies for the Hamiltonian-vector multiplication with shared memory system in consideration of the character of the Hamiltonian. The numerical test on NVIDIA Tesla P100 shows that the tuned LOBPCG code is about 1.5 times faster than the code with cuBLAS and cuSPARSE routines.
Modern-day supercomputers are equipped with sophisticated graphics processing units (GPUs) along with high-performance CPUs. Adapting existing algorithms specifically to GPU has resulted in under-utilization of CPU computing power. In this respect, we parallelize Jacobi and successive-over relaxation (SOR), which are used as smoother in multigrid method to maximize the combined utilization of both CPUs and GPUs. We study the performance of multigrid method in terms of total execution time by employing different hybrid parallel approaches, viz. accelerating the smoothing operation using only GPU across all multigrid levels, alternately switching between GPU and CPU based on the multigrid level and our proposed novel approach of using combination of GPU and CPU across all multigrid levels. Our experiments demonstrate a significant speedup using the hybrid parallel approaches, across different problem sizes and finite element types, as compared to the MPI only approach. However, the scalability challenge persists for the hybrid parallel multigrid smoothers.
System performance variability is a significant challenge to scalability of tightly-coupled iterative applications. Asynchronous variants perform better, but an imbalance in progress can result in slower convergence or even failure to converge, as old data is used for updates. In shared memory, this can be countered using progressive load balancing (PLB). We present a distributed memory extension to PLB (DPLB) by running PLB on nodes and adding a balancing layer between nodes. We demonstrate that this method is able to mitigate system performance variation by reducing global progress imbalance 1.08x–4.05x and time to solution variability 1.11x–2.89x. In addition, the method scales without significant overhead to 100 nodes.
The sparse grid combination technique can be used to mitigate the curse of dimensionality and to gain insight into the physics of hot fusion plasmas with the gyrokinetic code GENE. With the sparse grid combination technique, massively parallel simulations can be performed on target resolutions that would be prohibitively large for standard full grid simulations. This can be achieved by numerically decoupling the target simulation into several smaller ones. Their time dependent evolution requires load balancing to obtain near optimal scaling beyond the scaling capabilities of GENE itself. This approach requires that good estimates for the runtimes exist.
This paper revisits this topic for large-scale nonlinear global simulations and investigates common machine learning techniques, such as support vector regression and neural networks. It is shown that, provided enough data can be collected, load modeling by data-driven techniques can outperform expert knowledge-based fits – the current state-of-the-art approach.
In this work, we combine several previous efforts to simulate a large-scale soot particle agglomeration with a dynamic, multi-scale turbulent background flow field. We build upon previous simulations which include 3.2 million particles and implement load-balancing into the used simulation software as well as tests of the load-balancing mechanisms on this scenario. We increase the simulation to 109.85 million particles, superpose a dynamically changing multi-scale background flow field and use our software enhancements to the molecular dynamics software ESPResSo to simulate this on a Cray XC40 supercomputer. To verify that our setup reproduces essential physics we scale the influence of the flow field down to make the scenario mostly homogeneous on the subdomain scale. Finally, we show that even on the homogeneous version of this soot particle agglomeration simulation, load-balancing still pays off.
A roadmap for autotuning task-based numerical libraries is presented. Carefully chosen experiments are carried out when the numerical library is being installed to assess its performance. Real and simulated executions are considered to optimize the routine. The discussion is illustrated with a task-based tile Cholesky factorization, and the aim is to find the optimum tile size for any problem size, using the Chameleon numerical linear algebra package on top of the StarPU runtime system and also with the SimGrid simulator. The study shows that combining a smart exploration strategy of the search space with both real and simulated executions results in a fast, reliable autotuning process.
This work introduces a new idea of batched 3D-FFT with a survey of data decomposition methods and a review of the states-of-arts high performance parallel FFT libraries. Besides, it is argued that the particular usage of multiple FFTs has been associated with the batched execution. The batched 3D-FFT kernel, which is performed on the K computer, shows 45.9% speedup when N and P are 20483 and 128, respectively. The batched FFT allows the developer to take advantage of a flexible internal data layout and scheduling to improve the total performance.
The parallel induction algorithm discussed in the paper finds a minimal nondeterministic finite automaton (NFA) consistent with the given sample. The sample consists of examples and counterexamples, i.e., words that are accepted and rejected by the automaton. The algorithm transforms the problem to a family of constraint satisfaction problems solved in parallel. Only the first solution is sought, which means that upon finding a consistent automaton, the remaining processes terminate their execution. We analyze the parallel algorithm in terms of achieved speedups. In particular, we discuss the reasons of the observed superlinear speedups. The analysis includes experiments conducted for the samples defined over the alphabets of different sizes.