# Ebook: Parallel Computing is Everywhere

**Description**

The most powerful computers work by harnessing the combined computational power of millions of processors, and exploiting the full potential of such large-scale systems is something which becomes more difficult with each succeeding generation of parallel computers. Alternative architectures and computer paradigms are increasingly being investigated in an attempt to address these difficulties. Added to this, the pervasive presence of heterogeneous and parallel devices in consumer products such as mobile phones, tablets, personal computers and servers also demands efficient programming environments and applications aimed at small-scale parallel systems as opposed to large-scale supercomputers.

This book presents a selection of papers presented at the conference: Parallel Computing (ParCo2017), held in Bologna, Italy, on 12 to 15 September 2017. The conference included contributions about alternative approaches to achieving High Performance Computing (HPC) to potentially surpass exa- and zetascale performances, as well as papers on the application of quantum computers and FPGA processors. These developments are aimed at making available systems better capable of solving intensive computational scientific/engineering problems such as climate models, security applications and classic NP-problems, some of which cannot currently be managed by even the most powerful supercomputers available. New areas of application, such as robotics, AI and learning systems, data science, the Internet of Things (IoT), and in-car systems and autonomous vehicles were also covered.

As always, ParCo2017 attracted a large number of notable contributions covering present and future developments in parallel computing, and the book will be of interest to all those working in the field.

**Preface**

The most powerful computers available and planned for the near future harness the combined compute power of millions of processors. In order to utilise the potential of such large scale parallel systems, major efforts in algorithm design and software development are required. With each new generation of parallel computers, combining ever more processors in one system, the development of software that can efficiently and effectively exploit the full potential of such massive systems becomes more difficult. Alternate architectures and compute paradigms are thus increasingly being investigated in attempts to alleviate these difficulties.

The pervasive presence of heterogeneous and parallel devices in consumer products such as mobile phones, tablets, personal computers and servers, also demands efficient programming environments and applications targeting small scale parallel systems in contrast to large scale supercomputers.

In response to such demands the Parallel Computing (ParCo2017) conference held at Bologna, Italy in September 2017 also included discussions on alternative approaches to achieve High Performance Computing (HPC) capabilities that could potentially surpass Exa- and Zetascale performances. Talks on the application of Quantum Computers and FPGA processors to solve particular compute intensive problems exemplified future possibilities. These developments are mainly aimed at making available more capable systems for solving compute intensive scientific/engineering problems, such as, for example, climate models, security applications as well as classic NP-problems that currently cannot be managed even with the most powerful supercomputers available.

The fast expanding and wide spread use of parallel computers to solve problems emerging from new application fields is as important for future developments as the expansion of systems to achieve higher processing speeds. New application areas such as Robotics, AI and Learning Systems, Data Science, Internet of Things (IoT), In-Car Systems, Autonomous Vehicles, were discussed. Such applications often do not require extreme processing speeds, but rather a high degree of heterogeneous parallelism. These pose particular challenges for the Software Engineering aspects of parallel software, in particular efficiency, reliability, quality assurance and maintainability. Often, these very same systems also pose extreme challenges in terms of power/performance trade-offs, mainly related to limited amounts of power available from batteries and/or to the problems related to heat dissipation.

A further aspect is that, for example, Data Science, IoT and large scale Scientific/Engineering applications, are highly dependent on high speed and broad band communication to transfer huge quantities of data. High throughput systems combined with high performance capabilities are thus increasingly required in practical situations.

As was the case with all previous events in the parallel Computing series of conferences, ParCo2017 attracted a large collection of notable contributions that depict present and future developments in the parallel computing field. During this event the various trends and research areas mentioned above were discussed, either in keynotes, contributed papers or specialised symposia.

This volume, however, represents a selection of papers presented at the conference. Not all contributors could submit their contributions in time and some contributions could not be accepted. The end result is that in these proceedings some papers covering areas mentioned above are not included.

The organisers wish to thank all organisations and individuals who contributed to the success of this event. A particular word of thanks is due to Paola Alberigo, Silvia Monfardini and Claudia Truini for their indispensable role in organising the conference.

Sanzio Bassini

Marco Danelutto

Patrizio Dazzi

Gerhard Joubert

Frans Peters

16 November 2017

**Abstract**

Smart systems and the smart world concept are addressed in the framework of the fourth industrial revolution. New challenges in distributed autonomous robots and computing are considered. An illustration of a new kind of smart and reconfigurable distributed modular robot system is given. A prototype is also presented as well as the associated distributed algorithm.

**Abstract**

We are to solve a linear system of equations which arise from discretization of realistic stress analysis problems by Krylov subspace (KS) methods such as the Conjugate Gradient (CG) method and the AZMJ variant of ORTHOMIN(2). It is important to use effective preconditioners with the KS methods to rapidly and successfully solve the linear equations. However some of the preconditioners do not often work well, and the speed-up obtained by parallel computing deteriorates when using preconditioning. We therefore try to apply CG and AZMJ with an Eisenstat type of Symmetric SOR preconditioner (abbreviated as E-SSOR) for parallel computing to the realistic stress analysis problems, and then examine the effectiveness of E-SSOR and the parallel performance. The numerical experiments demonstrate that AZMJ and CG using E-SSOR for parallel computing are useful for obtaining rapid and successful convergence on the stress analysis problems, and the good speed-up can be gained by E-SSOR on parallel computer. The convergence behavior of AZMJ is superior to that of CG, and the parallel performance of AZMJ, which has the less number of synchronization points than CG, using the hybrid parallelization is higher on the parallel computer.

**Abstract**

The exact diagonalization is the most accurate approach for solving the Hubbard model. The approach calculates the ground state of the Hamiltonian derived exactly from the model. Since the Hamiltonian is a large sparse symmetric matrix, we usually utilize an iteration method. It has been reported that LOBPCG method with a shift-and-invert preconditioner using an approximate eigenvalue can solve the problem efficiently. However, the preconditioner does not take effect for the Hamiltonian whose off-diagonal elements are predominant. In this research, we apply the preconditioner using the Neumann expansion and propose the communication avoiding strategy in consideration of the physical property of the Hubbard model. We show that the preconditioner improves the convergence property and decreases the elapsed time for the Hamiltonian with the predominant off-diagonal elements.

**Abstract**

Molecular Dynamics (MD) codes predict the fundamental properties of matter by following the trajectories of a collection of interacting model particles. To exploit diverse modern manycore hardware, efficient codes must use all available parallelism. At the same time they need to be portable and easily extendible by the domain specialist (physicist/chemist) without detailed knowledge of this hardware. To address this challenge, we recently described a new Domain Specific Language (DSL) for the development of performance portable MD codes based on a “Separation of Concerns”: a Python framework automatically generates efficient parallel code for a range of target architectures.

Electrostatic interactions between charged particles are important in many physical systems and often dominate the runtime. Here we discuss the inclusion of long-range interaction algorithms in our code generation framework. These algorithms require global communications and careful consideration has to be given to any impact on parallel scalability. We implemented an Ewald summation algorithm for electrostatic forces, present scaling comparisons for different system sizes and compare to the performance of existing codes. We also report on further performance optimisations delivered with OpenMP shared memory parallelism.

**Abstract**

Multiplication of two sparse matrices is a key operation in the simulation of the electronic structure of systems containing thousands of atoms and electrons. The highly optimized sparse linear algebra library DBCSR (Distributed Block Compressed Sparse Row) has been specifically designed to efficiently perform such sparse matrix-matrix multiplications. This library is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. It is parallelized using MPI and OpenMP, and can exploit GPU accelerators by means of CUDA. We describe a performance comparison of DBCSR on systems with Intel Xeon Phi Knights Landing (KNL) processors, with respect to systems with Intel Xeon CPUs (including systems with GPUs).

We find that the DBCSR on Cray XC40 KNL-based systems is 11%-14% slower than on a hybrid Cray XC50 with Nvidia P100 cards, at the same number of nodes. When compared to a Cray XC40 system equipped with dual-socket Intel Xeon CPUs, the KNL is up to 24% faster.

**Abstract**

Three different INTEL based HPC systems are used to benchmark an application of the LifeV library for running simulations of patient-specific cardiovascular hemodynamics. The targeted INTEL architectures rely on the Hashwell-Broadwell family of processors. Running times and scalability measures are collected with two real-size experiments. A third small-size test case is used to profile the code, exposing the effect of compiler vectorization, MPI efficiency and memory footprint. Profiling showed an unexpected low degree of floating point functional units usage, and a low percentage of effective vectorization. Extensive code redesign is likely necessary to best exploit the architectural features available in INTEL Knight Landing processors.

**Abstract**

Memetic Algorithms represent one of the most promising implementation of Evolutionary Algorithms. Their strength resides in the ability to exploit stochastic and deterministic optimization methods at the same time. A Memetic Algorithm has been applied to the phase retrieval problem in the field of Coherent Diffraction Imaging, called Memetic Phase Retrieval; it represents a significant improvement in the imaging of matter via coherent diffraction experiments. Memetic Phase Retrieval requires the latest High Performance Computing resources, due to the high dimensionality of the problem and the involvement of the Fourier Transform: an efficient parallel implementation, able to fully exploit multi-core and multi-node hardware, is needed. The implementation of Memetic Phase Retrieval, which exploits the hybrid OpenMP/MPI parallel programming paradigm, along with MPI Remote Memory Access communications, is presented. Its scaling performances on the recent Intel's Knights Landing hardware are shown.

**Abstract**

Pipelined Krylov subspace methods achieve significantly improved parallel scalability compared to standard Krylov methods on present-day HPC hardware. However, this typically comes at the cost of a reduced maximal attainable accuracy. This paper presents and compares several stabilized versions of the communication-hiding pipelined Conjugate Gradients method. The main novel contribution of this work is the reformulation of the multi-term recurrence pipelined CG algorithm by introducing shifts in the recursions for specific auxiliary variables. These shifts reduce the amplification of local rounding errors on the residual. Given a proper choice for the shift parameter, the amplification of local rounding errors is reduced and the resulting algorithm improves the attainable accuracy. Numerical results on a variety of SPD benchmark problems compare different stabilization techniques for the pipelined CG algorithm, showing that the stabilized algorithms are able to attain a high accuracy while displaying excellent parallel performance.

**Abstract**

Coarrays have been part of the Fortran standard since Fortran 2008 and provide a syntactic extension of Fortran to support parallel programming, often called Coarray Fortran (CAF). Although MPI is the de facto standard for parallel programs running on distributed memory systems and little scientific software is written in CAF, many scientific applications could benefit from the use of CAF. We present the migration from MPI to CAF of the libraries PSBLAS and MLD2P4 for the solution of large systems of equations using iterative methods and preconditioners. In this paper, we describe some investigations for implementing the necessary communication steps in PSBLAS and MLD2P4 and provide performance results obtained on linear systems arising from discretization of 2D and 3D PDEs.

**Abstract**

This work introduces a design proposal towards modernization of high performance numerical library enabling various parallel execution, such as offloading, many-core concurrent execution, heterogeneous hybrid execution. A prototype implementation of Eigen-G2 employs a parallel task model supported by OpenMP, and exclusive control of offloading a GPU device. Also, multiple data formats are available by taking advantage of a metaprogramming support of C++ language. Eigen-G2 exhibits good parallel performance scalability when we use both multicore CPUs and a GPU at the same time.

**Abstract**

The computation of a number of the smallest eigenvalues of large and sparse matrices is crucial in various scientific applications, as the Finite Element solution of PDEs, electronic structure calculations or Laplacian of graphs, to mention a few. We propose in this contribution a parallel algorithm which is based on the spectral low-rank modification of a factorized sparse inverse preconditioner (RFSAI) to accelerate Newton-based iterative eigensolvers. Numerical results onto matrices arising from various realistic problems with size up to 5 million unknowns and 2.2×10^{8} nonzero elements account for the efficiency and the scalability of the proposed RFSAI–updated preconditioner.

**Abstract**

In computer-based numerical simulations, some methods to determine the electronic and optical properties of semiconductor nanostructures, require computing the energies that correspond to the interior eigenvalues of a Hamiltonian matrix. We study the case in which the Schrödinger equation is expanded into a matrix that has a block-tridiagonal structure. Additionally, the matrix can have two extra non zero blocks in the corners due to periodic boundary conditions. Given that not the whole eigenspectrum is required, we choose to use projection methods to compute the necessary set of eigenvalues. The shift-and-invert Lanczos method requires to solve a linear system at each iteration. We have developed a parallel code that improves the scalability of this step by exploiting the block structure. Results show that, to solve these specific cases, this method offers better scalability when compared to a general-purpose solver such as MUMPS.

**Abstract**

Solving triangular systems with band structure is considered as one of the main issues in computational linear algebra. This study describes a new approach in order to improve the parallel performance of the resolution problem. The experiments are performed on a multicore machine using OpenMP interface. The analysis and evaluation indicate that the proposed method minimizes effectively the execution time in comparison of the related works.

**Abstract**

An original algorithm was developed for ray tracing across unstructured 3D grid. The resulting set of rays provides the base for the grid-characteristic computation of radiative energy transfer in the high temperature plasma simulations using MPI parallel technique and grid decomposition. The algorithm is implemented as a C++ code and incorporated in 3D magneto hydrodynamic (MHD) Eulerian code. Accurate rational calculations using integers of unlimited range are applied for the intersection of rays with grid elements. Tracing of different rays within a single MPI process is carried out in parallel with the use of OpenMP threads. Acceleration and scalability of the implemented algorithms were investigated including a comparison with other solvers within MHD code. The developed algorithm enables accounting the anisotropy of the radiation field in complex multiscale 3D MHD simulations. Proposed applications are the following: 1) postprocessing the results of 3D simulations (numerical pinhole imaging, estimation of the radiation output in a given direction); 2) modeling the propagation of the laser radiation; 3) calculating the thermal radiation field on the basis of quasi-diffusion model (improvement of the Eddington factor for the diffusion approximation).

**Abstract**

In a superconducting magnet the null resistivity of the cable allows to run a very intense electrical current without dissipating heat. During operation, a region of the superconducting wire can turn to normal state (quench) with possible damage due to intense heat dissipation. Quench simulation is thus a necessary step during magnet design; it is a multi-physics computational problem that ends up very easily with tens of millions of equations in as many unknowns. In this paper we attempt an HPC approach for quench simulation.

**Abstract**

Calibration of individual based models (IBMs), successful in modeling complex ecological dynamical systems, is often performed only ad-hoc. Bayesian inference can be used for both parameter estimation and uncertainty quantification, but its successful application to realistic scenarios has been hindered by the complex stochastic nature of IBMs. Computationally expensive techniques such as Particle Filter (PF) provide marginal likelihood estimates, where multiple model simulations (particles) are required to get a sample from the state distribution conditional on the observed data. Particle ensembles are re-sampled at each data observation time, requiring particle destruction and replication, which lead to an increase in algorithmic complexity. We present SPUX, a Python implementation of parallel Particle Markov Chain Monte Carlo (PMCMC) algorithm, which mitigates high computational costs by distributing particles over multiple computational units. Adaptive load re-balancing techniques are used to mitigate computational work imbalances introduced by re-sampling. Framework performance is investigated and significant speed-ups are observed for a simple predator-prey IBM model.

**Abstract**

Multiblock structured grids is widely used in numerical simulations with complex geometries. Multiblock structured grids can provide the geometrical flexibility and retain a low computational cost. However, high performance computations of large-scale realistic applications using multiblock structured grids technique under distributed memory systems face several communication challenges. A proper method that specifies the connectivity among so many blocks of structured grids may be a challenge in large-scale simulations. A proper communication schedule among a large number of subdomains may be another challenge. To solve the communication challenges broadly existing in multiblock structured grids parallel computing, a parallel module for multiblock structured grids computing has been designed and implemented in JASMIN infrastructure. The parallel module consists of a blocks' relationship description algorithm and a unified communication schedule. Meanwhile, by encapsulating parallel computing strategies, such as distributed storage, data communications, etc. and providing standard interfaces, this module can help user conveniently realize multiblock structured grids parallel computing. According to our test results, applications based on this module can be run efficiently on thousands of processor cores, which prove the module's satisfying parallel computing performance.

**Abstract**

The magnetohydrodynamic (MHD) simulation is often applied to study the global dynamics and configuration of a planetary magnetosphere for the space weather. In this paper, the computational performance of MHD code is evaluated with 128 nodes Xeon Phi KNL of Cray XC40. As the results, the 2D and 3D domain decompositions of SoA (structure of array) make the effective performances and AoS (array of structure) and hybrid parallel computation become low performances. Adding the performance optimizations for Xeon Phi to our MHD simulation code, then we have obtained 2.4 % increase of execution efficiency in total and we achieved 3 TFlops performance gain using 128 nodes.

**Abstract**

Since the beginning of distributed computing, overlapping communication with computation has always been an attractive technique used to mask high communication costs. Although easy to detect by a human being, communication/computation overlapping requires knowledge about architectural and network details in order to be performed effectively.

When low level details influence performance and productivity, compilers and run-time libraries play the critical role of translating the high level statements understandable by humans into efficient commands suitable for machines.

With the advent of PGAS languages, parallelism becomes part of the programming language and communication can be expressed with simple variable assignments. As for serial programs, PGAS compilers should be able to optimize all aspects of the language. That would include communication, but unfortunately this is not yet the case.

In this work we consider parallel scientific programs written in Coarray Fortran and we focus on how to build a PGAS compiler capable to optimize the communication, in particular by automatically exploiting opportunities for communication/computation overlapping. We also sketch an extension for the Fortran language that allows one to express the relation between data and synchronization events; we finally show how this relation can be used by the compiler to perform additional communication optimizations.

**Abstract**

Reconfiguration of parallel applications has gained traction with the increasing emphasis on energy/performance trade-off. The ability to dynamically change the amount of resources used by an application allows reaction to changes in the environment, in the application behavior or in the user's requirements. A popular technique consists in changing the number of threads used by the application (Dynamic Concurrency Throttling). Although this provides good control of application performance and power consumption, managing the technique can impose a significant burden on the application programmer, mainly due to state management and redistribution following the addition or removal of a thread. Nevertheless, some common state access patterns have been identified in some popular applications. By leveraging on this knowledge, we will describe how it is possible to simplify the state management procedures following a Concurrency Throttling operation.