Christian Neugebauer, Rudolf Berrendorf, Florian Mannuss
445 - 454
For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient work distributions often produce good results. In cases where this is not true, adaptive load balancing can improve the balance and performance. This paper introduces a low overhead framework for adaptive load balancing of parallel SpMV operations. It uses statistical filters to gather relevant runtime performance data and detects an imbalance situation. Three different algorithms were compared that adaptively balance the load with high quality and low overhead. Results show that for sparse matrices, where the adaptive load balancing was enabled, an average speedup of 1.15 (regarding the total execution time) could be achieved with our best algorithm over 4 different matrix formats and two different NUMA systems.
Steffen Hirschmann, Malte Brunn, Michael Lahnert, Colin W. Glass, Miriam Mehl, Dirk Pflüger
455 - 464
For short-range molecular dynamics (MD) simulations with heterogeneous particle distributions that dynamically change over time, highly flexible and dynamical load balancing methods are mandatory to achieve good parallel scalability. Designing and implementing load balancing algorithms is complex, especially for existing applications which were not designed to support arbitrary domain decompositions. In this paper, we present our approach to incorporate general domain decompositions and dynamic re-balancing into the existing MD software package ESPResSo. We describe the relevant interfaces and abstractions which enable us to reuse the physics algorithms in ESPResSo, without major re-implementations. As proof-of-concept, we show the implementation of a domain decomposition based on space-filling curves and a dynamic re-balancing mechanism using an enhanced version of the p4est library. The results indicate that our load balancing mechanism is capable of reducing the imbalance amongst processes and the total runtime of simulations in simple and complex scenarios. At the same time, the implementation of models and solvers in ESPResSo remains largely unchanged.
Thomas Gonçalves, Marc Pérache, Frédéric Desprez, Jean-François Méhaut
465 - 474
High energy physic scientists are strongly concerned by numerical simulations and High Performance Computing. In the field of numerical simulation, particle transport is known to be a hard problem especially when Monte Carlo solvers are used. These applications are difficult to parallelize and optimize due to the non determinism of the Monte Carlo approach. These simulations have to manage the transport of several billions of particles. Application developers have also to deal with memory consumption in order to fit into the memory available per computing node and they have to balance the workload among clusters nodes.
Our contribution consists mainly of a new algorithm dedicated to particle Monte Carlo transport able to solve the issues of memory consumption and load balancing. The computing load is modeled with a weighted graph to abstract the objects of simulated physical system. This load balancing algorithm is studied on a benchmark set and the performance results using 4096 CPU cores are very promising. Comparing to the fully replicated perfectly balanced approach, our algorithm provides a significant memory footprint reduction while keeping the imbalance between computing nodes below 15% and finally obtains a better parallel efficiency.
Dai Yang, Josef Weidendorfer, Carsten Trinitis, Tilman Küstner, Sibylle Ziegler
475 - 484
Exascale computing is the next major milestone for the HPC community. Due to a steadily increasing probability of failures, current applications must be made malleable to be able to cope with dynamic resource changes. In this paper, we show first results with LAIK, a lightweight library for dynamically re-distributable application data. This allows to free compute nodes from workload before a predicted failure. For a real-world application, we show that LAIK adds negligible overhead. In addition, we show the effect of different re-distribution strategies.
Samantha V. Adams, Olga Abramkina, Yann Meurdesoif, Mike Rezny
485 - 494
This paper describes an investigation into parallel IO using the XIOS framework with the Met Office LFRic Infrastucture. LFRic is currently in development as the replacement for the Met Office Unified Model to enable weather and climate modelling on HPC platforms of the next decade. This will involve running models on hundreds of thousands of cores, therefore a key aspect of the LFRic project is scalability. At this scale, IO is likely to be problematic so LFRic has been exploring the XIOS parallel IO framework as a solution. The purpose of our preliminary experiments was twofold. Firstly, to assess general scalability of our benchmark model using XIOS for parallel output. Secondly, to explore tuning of this benchmark according to some XIOS and Lustre metrics. We successfully ran jobs on up to 82,944 cores on our Cray XC40 machine and found that XIOS was able to efficiently write to one output file whilst still showing scalability with respect to runtime. An investigation into Lustre file striping parameters showed that, with appropriate striping, we could use fewer IO servers than our original guess and still maintain a good performance.
When performing computations where load balancing is complex, dynamic load balancing is becoming increasingly necessary. In this paper we examine one method to perform dynamic load balancing, known as Task-Based Parallelism.
Many libraries implement Task-Based Parallelism, however in this paper we examine the OpenMP standard and implementations, and apply it to the Classical Molecular Dynamics code, DL_POLY_4, focusing on the two body force calculations that make up a large percentage of the compute in many simulation runs.
Our results show reasonable performance using OpenMP tasks, however some of the extensions available in other libraries such as OmpSs or StarPU may help with performance for problems similar to Molecular Dynamics, where avoiding race conditions between tasks can have a substantial scheduling overhead.
James S. Willis, Matthieu Schaller, Pedro Gonnet, Richard G. Bower, Peter W. Draper
507 - 516
In particle-based simulations, neighbour finding (i.e finding pairs of particles to interact within a given range) is the most time consuming part of the computation. One of the best such algorithms, which can be used for both Molecular Dynamics (MD) and Smoothed Particle Hydrodynamics (SPH) simulations, is the pseudo-Verlet list algorithm. This algorithm, however, does not vectorize trivially, and hence makes it difficult to exploit SIMD-parallel architectures. In this paper, we present several novel modifications as well as a vectorization strategy for the algorithm which lead to overall speed-ups over the scalar version of the algorithm of 2.24x for the AVX instruction set (SIMD width of 8), 2.43x for AVX2, and 4.07x for AVX-512 (SIMD width of 16).
Modeling thermonuclear supernovae is a premier application for leadership-class supercomputers and requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, for example, the new pressure associated with this temperature increase. In fact, an EOS is needed after thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using results from the IBM/NVIDIA early test system (Summitdev, a precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we describe a hybrid OpenMP implementation with offloaded work to GPUs. We compare performance results between the two implementations, with a discussion of some of the currently available features of OpenACC and OpenMP 4.5.
The paper deals with the OpenMP parallel implementation of a high-order Discontinuous Galerkin solver for computational fluid dynamics (CFD) and computational aeroacoustics (CAA) applications. The use of the shared memory view of the OpenMP paradigm is here explored through three different parallel implementation strategies. The numerical experiments on 2D and 3D test cases, which consider the effects of different platforms, compilers and space discretizations, indicate that all the code versions perform quite satisfactory. In particular, the OpenMP domain decomposition algorithm reaches the highest level of parallel efficiency at low computational loads, while a colouring approach excels for the largest simulations. The performance gain observed in using a hybrid MPI/OpenMP version of the DG code on large HPC facilities will be demonstrated.
Jack B. Dennis, Lei Huang, Willie Lim, Hsiang-Huang Wu, Yuzhong Yan
539 - 549
This paper discusses how deep neural network simulations can be implemented and executed on a proposed computer system inspired by data flow principles. In particular, we use the cycle accurate Kiva simulator developed by the MIT CSAIL Fresh Breeze project to simulate the data flow machine used for running the neural networks. An example of a deep neural network and its simulation using Kiva is presented and used to demonstrate the benefits of the tree-of-chunks representation, hardware task scheduling, memory unification, and modular programming. The neural network is written in funJava, a functional subset of Java. Our compiler for funJava programs converts funJava methods to data flow graphs, identifies opportunities for data parallel implementation, and generates a graph of machine executable code blocks called codelets which are executed as tasks by the data flow machine. The behavior and performance of the neural network as the input data size and the number of cores in the data flow machine are varied are reported and discussed.
The contributions of the paper are: 1) It is the first attempt to implement the neural network algorithm using the fine-grained data flow execution model; 2) It provides a comprehensive performance study on the scalability of running the neural network on a simulated data flow machine; 3) It demonstrates the expressiveness of funJava for implementing neural network algorithms.
With the data doubling every year, data intensive applications are increasing as well as the demand of high-end resource capacity to analyse collected data sets. The explosion of analysis applications have become a major driver for revising system architecture and tools leading to the proliferation of software components and frameworks which may require multi-node and multi-core systems to scale-up and provide good performance. In this context, Machine learning and Deep learning are steadily proving to be successful methods for a variety of use cases, and their popularity has resulted in numerous open-source software tools becoming accessible to the public and popular across different scientific disciplines. But with the growth of applications and tools, it is becoming difficult for researchers to estimate how much resource is needed to run their analyses and select appropriate software and hardware components. The goal of this paper is to present the results of a preliminary comparative study of state-of-the-art machine and deep learning tools and benchmark them on Cineca HPC systems. The comparison has been done taking in consideration different factors including the impossibility to benchmark all tools available on the market, the existence of tools supporting hardware accelerators, such as GPU, and the availability of precedent studies [1,2]. Our preliminary results show that tested tools are able to leverage underneath system capabilities to achieve significant performance and that no single software exists that outperforms others opening space to further optimisation.
SPar was originally designed to provide high-level abstractions for stream parallelism in C++ programs targeting multi-core systems. This work proposes distributed parallel programming support for SPar targeting cluster environments. The goal is to preserve the original semantics while source-to-source code transformations will be turned into MPI (Message Passing Interface) parallel code. The results of the experiments presented in the paper demonstrate improved programmability without significant performance losses.
Fabian Wrede, Breno Augusto De Melo Menezes, Luis Filipe De Araujo Pessoa, Bernd Hellingrath, Fernando Buarque De Lima Neto, Herbert Kuchen
573 - 582
Swarm Intelligence (SI)-based metaheuristics are frequently used to solve complex optimization problems, which are too hard to be solved by classic exact algorithms. Inspired by nature, SI particles move through a search space in pursuit of good solutions. Even using SI, solving some large problems still takes a lot of time, e.g., due to the high number of dimensions and large search spaces. In order to over-come this, parallel implementations of SI algorithms have been investigated. They are typically based on low-level approaches for parallelism, such as MPI, OpenMP, and CUDA, which are tedious and error-prone to use. To overcome these issues, frameworks for high-level parallel programming such as the Muenster Skeleton Library (Muesli) can be used. We show how two SI algorithms, namely Particle Swarm Optimization (PSO) and Fish School Search (FSS), can be implemented in Muesli easily. Experimental results demonstrate the obtained performance and good scalability.
Andrew Brown, David Thomas, Jeff Reeve, Ghaith Tarawneh, Alessandro De Gennaro, Andrey Mokhov, Matthew Naylor, Tom Kazmierski
583 - 592
As computing systems get larger in capability – a good thing – they also get larger in ways less desirable: cost, volume, power requirements and so on. Further, as the datastructures necessary to support large computations grow physically, the proportion of wallclock time spent communicating increases dramatically at the expense of the time spent calculating. This state of affairs is currently unacceptable and will only get worse as exa-scale machines move from the esoteric to the commonplace. As the unit cost of non-trivial cores continues to fall, one powerful approach is to build systems that have immense numbers of relatively small cores embedded (both geometrically and topologically) in a vast distributed network of stored state data: take the compute to the data, rather than the other way round. In this paper, we describe POETS – Partially Ordered Event Triggered Systems. This is a novel kind of computing architecture, built upon the neuromorphic concept that has inspired such machines as SpiNNaker[1,2] and BrainScaleS. The central idea is that a problem is broken down into a large set of interacting devices, which communicate asynchronously via small, hardware brokered packets (the arrival of which is an event). The set of devices is the task graph. You cannot take a conventional codebase and port it to a POETS architecture; it is necessary to strip the application back to the underlying mathematics and reconstruct the algorithm in a manner sympathetic to the solution capabilities of the machine. However, for the class of problems for which this approach is suitable, POETS has already demonstrated solution speedups of a factor of 200 over conventional techniques.
Streamlined switching between computational resources in order to select the most suitable computational environment for parallel application execution is a crucial component of utility-like computing. However, machine heterogeneity obstructs multi-target deployment for complex and multi-dependency scientific parallel codes and makes this aim intractable. We describe a proposal for a metadeployment toolkit, called ADAPT, based on reusable recipes addressing appropriate match-up between an application and an execution platform. Our research aims at exploring challenges posed by transparent application deployment with all its prerequisites on heterogeneous resources. As some IaaS clouds and grids accept customized OS images, we explore application-oriented image assembly to further improve deployment for these specific targets. We explain how our approach increases “usability” of various resources for parallel applications and simplifies arcane build processes.
Fabio Tordini, Marco Aldinucci, Paolo Viviani, Ivan Merelli, Pietro Liò
605 - 614
The cloud environment is increasingly appealing for the HPC community, which has always dealt with scientific applications. However, there is still some skepticism about moving from traditional physical infrastructures to virtual HPC clusters. This mistrusting probably originates from some well known factors, including the effective economy of using cloud services, data and software availability, and the longstanding matter of data stewardship. In this work we discuss the design of a framework (based on Mesos) aimed at achieving a cost-effective and efficient usage of heterogeneous Processing Elements (PEs) for workflow execution, which supports hybrid cloud bursting over preemptible cloud Virtual Machines.
The biennial mini-symposium “Parallel computing with FPGAs” brings together research on applications and tools fostering the use of field programmable gate arrays. Key aspects are the efficiency, programmability, scalability and portability of high-level synthesis languages and tools. In particular this year's contributions present productivity and programmability results of using HLS languages OpenCL, OmpSs, MATLAB/Octave, OpenSPL and Vivado HLS. The current state and future challenges of HLS within the FPGA landscape is covered in a special keynote on bridging the gap between software and hardware designers.
Modern Systems-on-Chip (SoC) architectures and CPU+FPGA computing platforms are moving towards heterogeneous systems featuring an increasing number of hardware accelerators. These specialized components can deliver energy-efficient high performance, but their design from high-level specifications is usually very complex. Therefore, it is crucial to understand how to design and optimize such components to implement the desired functionality.
This paper discusses the challenges between software programmers and hardware designers, focusing on the state-of-the-art methods based on high-level synthesis (HLS). It also highlights the future research lines for simplifying the creation of complex accelerator-based architectures.