In this work, we address the problem of tuning communication libraries by using a deep reinforcement learning approach. Reinforcement learning is a machine learning technique incredibly effective in solving game-like situations. In fact, tuning a set of parameters in a communication library in order to get better performance in a parallel application can be expressed as a game: Find the right combination/path that provides the best reward. Even though AITuning has been designed to be utilized with different run-time libraries, we focused this work on applying it to the OpenCoarrays run-time communication library, built on top of MPI-3. This work not only shows the potential of using a reinforcement learning algorithm for tuning communication libraries, but also demonstrates how the MPI Tool Information Interface, introduced by the MPI-3 standard, can be used effectively by run-time libraries to improve the performance without human intervention.
This paper discusses the problem of reliable benchmarking of non-blocking MPI communications, both non-blocking point-to-point and non-blocking MPI-3 collective operations. The problem of accurate and practical estimating the level of calculation/communication overlapping (communication hiding efficiency) is discussed. Authors propose the efficiency estimation approach and methodology which is different from previous well-known works. The new IMB-ASYNC benchmark design is proposed. Some practical tests were made on Lomonosov-2 supercomputer using widely used Intel MPI 2017 Update 1 MPI implementation. The results of the tests are discussed, and the future work on IMB-ASYNC testing and development is outlined.
MuPAT, an interactive multiple precision arithmetic toolbox for use on MATLAB and Scilab, enables users to handle quadruple- and octuple-precision arithmetic operations. MuPAT uses the DD and QD algorithms, which require from 10 to 600 double-precision floating-point operations for each DD or QD operation, which entails corresponding execution time costs. In order to reduce the execution time of vector and matrix operations, we apply FMA, AVX2, and OpenMP to MuPAT by using the MATLAB executable file. Unit stride access is required for high performance and it makes vectorization with AVX2 easier. Larger blocks are suitable for parallelization with OpenMP. That is, AVX2 is suitable for the innermost loop and OpenMP is suitable for the outer loop. One result of adopting the described configuration is that matrix multiplication is nearly 13 times faster in a four-core environment. By using parallel processing in this way, the execution time of some DD vector operations is almost twice that of the original double-precision floating-point operations without parallel processing.
Future large-scale high-performance computing clusters will face a power wall where the peak power draw of these clusters exceeds the maximal power-supplying capability of the surrounding infrastructure. To use the limited power budget efficiently, we developed a dynamic strategy to tackle execution time imbalance issues through power shifting and frequency limitation. By applying this strategy to NPB OpenMP benchmarks, we succeed in a continuous enforcement of power draw under a specified power cap. At the same time, execution time is reduced by up to 12.8% and the energy to solution is reduced by up to 12.3compared to a native power strategy.
Performance measurement of the particle-in-cell (PIC) method for collisionless plasma is made on the strong scaling of the thread-level parallelism with OpenMP. The conventional program structure of the PIC method, in which a single loop statement involves an iteration through the list of particles, is compared with the new program structure, in which outer multiple loop statements involve iterations through spatial grid cells and the most inner single loop statement involves an iteration through the list of particles. The present strong scaling measurement shows that the new program structure improves both performance and scalability of the PIC code from the conventional program structure. The new code runs about three times faster than the conventional code without sorting of the list of particles.
We discuss an open source implementation of Backus FP formalism in C++. Our implementation preserves all the nice formal properties of the original language. The implementation is fully C++17 compliant and leverages standard concurrency mechanisms. It provides linear scalability on state-of-the-art shared memory multi cores. By preserving the possibility to use all the rules of the associated “algebra of programs” described by Backus more that 40 years ago, the C++ FP implementation is a natural candidate to be used to introduce parallel programming concepts in core parallel computing courses.
Today’s computer architectures are increasingly specialized and heterogeneous configurations of computational units are common. To provide efficient programming of these systems while still achieving good performance, including performance portability across platforms, high-level parallel programming libraries and tool-chains are used, such as the skeleton programming framework SkePU. SkePU works on heterogeneous systems by automatically generating program components, “user functions”, for multiple different execution units in the system, such as CPU and GPU, from a high-level C++ program. This work extends this multi-backend approach by providing the possibility for the programmer to provide additional variants of these user functions tailored for different scenarios, such as platform constraints. This paper introduces the overall approach of multi-variant user functions, provides several use cases including explicit SIMD vectorization for supported hardware, and evaluates the result of these optimizations that can be achieved using this extension.
Andrew Brown, Mark Vousden, Alex Rast, Graeme Bragg, David Thomas, Jonny Beaumont, Matthew Naylor, Andrey Mokhov
487 - 496
POETS (Partially Ordered Event Triggered Systems) is a significantly different way of approaching large, compute intensive problems. The evolution of traditional computer technology has taken us from simple machines with tiny memory and (by todays standards) glacial clock speeds, to multi-gigabyte architectures running orders of magnitude faster, but with the same fundamental process at the heart: a central core doing one thing at a time. Over the past few years, architectures have appeared containing multiple cores, but exploiting these efficiently in the general case remains a ‘holy grail’ of computer science. POETS takes an alternative approach, made possible only today by the proliferation of cheap, small cores and massive reconfigurable platforms. Rather than program explicitly the behaviour of each core and each communication between them, as is done in conventional supercomputers, here the programmer defines a set of relatively small, simple behaviours for the set of cores, and leaves them to get on with it – with the right behavioural definitions, the system ‘self-organises’ to produce the desired results.
Dario Dematties, George K. Thiruvathukal, Silvio Rizzi, Alejandro Wainselboim, B. Silvano Zanutto
497 - 506
The interdisciplinary field of neuroscience has made significant progress in recent decades, providing the scientific community in general with a new level of understanding on how the brain works beyond the store-and-fire model found in traditional neural networks. Meanwhile, Machine Learning (ML) based on established models has seen a surge of interest in the High Performance Computing (HPC) community, especially through the use of high-end accelerators, such as Graphical Processing Units(GPUs), including HPC clusters of same. In our work, we are motivated to exploit these high-performance computing developments and understand the scaling challenges for new–biologically inspired–learning models on leadership-class HPC resources. These emerging models feature sparse and random connectivity profiles that map to more loosely-coupled parallel architectures with a large number of CPU cores per node. Contrasted with traditional ML codes, these methods exploit loosely-coupled sparse data structures as opposed to tightly-coupled dense matrix computations, which benefit from SIMD-style parallelism found on GPUs. In this paper we introduce a hybrid Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization scheme to accelerate and scale our computational model based on the dynamics of cortical tissue. We ran computational tests on a leadership class visualization and analysis cluster at Argonne National Laboratory. We include a study of strong and weak scaling, where we obtained parallel efficiency measures with a minimum above 87% and a maximum above 97% for simulations of our biologically inspired neural network on up to 64 computing nodes running 8 threads each. This study shows promise of the MPI+OpenMP hybrid approach to support flexible and biologically-inspired computational experimental scenarios. In addition, we present the viability in the application of these strategies in high-end leadership computers in the future.
Scientific visualization tools are essential for the understanding of physical simulation, as it gives a visualization aspect of the simulated phenomena. In the past years, data produced by simulations join the big-data trend. To maintain a reasonable reaction time of the user’s commands, many scientific tools tend to introduce parallelism schemes to their software. As the number of cores in any given architecture increases, the need for software to utilize the architecture is inevitable. Thus, GraphiX – a scientific visualization tool parallelized in a shared-memory fashion via OpenMP version 4.5 was created. We chose Gnuplot as the graphical utility for GraphiX due to its speed as it is written in C. GraphiX parallelism scheme’s work-balance is nearly perfect and scales well both in terms of memory and amount of cores. We achieved a maximum of 560% speedup with 16 cores while visualizing approx 3 million cells.
Allen D. Malony, Matt Larsen, Kevin Huck, Chad Wood, Sudhanshu Sane, Hank Childs
521 - 530
Large scale parallel applications have evolved beyond the tipping point where there are compelling reasons to analyze, visualize and otherwise process output data from scientific simulations in situ rather than writing data to filesystems for post-processing. This modern approach to in situ integration is served by recently developed technologies such as Ascent, which is purpose-built to transparently integrate runtime analysis and visualization into many different types of scientific domains. The TAU Performance System (TAU) is a comprehensive suite of tools that have been developed to measure the performance of large scale parallel libraries and applications. TAU is widely-adopted and available on leading-edge HPC platforms, but has traditionally relied on post-processing steps to visualize and understand application performance. In this paper, we describe the integration of Ascent and TAU for two complementary purposes: Analyzing Ascent performance as it serves the visualization needs of scientific applications, and visualizing TAU performance data at runtime. We demonstrate the immediate benefits of this in situ integration, reducing the time to insight while presenting performance data in a perspective familiar to the application scientist. In the future, the integration of TAU’s performance observations will enable Ascent to reconfigure its behavior at runtime in order to consistently stay within user-defined performance constraints while processing visualizations for complex and dynamic HPC applications.
Adriano Vogel, Dalvan Griebler, Luiz Gustavo Fernandes, Marco Danelutto
533 - 542
Video streaming applications have critical performance requirements for dealing with fluctuating workloads and providing results in real-time. As a consequence, the majority of these applications demand parallelism for delivering quality of service to users. Although high-level and structured parallel programming aims at facilitating parallelism exploitation, there are still several issues to be addressed for increasing/improving existing parallel programming abstractions. In this paper, we aim at employing self-adaptivity for stream processing in order to seamlessly manage the application parallelism configurations at run-time, where a new strategy alleviates from application programmers the need to set time-consuming and error-prone parallelism parameters. The new strategy was implemented and validated on SPar. The results have shown that the proposed solution increases the level of abstraction and achieved a competitive performance.
Dinei A. Rockenbach, Dalvan Griebler, Marco Danelutto, Luiz G. Fernandes
543 - 552
The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained.
In this work we describe a method to measure the computing performance and energy-efficiency to be expected of an FPGA device. The motivation of this work is given by their possible usage as accelerators in the context of floating-point intensive HPC workloads. In fact, FPGA devices in the past were not considered an efficient option to address floating-point intensive computations, but more recently, with the advent of dedicated DSP units and the increased amount of resources in each chip, the interest towards these devices raised. Another obstacle to a wide adoption of FPGAs in the HPC field has been the low level hardware knowledge commonly required to program them, using Hardware Description Languages (HDLs). Also this issue has been recently mitigated by the introduction of higher level programming framework, adopting so called High Level Synthesis approaches, reducing the development time and shortening the gap between the skills required to program FPGAs wrt the skills commonly owned by HPC software developers. In this work we apply the proposed method to estimate the maximum floating-point performance and energy-efficiency of the FPGA embedded in a Xilinx Zynq Ultrascale+ MPSoC hosted on a Trenz board.
In this work, a new algorithm was developed for calculating the fourpoint water model TIP4P on graphics accelerators. It was designed as a part of the flexible molecular dynamics modeling package LAMMPS in the library module “GPU”. In this paper we describe two approaches to implement the TIP4P model for GPU: 1) to divide the related computations between CPU and GPU; 2) to compute the interaction fully on the GPU. We verify the program, benchmark and profile it. The achieved speedup of interaction computation is about x7, acceleration of the entire calculation of about 55%.
Ekaterina Dlinnova, Sergey Biryukov, Vladimir Stegailov
574 - 582
The article presents the energy consumption and efficiency analysis based on the data from three small-size supercomputers installed in JIHT RAS. One system is the air-cooled hybrid supercomputer Desmos with AMD FirePro GPUs and two others are the air-cooled and liquid-cooled segments of the supercomputer Fisher based on AMD Epyc Naples CPUs. To collect data, we implement the same real-time analytics infrastructure on all three supercomputers. We consider classical molecular-dynamics problem as a benchmarking tool. Our results quantify the energy savings that are provided by the GPU-based calculations in comparison with CPU-only calculations and by liquid cooling in comparison with air-cooling. During strong scaling benchmarks, we detect an interesting minimum of energy consumption in the CPU-only case.
David Goz, Georgios Ieronymakis, Vassilis Papaefstathiou, Nikolaos Dimou, Sara Bertocco, Antonio Ragagnin, Luca Tornatore, Giuliano Taffoni, Igor Coretti
583 - 592
The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct N-body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations.We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to “traditional” technologies for HPC, which purely focus on peak-performance than on power-efficiency.
Håvard H. Holm, André R. Brodtkorb, Martin L. Sætra
593 - 604
In this work, we examine the performance and energy efficiency when using Python for developing HPC codes running on the GPU. We investigate the portability of performance and energy efficiency between CUDA and OpenCL; between GPU generations; and between low-end, mid-range and high-end GPUs. Our findings show that for some combinations of GPU and GPU code, there is a significant speedup for CUDA over OpenCL, but that this does not hold in general. Our experiments show that performance in general varies more between different GPUs, than between using CUDA and OpenCL. Finally, we show that tuning for performance is a good way of tuning for energy efficiency.