In this paper we report results of the analysis of computational performances and energy efficiency of a Lattice Boltzmann method (LBM) based application on the Intel KNL family of processors. In particular we analyse the impact of the main memory (DRAM) while using optimised memory access patterns to accessing data on the on-chip memory (MCDRAM) configured as cache for the DRAM, even when the size of the data of the simulation fits the capacity of the on-chip memory available on socket.
In this paper, we evaluate the performance, power consumption and its variation and also thermal behavior of the DGX-2 server from Nvidia. We present a development of specialized synthetic benchmarks to measure raw performance of GPUs for single, double, half precision and also Tensor Core units. With these benchmarks, we were able to reach peak performance and verify the specification provided by Nvidia. We achieved 130.79 TFLOPS peak performance in half-precision on Tensor Cores. We also measured the thermal stability of the DGX-2 system. It can hold its peak performance when all 16 GPUs are fully loaded except Tensor Core workload, when thermal throttling occurred with with up to 1% performance penalty. During single-precision workload we observed 23% variation of the power consummation of individual GPUs installed in the system. Finally, we have evaluated the behavior of the Tesla V100-SXM3 chip under the DVFS tuning. Running at optimal frequency, the compute bound workload can save up to 39% energy while the run-time increases by 51%. More importantly, memory bound workload can save up to 31% with 2% throughput penalty and during the communication over NVLink one can save up to 26% energy with no penalty.
The Sparse Matrix-Vector Multiplication kernel (SpMV) has been one of the most popular kernels in high-performance computing, as the building block of many iterative solvers. At the same time, it has been one of the most notorious kernels, due to its low flop per byte ratio, which leads to under-utilization of modern processing system resources and a huge gap between the peak system performance and the observed performance of the kernel. However, moving forward to exascale, performance by itself is no longer the holy grail; the requirement for energy efficient high-performance computing systems is driving a trend towards processing units with better performance per watt ratios. Following this trend, FPGAs have emerged as an alternative, low-power accelerator for high-end systems. In this paper, we implement the SpMV kernel on FPGAs, towards an accelerated library for sparse matrix computations, for single-precision floating point values. Our implementation focuses on optimizing access to the data for the SpMV kernel and applies common optimizations to improve the parallelism and the performance of the SpMV kernel on FPGAs.We evaluate the performance and energy efficiency of our implementation, in comparison to modern CPUs and GPUs, for a diverse set of sparse matrices and demonstrate that FPGAs can be an energy-efficient solution for the SpMV kernel.
In this paper we present an evaluation of the Intel Xeon Broadwell platform in the CINECA Galileo supercomputer when DVFS and UnCore Frequency (UCF) tuning is performed under the active power capping using RAPL powercap registers. This work is an extension of our previous work done under the H2020 READEX project which focused on a dynamic tuning of DVFS and UCF for complex HPC applications, but with no powercap limit enforced. Power capping is an essential technique that allows system administrators to maintain the power budget of an entire system or data center using either out-of-band management system or runtime systems such as GEOPM.
In this paper we use two boundary workloads, Compute Bound Workload (CBW) and Memory Bound Workload (MBW) to show the behavior of the platform under power capping and potential for both energy and runtime savings when compared to the default CPU behavior. We show that DVFS and UCF tuning behave differently under the limited power budget. Our results show that if CPU has a limited power budget the proper tuning can provide both improved energy consumption as well as reduced runtime and that it is important to tune both DVFS and UCF.
For MBW we can save up 22% for both runtime and energy when compared to default behavior under powercap. For CBW we can improve both performance, up to 9.4%, and energy consumption, up to 14.9%.
For symmetric (hermitian) (dense or banded) matrices the computation of eigenvalues and eigenvectors Ax = λBx is an important task, e.g. in electronic structure calculations. If a larger number of eigenvectors are needed, often direct solvers are applied. On parallel architectures the ELPA implementation has proven to be very efficient, also compared to other parallel solvers like EigenExa or MAGMA. The main improvement that allows better parallel efficiency in ELPA is the two-step transformation of dense to band to tridiagonal form. This was the achievement of the ELPA project. The continuation of this project has been targeting at additional improvements like allowing monitoring and autotuning of the ELPA code, optimizing the code for different architectures, developing curtailed algorithms for banded A and B, and applying the improved code to solve typical examples in electronic structure calculations. In this paper we will present the outcome of this project.
Graphs are a powerful tool for data representation in a wide range of domains like social, biological, informational, etc. But their extremely large sizes often makes it computationally infeasible to study the entire graphs. Graph sampling provides a solution by generating smaller subgraphs which are computationally feasible to analyze and can be used to infer the properties of the entire graph. In this work, we develop a high throughput parallel implementation of Totally Induced Edge Sampling (TIES) algorithm on FPGA. Prior research has shown that TIES performs better than other sampling techniques in terms of preserving the topological properties of the original graph, and thus generates better quality subgraphs. The algorithm randomly samples the edges and inserts the corresponding vertices into the sampled vertex set until the desired number of vertices are sampled. Then, the edges connecting the sampled vertices are included in the sampled subgraph. We use multiple parallel pipelines to achieve high throughput and faster graph sampling. The parallel pipelines need to access a global dynamic data structure which contains the vertices sampled thus far. To support this, we develop a novel dynamic hash table data structure which supports parallel accesses in each clock cycle. We vary the number of pipelines, the size of the sampled subgraph and analyze the performance of the design in terms of on-chip FPGA resource utilization, throughput and total execution time. Our design achieves a throughput as high as 2471 Million Edges Per Second (MEPS) and performs 3.6x better than the state-of-the-art multi-core design.
Non-Local means (NL-means) algorithm is a robust image denoising algorithm. Its computational complexity is, however, higher than other algorithms, and its availability is limited. In this paper, we propose an implementation method of the NL-means algorithm on FPGA. In the NL-means, the cross correlations between the small windows are repeatedly calculated, and a large number of intermediate data have to be held temporarily to reduce the amount of its computation. In our approach, the scan direction of the image is changed in the zigzag way. This zigzag scan increases the computation time because of the recalculation on the scan borders, but the required memory size can be drastically reduced. We have implemented the circuit on a Xilinx FPGA, and showed that with a small size FPGA, its real-time processing is possible.
Convolutional Neural Networks (CNNs) currently dominate the fields of artificial intelligence and machine learning due to their high accuracy. However, their computational and memory needs intensify with the complexity of the problems they are deployed to address, frequently requiring highly parallel and/or accelerated solutions. Recent advances in machine learning showcased the potential of CNNs with reduced precision, by relying on binarized weights and activations, thereby leading to Binarized Neural Networks (BNNs). Due to the embarassingly parallel and discrete arithmetic nature of the required operations, BNNs fit well to FPGA technology, thus allowing to considerably scale up problem complexity. However, the fixed amount of resources per chip introduces an upper bound on the dimensions of the problems that FPGA-accelerated BNNs can solve. To this end, we explore the potential of remote FPGAs operating in tandem within a disaggregated computing environment to accelerate BNN computations, and exploit dynamic partial reconfiguration (DPR) to boost aggregate system performance. We find that DPR alone boosts throughput performance of a fixed set of BNN accelerators deployed on a remote FPGA by up to 3x in comparison with a static design that deploys the same accelerator cores on a software-programmable FPGA locally. In addition, performance increases linearly with the number of remote devices when inter-FPGA communication is reduced. To exploit DPR on remote FPGAs and reduce communication, we adopt a versatile remote-accelerator deployment framework for disaggregated datacenters, thereby boosting BNN performance with negligible development effort.
Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches.
In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU.
Cellular automata are a massively parallel programming model that are capable to solve many algorithmic problems efficiently. The complexity of defining a suitable cell rule for a concrete problem can be overcome by the use of the extended model of global cellular automata in conjunction with specialized compilers, to translate a high-level imperative programming language to cellular automata. Obviously, the execution on universal multicore processors does not make use of the full parallel potential of cellular automata and the workflow for direct hardware implementations is slow and hard to debug. In this paper, we propose a novel processor architecture that can execute a global cellular automaton as software and can still compete with other software or hardware implementations.
In this paper, we propose a network crossbar implementation using partial reconfiguration of an FPGA in a multi-FPGA cluster computing system. With a proposed framework, inter-FPGA network routing can be changed by reconfiguring the crossbar module by a partial reconfiguration mechanism. The purpose of this paper is to compare ordinary crossbar circuits and partial reconfiguration crossbar circuits, in terms of resource usage and the maximum operating frequency. As a result, by using partial reconfiguration, the maximum operating frequency is improved by 1.6 times while reducing required ALM resources by 13%, a proper bus sizes for a crossbar are selected.
Unreproducibility stemming from a loss of data integrity can be prevented with hash functions, secure sketches, and Benford’s Law when combined with the historical practice of a Pli Cacheté where scientific discoveries were archived with a 3rd party to later prove the date of discovery. Including the distinct systems of preregistation and data provenance tracking becomes the starting point for the creation of a complete ontology of scientific documentation. The ultimate goals in such a system–ideally mandated–would rule out several forms of dishonesty, catch computational and database errors, catch honest mistakes, and allow for automated data audits of large collaborative open science projects.
Transparency and reproducibility are important aspects of validation for Machine Learning (ML) models that are not fully understood and applies independently of the application domain.We offer a case study of reproducibility that highlights the challenges encountered when attempting to reproduce analyzes obtained with Machine Learning methods in materials informatics. Our study explores prediction results obtained with ML models and issues in training data serving as input. We discuss challenges related to theory-driven and numerical errors in training data, lack of reproducibility across platforms and versions, and effects of randomness when varying hyperparameters. In addition to model accuracy, a main metric of interest in the ML community, our results show that model sensitivity may be equally important for applying ML in domain applications such a materials science.
Establishing the reproducibility of an experiment often requires repeating the experiment in its native computing environment. Containerization tools provide declarative interfaces for documenting native computing environments. Declarative documentation, however, may not precisely recreate the native computing environment because of human errors or dependency conflicts. An alternative is to trace the native computing environment during application execution. Tracing, however, does not generate declarative documentation.
In this paper, we preserve the native computing environment via tracing and and automatically generate declarative documentation using trace logs. Our method distinguishes between inputs, outputs, user and system dependencies for a variety of programming languages. It then maps traced dependencies to standard package names and their versions via querying of standard package repositories. We use standard package names to generate comprehensive declarative documentation of the container.We verify the efficacy of this approach by preserving the native computing environments of several scientific projects submitted on Zenodo and GitHub, and generating their declarative documentation. We measure precision and recall by comparing with author-provided documentation. Our approach highlights overand under-documentation in scientific experiments.
Whole Tale http://wholetale.org is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of “Tales” for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring significant compute resources or utilizing massive data is an especially challenging open problem. We describe opportunities, challenges, and solutions to facilitating reproducibility for data-and compute-intensive research, that we call “Tales at Scale,” using the Whole Tale computing platform.We highlight challenges and solutions in frontend responsiveness needs, gaps in current middleware design and implementation, network restrictions, containerization, and data access. Finally, we discuss challenges in packaging computational experiment implementations for portable data-intensive Tales and outline future work.