The supercomputers listed in Top500 and running the HPCG benchmark are usually not able to use more than 1–5 % of their peak computing power, and no system is able to achieve 1 PFlop/s. The paper shows that the Conjugate Gradient algorithm without preconditioning requires only weak bandwidth of interconnect links, but, being a problem of low arithmetic intensity (flop/byte ratio), it would need the memory bandwidth about 20 times greater than available.
Based on this, the paper investigates architecture features of a prospective cost effective processor with the memory bandwidth matching the computing power, when solving problems of low arithmetic intensity.
The distributed monitoring infrastructure of the Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) records on a Hadoop infrastructures a broad variety of computing and storage logs. They represent a valuable source of information for system tuning and capacity planning. In this paper we analyze machine learning (ML) techniques on large amount of traces to discover patterns and correlations useful to classify the popularity of experiment-related datasets. We implement a scalable pipeline of Spark components which collect the dataset access logs from heterogeneous monitoring sources and group them into weekly snapshots organized by CMS sites. Predictive models are trained on these snapshots and forecast which dataset will become popular over time. Dataset popularity predictions are then used to experiment a novel strategy of data caching, called Popularity Prediction Caching (PPC). We compare the hit rates of PPC with those produced by well known caching policies. We demonstrate how the performance improvement is as high as 20% in some sites.
We discuss schedulers for a heterogeneous distributed platform, designed to execute a variety of tasks in a non-dedicated environment. The platform uses and controls a large number of non-dedicated heterogeneous computational resources in a local network. Several self-scheduling algorithms have been adapted to take into account the computational capacity of each workstation of the network. To evaluate the schedulers we use the platform to execute a software tool for molecular docking. We analyze the performance of the self-scheduling algorithms and their impact on the execution time of the application.
Scientific legacy codes are usually holding large numbers of physical arrays which have been used and updated by the code routines, and the code parameters are set using a simple textual data file. Therefore, in cases when there is a need to perform a parallel large-scale parameters-sweep, the user needs to manually replicate the code multiple times, change the data file to its needs, run the code and control its performances - at each working directory separately. This task is practicable when there is a need to create several computations, but impossible when there is a need to run a large-scale parallel parameters-sweep on a cluster with thousands of calculations. In this work we present CalCul – A Python-based Workspace for High-Performance Legacy Scientific Codes, which is able to automatically render most of legacy codes data files into a python object, allowing the user – using many other functionalities – to control and handle all the needs of a parameters-sweep using the simplicity and sophistication of Python libraries, and thus allowing the conversion of work with software from manual to automatic. Also, in order to not damage the way legacy software work, and in order to abstain from replacing existing work modules for these software, the CalCul system provides a mirror between the changes done using the Python object and the legacy data files, and thus allows working dually on the software – either from CalCul or from the data files itself. CalCul also interfaces with IPython interactive command shell, and in this way, the user can manage all of his scientific actions entirely in one hermetic workspace.
Benchmarking and rating constantly running, distributed, message based systems is very difficult. Time based benchmarks require high effort and are very error prone and expensive because they need dedicated hardware. Instead we propose to use the number of messages in such systems as an abstract layer which is independent from physical clusters but nevertheless allows to infer physical behaviour using hardware specific cost vectors. The number of messages and cost vectors can be systematically extracted from a running instance of a message based system. We show how results gained using a traditional benchmark and those inferred using our technique correlate using two real world examples.
In order to operate within power supply constraints, the next generation of supercomputers must be energy efficient. Both the capacities of the target HPC system architecture and workload features impact the energy efficiency of parallel applications. These system and workload factors form a complicated optimization search space. Further, a typical workload may consist of multiple algorithmic kernels each with different power consumption patterns. Using the Parallel Research Kernels as a case study, we identify key bottlenecks that change the energy usage pattern and develop strategies that improve energy efficiency by optimizing both workload and system parameters in an automated manner. The method provides significant insights to identify repeatable, statistically significant energy saving opportunities for parallel applications at various scales.
With the advent of a new generation of supercomputers characterized by tightly-coupled integration of a large-number of powerful processing cores in the same die, energy and temperature walls are looming threats to the growth in computational power.
Scientific computing is characterized by a single application running in parallel on multiple nodes and cores until termination. The message-passing programming model is a widely adopted paradigm for explicitly handling data-sharing between processes of the same application. As an effect of the MPI communication patterns among different processes, the application is characterized by phases which can be exploited by OS power manager. In addition, the large number of cores integrated in the same silicon die introduces large thermal capacitance as well as on-die thermal heterogeneity. Jointly exploiting local workload unbalance and computational node heterogeneity can open interesting opportunities for advanced thermal and energy management. In this paper, we present an exploratory work to assess these opportunities and their limiting factors. We analyze application workload and we identify opportunities to reduce energy consumption and their impact on performance. We test our methodology on a widely-used quantum-chemistry application demonstrating potential benefits of combining the application flow with power and thermal management strategies.
Mesh deformation is a performance critical part of many problems in fluid dynamics. Radial basis function (RBF) interpolation based methods for mesh deformation have addressed the increasing complexity for larger data sets. Recently a domain decomposition method has been introduced which allows mapping these algorithms well to distributed memory systems. Because heterogeneous systems have proven to be more time and energy efficient for some applications, the HPC resources available to engineering users and researchers become increasingly heterogeneous.
In this paper, we describe two optimisations performed on a RBF based interpolation solver for mesh deformation. Motivated by a theoretical performance analysis, the existing MPI distributed model was extended to hybrid parallelisation with OpenMP to achieve better scaling efficiency on systems with hundreds of cores. In addition, an auto-tuning step at compile time which selects a threshold for code paths was introduced which yields up to twofold performance compared to a constant parameter approach across all test systems.
Our results indicate scaling efficiency in excess of 50–70 % for fully utilised dual socket systems of up to 96 cores, which is above the theoretical ideal performance of the baseline code. Utilisation of a single GPU improves time and energy to solution when a single CPU core is used, constraints of the applied domain decomposition which degrade performance when a single GPU is combined with many CPU cores are identified.
Error-tolerating applications are increasingly common in the emerging field of real-time HPC. Proposals have been made at the hardware level to take advantage of inherent perceptual limitations, redundant data, or reduced precision input , as well as to reduce system costs or improve power efficiency . At the same time, works on floating-point to fixed-point conversion tools  allow us to trade-off the algorithm exactness for a more efficient implementation. In this work, we aim at leveraging existing, HPC-oriented hardware architectures, while including in the precision tuning an adaptive selection of floating- and fixed-point arithmetic.
Our proposed solution takes advantage of the application domain knowledge of the programmers by involving them in the first step of the interaction chain. We rely on annotations written by the programmer on the input file to know which variables of a computational kernel should be converted to fixed-point. The second stage replaces the floating-point variables in the kernel with fixed-point equivalents. It also adds to the original source code the utility functions to perform data type conversions from floating-point to fixed-point, and vice versa. The output of the second stage is a new version of the kernel source code which exploits fixed-point computation instead of floating-point computation.
As opposed to typical custom-width hardware designs, we only rely on the standard 16-bit, 32-bit and 64-bit types. We also explore the impact of the fixed-point representation on auto-vectorization.
We discuss the effect of our solution in terms of time-to-solutions, error and energy-to-solution.
Energy efficiency and consumption are now the most important and challenging issues in current Petascale and in designing future Exascale computing systems. The European Union Horizon 2020 READEX project uses an online approach to exploit application dynamism and tune large-scale HPC applications to improve energy efficiency and performance. The paper presents the READEX methodology, consisting of the Design-Time Analysis and Runtime Application Tuning, and describes the pre-analysis steps involving application dynamism and significant region detection. During design-time, the READEX tuning plugin evaluates configurations of hardware and software tuning parameters to determine the best settings for instances of application regions. The runtime tuning dynamically switches to the best configuration for an application region during production runs. Finally, the energy savings obtained for LULESH on the Taurus supercomputer highlight the effectiveness of this methodology.
In the context of time-critical applications there exists the need of clustering data streams so as to provide approximated solutions in the shortest possible time, in order to capture in real-time the evolution of physical or social phenomena. In this work, a nature-inspired algorithm for clustering of evolving big data stream is presented, which is designed to be executed on many-core GPU architectures.
The LAPACK routines GEQRT2 and GEQRT3 can be used to compute the QR decomposition of a matrix of size m×n as well as the storage-efficient representation of the orthogonal factor
In this paper we present an optimization of a spectral finite element method implementation. The improvements consisted in the modification of the memory layout of the main algorithmic kernels and in the augmentation of the arithmetic intensity via loop transformations. The code has been deployed on multi-core SIMD machines and GPU. Compared to our starting point, i.e. the original scalar sequential code, we achieved a speed up of ×228 on CPU. We present comparisons with the SPECFEM2D code that prove the good performances of our implementation on similar cases. On GPU, a hybrid solution is investigated.
The support for heterogenous platforms requires multiple specialised devices collaborate to execute an application. The SYCL standard publishes by Khronos, providing a C++ abstraction layer on top of OpenCL that provides single-source programming for a large number of heterogeneous devices. Single-source programming and task data-flow approach enable SYCL developers to leverage modern programming techniques on heterogeneous platforms. In this paper, we present how SYCL combines expression tree templates and kernel fusion to develop SYCL-BLAS, an efficient BLAS implementation for heterogeneous platforms. The use of templates permits to generate BLAS kernels related to each BLAS routine. whereas kernel fusion describes how to merge the expression trees, enlarging the BLAS kernels. These features prove that SYCL can be used to quickly develop libraries for heterogeneous systems by providing sufficient levels of abstraction. Our experiments compare the performances of clBLAS and SYCL-BLAS on a server equipped with an Intel Core i7-6700K CPU and an AMD R9 GPU.
The k-nearest neighbor (k-NN) search is the rudimentary procedure widely used in machine learning and data embedding techniques. Herein we present a new multi-GPU/CUDA implementation of the brute-force (BF) k-NN algorithm. We demonstrate its advantages over currently the fastest GPU/CUDA implementations of BF k-NN, e.g.,  both in terms of computational time and memory requirements. Unlike its competitors, our code scales linearly with the number of GPUs (up to 8 units) what allows for scrutinizing much larger high-dimensional datasets. We also present a new GPU implementation of the approximate k-NN algorithm used in the LargeVis data embedding algorithm , which is more than two orders of magnitude faster than its original CPU version. We discuss its limitations in terms of accuracy, efficiency and usefulness in data embedding.
Satellite-based remote sensing in the mid-infrared spectral region can deliver a wealth of information on pressure, temperature, clouds and aerosols, and trace gas concentrations in the atmosphere. Interpreting the satellite measurements requires to solve an inverse modelling problem based on variational methods and a forward model evaluating the radiative transfer equations. As state-of-the-art satellite measurement campaigns require Petascale systems to process the data in due time, graphical processing units are employed for the high-throughput problem of computing the forward model for a given atmospheric state. We explore features of the considered architecture as well as relevant performance signatures of the different implementations to improve our understanding on opportunities for efficient exploitation of GPU-accelerated architectures based on the POWER2processor for this class of applications. Scalability is a key aspect as the application is known to scale well on massively-parallel architectures.
It is well known that heterogeneous computing is being adopted in HPC systems, with the adoption of GPGPUs and manycore accelerators, and currently with the early adoption of FPGA systems. The FETHPC MANGO project emerged to analyze future heterogeneous manycore architectures for HPC. In this paper we describe the details of the heterogeneous architecture proposed in the MANGO project. The paper focuses mainly on the description of the interconnect design, the accelerators layout strategy, and the design of manycore accelerators for specific target applications within MANGO. We conclude the paper with the description of one of the targeted applications within MANGO.
In this paper, we suggest a different methodology to shorten the code optimization development time while getting a unified code with good performance on different targeted devices. In the scope of this study, experiments are illustrated on a Discontinuous Galerkin code applied to Computational Fluid Dynamics. Tests are performed on CPUs, KNL Xeon-Phi and GPUs where performance comparison confirms that the GPU optimization guideline leads to efficient versions on CPU and Xeon-Phi for this kind of scientific applications. Based on these results, we finally suggest a methodology to end-up with an efficient hybridized CPU–GPU implementation.
This paper presents an efficient parallel and vectorized implementation of three different selection functions (Roulette Wheel, I-Roulette and DS-Roulette) for tour construction (the most time-consuming part of the Ant Colony Optimization bio-inspired metaheuristic) targeting two Intel multi-core processors and the Knights Corner Intel Xeon Phi coprocessor. The results show that our best implementation (with I-Roulette as selection function) on Xeon Phi 7120P runs up to 78.98x faster compared to its sequential counterpart on a Xeon v2 CPU.
A complex network is a set of entities in a relationship, modeled by a graph where nodes represent entities and edges between nodes represent relationships. Graph algorithms have inherent characteristics, including data-driven computations and poor locality. These characteristics expose graph algorithms to several challenges, because most well studied (parallel) abstractions and implementation are not suitable for them. This work shows how to use some complex-network properties, including community structure and heterogeneity of node degree, to tackle one of the main challenges in graph analysis applications: improving performance, by a proper memory management and an appropriate thread scheduling. In this paper, we first proposed Cn-order, a heuristic that combines advantages of the most recent algorithms (Gorder, Rabbit and NumBaCo) to reduce cache misses in graph algorithms. Second, we proposed deg-scheduling, a degree-aware scheduling to ensure load balancing in parallel graph applications. Then we proposed commdeg-scheduling, an improved version of deg-scheduling that uses Cn-order to take into account graph order in scheduling. Experimental results on a 32 cores NUMA machine (NUMA4) (with Pagerank and livejournal for example) showed that Cn-order used with deg-scheduling (comm-deg-scheduling) outperforms the recent orders: with 32 threads, we reduce time by 26.81% compared to Gorder, 17.28% compared to Numbaco and 11.53% compared to Rabbit.
The amount of genomic data produced by DNA-sequencing is growing at an unprecedented rate due to the ever greater throughput provided by the new generation of genome sequencing (NGS) platforms. To understand and interpret biological data a huge number of metadata and annotations are required. Genomic metadata include very heterogeneous biological and clinical attributes gathered at different levels of details such as diseases, genes, proteins, interactions, pathways but also phenotypic characterization of patients and clinical evidence or diagnosis. All these information are required to draw a comprehensive picture of many underlying phenomenons, thus contributing to scientifically understand the observed data. The complexity of genomic data arises due to the number of involved entities (from millions to billions) and the complex relationships between them; biological information is typically highly connected, not uniform, semi-structured and unpredictable. Relationships and connections may be stored in a relational database and data can be extracted adopting traversal-type queries implying joins. Nevertheless joins with large tables easily become too cumbersome and computationally expensive to design, execute and maintain. This critical aspect makes relational databases non-suitable for this kind of data operations. Studies already suggested that graph databases are among the best choices to explore linked data, due to the design of the core engine which optimizes performance in exploring connections . They also provide a flexible solution for the integration and exploration of multiple levels of biological information  and some specific sets of biological data are already structured as graphs .
The local Extrapolated Diffusion (EDF) method was studied in  for a torus network using communication among 4 adjacent neighbors. In the present paper we study the performance of the same method with the extension that we add four additional edges to the central node in position (i,j) of the torus network. We develop the EDF method for a 8-regular torus network. The method uses two sets of parameters and for each node in order to increase its rate of convergence. The conventional way to analyze the convergence of the Diffusion method is to use matrix analysis. This approach depends heavily upon the property of the Laplacian matrix being circulant. However, the Laplacian matrix of our method does not have this property. To circumvent this problem we use Fourier analysis to determine optimum values for the set of parameters via a closed form formulae resulting in the maximization of its rate of convergence. It is shown that the optimum value of the convergence factor depends only upon the dimensions of the torus. Moreover, by keeping fixed the one dimension and increasing the other dimension of the torus the rate of convergence of the EDF method with 8 adjacent neighbors is also increased compared to the EDF method with 4 adjacent neighbors.