Ebook: Co-Scheduling of HPC Applications
High-performance computing (HPC) has become an essential tool in the modern world. However, systems frequently run well below theoretical peak performance, with only 5% being reached in many cases. In addition, costly components often remain idle when not required for specific programs, as parts of the HPC systems are reserved and used exclusively for applications.
A project was started in 2013, funded by the German Ministry of Education and Research (BMBF), to find ways of improving system utilization by compromising on dedicated reservations for HPC codes and applying co-scheduling of applications instead. The need was recognized for international discussion to find the best solutions to this HPC utilization issue, and a workshop on co-scheduling in HPC, open to international participants – the COSH workshop – was held for the first time at the European HiPEAC conference, in Prague, Czech Republic, in January 2016.
This book presents extended versions of papers submitted to the workshop, reviewed for the second time to ensure scientific quality. It also includes an introduction to the main challenges of co-scheduling and a foreword by Arndt Bode, head of LRZ, one of Europe's leading computer centers, as well as a chapter corresponding to the invited keynote speech by Intel, whose recent extensions to their processors allow for better control of co-scheduling.
High performance computing (HPC) has become an important part of the modern world: It is used almost in all of our industry today for improving products through simulating new product prototypes. In the academic world it is an essential tool for scientific research. However, systems often run far below their theoretical peak performance: In many cases only five per cent of a machine's peak performance is reached. In addition, costly components often remain idle while not being required for specific programs, as parts of HPC systems are reserved and exclusively used for the applications.
To further improve the state of the art in this research area, a project funded by the German Ministry of Education and Research (BMBF) was started in 2013. The main idea was to improve system utilization by compromising on dedicated reservations for HPC codes, and apply co-scheduling of applications instead.
As key research partners within this project, we observed a need for international discussion for finding the best solutions to this utilization issue in High Performance Computing: The approach taken by most research groups and hardware vendors is the opposite, as they try to switch off idling components, which can become quite difficult in reality. To this end, we (the editors of this book) have started to organize a workshop on Co-Scheduling in HPC, COSH, which is open to international participants and was held for the first time at the European HiPEAC conference in 2016 in Prague.
This book mainly consists of significantly extended versions of all papers submitted to this workshop. They were reviewed for a second time to ensure high scientific quality. At COSH2016 we had an invited keynote speech by Intel on recent extensions of their processors which can allow for better control of co-scheduling. We are happy to have a corresponding chapter added in this book as well as a foreword by Arndt Bode, head of one of the leading European computing centers, LRZ. Together with the project consortium leader of the abovementioned research project, André Brinkmann, we start this book with an introduction to the main challenges of Co-Scheduling as well as related research in the field.
Carsten Trinitis and Josef Weidendorfer
November 2016
In this chapter, we explain our view on the benefits of using co-scheduling for high performance computing in the future. To this end, we start with the issues of the current situation and a motivation. Then we define what we see as the main requirements for co-scheduling. Finally, we list the challenges we see on the way to effective and beneficial co-scheduling in compute centers.
This way, we want to make the reader aware of the research required within the context of co-scheduling, and we want to prepare the stage for the following chapters which focus on specific parts of the general topic of this book. It should become clear that we are only at the start of the research that needs to be carried out for effective co-scheduling on current peta-, future exascale systems and beyond.
Moore's law has driven processor development for several decades and has increased the available space for transistors on the area of the processor dies. Despite this increase of space, the operational frequency of the processors has stagnated or is even decreasing, due to energy and thermal constraints. The available transistors are nowadays being used to increase the number of cores, the width of SIMD parallel units, and to integrate adjacent technologies such as memory controllers and network fabric. Except for rare cases, typical applications do not utilize all available resources of large supercomputers. Some parts of the available resources are under-utilized, while others are stressed by the application. Co-scheduling strives to solve this problem by scheduling several applications that demand different components of the same resource. This can lead to a better overall utilization of the system. For effective co-scheduling, however, the execution environment has to provide quality-of-service measures to ensure that applications are not inadvertently influencing each other. Cache Allocation Technology is one of the building blocks to achieve this isolation.
This paper presents a fast and simple contention-aware scheduling policy for CMP systems that relies on information collected at runtime with no additional hardware support. Our approach is based on a classification scheme that detects activity and possible interference across the entire memory hierarchy, including both shared caches and memory links. We schedule multithreaded applications taking into account their class, targeting both total system throughput and application fairness in terms of fair distribution of co-execution penalties. We have implemented a user level scheduler and evaluated our policy in several scenarios with different contention levels and a variety of multiprogrammed multithreaded workloads. Our results demonstrate that the proposed scheduling policy outperforms the established Linux and Gang schedulers, as well as a number of research contention-aware schedulers, both in system throughput and application fairness.
Heading towards exascale, the challenges for process management with respect to flexibility and efficiency grow accordingly. Running more than one application simultaneously on a node can be the solution for better resource utilization. However, this approach of co-scheduling can also be the way to go for gaining a degree of flexibility with respect to process management that can enable some kind of interactivity even in the domain of high-performance computing. This chapter gives an introduction into such co-scheduling policies for running multiple MPI sessions concurrently and interactively within a single user allocation. The chapter initially introduces a taxonomy for classifying the different characteristics of such a flexible process management, and discusses actual manifestations thereof during the course of the reading. In doing so, real world examples are motivated and presented by means of ParaStation MPI, a high-performance MPI library supplemented by a complete framework comprising a scalable and dynamic process manager. In particular, four scheduling policies, implemented in ParaStation MPI, are detailed and evaluated by applying a benchmarking tool that has especially been developed for measuring interactivity and dynamicity metrics of job schedulers and process managers for high-performance computing.
In recent years, the cost for power consumption in HPC systems has become a relevant factor. In addition, most applications running on supercomputers achieve only a fraction of a system's peak performance. It has been demonstrated that co-scheduling applications can improve overall system utilization and energy efficiency. Co-scheduling here means that more than one job is executed simultaneously on the same nodes of a system. However, applications being co-scheduled need to fulfill certain criteria such that mutual slowdown is kept at a minimum. We observe that with threads from different applications running on individual cores of the same multi-core processors, any influence mainly is due to sharing the memory hierarchy.
In this paper, we propose a simple approach for assessing the memory access characteristics of an application which allows estimating the mutual influence with other co-scheduled applications. We compare this with the stack reuse distance, another metric to characterize memory access behavior. Furthermore, we present a set of libraries and a first HPC scheduler prototype that automatically detects an application's main memory bandwidth utilization and prevents the co-scheduling of multiple main memory bandwidth limited applications. We demonstrate that our prototype achieves almost the same performance as we achieved with manually tuned co-schedules in previous work.
Co-scheduling processes on different cores in the same server might lead to excessive slowdowns if they use the same shared resource, like a memory bus. If possible, processes with a high shared resource use should be allocated to different server nodes to avoid contention, thus avoiding slowdown. This article proposes the more general principle that twins, i.e. several instances of the same program, should be allocated to different server nodes. The rational for this is that instances of the same program use the same resources and they are more likely to be either low or high resource users. High resource users should obviously not be combined, but a bit non-intuitively, it is also shown that low resource users should also not be combined in order to not miss out on better scheduling opportunities. This is verified using both a probabilistic argument as well as experimentally using ten programs from the NAS parallel benchmark suite running on two different systems. By using the simple rule of forbidding these terrible twins, the average slowdown is shown to decrease from 6.6% down to 5.9% for System A and from 9.5% to 8.3% for System B. Furthermore, the worst case slowdown is lowered from 12.7% to 9.0% and 19.5% to 13% for systems A and B, respectively. Thus, indicating a considerable improvement despite the rule being program agnostic and having no information about any program's resource usage or slowdown behavior.
Only few High-Performance Computing (HPC) applications exploit the whole computing power of current supercomputers. This trend is not likely to change in the near future since the performance increase is expected to stem from explicit parallelism adding to the burden of the application programmers to write efficient code. Co-scheduling is expected to be one means to cope with this issue by a revocation of the exclusive assignment of jobs to nodes and vice versa. This approach demands for dynamic schedules as jobs come and go influencing the optimal distribution of the processes across the cluster. Therefore, we consider virtualization techniques valuable for future supercomputers. On the one hand, they facilitate a strong isolation between applications running on the same nodes and enable the seamless migration thereof. On the other hand, recent hardware developments diminish the performance gap that is introduced by the virtualization layer to native execution.
In this chapter we present a comprehensive study of virtualization applied to HPC. In doing so, we do not only focus on quantitative aspects estimating the performance impact on common workloads but also discuss qualitative arguments for and against the deployment of virtualization in this context. Our results reveal that well configured Virtual Machines (VMs) can achieve close to native performance. However, the current HPC software stack requires adaptions for the support of transparent migrations and locality-aware communication.
For many years, the number of processing units per compute node is increasing significantly. For utilizing all or most of the available compute resources of a high-performance computing cluster, at least some of its nodes will have to be shared by several applications at the same time. Even if jobs are co-scheduled on a node, it can happen that compute resources remain idle, although there may be jobs that could make use of them (e.g., if the resource was temporarily blocked when the job was started). Heterogeneous schedulers, which schedule tasks for different devices, can bind jobs to resources in a way that can significantly reduce the idle time. Typically, such schedulers make their decisions based on a static strategy.
We investigate the impact of allowing a heterogeneous scheduler to modify its strategy at runtime. For a set of applications, we determine the makespan and show how it is influenced by four different scheduling strategies. A strategy tailored to one use case can be disastrous in another one and can consequently even result in a slowdown – in our experiments of up to factor 2.5.