
Ebook: High Performance Computing: From Grids and Clouds to Exascale

In the last decade, parallel computing technologies have transformed high-performance computing. Two trends have emerged: massively parallel computing leading to exascale on the one hand and moderately parallel applications, which have opened up high-performance computing for the masses, on the other. The availability of commodity hardware components, a wide spectrum of parallel applications in research and industry and user-friendly management and development tools have enabled access to parallel and high-performance computing for a wide spectrum of end users from research and academia to mid-market industries. This book presents the proceedings of the biennial High Performance Computing workshop held in Cetraro, Italy, in June 2010. It is divided into four sections: state-of-the-art and future scenarios, grids and clouds, technologies and systems, and lastly, applications. The conference addressed the problems associated with the efficient and effective use of large systems. The papers included cover a wide range of topics from algorithms and architectures through grid and cloud technologies to applications and infrastructures for e-science.
During the last decade parallel computing technologies transformed high performance computing. Two trends are now clearly visible: massively parallel computing leading to exascale; and moderately parallel applications opening up high performance computing for the masses. Especially the availability of commodity hardware components, a wide spectrum of parallel applications in research and industry, and user-friendly management and development tools have enabled access to parallel and high performance computing for a wide spectrum of end-users from research, academia and the mid-market industry.
The majority of notebooks and standard PCs today incorporate multiprocessor chips with up to eight processors. This number is expected to soon reach twelve and more. These standard components allow the construction of high speed parallel systems in the petascale range at a reasonable cost. The number of processors incorporated in such systems are of the order of 104 to 106.
Due to the flexibility offered by parallel systems constructed with commodity components, these can easily be linked through wide area networks, for example the Internet, to realise Grids or Clouds. Such networks can be accessed by a wide community of users from many different disciplines to solve compute intensive and data intensive problems requiring high speed computing resources.
The problems associated with the efficient and effective use of such large systems were the theme of the biennial High Performance Computing workshop held in June 2010 in Cetraro, Italy. A selection of papers presented at the workshop are collected in this book. They cover a range of topics, from algorithms and architectures to Grid and Cloud technologies to applications and infrastructures for e-science.
The editors wish to cordially thank all the authors for preparing their contributions as well as the reviewers who supported this effort with their constructive recommendations.
Ian Foster, USA
Wolfgang Gentzsch, Germany
Lucio Grandinetti, Italy
Gerhard R. Joubert, Netherlands/Germany
4 June 2011
With the widespread availability of high-speed networks, it becomes feasible to outsource computing to remote providers and to federate resources from many locations. Such observations motivated the development, from the mid-1990s onwards, of a range of innovative Grid technologies, applications, and infrastructures. We review the history, current status, and future prospects for Grid computing.
Recent developments in the field of supercomputing in Germany and Europe are presented. JUGENE, the highly scalable IBM Blue Gene/P system at the Jülich Supercomputing Centre (JSC), the first Petaflop-supercomputer in Europe, started operation in July 2009. Additionally, the JSC has installed a 300 Teraflop supercomputer, JUROPA, a next-generation general purpose cluster system following JSC's p690 JUMP. JSC designed and constructed JUROPA in a two-years effort, supported by its industrial partners Bull, SUN, INTEL, Mellanox, Novell and ParTec. One third of JUROPA is named HPC-FF, a system for fusion research. After achieving the high rank 10 in the TOP500 list and number 2 in Europe, JUROPA/HPC-FF became operational in August 2009. With these systems, the JSC has finally realized its concept of a dual system complex, which is adapted to meet the requirements of the NIC application portfolio, both flexibility and scalability, as effective as possible. Additionally, the JSC was co-designing, benchmarking and is now hosting a 100-Teraflop/s QPACE system developed by the SFB/TR 55 “Hadron Physics”, the most energy efficient supercomputer as to 2009 and 2010. Coordinated by the Jülich Supercomputing Centre, the preparation phase project for the European supercomputing infrastructure managed by the Partnership for Advanced Computing in Europe was finalized by mid of 2010. It is followed by the first implementation phase project, again coordinated by the JSC, which started mid of 2010. In spring 2010 the statutes of the European Supercomputing Research Infrastructure were signed and JUGENE started contributing as the first supercomputer to PRACE in August 2010.
Significant advancements in hardware and software will be required to reach Exascale computing in this decade. The confluence of changes needed in both architectures and applications has created an opportunity for co-design. This paper offers a co-design methodology for high-performance computing and describes several tools that are being developed to enable co-design.
The current dynamic development of heterogeneous (CPU+GPU) computing and its applications to scientific, engineering and business problems owes the success to several factors. One of them is the maturity of parallel computing after many years of struggle and experimentation with different parallel computer architectures. The second is the relatively low price of processors and our ability to put many of them on a single chip. The third equally important factor is the structure of very many numerical mathematics algorithms containing highly parallelizable operations whose processing can be accelerated by using massively parallel GPU and multicore CPU. In this paper we provide an overview of the field and simple but realistic examples. The paper is targeted for beginner CUDA users. We have decided to show a simple source code of vector addition on GPU. This example does not cover advanced CUDA usage, such as shared memory accesses, divergent branches, optimization coalescing or loop unrolling. To illustrate performance we demonstrate results of matrix-matrix multiplication where some of the optimization techniques were used to gain impressive speedup.
Although there are many production level service and desktop grids they are usually not able to interoperate. The European FP7 EDGeS project aimed at integrating service grids (SG) and desktop grids (DG) to merge their benefits. The established production infrastructure includes gLite based service grid and BOINC and XtremWeb desktop grids. The paper describes the bridging technology developed in EDGeS both for the DG→SG and SG→DG directions both at middleware and application level.
The growing number of commercial and scientific clouds strongly suggests that in the near future users will be able to combine cloud services to build new services, a plausible scenario being the case when users need to aggregate capabilities provided by different clouds. In such scenarios, it will be essential to provide virtual networking technologies that enable providers to support crosscloud communication and users to deploy crosscloud applications. This chapter describes one such technology, its salient features and remaining challenges. It also makes the case for crosscloud computing, discussing its requirements, challenges possible technologies and applications.
Using cloud technologies, it is possible to provision HPC services on-demand. Customers of the service are able to provision virtual HPC systems in a self-service portal and deploy and execute their specific application without operator intervention. The business model foresees to only charge the amount of resources actually used. There remain open questions in the area of performance optimization, advanced resource management, and fault tolerance. The Open Cirrus cloud computing testbed offers an environment in which we can treat these problems.
Nowadays cloud computing is a popular paradigm to provide software, platform and infrastructure as a service to the consumers. It has been observed that seldom the capacity of personal computers (PC) is fully utilized. Desktop cloud is a novel approach for resource harvesting in a heterogeneous non-dedicated desktop environment. This paper discusses a virtual infrastructure manager and a scheduling framework to leverage idle PCs, with the permission of PC owners. Prima facie, VirtualBox is the best suited hypervisor as a backbone of the private desktop cloud architecture. A consumer is able to submit a lease to be deployed on idle resources, to launch a computation abstracted as a virtual machine (VM) or a virtual cluster using virtualization. In this approach, the role of the Scheduler is to balance both requirements of resource provider and resource consumer of the cloud in a non-dedicated heterogeneous environment. In addition, the permission of PC owners is taken into account, and consumers expect the best possible performance during the whole session. From the consumer's point of view, a prototype implementation of desktop clouds is useful for submitting lease requirement to the scheduler e.g. for running HPC applications.
This work discusses the scheduling technique for Virtual Infrastructure Management (VIM) and virtual cluster launching in private desktop clouds; besides, Virtual Disk Preservation Mode (DPM) and its relation with virtual cluster deployment time is explained. In all, it is quite challenging to yield the power of the idle resources in such a non-dedicated heterogeneous environment.
The complexity of high-end computing has been increasing rapidly following the exponential increase in processing speed of the novel electronic digital technologies. Subsequently, the software development productivity has been attracting higher attention by the professional community because of its increasing importance for the development of complex software systems and applications. At the same time, component-based technologies have emerged as a modern and promising alternative with a clear potential to improve significantly the productivity of software development, particularly for extreme-scale computing. However, the lack of longer-term experience and the increasing complexity of the target systems demand much more research results in the field. In particular, the search for the most appropriate component model and corresponding programming environments is of high interest and importance. The higher level of complexity involves a wider range of requirements and resources which demand dedicated support for dynamic properties and flexibility that could be provided in an elegant way by adopting the component-based methodology for software development.
General purpose many-core architectures of the future need new scalable operating system designs and abstractions in order to be managed efficiently. As both memory and processing resources will be ubiquitous, concurrency will be the norm. We present a strategy for operating systems for such architectures, presenting the approach we take for our SVP (Self-adaptive Virtual Processor) based Microgrid many-core architecture. Aspects such as scheduling, resource management, memory protection and system services are covered, with a detailed discussion on the design of an I/O subsystem for the Microgrid.
Scientific numerical applications are always expecting more computing and storage capabilities to compute at finer grain and/or to integrate more phenomena in their computations. Even though, they are getting more complex to develop. However, the continual growth of computing and storage capabilities is achieved with an increase complexity of infrastructures. Thus, there is an important challenge to define programming abstractions able to deal with software and hardware complexity. An interesting approach is represented by software component models. This chapter first analyzes how high performance interactions are only partially supported by specialized component models. Then, it introduces HLCM, a component model that aims at efficiently supporting all kinds of static compositions.
Current desktop computers are heterogeneous systems that integrate different types of processors. For example, general-purpose processors and GPUs do not only have different characteristics but also adopt diverse programming models. Despite these differences, data parallelism is exploited for both types of processors, by using application processing interfaces such as OpenMP and CUDA, respectively. In this work we propose to collaboratively use all these types of processors, thus increasing the amount of data parallelism exploited. In this setup, each processor executes its own optimized implementation of a target application. To achieve this goal, a platform has been developed composed of a task scheduler and an algorithm for runtime dynamic load balancing using online performance models of the different devices. These models are built without relying on any prior assumptions on the target application or system characteristics. The modeling time is negligible when several instances of a class of applications are executed in sequence or for iterative applications. As a case study, a database application is chosen to illustrate the usage of the proposed algorithm for building the performance models and to achieve dynamic load balancing. Experimental results clearly show the advantage of collaboratively using a quad-core processor along with a GPU. In practice, a performance improvement of about 42% is achieved by applying the proposed techniques and tools to Query Q3 of the TPC-H Decision Support System benchmark.
In several scientific domains large data repositories are generated. To find interesting and useful information in those repositories, efficient data mining techniques must be used. Many scientific fields, such as astronomy, biology, medicine, chemistry and earth science, get advantages from data mining analysis. The exploitation of data mining techniques in science helps scientists in hypothesis formation and gives them a support on their scientific practices, taking advantage from the knowledge that can be extracted from large data sources. Data mining tasks are often distributed since they involve data and tools located over geographically distributed environments, like the Grid. Therefore, it is fundamental to exploit effective paradigms, such as services and workflows, to model data mining tasks that are both multi-staged and distributed. This chapter discusses data mining services and workflows for analyzing scientific data in high performance distributed environments such as Grids and Clouds. It also presents a workflow formalism and a service-oriented programming framework, named DIS3GNO, for designing and running distributed data mining tasks in the Knowledge Grid.
Architectures for supercomputing have evolved rapidly over the past ten years on essentially two but possibly converging tracks. Firstly those of the Symmetric Multiprocesing cluster type architecture (i.e. the IBM p-series and the Cray XMT-series) and secondly the fully distributed node architecture exemplified by the IBM Blue Gene series. Both have advantages and disadvantages. Problems in the Science and Engineering world have expanded in a fashion exemplified by Parkinson's Law (“work fills the time available”). Science problems, especially those in the physiological domain are themselves defined over vast ranges of scale lengths as will be shown in the following sections. In order to solve problems whose scale lengths vary substantially there are two possible solutions. Either discretise down to the smallest scale with the possibility of producing such large data sets and numbers of equations that the memory requirements become too large for a specific machine or divide the problem into a subset of appropriate length scales and map these discretised sub-domains onto appropriate machine architectures with a fast communication link. The definitions of “appropriate” and “fast” here is determined at present on a case-by-case basis. A generic solution to where the “optimum” boundary should be between differing architectures is a substantial problem in itself. The two architectures each have their own advantages and disadvantages and in the light of this our group at Canterbury have deliberately utilised this to link both compute architectures together to solve a single problem involving the simulation of flow in the cerebro-vasculature. We show that certain mappings of large vascular trees have constraints placed upon them when more than 256 Blue Gene/L processors are used.
Real-Time Online Interactive Applications (ROIA), such as massively multi-player online games, e-Learning applications and high-performance real-time simulations, establish a rapidly growing commercial market and may potentially become a killer application class for future parallel and distributed systems including Clouds. ROIA pose several specific challenges as compared to typical (e.g., numerical) high-performance applications: a) huge numbers of users participating in a single application session (up to 104), b) low response times in order to deliver a real-time user experience (from 100 ms for fast-paced action games to 1.5s for role-playing games), and c) high update rate of the global application state (up to 50Hz).
In this paper, we aim at achieving scalability of ROIA, i.e. accommodating an increasing number of users by employing additional compute resources, on two major classes of target systems: distributed systems with multiple physical servers, and Cloud systems that provide distributed virtual resources. We develop a fine-grained model for the main performance metrics of ROIA, in particular the response time. We present our Real-Time Framework (RTF) – a development and execution platform for ROIA, and report experimental results on the scalability of RTF-based ROIA and their responsiveness. An additional challenge of ROIA is the changing (e.g., daytime-dependent) load: statically assigned servers may either not cope with the peak user numbers or remain underutilized when these numbers are low. We address this challenge by extending RTF with dynamic management tools which allow us to efficiently use virtual Cloud resources, and report our preliminary experiments on the performance of ROIA execution on the Amazon EC2 Cloud system.
A Large-scale Wireless Sensor Network (LWSN). such as an environment monitoring system deployed in a city, could yield data on the order of petabytes each year. Storage and computation of such vast quantities of data pose difficult challenges to the LWSN, particularly because sensors are highly constrained by their scarce resources. Distributed storage and parallel processing are solutions that deal with the massive amount of data by utilizing the collective computational power of the large number of sensors, all while keeping inter-node communication minimized to save energy. In this chapter, we conduct a survey on the state-of-the-art of distributed storage and parallel processing in LWSNs. We will focus on the LWSN scenario in which vast amounts of data are collected prior to intensive computation on that data. We argue that current research results in this direction fall into three categories: 1) hierarchical system architecture to support parallel and distributed computation; 2) distributed data aggregation and storage strategies; 3) parallel processing, scheduling and programming methods. We highlight some important results for each of these categories and discuss existing problems and future directions.