
Ebook: High Speed and Large Scale Scientific Computing

During the last decade parallel technologies have completely transformed main stream computing. The majority of standard PC ́s today incorporate multi-processor chips, with up to four processors. This number will reach eight and more processors soon. The flexibility that parallel systems constructed with commodity components offer, make them easy to link through wide area networks, such as the Internet, to realize Grids or Clouds. The immediate benefit of these networks is that they can be accessed by a wide community of users, from many different disciplines, to solve compute intensive or data intensive problems requiring high speed computing resources. High Speed and Large Scale Scientific Computing touches upon issues related to the new area of Cloud computing, discusses developments in Grids, Applications and Information Processing, as well as e-Science. The book includes contributions from internationally renowned experts in these advanced technologies. The papers collected in this volume would be of interest to computer scientists, IT engineers and IT managers interested in the future development of Grids, Clouds and Large Scale Computing.
During the last decade parallel computing technologies transformed main stream computing. The majority of notebooks and standard PC's today incorporate multi processor chips with up to four processors. This number is expected to soon reach eight and more. These standard components allow the construction of high speed parallel systems in the petascale range at a reasonable cost. The number of processors incorporated in such systems is of the order of 104 to 106.
Due to the flexibility offered by parallel systems constructed with commodity components, these can easily be linked through wide area networks, for example the Internet, to realise Grids or Clouds. Such networks can be accessed by a wide community of users from many different disciplines to solve compute intensive and/or data intensive problems requiring high speed computing resources.
The problems associated with the efficient and effective use of such large systems were the theme of the biannual High Performance Computing workshop held in July 2008 in Cetraro, Italy. A selection of papers presented at the workshop are combined with a number of invited contributions in this book.
The papers included cover a range of topics, from algorithms and architectures to Grid and Cloud technologies to applications and infrastructures for e-science.
The editors wish to thank all the authors for preparing their contributions as well as the reviewers who supported this effort with their constructive recommendations.
Wolfgang Gentzsch, Germany
Gerhard Joubert, Netherlands/Germany
Lucio Grandinetti, Italy
Date: 2009-09-10
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the coarse-grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloads, Cholesky factorization and QR factorization, using dynamic data-driven execution. Two emerging approaches to implementing coarse-grain dataflow are examined, the model of nested parallelism, represented by the Cilk framework, and the model of parallelism expressed through an arbitrary Direct Acyclic Graph, represented by the SMP Superscalar framework. Performance and coding effort are analyzed and compared agains code manually parallelized at the thread level.
The main objective of this chapter is to show the need for algorithmic and scheduling techniques. Even if resources at our disposal would become abundant and cheap, not to say unlimited and free (a perspective that is not granted), we would still need to assign the right task to the right device. We give several examples of such situations where careful resource selection and allocation are mandatory. Finally we outline some important algorithmic challenges that need be addressed in the future.
Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first the use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated. The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speed up computation kernels significantly with respect to traditional processors.
This paper presents the results obtained in the project S-Net by members of the Compilation Technology and Computer Architecture (CTCA) group at University of Hertfordshire, U.K. We argue that globally distributed HPC will require tools for coordination of asynchronous networked components, and that such coordination can be achieved by reducing the vertex in- and out- degrees of the processing nodes to 1, using single-input single-output combinators for network construction and by externalising the component state. This approach is presented first as a set of language design principles and then in the form of coordination language. The language is illustrated by an application example.
Scientists from many domains desire to address problems within the next decade that, by all estimates, require computer systems that can achieve sustained exaflop computing rates (i.e., 1×1018 floating point operations per second) with real-world applications. Simply scaling existing designs is insufficient: analysis of current technological trends suggests that only a few architectural components are on track to reach the performance levels needed for exascale computing. The network connecting computer system nodes presents a particularly difficult challenge because of the prevalence of a wide variety of communication patterns and collective communication operations in algorithms used in scientific applications and their tendency to be the most significant limit to application scalability. Researchers at Oak Ridge National Laboratory and elsewhere are actively working to overcome these network-related scalability barriers using advanced hardware and software design, alternative network topologies, and performance prediction using modeling and simulation.
Dynamic Querying (DQ) is a technique adopted in unstructured Peer-to-Peer (P2P) networks to minimize the number of nodes that is necessary to visit to obtain the desired number of results. In this chapter we describe the use of the DQ technique over a Distributed Hash Table (DHT) to implement a scalable Grid information service. The DQ-DHT (Dynamic Querying over a Distributed Hash Table) algorithm has been designed to perform DQ-like searches over DHT-based networks. The aim of DQ-DHT is two-fold: allowing arbitrary queries to be performed in structured P2P networks, and providing dynamic adaptation of search according to the popularity of resources to be located. Through the use of the DQ-DHT technique it is possible to implement a scalable Grid information service supporting both structured search and execution of arbitrary queries for searching Grid resources on the basis of complex criteria or semantic features.
In the process of developping grid applications, people need to often evaluate the robustness of their work. Two common approaches are, simulation where one can evaluates his software and predict behaviors under conditions usually unachievable in a laboratory experiment and experimentation where the actual application is launched on an actual grid. However simulation could ignore unpredictable behaviors due to the abstraction done and experimation does not guarantee a controlled and reproducible environment.
In this chapter, we propose an emulation platform for parallel and distributed systems including grids where both the machines and the network are virtualized at a low level. The use of virtual machines allows us to test highly accurate failure injection since we can “destroy” virtual machines and, network virtualization provides low-level network emulation. Failure accuracy is a criteria that notes how realistic a fault is. The accuracy of our framework is evaluated through a set of micro benchmarks and a very stable P2P system call Pastry since we are very interested in the publication system and resources finding of grid systems.
This paper presents an overview of the DEISA2 project, vision, mission, objectives, and the DEISA infrastructure and services offered to the e-science community. The different types of applications are discussed which specifically benefit from this infrastructure and services, and the DEISA Extreme Computing Initiative for supercomputing applications is highlighted. Finally, we analyse the DEISA sustainability strategy and present lessons learned.
This paper is about UNICORE, a European Grid Technology with more than 10 years of history. Originating from the Supercomputing domain, the latest version UNICORE 6 has matured into a general-purpose Grid technology that follows established Grid and Web services standards and offers a rich set of features to its users. An architectural insight into UNICORE is given, highlighting the workflow features as well as the different client options. The paper closes with a set of example use cases and e-infrastructures where the UNICORE technology is used today.
The use of virtualization, along with an efficient virtual machine management, creates a new virtualization layer that isolates the service workload from the resource management. The integration of the cloud within the virtualization layer, can be used to support on-demand resource provisioning, providing elasticity in modern Internet-based services and applications, and allowing to adapt dynamically the service capacity to variable user demands. Cluster and grid computing environments are two examples of services which can obtain a great benefit from these technologies. Virtualization can be used to transform a distributed physical infrastructure into a flexible and elastic virtual infrastructure, separating resource provisioning from job execution management, and adapting dynamically the cluster or grid size to the users' computational demands. In particular, in this paper we analyze the deployment of a computing cluster on top of a virtualized infrastructure layer, which combines a local virtual machine manager (the OpenNebula engine) and a cloud resource provider (Amazon EC2). The solution is evaluated using the NAS Grid Benchmarks in terms of processing overhead due to virtualization, communication overhead due to the management of nodes across different geographic locations, and elasticity in the cluster processing capacity.
This paper examines issues related to the execution of scientific applications, and in particular computational workflows, on Cloud-based infrastructure. The paper describes the layering of application-level schedulers on top of the Cloud resources that enables grid-based applications to run on the Cloud. Finally, the paper examines issues of Cloud data management that supports workflow execution. We show how various ways of handling data have impact on the cost of the overall computations.
Cloud computing is an upcoming field in distributed systems. Unlike grid computing the most prominent drivers in this field come from industry. The most prominent examples of cloud computing vendors include Amazon, Google and salesforce.com. Their offerings had been developed to solve inhouse problems, such as how to simplify systems management in extremely large environments. Now, they offer their technology as a platform to the public.
While web companies have embraced the new platforms very fast the question remains whether they represent suitable platforms for enterprise HPC. In this chapter we assess the impact of cloud computing technology on enterprises. We present PHASTGrid, an integration platform that allows enterprises to move their workload seamlessly between different internal and external resources. In addition we present our solution for software license management in virtualized environments.
Based on our experience we conclude that cloud computing with its on-demand, pay-as-you-go offerings can help enterprises to solve computationally intense problems. The current nature of the offerings limits the range of problems that can be executed in the cloud, but different offerings will be available in the future. ISVs need to adjust their software licensing model in order to be able to offer their software in these environments.
Interest in cloud computing has grown significantly over the past few years both in the commercial and non-profit sectors. In the commercial sector, various companies have advanced economic arguments for the installation of cloud computing systems to service their clients' needs. This paper focuses on non-profit educational institutions and analyzes some operational data from the Virtual Computing Laboratory (VCL) at NC State University from the past several years. The preliminary analysis from the VCL suggests a model for designing and configuring a cloud computing system to serve both the educational and research missions of the university in a very economical cost efficient manner.
When describing some parts of a big system they all seem so different. But combining them all together leads to the big picture. The same is true for today's technologies. When looking at all the different facets of Grid and Cloud computing no concrete picture might evolve. But trying to address all kind of issues might lead to a system which is able to manage service on demand. This paper tries to provide the big picture, of what today is called Cloud computing. It focuses on the faces, business models and handling of services in Clouds. And therefor tries to give the big picture of Cloud computing.
Aneka is a platform for deploying Clouds developing applications on top of it. It provides a runtime environment and a set of APIs that allow developers to build .NET applications that leverage their computation on either public or private clouds. One of the key features of Aneka is the ability of supporting multiple programming models that are ways of expressing the execution logic of applications by using specific abstractions. This is accomplished by creating a customizable and extensible service oriented runtime environment represented by a collection of software containers connected together. By leveraging on these architecture advanced services including resource reservation, persistence, storage management, security, and performance monitoring have been implemented. On top of this infrastructure different programming models can be plugged to provide support for different scenarios as demonstrated by the engineering, life science, and industry applications.
A novel approach to scientific investigations, besides analysis of individual phenomena, integrates different, interdisciplinary sources of knowledge about a complex system to obtain an understanding of the system as a whole. This innovative way of research called system-level science, requires advanced methods and tools for enabling collaboration of research groups. This paper presents a new approach to development and execution of collaborative applications. These applications are built as experiment plans with a notation based on the Ruby language. The virtual laboratory, which is an integrated system of dedicated tools and servers, provides a common space for planning, building, improving and performing in-silico experiments by a group of developers. The application is built with elements called gems which are available on the distributed Web- and Grid-based infrastructure. The process of application developments and the functionality of the virtual laboratory are demonstrated with a real-life example of the drug susceptibility ranking application from the HIV treatment domain.
We describe a suite of data mining tools that cover clustering, information retrieval and the mapping of high dimensional data to low dimensions for visualization. Preliminary applications are given to particle physics, bioinformatics and medical informatics. The data vary in dimension from low (2-20), high (thousands) to undefined (sequences with dissimilarities but not vectors defined). We use deterministic annealing to provide more robust algorithms that are relatively insensitive to local minima. We discuss the algorithm structure and their mapping to parallel architectures of different types and look at the performance of the algorithms on three classes of system; multicore, cluster and Grid using a MapReduce style algorithm. Each approach is suitable in different application scenarios. We stress that data analysis/mining of large datasets can be a supercomputer application.
Today's state-of-the-art cluster supercomputers include commodity components such as multi-core CPUs and graphics processing units. Together, these hardware devices provide unprecendented levels of performance in terms of raw GFLOPS and GFLOPS/cost. High-performance computing applications are always in search of lower execution times, greater system utilization, and better efficiency, which means that developers will need to leverage these disruptive technologies in order to take advantage of modern cluster computers' full potential processing power. New application models and middleware systems are needed to ease the developer's task of writing programs which efficiently use this processing capability. Here, we present the implementation of a biomedical image analysis application which serves as a case-study for the development of applications for modern heterogeneous supercomputers. We present detailed application-specific optimizations which we generalize and combine with new programming models into a blueprint for future application development. Our techniques show good success executing on a modern heterogeneous GPU cluster providing 10 TFLOPS of peak processing capability.
An algorithm is presented which can be used to “link” two differing computer architectures together in what has become known as System Level Acceleration (SLA). This allows for a substantial increase in the complexity of the problem to be solved and also in the range of scales that often inhibits the scientist/engineer. Two examples of problems able to be solved via SLA, that of coupled cells in an vascular domain and modeling blood auto-regulation in the cerebro-vasculature.