Ebook: High Performance Computing and Grids in Action
Advancement of science and technology and its impact on real life applications is more and more strictly related to the progress and availability of high performance parallel computer systems and grids, the novel networked infrastructures aimed to organize and optimize the use of a huge amount of distributed data processing and computing resources. This book collects in four chapters single monographs related to the fundamental advances in parallel computer systems and their future developments from different points of view (from computer scientists, computer manufacturers, end users) and related to the establishment and evolution of grids fundamentals, implementation and deployment. The aim is to cover different points of view in the field by actors playing different roles, to orchestrate and correlate their interconnection and coherence, and - above all - to show behaviors, impacts, performances of architectures, systems, services and organizations in action. Accordingly, the expected audience would be broad, mainly made up by computer scientists, Ph.D. students, post-doc researchers, specialists of computing and data centers, computer engineers and architects, project leaders, information system planners and professional technologists.
Advancement of Science and Technology and its impact on the real life applications is more and more strictly related to the progress and availability of high performance parallel computer systems and Grids, the novel networked infrastructures aimed to organize and optimize the use of a huge amount of distributed data processing and computing resources.
The book collects in four chapters a selection of papers presented at the 2006 International Advanced Research Workshop on High Performance Computing and Grids in Cetraro, Italy; these papers are related to some fundamental advances in parallel computer systems and their future developments and to the establishment and evolution of Grids fundamentals, implementation, deployment.
The aim is to cover different points of view in the field by actors playing different roles, to orchestrate and correlate their interconnection and coherence, and – above all – to show behaviours, impacts, performances of architectures, systems, services and organizations in action.
Accordingly the expected audience would be broad, mainly made up by computer scientists, graduate students, post doc researchers, specialists of computing and data centers, computer engineers and architects, project leaders, information system planners, professional technologists.
The book structure will help the reader to accomplish the objective mentioned above. An introductory monograph by I. Foster discusses the general and appealing issue on when and how to advance the state of the art in scientific software and infrastructure. Chapters 1 and 2 deal with some fundamental topics on High Performance Computing and Grids. Chapter 3 surveys different aspects of Grid Computing use, in particular tools and services for making available and deployable grids. Chapter 4 deals with applications; these in general are an indispensable motivation for researching and developing new fields of science and related technologies.
It is my pleasure thanking and acknowledging the contribution of many individuals that have made this book possible.
Gerhard Joubert, the Editor of the IOS Press book series “Advances in Parallel Computing” which now includes this volume, for his support, encouragement and advice on editorial issues and for writing the Foreword.
Jack Dongarra, Ian Foster, Geoffrey Fox, Miron Livny, among others, for advising on scientific issues before, during and after the workshop organization.
My colleagues from the Center of Excellence on High Performance Computing and the Department of Electronics, Informatics and Systems at University of Calabria, Italy are thanked for their constructive comments.
I feel indebted to Maria Teresa Guaglianone for her dedication in handling the book's material and for keeping, nicely and effectively, relationships with the contributing Authors and the Publisher.
Finally, the support of Hewlett Packard towards the book publication is acknowledged and in particular the persistent consideration and help given by Frank Baetke.
Professor and Director, Center of Excellence on High Performance Computing University of Calabria, Italy
Licklider advocated in 1960 the construction of computers capable of working symbiotically with humans to address problems not easily addressed by humans working alone. Since that time, many of the advances that he envisioned have been achieved, yet the time spent by human problem solvers in mundane activities remains large. I propose here four areas in which improved tools can further advance the goal of enhancing human intellect: services, provenance, knowledge communities, and automation of problem-solving protocols.
By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor. Results on modern processor architectures and the Cell BE are presented.
This paper describes a machine/programming model for the era of multi-core chips. It is derived from the sequential model but replaces sequential composition with concurrent composition at all levels in the program except at the level where the compiler is able to make deterministic decisions on scheduling instructions. These residual sequences of instructions are called microthreads and they are small code fragments that have blocking semantics. Dependencies that would normally be captured by sequential programming are captured in this model using dataflow synchronisation on variables in the contexts of these microthreads. The resulting model provides a foundation for significant advances in computer architecture as well as operating systems and compiler development. The paper takes a high-level perspective on the field of asynchronous distributed systems and comes to the conclusion that dynamic and concurrent models are the only viable solution but that these should not necessarily be visible to the users of the system.
In this chapter we describe an end-to-end workflow management system that enables scientists to describe their large-scale analysis in abstract terms, then maps and executes the workflows in an efficient and reliable manner on distributed resources. We describe Pegasus and DAGMan and various workflow restructuring and optimizations they perform and demonstrate the scalability and reliability of the approach using applications from astronomy, gravitational-wave physics, and earthquake science.
Large-scale scientific computing is playing an ever-increasing role in critical decision-making and dynamic, event-driven systems. While some computation can simply wait in a job queue until resources become available, key decisions concerning life-threatening situations must be made quickly. A computation to predict the flow of airborne contaminants from a ruptured railcar must guide evacuation rapidly, before the results are meaningless. Although not as urgent, on-demand computing is often required to leverage a scientific opportunity. For example, a burst of data from seismometers could trigger urgent computation that then redirects instruments to focus data collection on specific regions, before the opportunity is lost. This paper describes the challenges of building an infrastructure to support urgent computing. We focus on three areas: the needs and requirements of an urgent computing system, a prototype urgent computing system called SPRUCE currently deployed on the TeraGrid, and future technologies and directions for urgent computing.
In widely distributed systems generally, and in science-oriented Grids in particular, software, CPU time, storage, etc., are treated as “services” – they can be allocated and used with service guarantees that allows them to be integrated into systems that perform complex tasks. Network communication is currently not a service – it is provided, in general, as a “best effort” capability with no guarantees and only statistical predictability.
In order for Grids (and most types of systems with widely distributed components) to be successful in performing the sustained, complex tasks of large-scale science – e.g., the multi-disciplinary simulation of next generation climate modeling and management and analysis of the petabytes of data that will come from the next generation of scientific instrument (which is very soon for the LHC at CERN) – networks must provide communication capability that is service-oriented: That is it must be configurable, schedulable, predictable, and reliable. In order to accomplish this, the research and education network community is undertaking a strategy that involves changes in network architecture to support multiple classes of service; development and deployment of service-oriented communication services, and; monitoring and reporting in a form that is directly useful to the application-oriented system so that it may adapt to communications failures.
In this paper we describe ESnet's approach to each of these – an approach that is part of an international community effort to have intra-distributed system communication be based on a service-oriented capability.
Recent developments in grid middleware and infrastructure have made it possible for a new generation of scientists, e-Scientists, to envisage and design large-scale computational experiments. However, while scheduling and execution of these experiments has become common, developing, deploying and maintaining application software across a large distributed grid remains a difficult and time consuming task. Without simple application deployment, the potential of grids cannot be realized by grid users. In response, this paper presents the motivation, design, development and demonstration of a framework for grid application deployment. Using this framework, e-Scientists can develop platform-independent parallel applications, characterise and identify suitable computational resources and deploy applications easily.
Grids are built by communities who need a shared cyberinfrastructure to make progress on the critical problems they are currently confronting. An e-science portal is a conventional Web portal that sits on top of a rich collection of web services that allow a community of users access to shared data and application resources without exposing them to the details of Grid computing. In this chapter we describe a service-oriented architecture to support this type of portal.
The problem of scheduling a parallel application on to heterogeneous distributed computing environment is one of the most challenging problems in a grid. Two main problems arise: how to efficiently discover resources capable of providing the demanded computing services and how to schedule them. We propose a new fully decentralized solution for both resource brokering and scheduling in a grid based on a peer-to-peer model, which consider resource brokering and scheduling as parts of a single activity, in order to achieve high resources utilization and low response times. The proposed scheduling solution uses the relationship among neighbours to let each node learning structural and functional information about the resource-sharing environment. This mechanism makes the system an intelligent organism, characterized by the capability of self-learning about its environment.
Scientific computing is moving rapidly from a world of “reliable, secure parallel systems” to a world of distributed software, virtual organizations and high-performance, though unreliable parallel systems with few guarantees of availability and quality of service. This transformation poses daunting scaling and reliability challenges and necessitates new approaches to collaboration, software development, performance measurement, system reliability and coordination. This paper describes Renaissance approaches to solving some of today's most challenging scientific and societal problems using Grids and parallel systems, supported by rich tools for performance analysis, reliability assessment and workflow management.
This paper argues for the need to provide more flexibility in the level of service offered by Grid-enable high-performance, parallel, supercomputing resources. It is envisaged that such need could be satisfied by making separate Service Level Agreements (SLAs) between the resource owner and the user who wants to submit and run a job on these resources. A number of issues related to the materialization of this vision are highlighted in the paper.
TeraGrid is a national-scale computational science facility supported through a partnership among thirteen institutions, with funding from the US National Science Foundation . Initially created through a Major Research Equipment Facilities Construction (MREFC ) award in 2001, the TeraGrid facility began providing production computing, storage, visualization, and data collections services to the national science, engineering, and education community in January 2004. In August 2005 NSF funded a five-year program to operate, enhance, and expand the capacity and capabilities of the TeraGrid facility to meet the growing needs of the science and engineering community through 2010. This paper describes TeraGrid in terms of the structures, architecture, technologies, and services that are used to provide national-scale, open cyberinfrastructure. The focus of the paper is specifically on the technology approach and use of middleware for the purposes of discussing the impact of such approaches on scientific use of computational infrastructure. While there are many individual science success stories, we do not focus on these in this paper. Similarly, there are many software tools and systems deployed in TeraGrid but our coverage is of the basic system middleware and is not meant to be exhaustive of all technology efforts within TeraGrid. We look in particular at growth and events during 2006 as the user population expanded dramatically and reached an initial “tipping point” with respect to adoption of new “grid” capabilities and usage modalities.
Scientists and, more generally end users of computer systems, need to be able to trust the data they use. Understanding the origin or provenance of data can provide this trust. Attempts have been made to develop systems for recording provenance, however, most are not generic and cannot be applied in a general manner across different systems and different technologies. Moreover, many existing systems confuse the concept of provenance with its representation. In this article, we discuss an open, technology neutral model for provenance. The model can be applied across many different systems and makes the important distinction between provenance and the way it can be generated from a concrete representation of process. The model is described and applied to a grid-based example bioinformatics application.
We review the emergence of a diverse collection of modern Internet-scale programming approaches, collectively known as Web 2.0, and compare these to the goals of cyberinfrastructure and e-Science. e-Science has had success following the Enterprise development model, which emphasizes sophisticated XML formats, WSDL and SOAP-based Web Services, complex server-side programming tools and models, and qualities of service such as security, reliability, and addressing. Unfortunately, these approaches have limits on deployment and sustainability, as the standards and implementations are difficult to adopt and require developers and support staff with a high degree of specialized expertise. In contrast, Web 2.0 applications have demonstrated that simple approaches such as (mostly) stateless HTTP-based services operating on URLs, simple XML network message formats, and easy to use, high level network programming interfaces can be combined to make very powerful applications. Moreover, these network applications have the very important advantage of enabling “do it yourself” Web application development, which favors general programming knowledge over expertise in specific tools. We may conservatively forecast that the Web 2.0 approach will supplement existing cyberinfrastructure to enable broader outreach. Potentially, however, this approach may transform e-Science endeavors, enabling domain scientists to participate more directly as codevelopers of cyberinfrastructure rather than serving merely as customers.
In this chapter, we describe the BabelPeers project. The idea of this project is to develop a system for Grid resource description and matching, which is semantically rich while maintaining scalability and reliability. This is achieved by the distribution of resource data over a p2p network, combined with sophisticated mechanisms for query processing, reasoning, and load balancing.
We start by describing the benefits of semantically expressive languages for Grid resource description, and then continue to explain our methods, and how they help to maintain scalability. This includes distributed data storage and query processing strategies. Reliability is given by replication in the p2p network. Special emphasis is given on novel methods for load balancing and efficient query processing. Finally, we present benchmarks and simulations to show the good performance of the BabelPeers system.
Data integration is a key issue for exploiting the availability of large, heterogeneous, distributed and highly dynamic data volumes on Grids. In the Grid, a centralized structure for coordinating all the nodes may not be efficient because it becomes a bottleneck when a large number of nodes are involved and, most of all, it does not benefit from the dynamic and distributed nature of Grid resources. In this chapter, we present a decentralized service-based data integration architecture for Grid databases that we refer to as Grid Data Integration System (GDIS). GDIS main concern is reconciliation of schema heterogeneity among data sources. GDIS adopts the decentralized XMAP integration approach that can effectively exploit the available Grid resources and their dynamic allocation. The GDIS architecture offers a set of Grid services that allow users to submit queries over a single database and receive the results from multiple databases that are semantically correlated with the former one. The underlying model of such architecture is discussed and its implementation based on the OGSA Globus architecture is described. Moreover, we show how it fits the XMAP formalism/algorithm.
Grids encourage and promote the publication, sharing and integration of scientific data, distributed across Virtual Organizations. Scientists and researchers work on huge, complex and growing datasets. The complexity of data management within a grid environment comes from the distribution, heterogeneity and number of data sources. Along with coarse grained services (basically grid storages, replica services, storage resource managers, etc.), there is a strong interest on fine grained ones concerning, for instance, grid-database access and management. Moreover, as grid computing, technologies and standards evolve, more mature environments (production grids such as EGEE) become available for production based activities and tools/services able to access in grid relational databases are also strongly required. We describe the Grid Relational Catalog (GRelC) Project, a framework for grid data management, highlighting its approach, architecture, components, services and technological issues.
Service oriented application development has gained wide acceptance in the Grid computing community. A number of projects and community efforts have been developing standards (OGSA and WSRF), tools, and middleware systems to support development, deployment, and execution of Grid Services. In this paper, we present the use of Grid Services in two application scenarios; servicing of large scale data from simulation-based oil reservoir management studies and computer-aided analysis of large microscopy images for neuroblastoma prognosis. We describe how the high-performance backend systems have been developed for these applications and how these backend systems have been exposed through Grid Service interfaces. Our work illustrates that with the help of appropriate middleware systems, it is possible to develop and deploy high-performance Grid Services that support storage, retrieval and processing of large scale scientific and biomedical datasets, while enabling remote access to the data via well-defined interfaces and protocols.
A very important topic in the effort of deploying workflow applications in grid environments is finding a means to describe the complex application structures that is simple for users yet expressive enough to provide the workflow scheduler with the information needed to make (good) scheduling decisions. A workflow description language defines syntax and semantics for specifying workflow tasks and their relationships; thus it provides an abstract and formalized representation of the complex workflow structures in text format. But the existing workflow description languages focus on expressiveness with respect to describing the data and control flow of workflow structures. They lack the ability to specify resource request information in support of resource allocations for workflow tasks by the scheduler. In addition, we have found that many of the features requested by end users are not, or are only partially supported in current workflow languages.
In this paper, we present a high-level abstract language for domain scientists to describe workflow applications in grid environments, the Grid Application Modeling and Description Language (GAMDL). GAMDL associates resource request information with a workflow description so that a workflow scheduler can make resource co-allocation requests based only on the workflow description itself. In terms of expressiveness, GAMDL is able to describe data-flow structures of complex domain problems, and also allows the definition of control-flow logic within the data-flow. Designed to be intuitive and suitable for users without a background in grid computing, GAMDL provides features that are not available in other languages.