The distributed computing infrastructure known as “The Grid” has been undoubtedly one of the most successful science-oriented large-scale IT projects in the past 20 years. The basic idea of the Grid is to take individually managed, network-connected computing centres to the next evolutionary level —a “self-managing, seamless large computer entity” sharing various types of resources and open to all research communities. This is done through simple user interfaces and web portals, successfully hiding the underlying structure of the individual computing centres, and owing to a fast network, making access to the resources independent of their geographical location. These once bold-sounding promises and ambitious goals are today a reality and the Grid is viewed as an ordinary, albeit quite powerful, computational tool by its users.
The Grid had the usual growth pattern of a large project —many years elapsed from its inception to full-blown operation status. A rich development history goes back to the Condor project first steps in 1988, passing through several important milestones —the Globus project in 1997, the European Union DataGrid in 2001 and its successor the EGEE project, gLite Grid middleware release in 2006, and beyond. Today, after a decade of physical growth —enlarging the existing computing centres and constantly adding new ones— it is a fully operational international entity, encompassing several hundred computing sites on all continents, giving access to hundreds of thousands CPU cores and hundreds of petabytes of storage, all connected through robust national and international scientific networks. It has evolved to become the main computational platform for many scientific communities —mathematical sciences, physics, geology, earth-observation and number of life-science disciplines.
There are several essential key elements that have contributed to the success of the Grid. The primary one is the early adoption of standard interfaces and open-source technologies, allowing contribution from a variety of development groups and individual programmers to build a common Grid middleware. Its functionality is further enriched by specific tools serving the needs of individual groups but all using the common middleware to access to the underlying resources. The second element is its fully distributed nature, presenting no single point of failure and permitting new computing centres to join, or old ones to leave the Grid, with minimal effort and minimal impact to the overall operational status. The third one is the commitment of the national and international funding agencies to support the operation and growth of the individual computing centres comprising the Grid, as well as the middleware developer groups. This has resulted in a robust capacity growth: the year on year increase of CPU and storage has been of 25% and 20%, respectively, able to fulfil the ever-increasing computing needs of the user communities. It also allows the Grid software to continuously evolve, readily and rapidly absorbing new technologies and computational models, for example Cloud and Supercomputing resources, even being able to utilise volunteer desktop spare CPU cycles through projects like LHC@Home. The fourth element is the network, both in the sense of LAN and WAN. The ever increasing network hardware capabilities —NICs, switches and routers— allows the design and deployment of computing centres with high local throughput between CPU and storage nodes, using readily available off-the-shelf components. The performance characteristics of such centres reach and sometimes surpass dedicated, purpose-build high performance computing clusters at a fraction of the price. The WAN aspect has been even more astonishing —through national investment in scientific networking, the computing centres have now unprecedented data connectivity. International projects facilitate network peering between countries, thus enabling movement of petabytes of data between computing centres with dedicated links, or terabytes between any pair of computing centres around the world. The fast progress of the network was not foreseen in the original Grid charter where the network paths were functionally linked with specific roles of the computing facilities, thus introducing the Tiered centre structure.
This last deserves a bit of elaboration. The MONARC model assigned a specific tiered hierarchy to the individual computing centres. At the top of the pyramid are the Tier 1s with fastest network connection, serving as data collection and distribution points. Going down the structure are Tier 2s and Tier 3s serving progressively less-data intensive tasks and linked to specific regional centres, usually the country Tier 1. All data paths were strictly prescribed in the model, creating a complex hierarchical infrastructure and requiring special data management tools. The rapid network capacity growth which served the Tier 1s, 2s and 3s with fast networks both within a region and internationally, considerably reduced the task dependency on the tiered structure diminished. All centres could effectively execute any task, almost independent of the data volume involved. Operationally, the Grid structure became almost flat.
The availability of fast networks contributed to two additional positive aspects of the Grid. Bridging the “Digital divide”: already weakened by the open model of software development, ubiquitous network allows many new and emerging countries to deploy and operate Grid-enabled computing resources, thus further erasing the boundaries between regions with strong computing traditions and newcomers to the field. The second aspect is related to the establishment of network-related projects, for example the LHCONE, aimed at further improving the connectivity and collaboration across country borders. These projects use extensively Virtual Routing and Forwarding (VRF) technology to create direct, efficient and secure network for the Grid centres across the globe.
Ample CPU, storage and network resources as well as mature and simple to use Grid middleware allow the execution of a wide and complex variety of data- and CPU-intensive tasks. Taking as an example the CERN Large Hadron Collider experiments, since the start of operation in 2010, the processed amounts of data has reached volumes of few XB/year (1018 bytes) and hundreds of centuries of wall CPU time. This moved the Grid CPU and data management into the realm of “Big data”, fully consistent with the characteristics applied to this term. The amount of data stored and analysed by other scientific projects has also seen a manifold increase over the past years. Furthermore, the future-generation experiments like the Square Kilometre Array (SKA) radio telescope project and High Altitude Water Cherenkov (HAWC) Experiment will produce and process unprecedented by today standards data volumes of hundreds PBs yearly. These experiments are poised to use Grid technology for storage and processing adequately evolved for the future. This assures the existence of the Grid as a main platform for scientific computation well into the next decade and likely beyond.
The rapid growth of the individual computing centres forming the Grid presents new challenges for resources management. The standard batch systems and storage management solutions are no longer able to cope with the sheer number of CPU cores and petabytes of disk deployed at each of these centres. Emerging new technologies, usually associated with Cloud computing, are being actively adopted on the Grid. OpenStack and OpenNebula, to mention few of the most actively used Cloud management solutions, are used today to orchestrate the installations of large and small data centres. These open-source projects allow not only overcoming the limitations of the traditional batch systems, but open new possibilities in resources provisioning with increasing complexity and functionality. Coupled with VM and container technology, they allow the centres to offer “on demand” capabilities, from providing different OS images, memory and disk capacities to specific security and network settings, fulfilling diverse user requirements at the same time and on the same physical hosts. Just few years ago, some of these capabilities did not exist or required complex expert-intensive interventions. The ease of deployment of the new management technologies allows resources providers to switch from traditional to Cloud capabilities with minimal downtime, assuring continuous and efficient use of the installed hardware.
On the storage side, the advent of distributed object store solutions like HADOOP and CEPH allow for better scaling of the installed storage capacities and, at the same time, minimizes the possibility of data loss. This is achieved by using block device images, striped and replicated across the entire storage cluster. Both capabilities are becoming increasingly important as storage continues to grow and is still the most expensive item in the computing centres installations. Simple methods for data redundancy like RAID arrays or multiple file replicas are no longer affordable. Other novel approaches to storage include federated installations, where the individual storage capacities of many computing centres are managed together and present a single point of entry to the Grid middleware. This is also a Cloud-inspired approach which simplifies the structure of the Grid. The software tools used for federated storage are also part of the Cloud software stack or purpose-build storage solutions like xrootd and EOS, developed and deployed by the High Energy Physics community.
The ever increasing need for computing resources turned the attention of Grid developers and user community to industry resource providers and more specifically to the Elastic Clouds like Amazon Elastic Compute Cloud, Google Cloud Platform and European Cloud resources providers. There are several projects aiming at providing Grid middleware interfaces to the various Cloud solutions and successful pilot runs using Cloud resources have taken place in the past two years. The intended use of these resources is to bring further “elasticity” to the capacity of the Grid, especially in periods of high resource demand. Clouds are also used by the scientific project directly though the Cloud interfaces, usually implemented according to the Open Cloud Computing Interface (OCCI) standard. The computing centres nowadays provide both Grid and Cloud interfaces to their resources as the software maturity is such to minimise the burden of supporting various high-level interfaces to the underlying computing capacity.
To assure the working status of the computing resources, their efficient use and to help trace and fix operational issues, the Grid requires comprehensive and real-time monitoring. Several packages, different in scope and complexity are in use today. They monitor basic parameters of the installed hardware, the running applications, storage status and the network. The full list of the monitored parameter in today's Grid would be quite long and after ten years of continuous operation, is stored in many large, TB sized, relational databases. The challenges for the monitoring software are many —it must be unobtrusive, secure, scalable and configurable. The common approach to monitoring is to deploy it at several levels —the computing centres use fabric monitoring tools, the user communities have higher level tools for the running applications, the network operators deploy monitoring along the data paths. Further monitoring aggregation is done for resources accounting purposes. New projects are exploiting the possibilities to use Elasticsearch techniques and tools to go beyond the limitations of the traditional monitored data representations.
The Varenna Course 192 “Grid and Cloud computing: Concepts and Practical Applications” aimed to cover in-depth the conceptual and practical aspects of the Grid and Cloud computing, briefly outlined above. The present volume is divided into 8 chapters:
1. LHC computing (WLCG): Past, present, and future, by Ian G. Bird
2. Scientific Clouds, by Davide Salomoni
3. Clouds in biosciences: A journey to high throughput computing in life sciences, by Vincent Breton et al.
4. Monitoring and control of large-scale distributed systems, by Iosif Charles Legrand
5. Big data: Challenges and perspectives, by Dirk Duellmann
6. Advanced networking for scientific applications, by Artur Barczyk
7. Networking for high energy physics, by Harvey B. Newman et al.
8. Towards an OpenStack-based Swiss national research infrastructure, by Sergio Maffioletti et al.
Chapters 1, 2, 3 and 8 cover general application of Grid and Cloud computing in various scientific fields; chapters 4, 5, 6 and 7 discuss specific technical areas of the Grid and Cloud structures.
Happy reading!