
Ebook: Cloud Computing and Big Data

Cloud computing offers many advantages to researchers and engineers who need access to high performance computing facilities for solving particular compute-intensive and/or large-scale problems, but whose overall high performance computing (HPC) needs do not justify the acquisition and operation of dedicated HPC facilities. There are, however, a number of fundamental problems which must be addressed, such as the limitations imposed by accessibility, security and communication speed, before these advantages can be exploited to the full.
This book presents 14 contributions selected from the International Research Workshop on Advanced High Performance Computing Systems, held in Cetraro, Italy, in June 2012. The papers are arranged in three chapters. Chapter 1 includes five papers on cloud infrastructures, while Chapter 2 discusses cloud applications.
The third chapter in the book deals with big data, which is nothing new – large scientific organizations have been collecting large amounts of data for decades – but what is new is that the focus has now broadened to include sectors such as business analytics, financial analyses, Internet service providers, oil and gas, medicine, automotive and a host of others.
This book will be of interest to all those whose work involves them with aspects of cloud computing and big data applications.
At the International Research Workshop on Advanced High Performance Computing Systems in Cetraro in June 2012, two of the main workshop topics were High Performance Computing (HPC) in the Cloud and Big Data.
Cloud computing offers many advantages to researchers and engineers who need access to high performance computing facilities for solving particular compute intensive and/or large scale problems, but whose overall HPC needs do not justify the acquisition and operation of dedicated HPC facilities. The questions surrounding the efficient and effective utilization of HPC cloud facilities are, however, numerous, with perhaps the most fundamental issues being the limitations imposed by accessibility, security and communication speeds.
Therefore, in order to mobilize the full potential of cloud computing as an HPC platform a number of fundamental problems must be addressed. On the one hand it must be identified which classes of problems are amenable to the cloud computing paradigm with its limitations. On the other hand it must be clarified which technologies, techniques and tools are needed to enable a widely acceptable use of cloud computing for HPC.
The second topic, big data, is nothing new. Large scientific organizations have been collecting large amounts of data for decades. What is new, however, is that the focus has now broadened to almost all sectors – be it business analytics in enterprises, financial analyses, Internet service providers, oil and gas, medicine, automotive, and a long list of others.
This book presents three chapters with together 14 contributions, selected from the International Research Workshop on Advanced High Performance Computing Systems in Cetraro in June 2012. The five contributions of the first chapter on “Cloud Infrastructures” discuss several important topics of High Performance Computing in the cloud, covering automatic clouds with an open-source and deployable Platform-as-a-Service; QoS-aware cloud application management; building secure and transparent inter-cloud infrastructure for scientific applications; cloud adoption issues such as interoperability and security; and semantic technology for supporting software portability and interoperability in the cloud.
Chapter two discusses “Cloud Applications”, with a focus on using clouds for technical computing; dynamic job scheduling of parametric computational mechanics studies; the bulk synchronous parallel model; and executing multi-workflow simulations on clouds.
Finally, the articles in chapter three are dealing with “Big Data” problems such as ephemeral materialization points in stratosphere data management on the cloud; a cloud framework for big data analytics workflows; high performance big data clustering; scalable visualization and interactive analysis using massive data streams; and mammoth data in the cloud from clustering social images. The editors wish to thank all the authors for preparing their contributions as well as the many reviewers who supported this effort with their constructive recommendations.
Charlie Catlett, USA
Wolfgang Gentzsch, Germany
Lucio Grandinetti, Italy
Gerhard Joubert, Netherlands/Germany
José Luis Vazquez-Poletti, Spain
14 July 2013
Cloud computing is making a new step towards the already old vision of utility computing of seamless delivery of computing, storage or networks as measurable consumables. However, completely automated processes are not yet in place, and human intervention is still required. This paper intends to provide a snapshot of the current status in the automated processes happening in the Cloud. Moreover, a special attention is given to a recent developed platform-as-a-service that was designed to be an open-source and deployable middleware: ensuring the portability of applications based on elastic components and consuming infrastructure services, it is a good example of the potential of automated procedures in a Cloud environment.
Cloud computing is attractive to many organizations because of its support for on-demand resources. The processes of deploying and running an application on the cloud need to be simple and efficient in order to justify the costs incurred in moving to the cloud. Unfortunately this is not necessarily the case today. Cloud application management therefore needs to become more provider-independent, autonomic and Quality-of-Service (QoS) aware. The QuARAM framework for QoS-aware autonomic cloud application management supports application developers in selecting a cloud provider, provisioning resources on the provider, deploying the application, and then managing the execution of the application.
On 11 March 2011 Japan suffered a major earthquake. The resulting tsunami caused great loss of life and devastated many buildings, industries, and information services. Similarly, recent flooding in Thailand resulted in the destruction of property and a disruption of services. Thus, it is prudent for information technology infrastructure and services providers to formulate and implement procedures for rapid disaster recovery of key services that become even more critical during catastrophic events. To aid in our understanding of how academic resource centers can positively respond to such events, we have prototyped, developed and deployed GEO (Global Earth Observation) Grid applications on a distributed infrastructure made up of physical clusters contributed by our international colleagues. On top of these clusters, each site was able to host virtual machines, using their choice of virtualization infrastructure. We were able to realize a secure and transparent Inter-Cloud infrastructure through automated translation of virtual machines (e.g., between different hypervisors) and network virtualization. Cloud interoperation enables sharing of virtual machine images by private and public Clouds, i.e. a virtual machine image could be shared by different VM hosting environments including OpenNebula, Rocks, and Amazon EC2. Insights gained through the preliminary experiments indicated the key issue for Inter-Cloud is how we could use technologies for network virtualization. We used OpenFlow for network virtualization to build a secure (isolated) network for Inter-Cloud.
The cloud computing paradigm emerged shortly after the introduction of the “invisible” grid concepts but it has taken only a few years for cloud computing to gain enormous momentum within industry and academia alike. However, providing adequate interoperability and security support by those complex distributed systems is of primary importance for the wide adoption of cloud computing by the end users. This paper gives an overview of the main cloud interoperability and security issues and challenges. Existing and proposed solutions are also presented with particular attention to the security as a service approach. Some of the available directions for future work are also discussed.
Cloud vendor lock-in and interoperability gaps arise (among many reasons) when semantics of resources and services, and of Application Programming Interfaces is not shared. Standards and techniques borrowed from SOA and Semantic Web Services areas might help in gaining shared, machine readable description of Cloud offerings (resources, Services at Platform and Application level, and their API groundings), thus allowing automatic discovery, matchmaking, and thus supporting selection, brokering, interoperability and even composition of Cloud Services among multiple Clouds. The EU funded mOSAIC project (http://www.mosaic-cloud.eu) aims at designing and developing an innovative open-source API and platform that enables applications to be Cloud providers' neutral and to negotiate Cloud services as requested by their users. In this context, using the mOSAIC Cloud ontology and Semantic Engine, cloud applications' developers will be able to specify their services and resources requirements independently from Cloud providers' specific solutions. In order to update and maintain the platform and the mOSAIC API, the mOSAIC Semantic Discovery Service will, on the other hand, discover Cloud providers' functionalities and resources, and will compare and align to the mOSAIC API, thus supporting agnostic and interoperable access to Cloud providers' offers.
We discuss the use of cloud computing in technical (scientific) applications and identify characteristics such as loosely-coupled and data-intensive that lead to good performance. We give both general principles and several examples with an emphasis on use of the Azure cloud.
Parameter Sweep Experiments (PSEs) allow scientists to perform simulations by running the same code with different input data, which typically results in many CPU-intensive jobs and thus computing environments such as Clouds must be used. Job scheduling is however challenging due to its inherent NP-completeness. Therefore, some Cloud schedulers based on Swarm Intelligence (SI) techniques, which are good at approximating combinatorial problems, have arisen. We describe a Cloud scheduler based on Ant Colony Optimization (ACO), a popular SI technique, to allocate Virtual Machines to physical resources belonging to a Cloud. Simulated experiments performed with real PSE job data and alternative classical Cloud schedulers show that our scheduler allows a fair assignment of VMs, which are requested by different users, while maximizing the number of jobs executed every time a new user connects to the Cloud. Unlike previous experiments with our algorithm [9], in which batch execution scenarios for jobs were used, the contribution of this paper is to experiment with our proposal in dynamic scheduling scenarios. Results suggest that our scheduler provides a better balance to the number of executed jobs per unit time versus serviced users, i.e., the number of Cloud users that the scheduler is able to successfully serve.
Nowadays the concepts and infrastructures of Cloud Computing are becoming a standard for several applications. Scalability is not only a buzzword anymore, but is being used effectively. However, despite the economical advantages of virtualization and scalability, some factors as latency, bandwidth and processor sharing can be a problem for doing Parallel Computing on the Cloud.
We will provide an overview on how to tackle these problems using the BSP (Bulk Synchronous Parallel) model. We will introduce the main advantages of the CGM (Coarse Grained Multicomputer) model, where the main goal is to minimize the number of communication rounds, which can have an important impact on BSP algorithms performance. We will also briefly present our experience on using BSP in an opportunistic grid computing environment. Then we will show several recent models for distributed computing initiatives based on BSP. Finally we will present some preliminary experiments presenting the performance of BSP algorithms on Clouds.
Large simulations require the combination of many different applications. In many cases they are described as scientific workflows. In the recent years significant knowledge accumulated in the form of scientific workflows. This helps scientists to build even more complex simulations particularly, if they combine existing simulations with new ones and create multi-workflow simulations. The SHIWA project created the SHIWA Simulation Platform to help scientist sharing and combining workflows to build such complex multi-workflow simulations that also require the use of as many computing resources as possible. The mixed use of grid and cloud infrastructure can provide the necessary amount of resources that are needed to complete complex multi-workflow simulations in reasonable time. The SCI-BUS project has created the gateway technology that enables the mixed use of grid and cloud infrastructures. The current paper describes the SHIWA and SCI-BUS technology and how they can be used to run multi-workflow simulations on the mixed grid and cloud infrastructures.
Data streaming frameworks like stratosphere [1] are designed to work in the cloud on a large number of parallel working nodes. The increase of nodes together with the expected long run-time of data processing tasks causes an increase of failure probability. Therefore fault tolerance becomes an important issue in these systems. Existing fault tolerance strategies for data streaming systems usually accept full restarts or work in a blocking manner.
In this paper we introduce ephemeral materialization points, a non blocking materialization strategy in data streaming systems. This strategy selects materialization positions uncoordinated during run-time. The materialization decision is taken depending on the resource usage and the execution graph to minimize the expected recovery time in case of a failure. We show how and when to reach a decision whether to materialize or not, and which information could influence the decision.
Since digital data repositories are more and more massive and distributed, we need smart data analysis techniques and scalable architectures to extract useful information from them in reduced time. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large and efficient storage facilities together with high performance processors to get results in acceptable times. In this chapter we present a Data Mining Cloud Framework designed for developing and executing distributed data analytics applications as workflows of services. In this environment we use data sets, analysis tools, data mining algorithms and knowledge models that are implemented as single services that can be combined through a visual programming interface in distributed workflows to be executed on Clouds. The first implementation of the Data Mining Cloud Framework on Azure is presented and the main features of the graphical programming interface are described.
Scientific advances are collectively exploding the amount, diversity, and complexity of data becoming available. Our ability to collect huge amounts of data has greatly surpassed our analytical capacity to make sense of it. Efficient use of high performance computing techniques is critical for the success of the data-driven paradigm to scientific discovery. Data clustering is one of the fundamental analytics tasks heavily relied upon in many application domains, like astrohpysics, climate science, bioinformatics, etc. In this book chapter, we illustrate the challenges and opportunities in mining big data using two recently developed scalable parallel clustering algorithms. Experimental results on millions of high-dimensional data points clustered in parallel on thousands of processor cores are also presented.
Historically, data creation and storage has always out paced the infrastructure for its movement and utilization. This trend is increasing now more than ever, with the ever growing size of scientific simulations, increased resolution of sensors, and large mosaic images. Effective exploration of massive scientific models demands the combination of data management, analysis, and visualization techniques, working together in an interactive setting. The ViSUS application framework has been designed as an environment that allows the interactive exploration and analysis of massive scientific models in a cache-oblivious, hardware-agnostic manner, enabling processing and visualization of possibly geographically distributed data using many kinds of devices and platforms.
For general purpose feature segmentation and exploration we discuss a new paradigm based on topological analysis. This approach enables the extraction of summaries of features present in the data through abstract models that are orders of magnitude smaller than the raw data, providing enough information to support general queries and perform a wide range of analyses without access to the original data.
Social image datasets have grown to dramatic size with images classified in vector spaces with high dimension (512-2048) and with potentially billions of images and corresponding classification vectors. We study the challenging problem of clustering such sets into millions of clusters using Iterative MapReduce. We introduce a new Kmeans algorithm in the Map phase which can tackle the challenge of large cluster and dimension size. Further we stress that the necessary parallelism of such data intensive problems are dominated by particular collective operations which are common to MPI and MapReduce and study different collective implementations, which enable cloud-HPC cluster interoperability. Extensive performance results are presented.