Ebook: Big Data and HPC: Ecosystem and Convergence
Due to the increasing need to solve complex problems, high-performance computing (HPC) is now one of the most fundamental infrastructures for scientific development in all disciplines, and it has progressed massively in recent years as a result. HPC facilitates the processing of big data, but the tremendous research challenges faced in recent years include: the scalability of computing performance for high velocity, high variety and high volume big data; deep learning with massive-scale datasets; big data programming paradigms on multi-core; GPU and hybrid distributed environments; and unstructured data processing with high-performance computing.
This book presents 19 selected papers from the TopHPC2017 congress on Advances in High-Performance Computing and Big Data Analytics in the Exascale era, held in Tehran, Iran, in April 2017. The book is divided into 3 sections: State of the Art and Future Scenarios, Big Data Challenges, and HPC Challenges, and will be of interest to all those whose work involves the processing of Big Data and the use of HPC.
HPC is one of the most important and fundamental infrastructures for scientific development in all disciplines, and has progressed enormously due to the increasing need to solve complex problems. HPC is planned to reach Exascale computing capability of at least one exaFLOPS by 2020. Compared to the first petascale computer to come into operation in 2008, this capacity represents a thousand-fold increase.
Big data is the term used to refer to data sets so large and complex that traditional data processing applications are inadequate to handle them. Gathering, capture, data cleaning and curation, search, sharing, storage, transfer, analysis, visualization, and information privacy are its main challenges. Recent years have witnessed a flood of network data driven by sensors, the IoT, emerging online social media, cameras, M2M communications, mobile sensing, user-generated video content, and global-scale communications, all of which have brought people into the era of big data. The processing of Big Data requires vast amounts of storage and computing resources. This makes it an exciting time to join the practitioners in HPC systems and Big Data.
HPC facilitates the processing of big data, but the tremendous research challenges of recent years have included the scalability of computing performance for high velocity, high variety and high volume big data; Deep learning with massive-scale datasets; Big data programming paradigms on multi-core; GPUhybrid distributed environments; and unstructured data processing with high-performance computing. The tools and cultures of high-performance computing and big-data analytics have diverged to the detriment of both, and it is clear that they need to come together to effectively address a range of major research areas.
The question now is: what is the biggest technical challenge associated with advancing HPC beyond the exascale operations capability for numeric and big data applications? There are many challenges ahead, including system power consumption and environmentally friendly cooling, massive parallelism and component failures, data and transaction consistency, metadata and ontology management, precision and recall at scale, and multidisciplinary data fusion and preservation, and all these challenges must ultimately be solved to achieve a computing platform which will be useable by and accessible to researchers from a wide range of disciplines who may not be computer experts themselves.
In accordance with the importance of this field, the TopHPC2017 congress on Advances in High-Performance Computing and Big Data Analytics in the Exascale Era was held in Tehran, Iran, and selected papers from the congress are gathered together in this book.
Even the highest scale contemporary conventional HPC system architectures are optimized for the basic operations and access patterns of classical matrix and vector processing. These include emphasis on FPU utilization, high data reuse requiring temporal and spatial locality, and uniform strides of indexing through regular data structures. Such systems in the 100 Petaflops performance regime such as the Chinese Sunway TaihuLight, and the US CORAL Summit and Sierra to be deployed in 2018, in spite of their innovations, are still limited in these properties. Emerging classes of new application problems in data analytics, machine learning, and knowledge management demand very different operational properties in response to their highly irregular, sparse, and dynamic behaviors exhibiting little or no data reuse, random access patterns, and metadata-dominated parallel processing. At the core of these “big data” applications is dynamic adaptive graph processing, which is in some ways diametrically opposite to conventional matrix computing. Of immediate importance is the need to significantly enhance efficiency and scalability as well as user productivity, performance portability, and reduced energy. Key to success is the introduction of powerful runtime system strategies and software for the exploitation of real-time system information to support dynamic adaptive resource management and task scheduling. But software alone will be insufficient for extreme-scale where near fine-grained parallelism is necessary and software overheads will bound efficiency and scalability. A new era of architecture research is beginning in the combined domains of accelerator hardware for both graph processing and runtime systems. This paper will discuss the nature of the computational challenges, examples and experiments with state-of-the-art runtime system software HPX-5, and future directions in hardware architecture support for exascale runtime-assisted big data computation.
The convergence of High Performance Computing (HPC) and Big Data Analytics (BDA) has been the center of attention for past few years. HPC and BDA have separate software stacks and from financial point, it is impossible to invest in both categories at the same time. HPC is traditionally compute intensive while BDA is data intensive. In this paper we investigate the convergence of HPC and BDA from a technical perspective. We first review the state of the art and introduce the common frameworks of HPC and BDA. We also compare these frameworks in terms of scalability, data rate, data size, fault tolerance and real-time processing. We further compare the software stacks for HPC and BDA as convergence challenges and discuss on existing solutions to convergence.
Achieving exaflop capabilities imposes some challenging tasks, such as the performance evaluation in HPC, the understanding of the computational requirements of scientific applications, its relation with power consumption, and ways to monitor a wide range of parameters in heterogeneous environments. Considering these challenges, this work proposes a methodology for evaluating the performance of scientific applications in HPC and the analyses approaches and the tools to monitor the performance and the power consumption. Additionally, we present how to use this methodology to assist in the evaluation process of the energy industry applications while our own performance and monitoring tool is proposed.
High-throughput microscopy techniques generate an ever growing amount of data that are fundamental to gather scientific, biologically and medically relevant insights. This growing amount of data dramatically affects the scientific workflow at every step. Visualization and analysis tasks are performed with limited interactivity and the implementations often require HPC skills and lack of portability, usability and maintainability.
In this work we explore a software infrastructure that simplifies end-to-end visualization and analysis of massive data. Data management and movement is performed using a hierarchical streaming data access layer which enable interactive exploration of remote data. The analysis tasks are expressed and performed using a library for rapid prototyping of algorithms using an Embedded Domain Specific Language which enables portable deployment in both desktop and HPC environments. Finally, we use a scalable runtime system (Charm++) to automate the mapping of the analysis algorithm to the computational resources available, reducing the complexity of developing scaling algorithms. We present large scale experimentations using tera-scale microscopy data executing some of the most common neuroscience use cases: data filtering, visualization using two different image compositing algorithms, and image registration.
Learning effective feature representations and similarity measures are critical to the performance of a CBIR. Although various techniques have been proposed, it remains one of the most challenging problems in CBIR, which is mainly due to “semantic gap” issue that exists between low-level image pixels captured by machine and high-level semantic concepts perceived by human. One of the most important advances in machine learning is known as “deep learning” that attempts to model high-level abstractions in data by employing deep architectures composed of multiple non-linear transformations. We can improve CBIR using the state-of-the-art deep learning techniques for learning feature representations and similarity measures. Deep Neural Networks have recently shown great performance on image classification.
In this paper we take another step and present some proposed methods for object detection and semantic segmentation which can be used for CBIR. Besides, we propose an approach based on a well-known deep CNN architecture, GoogLeNet. One of the most important approaches for object detection is RCNN (Regions with zCNN features). The core idea of RCNN is to generate multiple object proposals, extract features from each proposal using a CNN, and then classify each candidate window with a category-specific linear SVM. We will present the idea of RCNN and the method improved this idea including multi-stage and deformable deep CNN. The other important approaches for object detection are based on a DNN-based regression towards an object mask. In this approach we can define one or multi CNN to detect multi objects. We will present the idea of DNN-based regression and compare this idea with the idea of RCNN. Finally, we propose a CBIR using a Deep CNN in this paper. In our proposed CBIR, we consider the output of the last convolutional layer as image features to find similar images based on these feature maps.
Deep learning requires extremely large computational resources to train a multi-layer neural network. GPUs are often used to train deep neural networks, because the primary computational steps match their SIMD-style nature and they provide much rawer computing capability than traditional CPU cores. In this paper we explain the role of GPUs and Caffe to get high performance computing model in deep learning.
Totally, our goal in this paper is to explain various architectures in deep neural networks, RCNN approach to object detection, DNN-based regression approach to object detection, high performance computing in Deep learning using GPUs, and Our proposed CBIR using a Deep CNN.
Graph-based approaches have been shown to be efficient in information extraction, especially in the case of text mining. Compared to methods like vector space models, a graph representation of a document has less information loss caused by feature extraction. However, constructing graph models are more CPU and memory intensive, thus utilizing HPC solutions seems inevitable in this case. This paper suggests a pipeline method of constructing a graph model that lets for an arbitrary level of parallel processing and distributed computing. This method also enables a wide range of data visualization opportunity. It is shown that big data hardware and software infrastructures could be used without any algorithmic limit. Results show a significant decrease in runtime.
Big data is rapidly considered in different scientific domains, industries and business methods. Considering the concept of Internet of Things, big data is generated by everything around us continuously, and therefore, dealing with big data and its challenges are important and requires new thinking strategies and also techniques. Signal processing is one of the solutions that is utilized with big data in most scientific fields. This paper gives a brief introductory preview of the subjects included in this area and describes some of challenges and tactics. In order to show the rapid growth of attentions in signal processing for big data, a statistical analysis on corresponding patents is considered as well.
This work addresses the recent convergence of the Operational Technology (OT) and Information Technology (OT) with the utilization of wireless and telecom technologies that encouraging the Internet of Things (IoT). The highest peak of the IoT hype has strong predecessor technologies Machine-to-Machine (M2M), Cyber Physical Systems (CPS), & successor Web of Things (WoT) and Edge Computing (EC). The discussion is extended to the analysis of various hardware and platforms, microcontrollers, Single Board Computers (SBC), sensors and actuators for IoT applications. The analysis and design of alert based physical location monitoring system, which is extended with statistical based machine learning support for the predictive analytics. The primary intention is to design a generalized model which can be applied in any context. The recording of the data is necessary to trace not only from the sensors but also from other factors like mobile devices are used for smart objects. The use of Message Queue Telemetry Transport (MQTT) in the system is discussed as one of the methodologies for capturing the data from the smart objects. The experiences of preparation of the Raspberry Pi SBC for IoT based applications, connecting sensors, accepting readings from the sensors, and sending the data on the AWS IoT Cloud for the further processing is explored. The subscription and publication of MQTT messages, data notification through the email, SMS, stored in the AWS DynamoDB, AWS Simple Notification Service (SNS), and automation of cautionary actions are discussed. The text to speech conversion approach of the notification is discussed. The impact of the edge computation near to the source of the data and timely updates on the Cloud reduces the read-write transactions on the Cloud and reduces the billing amount charged by Cloud provider.The security of IoT applications in the various parts i.e. from devices to SBC, Microcontroller and pushing on the Cloud needs secure communication such as X.509, security certificates, and necessary security measures are discussed. The IoT is capturing the huge opportunities in the various sectors, the existing operational protocols, their interoperability, low energy wireless technologies are also taken into consideration.
Today a huge amount of diverse data is produced very fast by various sources like sensors, social networks, stock markets, electronic businesses every second. This gives rise to the era of big data. Almost all data have valuable information that can be earned by processing of data. Some data have short lifetime and lose their value fast. Besides, processing of large volumes of data is a big challenge. One solution for handling of such data is distributed and fast parallel processing. A set of clusters consisting of heterogeneous compute nodes can be used for distributed parallel processing. In such a big system, allocating suitable big data and tasks to the compute nodes play an important role. Since different tasks for processing of various data might have different requirements, the main goal of an autonomous task scheduler is to allocate tasks to appropriate compute nodes for fast big data processing. In this paper, we propose a novel autonomous task scheduler for fast processing of big data on a given cluster with heterogeneous compute nodes. We use Spark as our data processing framework alongside Mesos as a cluster manager that provides efficient resource isolation and sharing across distributed applications and frameworks. We discuss how our scheduler tackles the notable challenges in the way of fast processing of big data on the configured heterogeneous cluster.
Increasingly large datasets make scalable and distributed data analytics necessary. Frameworks such as Spark and Flink help users in efficiently utilizing cluster resources for their data analytics jobs. It is, however, usually difficult to anticipate the runtime behavior and resource demands of these distributed data analytics jobs. Yet, many resource management decisions would benefit from such information.
Addressing this general problem, this chapter presents our vision of adaptive resource management and reviews recent work in this area. The key idea is that workloads should be monitored for trends, patterns, and recurring jobs. These monitoring statistics should be analyzed and used for a cluster resource management calibrated to the actual workload. In this chapter, we motivate and present the idea of adaptive resource management. We also introduce a general system architecture and we review specific adaptive techniques for data placement, resource allocation, and job scheduling in the context of our architecture.
Subgraph counting aims to count the number of occurrences of a subgraph T (aka as a template) in a given graph G. The basic problem has found applications in diverse domains. The problem is known to be computationally challenging – the complexity grows both as a function of T and G. Recent applications have motivated solving such problems on massive networks with billions of vertices.
In this chapter, we study the subgraph counting problem from a parallel computing perspective. We discuss efficient parallel algorithms for approximately resolving subgraph counting problems by using the color-coding technique. We then present several system-level strategies to substantially improve the overall performance of the algorithm in massive subgraph counting problems. We propose: 1) a novel pipelined Adaptive-Group communication pattern to improve inter-node scalability, 2) a fine-grained pipeline design to effectively reduce the memory space of intermediate results, 3) partitioning neighbor lists of subgraph vertices to achieve better thread concurrency and workload balance. Experimentation on an Intel Xeon E5 cluster shows that our implementation achieves 5x speedup of performance compared to the state-of-the-art work while reduces the peak memory utilization by a factor of 2 on large templates of 12 to 15 vertices and input graphs of 2 to 5 billions of edges.
This article presents an idea and structure for a final practical assignment on parallel computing, which was given to Master's degree students at the end of their studies at the MSU Faculty of Computational Mathematics and Cybernetics in the fall semester of 2016. The main objective of the assignment is to teach students to follow a proper approach to describing algorithm parallel structure and properties, which is the key link in the supercomputer co-design chain. In spite of the fact that the assignment was challenging and labor-intensive, for both students and teachers, we rate this experiment as a highly successful one. This assignment is extremely useful, relevant and important, particularly at the end of the education, as it requires joining individual elements of knowledge and skills that students have previously acquired during various courses.
Parallel video coding has recently received considerable attention since there is a strong need for real time applications. To resolve the extremely high computational complexity of Motion Estimation as a most time consuming part of the video coding, suitable parallel algorithms need to be developed. Recently, the use of Graphics Processing Units (GPUs) for parallel tasks has become popular. Motion estimation as a most important part of the video coding standard is readily parallelized. Thus, this task is well-suited for GPU based parallel implementation. In this paper, three GPU based parallel approaches of motion estimation are introduced. In all of these algorithms, the motion information is analyzed and dynamic search areas are employed. In another approach, we make use of the combined CPUs and GPU system in order to reach more effective results especially in the case of real time applications for devices with limited computational resources. The results of the combined method are superior especially for medium amount of motion. The experimental results show appropriate performance of the presented approaches in both PSNR values and speedup.
Traveling Salesman Problem (TSP) is an NP-hard combinatorial optimization problem. Approximation algorithms have been used to reduce the worst-case factorial time complexity of TSP to non-deterministic polynomial time successfully. However, approximation methods result in a sub-optimal solution as they do not cover the search space adequately. Further, CPU implementations of approximation methods are too time consuming for large input instances. On the other hand, GPUs have been shown to be effective in exploiting data and memory level parallelism in large, complex problems.
This chapter presents a time-efficient parallel implementation of Iterative Hill Climbing algorithm (PIHC) that solves TSP on the GPU in a near-optimal manner. Performance evaluation has been carried out on symmetric TSPLIB instances ranging from 105 to 85,900 cities. The PIHC implementation gives 181× speedup over its sequential counterpart and 251× speedup over the existing GPU based TSP solver using hill climbing approach. The PIHC implementation gives a quality cost with error rate 0.72% in the best case and 8.06% in the worst case – this is better than existing GPU based TSP solvers using hill climbing approach.
With the technology scaling, multicore and many-core processors introduced themselves as an alternative to offset performance demands. Power and thermal constraints make remarkable area of chip underutilized, which is called the dark silicon in the literature. In this paper, we model a multicore processor based on Amdahl's Law and meanwhile consider both reliability issues and memory overheads. Then based on the accurate empirical results, we suggest effective voltage and frequency scaling under power constraints and different amount of dark silicon. Since power-efficient small cores result in more active cores in the multicore architecture, we attempt to improve the total performance by introducing more power-efficient multicore architectures along with decreasing the dark silicon percentage. According to the results, voltage scaling of the processor has negligible effect on performance of memory-intensive applications. However, performance of CPU-intensive applications is sensitive to voltage scaling, where for parallel and serial applications performance is improved and diminished, respectively. Energy and performance per watt of all applications are improved by voltage scaling.
Random access containers (ex. Arrays) and sequential containers (ex. Lists) are today of the most useful abstractions in sequential programming. They facilitate access to a group of individual objects and provide uniform structure for realizing different semantics of synchronization in concurrent applications. To overcome the same challenges in parallel programming, we propose new abstraction called Time Collection. Parallelism and asynchronism in coupling channels are considered as the main challenges to achieve high performance. The first prototype is implemented on Athapascan parallel programming environment. Coupling of parallel codes is our motivation which is one of the main challenge of recent research projects in multi-disciplinary simulation, interactive multimedia environments and multi-screen visualization.
Microarrays are extensively used in genomic research and have several ranges of applications in biology and medicine, providing a large amount of data. Several different kinds of microarrays are available, distinguishable by characteristics such as the kind of probes, the surface support used, and the method employed for the target detection. Although microarrays have been developed and applied in many biological contexts, the management and investigation, require advanced computational tools to speed up data analysis and the interpretation of the results. To better deal with microarray data sets with characteristics is the huge dimension, the development of easy to use analysis tools as well as to produce accurate predictions, and comprehensible models arise. The object of this paper is to provide a review of software tool easy to use even from not expert of the domain, as well as able to efficiently deal with microarray data to derive a set of information to discriminate and identify SNPs, associated with genes related with particular drug response, phenotypes and complex diseases developing.
As a principle, creating a distributed program is usually much more difficult than creating a sequential program with the same functionality from a source code. Therefore, automatic distribution of a program into a distributed computing environment has always been an open research problem. The purpose of the automatic distribution is to achieve the highest speed up by creating most concurrency in the source code. Therefore, the aim of this paper is to propose a new quality function by considering synchronous and asynchronous invocations. Also, it presents an evolutionary approach that uses genetic algorithm and hill climbing for minimizing the proposed quality function, aiming to extract the distributed architecture from the source code. The results show that the proposed approach adequately extract the distribution architecture of a source code and improves the speed up.
Grid computing has emerged as a new computing infrastructure aiming to make science advancement and progress a much simpler and enjoyable task: let everyone share what they have (stuff that others find quite appreciable), so they can do smarter, better, and be happier. With the emergence of Cloud computing as a Yet Another Hype in the computer arena, Grid computing has been slowly pushed out of the window. However, we believe that the goal and purpose of Grid computing is majestic and graceful. In this paper, we propose an architectural approach to Grid provisioning, bearing in mind that there are two successful phenomena, namely the Internet and the World Wide Web, that have always faced with the same problems that the Grid is facing now: (1) heterogeneity, and (2) geographical and administrative scalability. We take advantage of the layered architectural style that has proven to be useful and effective in designing complex systems such as the computer itself. We also use the underlying concepts of the ISO OSI Reference Model as the guideline.