The book ‘Data Intensive Computing Applications for Big Data’ discusses the technical concepts of big data, data intensive computing through machine learning, soft computing & parallel computing paradigms. The volume brings together researchers to report their latest results, or progress in the development of the above mentioned areas. Since there are few books on this specific subject matter, the editors aim to provide a common platform for researchers working in this area to exhibit their novel findings.
The book is organized into twenty-four chapters.
Chapter 1 describes a detailed survey of this diversified domain, and includes issues such as predicting future business possibilities and probabilities, managing big data through NoSQL databases, search and knowledge discovery, real-time data stream analysis, improving latency rate towards accessing and processing big data in main memory, managing the redundancy and location of distributed big data, integrating required data through virtualization, i.e. without physically moving the high volume of data, integrating data through physical movement, data preparation for useful analytics, and managing data quality.
Chapter 2 covers major breakthrough technologies like MapReduce, HadoopDB, Cassandra, Chubby Lock service, PLATFORA, SkyTree, Dremel, Pregel, Spanner, Shark, Megastore, Spark, F1, MLBase, NoSQL Databases, HBase, HDFS, YARN, Mahout and Chukwa. Big data technologies find applications in various domains, such as IT, government, defense, manufacturing, earth sciences, healthcare, agriculture, education, media industry, retail, real estate, science & research activities like the Large Hadron Collider (LHC), astronomy, and sports. The application of big data techniques and technologies in these domains is also discussed to emphasize their current importance in leading to data-driven decision making.
Chapter 3 addresses the overseeing of vast quantities of information and explores risk condition testing today. The rapid consumerization of IT has raised many difficulties. The average end user bunches sites and utilizes a developing number of working frameworks and applications every day using an assortment of versatile mobile and desktop gadgets. This leads to a mind-boggling and regularly expanding volume, speed, and assortment of created, shared, and proliferated information. The risks implicit in this situation are continually advancing, and these developing risks – coupled with the quantity and complexity of devices and processing power that cybercriminals now have available to them and the huge expansion in information – mean that programming security organizations are grappling with difficulties on an unprecedented scale. Shielding PC clients from digital dangers is not a simple matter, and the frailty of techniques for detecting risk means that satisfactory outcomes are lacking. Effective protection depends on the correct blend of approach, knowledge, understanding and awareness of the risks, and the proficient handling of enormous amounts of information. Awareness of how information is composed, breaking down complex connections, utilizing specific pursuit calculations and custom models, are fundamental to success. While not exhaustive, this section condenses how big data is investigated with regard to digital security to protect the end user.
Chapter 4 discusses various application domains and related security threats. It also discusses several of the big data security solutions available in literature to safeguard the data by ensuring privacy and encryption such as: the expectation-maximization algorithm, portable data binding, privacy-preserving cost-reducing heuristic algorithms etc. These big data security analytics tools are also discussed according to five essential factors. Finally, the future prospects and the research directions which can contribute to addressing big data security issues are examined.
Chapter 5 aims to motivate beginners, professionals and data analysts to adopt a combination of all these technologies, which could help decision makers from different domains in the decision-making process. Further, this chapter explores state-of-the-art technologies and their benefits for a social/commercial organization.
The objective of Chapter 6 is to provide an overall view of the developed algorithms and paradigm shifts of current big data analysis using a machine learning approach to compute data. Here we will explore how the field of machine learning impacts the cloud computing paradigm. In first step, various tools, such as libraries and statistical tools, are applied to the cloud. In the second step, plug-ins with current tools are embedded in order to make a Hadoop cluster in the cloud to run working programs. In the third step, libraries of machine learning algorithms are deployed and used for data intensive computing.
Chapter 7 discusses parallel architecture, parallel programming models, MPI and OpenMP and CUDA programming.
In Chapter 8, comparisons of classification algorithm have been experimented to detect uncertain propositions in Twitter data of food price crisis. Comparative analysis of classification algorithm Naive Bayes and Support Vector Machine approach is done to detect uncertain propositions of tweets related to food price crisis. A model is trained to classify certain or uncertain proposition using a training file which is annotated using cue words available in English language text. Output of algorithm is the class showing given proposition is certain or uncertain. The objective of this chapter is to have a comparative analysis of text classification approach to detect uncertain events of Twitter data of food price crisis and to improve the accuracy of uncertainty classification approaches in order to detect uncertain events in natural language processing.
Chapter 9 focuses on the history, background, types of parallel computing, the memory architectures, message passing, concurrency control, deadlocks and their possible solution in parallel computing.
In Chapter 10, the authors describe all the attributes of big data i.e. popularly known as 9Vs. Business Intelligence contains these nine attributes on the basis of statistical models or hypothesis in order to provide better predictions and outcomes for any research or results. Machine Learning provides the platform where the big data analysis can be done using cloud computing. With the help of this chapter, academicians, business people and researchers can easily find solutions for their required purpose.
In Chapter 11, Cloud Computing is discussed as internet based computing where the application software, infrastructure and platform are available in the cloud and the end users (businessman, developers) can access it through the internet, as a client. Cloud is a step on from Utility Computing. Owing to increase in use of these services by companies, several security issues have emerged and this challenges the cloud computing system to secure, protect and process the data which is the property of user. Therefore, we must develop high level authentication protocols for preventing security threats.
Chapter 12 focuses on understanding the need, features and applications of Spark SQL. It will also include Spark SQL code snippets to enhance the coding abilities of the readers.
In Chapter 13, an extended computing paradigm is introduced, and ideas about changing elements of the computing stack are suggested, while some implementation details of both hardware and software are discussed. The resulting new computing stack offers considerably higher computing throughput, simplified hardware architecture, drastically improved real-time behavior and in general, simplified and more efficient computing stack.
Chapter 14 explains reasons behind choosing Mongo DB to implement graph structures. Later on a set of possible solutions available are identified. At the end, a novel application has been created to not only simplify the implementation of graph structures efficiently in MongoDB but also for visualizing the graphs created by the users.
Chapter 15 provides the current HIV/AIDS research activities, issues and solutions, government policies and census information across the world. It also provides and concludes the information regarding HIV/AIDS Big Data Analytics, Models, Security and Privacy features, Big Data management techniques, quick decision making approach, Algorithms for the Prediction, Classification, Visualization, Clustering, Optimization and Distributed Processing problems and future scope.
In Chapter 16 the authors discuss the distributed system as a collection of physically separated homogeneous / heterogeneous computer systems or processes which are networked to provide various resources of the system to the connected users. The access of shared resources by connected users will increase computation speed, data availability, functionality and reliability. In this research paper the existing MUTEX and deadlock detection algorithms of distributed systems are described and the performance of these algorithms analyzed & compared.
Chapter 17 proposes a Simple map matching algorithm faces difficulties in testing on road trial. The proposed algorithm and approach overcome many of the drawbacks of existing methods and is capable of achieving higher accuracy, precise navigation in critical conditions for high carrier frequency signaling on road network.
In Chapter 18 the authors have used different techniques like Naïve Bayes, random forest, neural network, k-NN, C4.5, decision tree etc. Depending upon the sample size of data, the rate of accuracy in forecasting diabetes, cardio and cancer lies in 67–100%, 85–100% and 90–98% respectively. With the passage of time nature, volume, variety and veracity have tremendously changed. Therefore, it is intricate to envision the way the big data may influence the social, corporate and health industries. The effective use of Machine Learning (ML) and Prescriptive Big Data Analytics (PBDA) may assist in developing smart and complete healthcare solution for early diagnosis, treatment and prevention of diseases. Therefore, a smart healthcare framework based upon machine learning and prescriptive big data analytics is designed for accurate prediction of lifestyle based human disorders. The use of prescriptive analytics will improve the accuracy of different machine learning techniques in diagnosis of different life style based human disorders. Finally, the novelty lies in the coexistence of big data, data warehouse and machine learning techniques.
Chapter 19 is directed at utilizing parallel computing based neuro-feedback (PCBNFB) as a major therapeutic role in difficult areas like ADD (Adult Attention Deficit)/ADHD (Attention Deficit Hyperactivity Disorder), anxiety, obsessive compulsive disorder, learning disabilities, head injuries, Obsessive-Compulsive Disorder (OCD), reduces pain, quality of life for cancer patients suffering from chemotherapy etc. In a developing country like ours citing an ADD/ADHD as a reason for underperformance, underachievement is not taken seriously as its still unheard and unaware off and ADD/ADHD, compulsive and obsessive disorder patients spend their whole life considered as lazy, unorganized and duffers. The PCBNF unit will provide for each modality a real-time processing pipeline that handles signal acquisition and all the necessary methods/algorithms required for NFB calculation through effective and easier QEEG & LENS approach. Both QEEG and LENS have their advantages and limitations so the proposed research study explores in parallel fashion both QEES & LENS for better analysis of various neural disorders. Furthermore, it should provide the flexibility of using multimodal or unimodal NFB.
Chapter 20 discusses S-ARRAY, based on a master-slave configuration that consists of processors in multiples of four where input is given in form of 22n. S-Array is designed in such a way that even if we increase the number of input data and number of processing units, time complexity to sort increased amount of input data remains same i.e. O(log(log(n))/2)+c. Our future work lies in computing performance of S-Array when it sorts huge data in different architectures.
Chapter 21 considers a knowledge discovery paradigm (classifying humanoid robot gestures correctly) and formulates an algorithm which is based on protein synthesis mechanism. Our proposed algorithm allows true exploratory knowledge discovery, which is time efficient and robust and require a more simple input data format than the conventional algorithms.
Chapter 22 provides implementation details and recommendation systems. There are several programming languages and Scala is one which is object-oriented and also supports functional programming. For processing large amount of data, functional programming is faster and easier, but in some cases we need to provide security for sensitive data. Scala runs on Java Virtual Machine (JVM) and Scala can execute Java code, since Scala code is converted to Java Byte code when compiled. Scala will have a better advantage to implement big data applications.
In Chapter 23, the performance over fading channel on the wireless communication is studied, simulated and compared for Maximum Likelihood (ML) receivers in different modulation structure in the presence of Gaussian Channel estimation error based on different Doppler shift. The second part of this study is to analyse the performance degradation due to channel estimation error, fading noise, and interference between users for several fading channels in multiple users multiple antennas with different performance measure.
Chapter 24 discussed how blockchain's capabilities cultivate a new form of data monitoring, analysis and storage. We identify the nine key challenges to be mindful of, before considering a strategic shift onto an entire infrastructure based on such new blockchain technology.
There have been several influences from our family and friends who have sacrificed a lot of their time and attention to ensure that we are kept motivated to complete this crucial project.
The editors are thankful to all the members of IOS Press BV especially Gerhard R. Joubert, Maarten Fröhlich, and E.H. Fredriksson for the given opportunities to edit this book.
Valentina E. Balas
D. Jude Hemanth