Ebook: Data Intensive Computing Applications for Big Data
The book ‘Data Intensive Computing Applications for Big Data’ discusses the technical concepts of big data, data intensive computing through machine learning, soft computing and parallel computing paradigms. It brings together researchers to report their latest results or progress in the development of the above mentioned areas. Since there are few books on this specific subject, the editors aim to provide a common platform for researchers working in this area to exhibit their novel findings. The book is intended as a reference work for advanced undergraduates and graduate students, as well as multidisciplinary, interdisciplinary and transdisciplinary research workers and scientists on the subjects of big data and cloud/parallel and distributed computing, and explains didactically many of the core concepts of these approaches for practical applications.
It is organized into 24 chapters providing a comprehensive overview of big data analysis using parallel computing and addresses the complete data science workflow in the cloud, as well as dealing with privacy issues and the challenges faced in a data-intensive cloud computing environment.
The book explores both fundamental and high-level concepts, and will serve as a manual for those in the industry, while also helping beginners to understand the basic and advanced aspects of big data and cloud computing.
The book ‘Data Intensive Computing Applications for Big Data’ discusses the technical concepts of big data, data intensive computing through machine learning, soft computing & parallel computing paradigms. The volume brings together researchers to report their latest results, or progress in the development of the above mentioned areas. Since there are few books on this specific subject matter, the editors aim to provide a common platform for researchers working in this area to exhibit their novel findings.
The book is organized into twenty-four chapters.
Chapter 1 describes a detailed survey of this diversified domain, and includes issues such as predicting future business possibilities and probabilities, managing big data through NoSQL databases, search and knowledge discovery, real-time data stream analysis, improving latency rate towards accessing and processing big data in main memory, managing the redundancy and location of distributed big data, integrating required data through virtualization, i.e. without physically moving the high volume of data, integrating data through physical movement, data preparation for useful analytics, and managing data quality.
Chapter 2 covers major breakthrough technologies like MapReduce, HadoopDB, Cassandra, Chubby Lock service, PLATFORA, SkyTree, Dremel, Pregel, Spanner, Shark, Megastore, Spark, F1, MLBase, NoSQL Databases, HBase, HDFS, YARN, Mahout and Chukwa. Big data technologies find applications in various domains, such as IT, government, defense, manufacturing, earth sciences, healthcare, agriculture, education, media industry, retail, real estate, science & research activities like the Large Hadron Collider (LHC), astronomy, and sports. The application of big data techniques and technologies in these domains is also discussed to emphasize their current importance in leading to data-driven decision making.
Chapter 3 addresses the overseeing of vast quantities of information and explores risk condition testing today. The rapid consumerization of IT has raised many difficulties. The average end user bunches sites and utilizes a developing number of working frameworks and applications every day using an assortment of versatile mobile and desktop gadgets. This leads to a mind-boggling and regularly expanding volume, speed, and assortment of created, shared, and proliferated information. The risks implicit in this situation are continually advancing, and these developing risks – coupled with the quantity and complexity of devices and processing power that cybercriminals now have available to them and the huge expansion in information – mean that programming security organizations are grappling with difficulties on an unprecedented scale. Shielding PC clients from digital dangers is not a simple matter, and the frailty of techniques for detecting risk means that satisfactory outcomes are lacking. Effective protection depends on the correct blend of approach, knowledge, understanding and awareness of the risks, and the proficient handling of enormous amounts of information. Awareness of how information is composed, breaking down complex connections, utilizing specific pursuit calculations and custom models, are fundamental to success. While not exhaustive, this section condenses how big data is investigated with regard to digital security to protect the end user.
Chapter 4 discusses various application domains and related security threats. It also discusses several of the big data security solutions available in literature to safeguard the data by ensuring privacy and encryption such as: the expectation-maximization algorithm, portable data binding, privacy-preserving cost-reducing heuristic algorithms etc. These big data security analytics tools are also discussed according to five essential factors. Finally, the future prospects and the research directions which can contribute to addressing big data security issues are examined.
Chapter 5 aims to motivate beginners, professionals and data analysts to adopt a combination of all these technologies, which could help decision makers from different domains in the decision-making process. Further, this chapter explores state-of-the-art technologies and their benefits for a social/commercial organization.
The objective of Chapter 6 is to provide an overall view of the developed algorithms and paradigm shifts of current big data analysis using a machine learning approach to compute data. Here we will explore how the field of machine learning impacts the cloud computing paradigm. In first step, various tools, such as libraries and statistical tools, are applied to the cloud. In the second step, plug-ins with current tools are embedded in order to make a Hadoop cluster in the cloud to run working programs. In the third step, libraries of machine learning algorithms are deployed and used for data intensive computing.
Chapter 7 discusses parallel architecture, parallel programming models, MPI and OpenMP and CUDA programming.
In Chapter 8, comparisons of classification algorithm have been experimented to detect uncertain propositions in Twitter data of food price crisis. Comparative analysis of classification algorithm Naive Bayes and Support Vector Machine approach is done to detect uncertain propositions of tweets related to food price crisis. A model is trained to classify certain or uncertain proposition using a training file which is annotated using cue words available in English language text. Output of algorithm is the class showing given proposition is certain or uncertain. The objective of this chapter is to have a comparative analysis of text classification approach to detect uncertain events of Twitter data of food price crisis and to improve the accuracy of uncertainty classification approaches in order to detect uncertain events in natural language processing.
Chapter 9 focuses on the history, background, types of parallel computing, the memory architectures, message passing, concurrency control, deadlocks and their possible solution in parallel computing.
In Chapter 10, the authors describe all the attributes of big data i.e. popularly known as 9Vs. Business Intelligence contains these nine attributes on the basis of statistical models or hypothesis in order to provide better predictions and outcomes for any research or results. Machine Learning provides the platform where the big data analysis can be done using cloud computing. With the help of this chapter, academicians, business people and researchers can easily find solutions for their required purpose.
In Chapter 11, Cloud Computing is discussed as internet based computing where the application software, infrastructure and platform are available in the cloud and the end users (businessman, developers) can access it through the internet, as a client. Cloud is a step on from Utility Computing. Owing to increase in use of these services by companies, several security issues have emerged and this challenges the cloud computing system to secure, protect and process the data which is the property of user. Therefore, we must develop high level authentication protocols for preventing security threats.
Chapter 12 focuses on understanding the need, features and applications of Spark SQL. It will also include Spark SQL code snippets to enhance the coding abilities of the readers.
In Chapter 13, an extended computing paradigm is introduced, and ideas about changing elements of the computing stack are suggested, while some implementation details of both hardware and software are discussed. The resulting new computing stack offers considerably higher computing throughput, simplified hardware architecture, drastically improved real-time behavior and in general, simplified and more efficient computing stack.
Chapter 14 explains reasons behind choosing Mongo DB to implement graph structures. Later on a set of possible solutions available are identified. At the end, a novel application has been created to not only simplify the implementation of graph structures efficiently in MongoDB but also for visualizing the graphs created by the users.
Chapter 15 provides the current HIV/AIDS research activities, issues and solutions, government policies and census information across the world. It also provides and concludes the information regarding HIV/AIDS Big Data Analytics, Models, Security and Privacy features, Big Data management techniques, quick decision making approach, Algorithms for the Prediction, Classification, Visualization, Clustering, Optimization and Distributed Processing problems and future scope.
In Chapter 16 the authors discuss the distributed system as a collection of physically separated homogeneous / heterogeneous computer systems or processes which are networked to provide various resources of the system to the connected users. The access of shared resources by connected users will increase computation speed, data availability, functionality and reliability. In this research paper the existing MUTEX and deadlock detection algorithms of distributed systems are described and the performance of these algorithms analyzed & compared.
Chapter 17 proposes a Simple map matching algorithm faces difficulties in testing on road trial. The proposed algorithm and approach overcome many of the drawbacks of existing methods and is capable of achieving higher accuracy, precise navigation in critical conditions for high carrier frequency signaling on road network.
In Chapter 18 the authors have used different techniques like Naïve Bayes, random forest, neural network, k-NN, C4.5, decision tree etc. Depending upon the sample size of data, the rate of accuracy in forecasting diabetes, cardio and cancer lies in 67–100%, 85–100% and 90–98% respectively. With the passage of time nature, volume, variety and veracity have tremendously changed. Therefore, it is intricate to envision the way the big data may influence the social, corporate and health industries. The effective use of Machine Learning (ML) and Prescriptive Big Data Analytics (PBDA) may assist in developing smart and complete healthcare solution for early diagnosis, treatment and prevention of diseases. Therefore, a smart healthcare framework based upon machine learning and prescriptive big data analytics is designed for accurate prediction of lifestyle based human disorders. The use of prescriptive analytics will improve the accuracy of different machine learning techniques in diagnosis of different life style based human disorders. Finally, the novelty lies in the coexistence of big data, data warehouse and machine learning techniques.
Chapter 19 is directed at utilizing parallel computing based neuro-feedback (PCBNFB) as a major therapeutic role in difficult areas like ADD (Adult Attention Deficit)/ADHD (Attention Deficit Hyperactivity Disorder), anxiety, obsessive compulsive disorder, learning disabilities, head injuries, Obsessive-Compulsive Disorder (OCD), reduces pain, quality of life for cancer patients suffering from chemotherapy etc. In a developing country like ours citing an ADD/ADHD as a reason for underperformance, underachievement is not taken seriously as its still unheard and unaware off and ADD/ADHD, compulsive and obsessive disorder patients spend their whole life considered as lazy, unorganized and duffers. The PCBNF unit will provide for each modality a real-time processing pipeline that handles signal acquisition and all the necessary methods/algorithms required for NFB calculation through effective and easier QEEG & LENS approach. Both QEEG and LENS have their advantages and limitations so the proposed research study explores in parallel fashion both QEES & LENS for better analysis of various neural disorders. Furthermore, it should provide the flexibility of using multimodal or unimodal NFB.
Chapter 20 discusses S-ARRAY, based on a master-slave configuration that consists of processors in multiples of four where input is given in form of 22n. S-Array is designed in such a way that even if we increase the number of input data and number of processing units, time complexity to sort increased amount of input data remains same i.e. O(log(log(n))/2)+c. Our future work lies in computing performance of S-Array when it sorts huge data in different architectures.
Chapter 21 considers a knowledge discovery paradigm (classifying humanoid robot gestures correctly) and formulates an algorithm which is based on protein synthesis mechanism. Our proposed algorithm allows true exploratory knowledge discovery, which is time efficient and robust and require a more simple input data format than the conventional algorithms.
Chapter 22 provides implementation details and recommendation systems. There are several programming languages and Scala is one which is object-oriented and also supports functional programming. For processing large amount of data, functional programming is faster and easier, but in some cases we need to provide security for sensitive data. Scala runs on Java Virtual Machine (JVM) and Scala can execute Java code, since Scala code is converted to Java Byte code when compiled. Scala will have a better advantage to implement big data applications.
In Chapter 23, the performance over fading channel on the wireless communication is studied, simulated and compared for Maximum Likelihood (ML) receivers in different modulation structure in the presence of Gaussian Channel estimation error based on different Doppler shift. The second part of this study is to analyse the performance degradation due to channel estimation error, fading noise, and interference between users for several fading channels in multiple users multiple antennas with different performance measure.
Chapter 24 discussed how blockchain's capabilities cultivate a new form of data monitoring, analysis and storage. We identify the nine key challenges to be mindful of, before considering a strategic shift onto an entire infrastructure based on such new blockchain technology.
There have been several influences from our family and friends who have sacrificed a lot of their time and attention to ensure that we are kept motivated to complete this crucial project.
The editors are thankful to all the members of IOS Press BV especially Gerhard R. Joubert, Maarten Fröhlich, and E.H. Fredriksson for the given opportunities to edit this book.
Mamta Mittal
Valentina E. Balas
D. Jude Hemanth
Raghvendra Kumar
Big Data is a term used for a large and a variety of data that is structured, unstructured or hybrid. It comes from different sources at a high speed. It is a data that is too big and complicated to store, analyze and interpret. Big data are broadly comprised of 4 Vs: Volume, Velocity, Variety and Veracity; these make it bigger and much diversified domains. The processing, management and analysis of big data has opened-up a number of research domains. This data is different from traditional relational data and so the Relational Database Management Systems (RDBMS) are incapable to handle it. The RDBMS is inadequate to handle the growth in volume of data, heterogeneity of data and need to store and access it speedily. This necessitated a new type of database functionality to support the real time applications and the Big Data Technologies have emerged.
In this book chapter, we have carried out a detailed survey of this diversified domain, such as Predicting future business possibilities and probabilities, Managing big data through NoSQL databases, Search and knowledge discovery, Real-time data stream analysis, Improving latency rate towards accessing and processing big data in main memory, Managing redundancy and location of distributed big data, Integrating required data through virtualization i.e. without physically moving the high volume of data, Integrating data through a physical movement, Data preparation for useful analytics and Managing data quality.
World, nowadays, is engulfed in a deluge of data of different formats which is being generated from innumerable sources like mobile phones, social media, digital platforms, scientific experiments and enterprise applications. Such huge amount of unstructured as well as semi-structured data coming from various sources and in different formats is termed as “Big Data”. The trillion sensor world is further going to add to the explosive growth of Big Data segment. According to Gartner, “Big data is high volume, high velocity, and high variety information assets that require new forms of processing to enable enhanced decision making, insight, discovery, and process optimization”. It is quite evident that Big Data is creating opportunities for effective decision making across several domains and organizations.
Since many years of data management, Relational Database Management Systems (RDBMS) have always been a comprehensive solution for storage, processing and analysis of data. But to capture, store, process, analyze and visualize Big Data, the traditional database management techniques are no longer sufficient. Here comes the question what are the limitations of traditional database systems leading to their inability to handle “big data”? These limitations have been discussed in this chapter which made it essential to shift towards the paradigm of breakthrough big data technologies and techniques in a world drowning in a “data tsunami”. This chapter attempts to elaborate the reasons of this paradigm shift from traditional database processing to big data processing technologies.
Various big data analysis, processing and visualization techniques and technologies have been discussed in this chapter, which have changed the world of big data. Initially, large firms like Google developed technologies like MapReduce, The Google File System, Big Table etc to meet their own data needs. But eventually most of the organizations and even governments started gearing up for big data solutions to improve their decision making process by value generation based on data trends. Therefore there has been a sharp rise in the development of big data techniques. This chapter covers major breakthrough technologies like MapReduce, HadoopDB, Cassandra, Chubby Lock service, PLATFORA, SkyTree, Dremel, Pregel, Spanner, Shark, Megastore, Spark, F1, MLBase, NoSQL Databases, HBase, HDFS, YARN, Mahout and Chukwa. Big data technologies find applications in various domains like IT, Government, Defence, Manufacturing, Earth sciences, Healthcare, Agriculture, Education, Media industry, Retail, Real Estate, Science & Research activities like Large Hadron collider, Astronomy and Sports. The application of Big Data techniques and technologies in these domains has also been discussed to emphasize upon their importance in the present world leading to data-driven decision making.
The measure of data in world is creating well ordered. Data is building up an immediate consequence of usage of web, PDA and relational association. Big Data is a social occasion of enlightening accumulations which is significant in size and furthermore complicated. Generally size of the data is Petabyte and Exabyte. Standard database structures are not prepared to catch, store and look at this immense measure of data. As the web is creating, measure of enormous information continue developing. Big Data examination give better ways to deal with associations and government to separate unstructured data. By and by days, Big Data is a champion among the most talked point in IT industry. It will accept basic part in future. Big Data changes the way that data is managed and used. A segment of the applications are in regions, for instance, social protection, movement organization, dealing with a record, retail, preparing and whatnot. Affiliations are twisting up obviously more versatile and more open. New sorts of data will give new challenges moreover. The present segment highlights basic thoughts of Big Data. When overseeing “Bit Data,” the volume and sorts of data about IT and the business are too much wonderful, making it difficult to deal with in an off the cuff way. What's more, it has ended up being continuously difficult to secure essential information from the data being accumulated.
Each mechanized system and internet organizing exchange produces it. Systems, sensors and phones transmit it. With the movement in development, this data is being recorded and huge regard is being isolated from it. Enormous Data is a propelling term that delineates any voluminous measure of sorted out, semi-composed and unstructured data that can be burrowed for information.
Security and insurance issues are intensified by speed, volume, and combination of immense data, for instance, enormous scale cloud structures, contrasting characteristics of data sources and associations, spilling nature of data getting, and high volume between cloud movement. Thusly, standard security instruments, which are tweaked to securing little scale static (as opposed to spouting) data, are insufficient. In this part we highlight some tremendous data specific security and insurance challenges. Our craving from highlighting the challenges is that it will bring restored focus on propping colossal data structures.
Overseeing enormous information and exploring today's risk condition is testing. The quick consumerization of IT has raised these difficulties. The normal end client gets to bunch sites and utilizes a developing number of working frameworks and applications every day using an assortment of versatile and desktop gadgets. This means a mind-boggling and regularly expanding volume, speed, and assortment of information created, shared, and proliferated. The danger scene has advanced all the while, with the quantity of dangers expanding by requests of extent in brief periods. This developing risk scene, the quantity of complex devices and processing power that cybercriminals now have available to them, and the expansion of huge information mean programming security organizations are grappling with difficulties on an uncommon scale. Shielding PC clients from the attack of digital dangers is no simple assignment. In the event that danger discovery techniques are frail, the outcome is lacking. Effective insurance depends on the correct blend of approachs, human knowledge, a specialist comprehension of the risk scene, and the proficient handling of enormous information to make noteworthy insight. Seeing how information is composed, breaking down complex connections, utilizing specific pursuit calculations, and utilizing custom models are basic parts. While the points of interest of these segments are not completely inspected here, this section condenses how Big Data is investigated with regards to digital security to eventually profit the end client.
Numerous associations request effective answers for store and examine enormous measure of data. Distributed computing as an empowering agent gives adaptable assets and noteworthy financial advantages as lessened operational expenses. This worldview raises a wide scope of security and protection issues that must be thought about. Multi-tenure, loss of control, and trust are entering challenges in distributed computing conditions. This part audits the current advancements and a wide exhibit of both prior and best in class extends on cloud security and protection.
The tremendous rise in the volume, velocity and variety of structured (e.g. RDBMS), semi structured (e.g. XML, JSON, NoSQL etc) and unstructured (Sensors, Social media, Audio, Video, Mobiles etc) data produced by different industries, governments and organizations has led to the emergence of a new paradigm called “Big Data”. This data is being collected from different sectors like open data from Governments, Healthcare, Defence, Scientific experiments (E.g. Large Hadron Collider, Sloan Digital Sky Survey, Human Genome Project, and Square Kilometer Array etc), Media (Data-journalism and mining), academia, Information Technology, Manufacturing, sports, entertainment, social media and Internet of Things.
Big data utilizes cloud computing based distributed storage technology rather than local storage due to considerations of unpredictable data size, unstructured formats and variety of other reasons. There are several big data cloud platforms currently available for storage, analysis and processing of big data like Google cloud services, AppEngine, BigQuery, Azure, S3, DynamoDB, MapReduce YARN, Apache Spark etc provisioned by tech-giants like Google, Microsoft, Amazon and Cloudera. Due to the security and privacy concerns of the data being stored, the designing of appropriate cloud computing platforms is a major challenge for the researchers. Domains like Defence and Healthcare do not share their data on cloud platforms due to the lack of legal frameworks which could ensure the ethics, quality, integrity, security and confidentiality of the data. The security and privacy issues, threats and concerns related to big data storage and analysis which need to be addressed immediately have been examined in this chapter and the implications of these threats have also been investigated with live examples of security breaches and concerns worldwide. This chapter discusses various application domains and related security threats that need to be addressed. The chapter also discusses several Big Data security solutions available in literature which address these issues and safeguard the data by ensuring privacy and encryption like Expectation-Maximization algorithm, Portable Data Binding, Privacy preserving cost-reducing heuristic algorithm etc. The Big Data Security Analytics tools have also been discussed according to five essential factors. Finally, the future prospects and the research directions which can contribute in addressing the big data security issues have been concluded.
Data explosion and the tremendous growth in the amount of data that is being generated from various IT services increases an enormous demand for smartly analysing the generated data (structured and unstructured). Since Big data constitutes three V's (Volume, Velocity, and Variety), Machine learning is the best solution for exploiting the information hidden in Big data. With far less dependence on manual direction, it provides various methods and techniques to mine useful information from big and disparate data sources. For better and precise results, Machine learning techniques require a huge volume of data and high computation power. Therefore, to get cost-effective infrastructure Cloud computing is the best option. This chapter is aiming to motivate the beginners, professionals and data analysts to adopt the combination of all these technologies. The product outcomes would help decision makers belonging to different domains in decision-making process. Further, this chapter is exploring state of the art technologies and its benefits on a social/commercial organization.
When machine learning algorithms are applied to huge amount of data, we found difficulties to process such huge data. Now new approaches are being adopted because existing machine learning libraries doesn't have enough resources to process large datasets. So new libraries (CUDA, MapReduce, and Dryad) are adding up for concepts like parallel computing. Here we will take account of GraphLab, Apache MahoutTM, and Jubatus to get the exposure of famous academics and industrial results. Looking at the traditional machine learning techniques, tasks like to handle the data which is distributed identically or in batch mode becomes impossible and there is requirement to develop new algorithms to overcome with the existing difficulties faced by these traditional ML algorithms. The objective of this chapter is to provide overall view of developed algorithms and paradigm shifts of current big data analysis using machine learning approach to compute data. Here we will explore that the machine learning field has great impact on cloud computing paradigm. In first step we deploy various tool to the cloud like libraries and statistics tools. In second step we embed plugins with current tools in order to make Hadoop cluster on the cloud so that working programs can run on it. In third step libraries of machine learning algorithms are deployed and used for data intensive computing.
The massively parallel architecture of GPUs, coming from its graphics heritage, is now delivering transformative results for scientists and researchers all over the world. For some of the world's most challenging problems in medical research, drug discovery, weather modeling, and seismic exploration – computation is the ultimate tool. Without it, research would still be confined to trial and error-based physical experiments and observation. Parallel computing: use of multiple computers or processors working together on a common task. Each processor works on its section of the problem. Processors can exchange information. In this we discuss about the parallel architecture, parallel programming models, MPI and OpenMP and CUDA programming.
It's a world of full of information. Data is one of the important element of this era. One of the major sources of data is social media platforms like Twitter, Facebook etc. Everyday social media generates lot of data. It's a free form of communication where people communicate with each other without any restriction. Users can post anything on social media. However, human tendency of speaking about non-existing or source-less thing makes it unclear or create unuseful information and becomes unreliable information. This casual and word-of mouth form of communication leads to generate uncertain data and quality of information from factuality point of view becomes primary concern in social media. So, uncertainty detection is important in social media. Uncertainty detection in natural language text becomes challenging because dealing with natural language text is pretty complicated thing. Uncertainty is an important field of linguistics. Basically, it means “lack of information”. So, a statement whose truth value cannot be determined is considered as uncertain. Linguistics and Natural Language Processing aims at classifying factual and uncertain proposition. In this chapter, comparisons of classification algorithm have been experimented to detect uncertain propositions in Twitter data of food price crisis. Comparative analysis of classification algorithm Naive Bayes and Support Vector Machine approach is done to detect uncertain propositions of tweets related to food price crisis. A model is trained to classify certain or uncertain proposition using a training file which is annotated using cue words available in English language text. Output of algorithm is the class showing given proposition is certain or uncertain. The objective of this chapter is to have a comparative analysis of text classification approach to detect uncertain events of Twitter data of food price crisis and to improve the accuracy of uncertainty classification approaches in order to detect uncertain events in natural language processing.
Parallel computing finds its place in the world of computation long back since 1842 with the first ever design of Analytical Engine by English mathematician and computer pioneer Charles Babbage. It took lots of brainstorming by the great researchers like John Cocke, Daniel Slotnick, Gene Amdahl and the big organizations like IBM, Burroughs Corporation and Honeywell for the next century, to coin various laws and set a background for today's parallel computing platform. Since then, parallel computing has successfully replaced the notion of sequential computing in terms of instruction level, task level, thread level parallelism. From the use of SISD programs to MIMD programs, the evolution of parallelism in computing has been phenomenon. When a job is divided into several parts where each part is capable of running with others in parallel, then this not only increases the overall speed of execution of job, but also ascertains the optimal utilization of computer resources like memory, processor time and the like. On this basis, the computers can be categorized on hardware level at which they can support parallelism which include multi core/ many core computers, multi processor computers, clusters, grids and massively parallel supercomputers. With the concept of parallelization concepts like concurrency control, deadlocks, synchronization of tasks have been there. Lot of work is going on since years in addressing these concepts in virtue of parallel computing. This chapter is focusing on the history, background, types of parallel computing, the memory architectures, message passing, concurrency control, deadlocks and their possible solution in parallel computing.
Analysis of big data can be done in many ways. To define trends in Big Data we need to concentrate on the biggest challenges faced by this technology and various strategies have been developed in order to process such large data efficiently. We usually describe this by three factors, popularly known as 3Vs i.e. Volume, Velocity and Variety, which describes most of the features of data. The goal of this chapter is to provide traditional meaning and definitions of big data and how this technology is evolved in order to meet today's requirements and challenges. In this chapter we will describe all the attributes of big data i.e. popularly known as 9Vs. Business Intelligence contains these nine attributes on the basis of statistical models or hypothesis in order to provide better predictions and outcomes for any research or results. Machine Learning provides the platform where the big data analysis can be done using cloud computing. With the help of this chapter academicians, business people and researchers can easily find opportunities as a solution for their required purpose.
In the current era cloud computing is earning very high popularity for its vast characteristics which is divided into its common characteristics and essential characteristics. The common characteristics are massive scale, homogeneity, virtualization, low cost software, resilient computing, geographic computation, service orientation and advanced security. The essential characteristics are on demand self-service, broad network access, resource pooling, rapid elasticity and measured service. Cloud Computing is the internet based computing where the application software, infrastructure and platform are available in the cloud and the end users (businessman, developers) can access it through, as a client. Cloud is a step on from Utility Computing. Owing to increase in use of these services by companies, several security issues have emerged and this challenges cloud computing system to secure, protect and process the data which is the property of user. Therefore, we must develop high level authentication protocols for preventing the security threats.
The CCMP (Counter with Cipher Block Message Authentication Code Protocol) is a two cycle authenticated encryption (AE) mode. The first cycle is used to perform confidentiality computations, and the other cycle is used to compute authentication and integrity. Here, both the cycles use same encryption technique. It is already known that CCM/CCMP is a combination of two modes namely AES counter mode and cipher block chain MAC (CBC-MAC) mode. The counter mode is used to perform encryption which guarantees data privacy whereas CBC-MAC is used to achieve data validity and integrity. In this research work the investigation and critical analysis of CCMP based Secure Cloud (SecC) mechanism for cloud data management is presented which further improves the security issues in cloud networks.
The recent advances in Big data made attempts to analyze huge dumps of readily available transactional data to predict patterns and trends. Hadoop framework was developed based on MapReduce to exploit parallelism to the fullest. And, indeed it has enabled the computing mechanisms to be more robust, flexible, scalable and efficient. At the same time, this has unearthed many new limitations of existing databases and computational algorithms such as processing speed versus waiting times and parallelizability of a query. In this chapter, we will focus on understanding the need, features and applications of Spark SQL. It will also include Spark SQL code snippets to enhance the coding abilities of the readers.
Computing is still based on the 70-years old paradigms introduced by von Neumann. The need for more performant, comfortable and safe computing forced to develop and utilize several tricks both in hardware and software. Till now technology enabled to increase performance without changing the basic computing paradigms. The recent stalling of single-threaded computing performance, however, requires to redesign computing to be able to provide the expected performance. To do so, the computing paradigms themselves must be scrutinized. The limitations caused by the too restrictive interpretation of the computing paradigms are demonstrated, an extended computing paradigm introduced, ideas about changing elements of the computing stack suggested, some implementation details of both hardware and software discussed. The resulting new computing stack offers considerably higher computing throughput, simplified hardware architecture, drastically improved real-time behavior and in general, simplified and more efficient computing stack.
This paper tries exploring the ways with which a MongoDB database can be used as a Graph database. In this chapter, we identified the advantages in designing real world application designed using one of the dynamic set of Not Only Structured Query Language (NoSQL) databases – graph databases. MongoDB being a document oriented database is not capable of processing graphs by default. We have explained reasons behind choosing MongoDB to implement graph structures. Later on a set of possible solutions available are identified. At the end, a novel application has been created to not only simplify the implementation of graph structures efficiently in MongoDB but also for visualizing the graphs created by the users.
The objective of this chapter will be a key reference for researchers in the field of HIV/AIDS research with huge amount of data collection across the world with medical and all general information related to HIV/AIDS research. It overcomes (replaces) the existing system huge data management, processing problems with Big Data Analytics. This chapter provides the current HIV/AIDS research activities, issues and solutions, government policies and census information across the world. Also provides and concludes the information regarding HIV/AIDS Big Data Analytics, Models, Security and Privacy features, Big Data management techniques, quick decision making approach, Algorithms for the Prediction, Classification, Visualization, Clustering, Optimization and Distributed Processing problems and future scope.
In normal multiprogramming environment, several processes may compete for getting resources. Here, if a process requests for a resource and if the resource is not available at a particular time then the process enters into waiting state. But, sometimes a waiting state process is not converted into its present state and thus the process enters into a new situation called as indefinite blocking or deadlock state. Hence, we can say that a deadlock or starvation state occurs in a distributed system whenever two or more processes are in waiting state indefinitely for an event which is caused by one of the waiting processes.
Mutual Exclusion (MUTEX) is a condition where two or more processes want to access the same non-sharable resources e.g. printer. Therefore, the MUTEX ensures that at most one process can access a non-sharable resource at any time (Safety). The sharable resources on the other hand do not require MUTEX access and therefore never enter into deadlock or starvation state e.g. The distributed system is a collection of physically separated homogeneous/heterogeneous computer systems or processes which are networked to provide various resources of the system to the connected users. The access of shared resources by connected users will increase computation speed, data availability, functionality and Reliability. In this research paper the existing MUTEX and deadlock detection algorithms of distributed systems are described and the performance of these algorithms analysed & compared.
Prediction of accurate vehicle location is essential for various (Intelligent Transport System) ITS applications. Existing vehicle location systems are work on different position sensor and smart computing devices to execute map matching algorithm. There exist necessary needs for enhancing the performance of receivers in terms of position towards the error characteristics of (Global Position System) GPS signals. Simple map matching algorithm faces difficulties in testing on road trial. The proposed algorithm and approach overcome many of the drawbacks of existing methods and is capable of achieving higher accuracy, precise navigation in critical conditions for high carrier frequency signaling on road network.
IT based healthcare industry is succeeding in leaps and bounds. Nowadays, the massive volume and variety of healthcare data is available. Numbers of procedures have been designed for early diagnosis of human disorders. Here, the critical analysis of different machine learning algorithms in early lifestyle based disease diagnosis has been carried out. Authors have used different techniques like naïve bayes, random forest, neural network, k-NN, C4.5, decision tree etc. Depending upon the sample size of data, the rate of accuracy in forecasting diabetes, cardio and cancer lies in 67–100%, 85–100% and 90–98% respectively. With the passage of time nature, volume, variety and veracity have tremendously changed. Therefore, it is intricate to envision the way the big data may influence the social, corporate and health industries. The effective use of Machine Learning (ML) and Prescriptive Big Data Analytics (PBDA) may assist in developing smart and complete healthcare solution for early diagnosis, treatment and prevention of diseases. Therefore, a smart healthcare framework based upon machine learning and prescriptive big data analytics is designed for accurate prediction of lifestyle based human disorders. The use of prescriptive analytics will improve the accuracy of different machine learning techniques in diagnosis of different life style based human disorders. Finally, the novelty lies in the coexistence of big data, data warehouse and machine learning techniques.
Neuro-feedback (NFB) has been considered as the cognitive science to retrain brainwave patterns. The occurrence of various brain waves take place at different frequencies, some fast, some quite slow called as delta (0.5–3.5 Hz very slow high amplitude experienced in deep restorative sleep), theta (4–8 Hz represents daydream correlated with mental inefficiency a twilight zone between waking and sleep), alpha (8–12 Hz, slower and larger correlated with a state of relaxation), beta (13–30 Hz, fast brainwave correlated with intellectual activity) and gamma (above 30 Hz, represents very fast EEG activity), specified as classic EEG bands. The proposed work focuses upon utilizing parallel computing based neurofeedback (PCBNFB) as rehabilitation means that will directly retrain the electrical activity (waves) in human brain. The concept of brain ability for making sense of doing different stimuli at the same time (simultaneously) is referred as Parallel Computing so the proposed research study will be directed upon utilizing parallel computing based neuro-feedback (PCBNFB) as a major therapeutic role in difficult areas like ADD (Adult Attention Deficit)/ADHD (Attention Deficit Hyperactivity Disorder), anxiety, obsessive compulsive disorder, learning disabilities, head injuries, Obsessive-Compulsive Disorder (OCD), reduces pain, quality of life for cancer patients suffering from chemotherapy etc. In a developing country like ours citing an ADD/ADHD as a reason for underperformance, underachievement is not taken seriously as its still unheard and unaware off and ADD/ADHD, compulsive and obsessive disorder patients spend their whole life considered as lazy, unorganized and duffers. The PCBNF unit will provide for each modality a real-time processing pipeline that handles signal acquisition and all the necessary methods/algorithms required for NFB calculation through effective and easier QEEG & LENS approach. Both QEEG and LENS have their advantages and limitations so the proposed research study explores in parallel fashion both QEES & LENS for better analysis of various neural disorders. Furthermore, it should provide the flexibility of using multimodal or uni-modal NFB.
The problem of sorting a sequence of N elements on a parallel computer with K processors has been discussed since 1975 when parallel computing came into existence. Presently, with exponential increase in amount of data and computing power, a sorting algorithm is required where all the N available processors are utilized without an increase in overhead of merging. Due to increase in computation power, research of new sorting algorithms has been slowed down. Our proposal S-Array algorithm, is an approach of parallel execution of sort operation based on “Divide and Conquer” principal in a general multiprocessor architecture like NUMA (Non Uniform Multiprocessor Architecture). S-Array is based on master–slave configuration that consists of processors in multiples of four where input is given in form of 22n. S-Array is designed in such a way that even if we increase number of input data and number of processing units, time complexity to sort increased amount of input data remains same i.e. O(log(log(n))/2)+c. Our future work lies in computing performance of S-Array when it sorts huge data in different architectures.
Pattern mining and knowledge discovery is an important area of research especially for the huge temporal dataset. The present paper discusses a biologically inspired classification method which would help discovering meaningful hidden pattern in dataset for classification. The process emulates the protein synthesis methodology based on the information encoded in the DNA/RNA (Deoxyribo Nucleic Acid/Ribo Nucleic Acid) sequences which encompasses three stages i.e., transcription, splicing and translation. More specifically, we consider a knowledge discovery paradigm (classifying humanoid robot gestures correctly) and formulate an algorithm which is based on protein synthesis mechanism. Our proposed algorithm allows true exploratory knowledge discovery, which is time efficient and robust and require simple input data format than the conventional algorithms.
Big data has several applications like page bank algorithm, recommendation system, sentimental analysis, and many more. Recommendation is the key to success for discovering and retrieval of content in this era of huge data [1]. Association Rule Mining (ARM) algorithm is mostly used for recommendation systems. User behaviour is given input to ARM in key/value pair format which can be generated by Map-Reduce function [2]. ARM attempts to find frequent item sets among large datasets and describes the association relationships among different attributes. Market-Basket analysis is the major application of ARM used to predict customer's purchasing behaviour and recommend product to customer [3]. For implementing this recommendation systems there are several programming languages and scala is one among them which is an object oriented and also supports functional programming [4]. For processing large amount of data, functional programming is faster and easier, but in some cases we need to provide security for sensitive data. Also scala runs on Java Virtual Machine (JVM) and also scala can execute Java code since scala code is converted to Java Byte code when compiled. So scala will have a better advantage to implement big data applications.
In wireless communication system, the transmitted signal is distorted by various phenomenons that are intrinsic to the structure and contents of the wireless channel. Among these fading and interference has been the main dominant sources of distortion and degradation of performance. Fading can be defined as the fluctuation in amplitude phase and multipath delays over very short travels distance or very short time duration. To increase high data rate in wireless communication Multiple-Input Multiple-Output (MIMO) and Orthogonal Frequency Division Multiplexing (OFDM) techniques have been considered best as they have high frequency spectrum efficiency. In this study, the performance over fading channel on the wireless communication is studied, simulated and compared for Maximum Likelihood (ML) receivers in different modulation structure in the presence of Gaussian Channel estimation error based on different Doppler shift. The second part of this study is to analyse the performance degradation due to channel estimation error, fading noise, and interference between users for several fading channels in multiple users multiple antennas with different performance measures.
Blockchain has become one of the hottest technologies in the market, and has immense potential to reshape current business operations in multiple industries, particularly the banking sector. Its impact comes through the facilitation of fund transaction, securities trading, data analytics, and risk management. This study identifies the key characteristics of blockchain that cultivates competitive advantages applied in the banking sector. We discuss how blockchain's capabilities cultivate a new form of data monitoring, analysis and storage. We identify the nine key challenges to be mindful of, before considering a strategic shift onto an entire infrastructure based on such new blockchain technology.