Ebook: Global Data Management
An unforeseen growth of the volume and diversity of the data, content and knowledge is being generated all over the globe. Several factors lead to this growing complexity, among them: Size (the sheer increase in the numbers of knowledge producers and users, and in their production / use capabilities); Pervasiveness (in space and time of knowledge, knowledge producers and users); Dynamicity (new and old knowledge items will appear and disappear virtually at any moment); and Unpredictability (the future dynamics of knowledge are unknown not only at design time but also at run time). The situation is made worse by the fact that the complexity of knowledge grows exponentially with the number of interconnected components. The traditional approach of knowledge management and engineering is top-down and centralised, and depends on fixing at design time what can be expressed and how. Global Data Management is playing a crucial role in the development of our networked distributed society. Its importance has been recognised in the IST programme since several years, in particular in its long-term research part, Future and Emerging Technologies (FET). Many of the papers included in this book refer to IST and FET projects currently running or recently completed.
Managing Diversity in Knowledge
In the vision of pervasive communications and computing, information and communication technologies seamlessly and invisibly pervade into everyday objects and environments, delivering services adapted to the person and the context of their use. The communication and computing landscape will be sensing the physical world via a huge variety of sensors, and controlling it via a plethora of actuators. Applications and services will therefore have to be greatly based on the notions of context and knowledge.
In such foreseeable technology rich environments, the role of content providers and content consumers is being reshaped due to their immense and unprecedented number, and the way they generate, preserve, discover, use and abandon information. Pervasive communications call for new architectures based on device autonomy, fragmented connectivity, spatial awareness and data harnessing inside each network node. The realisation of this vision will then depend on the ability to access decentralized data, with demanding performance, scalability, security requirements than cannot be matched by centralized approaches. One of the key research challenges is then how to design a distributed data management infrastructure, allowing the handling of very high levels of complexity in the management of distributed highly heterogeneous data and knowledge sources (as they can be found in the Web), the integration of continuously changing data flows, and in general the management of multimedia data (e.g. personal, cultural heritage, education).
Global Data Management is playing a crucial role in the development of our networked distributed society. Its importance has been recognised in the IST programme since several years, in particular in its long-term research part, Future and Emerging Technologies (FET). Many of the papers included in this book refer to IST and FET projects currently running or recently completed. This subject is also one of the focal points identified for long-term FET research in the 7th Framework Programme for Community Research. The basic principles identified in the areas “Pervasive Computing and Communications” and “Managing Diversity in Knowledge” (see http://cordis.europa.eu/ist/fet/), as summarised in this foreword, are very much in line with the goals of this book.
An unforeseen growth of the volume and diversity of the data, content and knowledge is being generated all over the globe. Several factors lead to this growing complexity, among them: Size (the sheer increase in the numbers of knowledge producers and users, and in their production/use capabilities), Pervasiveness (in space and time of knowledge, knowledge producers and users), Dynamicity (new and old knowledge items will appear and disappear virtually at any moment), Unpredictability (the future dynamics of knowledge are unknown not only at design time but also at run time). The situation is made worse by the fact that the complexity of knowledge grows exponentially with the number of interconnected components.
The traditional approach of knowledge management and engineering is top-down and centralised, and depends on fixing at design time what can be expressed and how. The key idea is to design a “general enough” reference representation model. Examples of this top-down approach are the work on (relational) databases, the work on distributed databases, and, lately, the work on information integration (both with databases and ontologies).
There are many reasons why this approach has been and is still largely successful. From a technological point of view it is conceptually simple, and it is also the most natural way to extend the technology developed for relational databases and single information systems. From an organisational point of view, this approach satisfies the companies' desire to centralise and, consequently, to be in control, of their data. Finally, from a cultural point of view, this approach is very much in line with the way knowledge is thought of in the western culture and philosophy, and in particular with the basic principle (rooted in ancient Greek philosophy) that it must be possible to say whether a knowledge statement is (universally) true or false. This property is reassuring and also efficient from an organisational point of view in that it makes it “easy” to decide what is “right” and what is “wrong”.
However, as applications become increasingly open, complex and distributed, the knowledge they contain can no longer be managed in this way, as the requirements are only partially known at design time. The standard solution so far has been to handle the problems which arise during the life time of a knowledge system as part of the maintenance process. This however comes at a high price because of the increased cost of maintenance (exponentially more complex than the knowledge parts integrated inside it), the decreased life time of systems, and the increased load on the users who must take charge of the complexity which cannot be managed by the system. In several cases this approach has failed simply because people did not come to an agreement on the specifics of the unique global representation.
In pervasive distributed systems, the top-down approach must be combined with a new, bottom-up approach in which the different knowledge parts are designed and kept ‘locally’ and independently, and new knowledge is obtained by adaptation and combination of such items.
The key idea is to make a paradigm shift and to consider diversity as a feature which must be maintained and exploited and not as a defect that must be absorbed in some general schema. People, organisations, communities, populations, cultures build diverse representations of the world for a reason, and this reason lies in the local context, representing a notion of contextual, local knowledge which satisfies, in an optimal way, the (diverse) needs of the knowledge producer and knowledge user.
The bottom-up approach provides a flexible, incremental solution where diverse knowledge parts can be built and used independently, with some degree of complexity arising in their integration.
A second paradigm shift moves from the view where knowledge is mainly assembled by combining basic building blocks to a view where new knowledge is obtained by the design- or run-time adaptation of existing, independently designed, knowledge parts. Knowledge will no longer be produced ab initio, but more and more as adaptations of other, existing knowledge parts, often performed in run-time as a result of a process of evolution. This process will not always be controlled or planned externally but induced by changes perceived in the environment in which systems are embedded.
The challenge is to develop theories, methods, algorithms and tools for harnessing, controlling and using the emergent properties of large, distributed and heterogeneous collections of knowledge, as well as knowledge parts that are created through combination of others. The ability to manage diversity in knowledge will allow the creation of adaptive and, when necessary, self-adaptive knowledge systems.
The complexity in knowledge is a consequence of the complexity resulting from globalisation and the vitalisation of space and time produced by the current computing and networking technology, and of the effects that this has on the organisation and social structure of knowledge producers and users. This includes the following focus issues:
• Local vs. global knowledge. The key issue will be to find the right balance and interplay between operations for deriving local knowledge and operations which construct global knowledge.
• Autonomy vs. coordination, namely how the peer knowledge producers and users find the right balance between their desired level of autonomy of and the need to achieve coordination with the others.
• Change and adaptation, developing organisation models which facilitate the combination and coordination of knowledge and which can effectively adapt to unpredictable dynamics.
• Quality, namely how to maintain good enough quality e.g. through self-certifying algorithms, able to demonstrate correct answers (or answers with measurable incorrectness) in the presence of inconsistent, incomplete, or conflicting knowledge components.
• Trust, reputation, and security of knowledge and knowledge communities, for instance as a function of the measured quality; how to guard against deliberate introduction of falsified data.
Europe is very well positioned given the investment already done in many of these areas. This book represents a further step in the right direction.
Fabrizio Sestini
This text presents solely the opinions of the author, which do not prejudice in any way those of the European Commission.
During the last decade, we assisted to an astounding revolution in the computer world. The widespread usage of internet-enabled applications, together with the advent of community-based interactions, completely changed our concept of collaborative work. One of the most important steps in this direction is the development of new technologies for data object storage, able to guarantee high degrees of reliability, while permitting the access in a nomadic environment through heterogeneous devices. In this chapter, we will study the problem of implementing a global data object storage system, exploring the current state of the art for correlated technologies, and surveying the most interesting proposals.
Building efficient internet-scale data management services is the main focus of this chapter. In particular, we aim to show how to leverage DHT technology and extend it with novel algorithms and architectures in order to (i) improve efficiency and reliability for traditional DHT (exact-match) queries, particularly exploiting the abundance of altruism witnessed in real-life P2P networks, (ii) speedup range queries for data stored on DHTs, and (iii) support efficiently and scalably the publish/subscribe paradigm over DHTs, which crucially depends on algorithms for supporting rich queries on string-attribute data.
Data aggregation, in its most basic definition, is the ability to summarize information. Data aggregation is highly relevant to distributed systems, which collect and process information from many sources, like Internet-scale information systems, and peer-to-peer data management systems. Several architectures and techniques have been recently explored (either in the context of Internet services, peer-to-peer, or wireless sensor networks research), aiming at the design of general purpose, highly scalable data aggregation services. In this chapter, we describe application scenarios, and review and compare several proposed designs for aggregation services. In the second part of the chapter, we present a complete case study for aggregation services, in the context of a DHT-based peer data management architecture. We argue that aggregation services would benefit for a design, which is consistent with and integrated into a general purpose data indexing functionality, and present GREG, an architecture which is based on this assumption.
During the last decade the publish/subscribe communication paradigm gained a central role in the design and development of a large class of applications ranging from stock exchange systems to news tickers, from air traffic control to defense systems. This success is mainly due to the capacity of publish/subscribe to completely decouple communication participants, thus allowing the development of applications that are more tolerant to communications asynchrony. This chapter introduces the publish/subscribe communication paradigm, stressing those characteristics that have a stronger impact on the quality of service provided to participants. The chapter also introduce the reader to two widely recognized industrial standards for publish/subscribe systems: the Java Message Service (JMS) and the Data Distribution Service (DDS).
Peer-to-peer (P2P) computing offers new opportunities for building highly distributed data systems. Unlike client-server computing, P2P is a very dynamic environment where peers can join and leave the network at any time and offers important advantages such as operation without central coordination, peers autonomy, and scale up to large number of peers. However, providing high-level data management services (schema, queries, replication, availability, etc.) in a P2P system implies revisiting distributed database technology in major ways. In this chapter, we discuss the design and implementation of high-level data management services in APPA (Atlas Peer-to-Peer Architecture). APPA has a network-independent architecture that can be implemented over various structured and super-peer P2P networks. It uses novel solutions for persistent data management with updates, data replication with semantic-based reconciliation and query processing. APPA's services are implemented using the JXTA framework.
In Mobile Ad-Hoc Networks (MANETs) we are often faced with the problem of sharing information among a (potentially large) set of nodes. The replication of data items among different nodes of a MANET is an efficient technique to increase data availability and improve the latency of data access. However, an efficient replication scheme requires a scalable method to disseminate updates. The robustness and scalability of gossip (or epidemic) protocols make them an efficient tool for message dissemination in large scale wired and wireless networks. This chapter describes a novel algorithm to replicate and retrieve data items among nodes in a MANET that is based on a epidemic dissemination scheme. Our approach is tailored to the concrete network environment of MANETs and, while embedding several ideas from existing gossip protocols, takes into account the topology, scarcity of resources, and limited availability of both the devices and the network links in this sort of networks.
Wireless Sensor Networks are emerging as one of the most promising research directions, due to the possibility of sensing the physical world with a granularity unimaginable before. In this chapter, we address some of the major challenges related to the collection and elaboration of data sourcing from such networks of distributed devices. In particular, we describe the concepts of data storage, data retrieval and data processing. We discuss how such data management techniques will be able to sustain a novel class of data-intensive applications, which use the network as an interface to the physical world. Then, we identify some threats to the deployment of such networks on a large scale basis. In particular, we argue that even if appealing, the underlying composition of very large and heterogeneous wireless sensor networks poses enormous engineering challenges, calling for innovative design paradigms. Finally, we discuss an alternative solution, and the related data management mechanism, for the provisioning of sensor-based services in future pervasive environments.
While several peer-to-peer (p2p) schemes have emerged for distributed data storage, there has also been a growing undercurrent towards the invention of design methodologies underlying these peer-to-peer systems. A design methodology is a systematic technique that helps us to not only design and create new p2p systems (e.g., for data storage) in a quick and predictable manner, but also increases our understanding of existing systems. This chapter brings together in one place previous and existing work by several authors on design methodologies that are intended to augment the activity of p2p algorithms, keeping our focus centered around (but not restricted to) data storage systems. As design methodologies grow in number and in power, researchers are increasingly likely to rely on them to design new p2p systems.
Decentralized data management has been addressed during the years by means of several technical solutions, ranging from distributed DBMSs, to mediator-based data integration systems. Recently, such an issue has been investigated in the context of Peer-to-Peer (P2P) architectures. In this chapter we focus on P2P data integration systems, which are characterized by various autonomous peers, each peer being essentially an autonomous information system that holds data and is linked to other peers by means of P2P mappings. P2P data integration does not rely on the notion of global schema, as in traditional mediator-based data integration. Rather, it computes answers to users' queries, posed to any peer of the system, on the basis of both local data and the P2P mappings, thus overcoming the main drawbacks of centralized mediator-based data integration systems and providing the foundations of effective data management in virtual organizations.
In this chapter we first survey the most significant approaches proposed in the literature for both mediator-based data integration and P2P data management. Then, we focus on advanced schema-based P2P systems for which the aim is semantic integration of data, and analyze the commonly adopted approach of interpreting such systems using a first-order semantics. We show some weaknesses of this approach, and compare it with an alternative approach, based on multi-modal epistemic semantics, which reflects the idea that each peer is conceived as a rational agent that exchanges knowledge/belief with other peers. We consider several central properties of P2P data integration systems: modularity, generality, and decidability. We argue that the approach based on epistemic logic is superior with respect to all the above properties.
Until recently, most data integration techniques involved central components, e.g., global schemas, to enable transparent access to heterogeneous databases. Today, however, with the democratization of tools facilitating knowledge elicitation in machine-processable formats, one cannot rely on global, centralized schemas anymore as knowledge creation and consumption are getting more and more dynamic and decentralized. Peer Data Management Systems (PDMS) provide an answer to this problem by eliminating the central semantic component and considering instead compositions of local, pair-wise mappings to propagate queries from one database to the others.
In the following, we give an overview of various PDMS approaches; all the approaches proposed so far make the implicit assumption that all schema mappings used to reformulate a query are correct. This obviously cannot be taken as granted in typical PDMS settings where mappings can be created (semi) automatically by independent parties. Thus, we propose a totally decentralized, efficient message passing scheme to automatically detect erroneous schema mappings in a PDMS. Our scheme is based on a probabilistic model where we take advantage of transitive closures of mapping operations to confront local belief on the correctness of a mapping against evidences gathered around the network. We show that our scheme can be efficiently embedded in any PDMS and provide an evaluation of our techniques on large sets of automatically-generated schemas.
In p2p based data mangement applications, it is unrealistic to rely upon a centralized schema or ontology. The p2p paradigm is more than a new underlying infrastructure. It supports an emergent approach to data management where the data is generated and inserted into the network in a decentralized fashion. Thus, each peer or group of peers will have its own schema to store the data. Moreover, the user querying the data will use yet another schema to formulate the request. The vision of emergent schema management is to resolve these heterogeneities automatically in a self-organizing, emergent way by taking advantage of overlaps and mediators scattered over the network. The emerging schema information can be used in various ways, i.e. to drive the construction of an overlay network, and to route queries through the network.
In this article, we start by explaining the various challenges. We look at the problem both from the viewpoint of the database community describing schemas as entity-relationship models, and from the viewpoint of the knowledge representation community using logic-based formalisms. We then survey existing p2p based approaches dealing with semantics, schemas, and mediation. After describing our own approach to p2p schema management, we conclude with an outlook to open problems in the field.
It is appealing, yet challenging, to provide a set of geographically separated users with the same computing environment despite differences in underlying hardware or software.
This paper addresses the question of how to provide type interoperability: namely, the ability for types representing the same software module, but possibly defined by different programmers, in different languages and running on different distributed platforms, to be treated as one single type.
We present a pragmatic approach to deal with type interoperability in a dynamic distributed environment. Our approach is based on an optimistic transport protocol for passing objects by value (or by reference) between remote sites and a set of implicit type interoperability rules. We experiment the approach over the .NET platform which we indirectly evaluate.
With the increasing number of applications that base searching on similarity rather than on exact matching, novel index structures are needed to speedup execution of similarity queries. An important stream of research in this direction uses the metric space as a model of similarity. We explain the principles and survey the most important representatives of index structures. We put most emphasis on distributed similarity search architectures which try to solve the difficult problem of scalability of similarity searching. The actual achievements are demonstrated by practical experiments. Future research directions are outlined in the conclusions.
Peer-to-peer (P2P) computing is an intriguing paradigm for Web search for several reasons: 1) the computational resources of a huge computer network can facilitate richer mathematical and linguistic models for ranked retrieval, 2) the network provides a collaborative infrastructure where recommendations of many users and the community behavior can be leveraged for better search result quality, and 3) the decentralized architecture of a P2P search engine is a great alternative to the de-facto monopoly of the few large-scale commercial search services with the potential risk of information bias or even censorship. The challenges of implementing this visionary approach lie in coping with the huge scale and high dynamics of P2P networks. This paper discusses the architectural design space for a scalable P2P Web search engine and presents two specific architectures in more detail. The paper's focus is on query routing and query execution and their performance as the network grows to larger scales.
Hosting a Web site at a single server creates performance and reliability issues when request load increases, availability is at stake, and, in general, when quality-of-service demands rise. A common approach to these problems is making use of a content delivery network (CDN) that supports distribution and replication of (parts of) a Web site. The nodes of such networks are dispersed across the Internet, allowing clients to be redirected to a nearest copy of a requested document, or to balance access loads among several servers. Also, if documents are replicated, availability of a site increases. The design space for constructing a CDN is large and involves decisions concerning replica placement, client redirection policies, but also decentralization. We discuss the principles of various types of distributed Web hosting platforms and show where tradeoffs need to be made when it comes to supporting robustness, flexibility, and performance.
The use of peer-to-peer (P2P) networks for distributing content has been widely discussed in the last few years and the most important properties have been identified: scalability, efficiency and reliability. With CROSSFLUX we propose a P2P system for media streaming which incorporates these properties starting from the design. In addition reliability is coupled with fairness by rewarding peers that contribute more with a higher number of backup links. This coupling can be achieved by using links (1) for content distribution in one direction and (2) as backup in the opposite direction. For maximizing the throughput and distributing the load among the participating nodes an adaptive join procedure and reorganization algorithms are being used. Our evaluation of CROSSFLUX shows that recovery of node failures is fast and efficiency is increased with the help of our techniques.