
Ebook: Databases and Information Systems VIII

Databases and information systems are the backbone of modern information technology and are crucial to the IT systems which support all aspects of our everyday life; from government, education and healthcare, to business processes and the storage of our personal photos and archives.
This book presents 22 of the best revised papers accepted following stringent peer review for the 11th International Baltic Conference on Databases and Information Systems (Baltic DB&IS 2014), held in Tallinn, Estonia, in June 2014. The conference provided a forum for the exchange of scientific achievements between the research communities of the Baltic countries and the rest of the world in the area of databases and information systems, bringing together researchers, practitioners and Ph.D. students from many countries.
The subject areas covered at the conference focused on big data processing, data warehouses, data integration and services, data and knowledge management, e-government, as well as e-services and e-learning.
The 11th International Baltic Conference on Databases and Information Systems (Baltic DB&IS'2014) took place on 8–11 June 2014 in Tallinn, Estonia. The Baltic DB&IS series of biennial conferences celebrated its 20th anniversary year in 2014 as the first conference was held in Trakai (Lithuania) in 1994. Former Baltic DB&IS conferences have been held in Tallinn (1996, 2002, 2008), Riga (1998, 2004, 2010), and Vilnius (2000, 2006, 2012).
The Baltic Db&IS'2014 conference was organised by the Institute of Cybernetics at Tallinn University of Technology and Department of Computer Engineering of Tallinn University of Technology, and supported by the European Social Fund through the Estonian National Doctoral School in Information and Communication Technology, European Union Regional Development Fund through Centre of Excellence in Computer Science (EXCS), and Department of State Information Systems of Estonia.
The aim of the Baltic DB&IS series of conferences is to bring together researchers, practitioners and PhD students in the fields of advanced database and IS research. In 2014, the focus of the conference was on new research areas like big data management, linking data and knowledge, and new data usage scenarios.
The International Programme Committee had representatives from 20 countries. Altogether, 61 submissions were received from 13 countries. Each conference paper was assigned for review to at least three referees from different countries. As a result, 36 regular papers were accepted for presentation at the conference. From the presented papers, 22 best papers were selected for revision and were included into this volume.
The original research results presented in papers mostly belong to novel fields of research such as big data processing, data warehouses, data integration and services, data and knowledge management, e-government, e-services and e-learning.
This volume includes also three invited talks. Prof. Sören Auren and Dr. Christoph Lange discussed research challenges of applying Linked Data paradigm in various domains. Prof. Bela Stantic addressed problems of efficient management of big data. Dr. Audrone Lupeikiene and Prof. Albertas Caplinkas devoted their talk to quality requirements negotiation in service oriented systems.
Finally, we would like to thank the authors for their contributions and the invited speakers for sharing their views with us. We are very grateful to members of the Programme Committee and the additional referees for carefully reviewing the submissions and making their recommendations.
We wish to thank all the organising team and our sponsors. We express our belated gratitude for excellent organization of the conference in memoriam to Dr. Helena Kruus. Last, but not least, we thank all the participants, who really made the conference.
Hele-Mai Haav
Ahto Kalja
Tarmo Robal
September 2014
The Linked Data paradigm has emerged as a powerful enabler for data and knowledge interlinking and exchange using standardised Web technologies. In this article, we discuss our vision how the Linked Data paradigm can be employed to evolve the intranets of large organisations – be it enterprises, research organisations or governmental and public administrations – into networks of internal data and knowledge. In particular for large enterprises data integration is still a key challenge. The Linked Data paradigm seems a promising approach for integrating enterprise data. Like the Web of Data, which now complements the original document-centred Web, data intranets may help to enhance and flexibilise the intranets and service-oriented architectures that exist in large organisations. Furthermore, using Linked Data gives enterprises access to 50+ billion facts from the growing Linked Open Data (LOD) cloud. As a result, a data intranet can help to bridge the gap between structured data management (in ERP, CRM or SCM systems) and semi-structured or unstructured information in documents, wikis or web portals, and make all of these sources searchable in a coherent way.
Every day we witness new forms of data in various formats. Some example include structured data from transactions we make, unstructured data as text communications of different kinds, varieties of multimedia files and video streams. To ensure efficient processing of this data, often called ‘Big Data’, the use of highly distributed and scalable systems and new data management architectures, e.g. distributed file systems and NoSQL database, has been widely considered. However, volume, variety and velocity of Big Data and data analytics demands indicate that these tools solve Big Data problems partially only. To make effective use of such data, novel concepts of management is required, in particular, being able to efficiently access data within a tolerable time not only from the users own domain, but also from other domains which the user is authorized to access. This paper elaborates on some issues related to the efficient management of Big Data, discusses current trends, identifies the challenges and suggests possible research directions.
Service-oriented systems engineering (SoSE) counts its two-decade history. Nowadays it is a way of developing and deploying applications as well as the whole enterprise systems. Service-oriented requirements engineering (SoRE) as an integral part of SoSE and as a new requirements engineering subdiscipline faces a number of different kinds of challenges. The early SoRE approaches were derived from the initial phases of traditional software development methodologies, and the later ones are original, taking into account the specific characteristics of service-oriented systems. This paper discusses the specifics of service-oriented systems, describes the paradigm related SoRE issues and provides an overview of the characteristics of service-oriented enterprise systems. All these specifics and issues entail different methodological approaches to SoRE. The special attention is given to requirements negotiation activity. The paper presents a view-based approach to derive the balanced service quality requirements from an initial set of stakeholders' needs.
It is a very typical situation in many organizations that data gathered by their information systems are not used to their full potential. They can be processed only by IT professionals therefore limiting the scope of their usage. In this paper we propose a novel solution to the problem of how to get answers to ad hoc queries over the collected data in a way convenient to a domain expert. We exploit a concept of Star ontologies in order to develop a new kind of graphical query language that can be used by non-programmers. We explain the functioning of the language on an example of hospital domain.
With the appearance of massive databases the ever changing data sets are becoming increasingly important. Storage of massive data sets is always led by changes issued not only to actual data but meta-data that describes the structure. Meta-data is the foremost element to understand the information in databases when designing data storage and its change over long period of time. Thus there is no higher level system to comprehend unstructured data that falls into document-oriented databases. We present the Meta-model that documents itself and document, manages, and creates by using generic or universal meta-modelling constructs. The introduced method allows a flexible document structure that evolves itself over the time, reduces model creation time on the application level. We implement a novel meta-model prototype where the complexity of a model growth is proportional to its expansion scope.
Massively Multiplayer Online Role-Playing Games (MMORPGs) are very sophisticated applications, which have significantly grown in popularity since their early days in the mid-90s. Along with growing numbers of users the requirements on these systems have reached a point where technical problems become a severe risk for the commercial success. Within the CloudCraft project we investigate how Cloud-based architectures and data management can help to solve some of the most critical problems regarding scalability and consistency. In this article, we describe an implemented working environment based on the Cassandra DBMS and some of the key findings outlining its advantages and shortcomings for the given application scenario.
Data in various formats are everywhere, and the need to be able to analyze them is obvious. Although there are many useful tools for manipulating and analyzing data, it is very difficult to ensure inter-operation between the tools to be able to use the whole range of their capabilities. Data galaxies is a new concept that is intended to provide a common space for different kinds of data manipulations and visualizations, where existing tools can be used together in a data flow leading to the desired result. The power of the approach is demonstrated by several use cases. Implementation details are also provided.
Spatial data from different geographic databases can show a high degree of diversity in terms of object modelling, thematic information, completeness, or currentness of data. Thus, between datasets, database objects representing, e.g., the same road of the real world can show strong differences. Integrating two or more spatial data sources requires a matching on the instance level to identify and link multiple representations of the same real-world entities. Spatial matching has to analyze complex objects for their geometry and other attributes, as well as the dataset topology in order to calculate a matching between geographic objects based on similarity.
In this paper, we propose an iterative object matching process called SimMatching which strongly relies on attribute and relational similarity measures. Starting from geometric and thematic attribute similarity, relational similarity increases when neighboring objects have been matched. Additionally, strong constraints specifying allowed or forbidden matchings help to improve runtime and result quality. The main goals of our algorithm are adaptability to different input data and efficiency for handling complex objects while still achieving high quality results. Scalability to large datasets is supported by using a partitioning framework and parallel processing.
In this paper we describe an experiment on using message-level schema matching for Web services network construction. The aim of this study is to empirically determine a similarity threshold for schema matching which will leverage the same Web service message annotation quality as domain experts will through manual efforts. Since we use message annotations for construction of Web services networks, determining the proper threshold is essential in construction of a dataset for large-scale Web services network studies in realistic settings.
First we apply schema matching system COMA to Web service operations in SAWSDL-TC1 dataset. Then suitable upper and lower bounds of the threshold are determined by comparing the resulting matches at various thresholds with matches of manually crafted annotations from the SAWSDL-TC1 dataset. We construct Web services networks at various thresholds within the identified upper and lower bounds, compute a selection of commonly used network metrics for them and align them with other findings in the literature of Web services networks to select the most appropriate threshold value. Finally, we extend this experiment to a bigger real-world dataset of 8000+ operations. The study showed that automatically constructed Web service networks exhibit comparable topological properties as manually constructed network.
The studies on software project failures have identified problems in capturing requirements, managing complexity and dynamic changes of the environment because of the using traditional software engineering, where requirement capturing is static and prolonged. This issue is especially important for decision-making in dynamically changing business. The paper offers modernization of information system development methods used for implementation of automated information-, rule-, knowledge-, model-based decision processes. The paper propose to assist processes by early separation and development of a business logic model and implementation of decision-making, knowledge discovery process models and business process analysis using probabilistic models by proposing an information systems development framework. The advantages of such approach are early separation and development of a business logic model and further support for business people for modification of business logic without involvement of software developers and minimizing their persistence in the latter exploitation stages. Finally, the paper presents experimental results for stochastic decision extraction from system database using process mining and probabilistic models to ease framework implementation.
Trading data as a commodity has become increasingly popular. To obtain a better understanding of the emerging area of data marketplaces, we have conducted two surveys to systematically gather and evaluate their characteristics. This paper essentially continues and enhances a survey we conducted in 2012; it describes our findings from a second round done in 2013. Our study shows that the market is vivid with numerous exits and changes in its core business. We try to identify trends in this young field and explain them. Most notably, there is a definite trend towards high quality data.
Skyline query processing and the more general preference queries become reality in current database systems. Preference queries select those tuples from a database that are optimal with respect to a set of designated preference attributes. In a Skyline query these preferences only refer to minimum and maximum, whereas the more general approach of preference queries allow a more granular specification of user wishes as well as the specification of the relative importance of individual preferences. The incorporation of preferences into practical relational database engines necessitates an efficient and effective selectivity estimation module: A better understanding of the preference selectivity is useful for better design of algorithms and necessary to extend a database query optimizer's cost model to accommodate preference queries. This paper presents a survey on selectivity and cardinality estimation for arbitrary preference queries. The paper presents current approaches and discusses their advantages and disadvantages, such that one could decide which model should be used in a database engine to estimate optimization costs.
User behaviour prediction and analysis have become one of the most important approaches in the research of adaptive user interfaces (UI). Web personalisation, automatic UI adoption, recommender engines, flexible and responsive frameworks to user actions and user behaviour prediction provide alternative automatic solutions to enhance sophisticated user interfaces and serve as an alternative to conservative approaches in UI development. Nevertheless, research of unjust and improper user behaviour and its analysis in sophisticated user interfaces are left partly without needed attention. The objective of this paper is to focus on this kind of user behaviour and error analysis. In this paper we show that the rate of user mistakes while exploiting graphical UI depends on whether advanced UI development techniques like prototyping techniques and user tests were applied during UI development or they were disregarded. Moreover, the results demonstrate that graphical user interfaces without additional support for tablets and mobile devices have higher rate of user mistakes. For the latter, user tests were performed in order to analyse the importance and the scale of problem of clicking around UI elements. The paper also delivers solutions to user click misbehaviour problem to significantly decrease the rate of studied user interaction mistakes.
The adaptivity provision in the web based e-learning systems nowadays is actual topic. Adaptivity organization can be done by grouping users into previously defined learner groups and by offering these group delegates the appropriate learning scenario. In this article a detailed description of group creation algorithm is presented. A learner group module is made based on the creation algorithm and implemented in Moodle system. It creates course learning groups and classifies learners into already existing groups. Group creation is done by using feature tree. Learner grouping helps to ensure fast system adaptive reaction according to learner characteristics and learning course features. The system offers each learner group their own course learning scenario appropriate to created groups. The created learner group module can be used both in the whole adaptive learning system knowledge area and in individual courses and course categories.
Systematic investigation of e learning technological tools and resources to support tutors' design of educational content is at the core of the ongoing research in Higher Education. Training of ODL Tutors in chunking and restructuring educational content with added pedagogical value is a key factor in ODL strategic planning. As institutions and enterprises are actively collecting data and storing large databases, data mining is one step at the knowledge discovery process, dealing with patterns' and their relationships' extraction from large amounts of data. Basic characteristics and methodology of Design for Pedagogy tool (D4P), as a learning- design- oriented tool developed in Hellenic Open University (HOU) are presented: aim has been to provide support on HOU tutors for designing learning activities and space for storing educational material and activities' structures. Methodology rationale, background design of the ODL toolkit, report on preliminary usability testing are currently presented in a scope of providing scaffold and support to educators for embedding Learning Technology tools into their Open and Distance Learning courses.
The aim of current article is to analyse and identify the satisfaction of e-government services among small and medium sized enterprises (SMEs) mostly in two countries of Baltic Sea Region: Estonia and Germany. Survey data is obtained from SMEs in order to determine their usage and satisfaction of selected e-government services (e.g. seeking useful information from Government websites; VAT; submission of data to statistical office; attending public procurement) that are available in two countries. For comparison, Sweden is involved in the analysis of satisfaction of enterprises with seeking information from Government websites. The findings of this study imply that SMEs in urban areas and manufacturing enterprises are using e-government services more than SMEs in rural areas and service sectors, but there are differences between countries. There is a moderate correlation between external pressure and social influence on SMEs and the satisfaction towards e-government services used by SMEs.
Contemporary DBMS systems already use data-partitioning and data-flow analysis for intra-query parallelism. We study the problem of identifying data-partitioning targets. To rank candidates, we propose a simple cost model that relies on plan structure, operator cost and selectivity for a given base table. We evaluate this model in various optimization schemes and observe how it affects degrees of parallelism and query execution latencies across all TPC-H queries: When compared with the existing naïve model which partitions the largest physical table in the query, our approach identifies significantly better partitioning targets thus resulting in sinificantly higher degree of resource utilization and intra-query parallelism for most queries while having little impact on the remaining queries in the TPC-H benchmark.
Big data is one of the great challenges for the future Internet of Things applications. There is a vision to network millions of devices into a comprehensive network, however, currently there are no good solutions for doing this, neither are there feasible methods for exploiting the collected data. Collecting data to servers does not offer exploitation scenarios for the huge amounts of data that can be collected using the billions of sensing devices that may be deployed once the Internet of Things vision becomes a reality. Instead of collecting data to servers, the data could be processed by the devices collecting the data, utilizing the processing results right where they are created and used. Processing fewer data items autonomously in a large number of computing nodes results in much less complexity for a single computing node when compared to a scenario where all data is collected for processing to one capable computing node. However, in case of a distributed scenario the complexity is increased in some other areas such as communication and data validation as we can't assume a synchronous system with a fixed architecture and well defined data paths. The current paper presents an architecture based on the proactive middleware ProWare and some application examples that enable the construction of systems where the computation is distributed among individual computing nodes, thus alleviating the big data problem.