Ebook: Global Healthgrid: e-Science Meets Biomedical Informatics
HealthGrid 2008 is the sixth conference in this series of open forums for the integration of grid technologies and its applications in the biomedical, medical and biological domains to pave the path to an international research area in healthgrids. The main objective of the HealthGrid conference and the HealthGrid Association is the exchange and discussion of ideas, technologies, solutions and requirements that interest the grid and the life-sciences communities to foster the integration of grids into health. Subjects in this publication reflect the diversity of mature practice: Advancing Virtual Communities, offering a glimpse of the kind of communities that are brought together by means of collaboration grids; Public Health Informatics, exploring the diffusion of grid concepts and technologies in health informatics; Translational Bioinformatics, the contact point between medicine, healthcare and genomics; and Knowledge Management and Decision Support, one direction that is confidently expected to grow as the synergy of grids and ‘evidence-based practice’ in healthcare is exploited.
HealthGrid 2008 (http://chicago2008.healthgrid.org) is the sixth conference in this series of open forums for the integration of grid technologies and its applications in the biomedical, medical and biological domains to pave the path to an international research area in healthgrids. The main objective of the HealthGrid conference and the HealthGrid Association is the exchange and discussion of ideas, technologies, solutions and requirements that interest the grid and the life-sciences communities to foster the integration of grids into health. Participation is encouraged for grid middleware and grid applications developers, biomedical and health informatics users and security and policy makers to participate in a set of multidisciplinary sessions with a common concern on the applications to Health. It marks a new level of maturity for this event, migrating for the first time outside Europe to the city of Chicago. HealthGrid's sister organization, HealthGrid.US, has gathered together an impressive array of grid and biomedical informatics experts from both sides of the Atlantic—and beyond—to its conference in Chicago. This is indeed an auspicious occasion: there are similarities and differences between the European and American approaches, from nomenclature – ‘cyberinfrastructure’ in the US and ‘grids’ in Europe – through the variety of healthcare economies to the very style of biomedical research. Each has the potential to benefit the other, and each has the potential to benefit from the other. The conference is an occasion to celebrate differences and to explore points of contact, just as much as it is an occasion to celebrate similarities and to exploit the contrasts.
HealthGrid conferences have been organized on an annual basis. The first conference, held in 2003 in Lyon (http://lyon2003.healthgrid.org), reflected the need to involve all actors – physicians, scientists and technologists – who might play a role in the application of grid technology to health, whether healthcare or bio-medical research. The second conference, held in Clermont-Ferrand in January 2004 (http://clermont2004. healthgrid.org), reported research and work in progress from a large number of projects. The third conference in Oxford (http://oxford2005.healthgrid.org) had a major focus on the results and deployment strategies in healthcare. The fourth conference in Valencia (http://valencia2006.healthgrid.org) aimed at consolidating the collaboration among biologists, healthcare professionals and grid technology experts. This fifth conference focused on five pre-eminent domains viewed as application areas for grids in the biomedical field: molecules, cells, organs, individuals, and populations and aimed to show potential users that grids had already gone beyond hype to show concrete applications that demonstrate the success of the technology. As befits a diverse community and a maturing technology, the themes in 2008 reflect the diversity of mature practice: Advancing Virtual Communities, offering a glimpse of the kind of communities that are brought together by means of collaboration grids; Public Health Informatics, exploring the diffusion of grid concepts and technologies in health informatics; Tranlational Bioinformatics, the contact point between medicine, healthcare and genomics; and Knowledge Management and Decision Support, one direction that is confidently expected to grow as the synergy of grids and ‘evidence-based practice’ in healthcare is exploited. Thus, the nineteen papers selected from some 40 submissions of papers, demonstrations and posters have been organized in four sections, complemented by a fifth section of research road maps which relate, mostly indirectly, to keynotes and other events at the conference.
In the first section on Advancing Virtual Communities, Andrew Simpson et al. report on the development of a service-oriented interoperability framework, sif, in the context of a broader project, Generic Infrastructure for Medical Informatics (GIMI), which facilitates secure access to data in a variety of forms throughout a collaboration. Andrew Branson et al. address the information integration problem through the provision of an integrated data model with links to and from ontologies to homogenize biomedical data at different levels in the context of the EC Framework 6 Health-e-Child project; they identify clinical requirements and provide a detailed design approach. Nabil Abdennadher et al. expose XWCH, an easy-to-use middleware which they demonstrate can be exploited to gridify applications in such diverse applications as phylogeny inference on one hand and neuron connectivity on the other, thus providing evidence of the adaptability of their approach. M. Diarena et al. describe HOPE, a secure data integration platform taking its inspiration from a number of existing projects and building on EGEE gLite and the metadata catalogue AMGA and the portal GridSphere. Johan Montagnat et al. detail their work in the NeuroLOG project, in which they have assembled an ambitious middleware by analysing past experience and using existing components; they adopt a user-centred perspective in the domain of neuroscience and report on the project's design study and data integration strategy based on local schemas in the various sites.
The section on Public Health Informatics begins with a paper by Mikko Pitkanen et al. on the reasons for limited grid adoption in healthcare settings. Although security considerations and issues of IT management were significant barriers, the authors were pleasantly surprised by the high level of knowledge of grid technologies among those surveyed and also by the number of tentative experiments currently being undertaken. On the other hand, Silvia Olabarriaga et al. focussed on usability, exploring the gap between developers' perceptions and those of users. By taking a retrospective look at various to make results of functional MRI available to scientific users, the authors identify three stages in maturity, from ‘low-hanging fruit’ through ‘trying out’ to ‘end-user ready’. Exposing results from an Ecos-Colciencias project, Alexandra Pomares et al. report from Colombia on an independent approach to data integration in health virtual organizations through a high conceptual level of virtual data objects that can be queried independently of the logical or physical data sources behind them; the approach introduces a number of innovations in ‘query cartography’ and a semantic caching strategy to optimize network performance. Finally in this section, Professor Richard Sinnott et al. address what they describe as ‘arguably the greatest challenge facing the rollout and adoption of grid technologies to meet the changing face of postgenomic clinical research, especially with regard to information governance, ethics and hence security solutions’, namely the data requirements of the clinical domain. They describe how solutions from the Virtual Organizations for Trials and Epidemiological Studies (VOTES) project are being refactored to meet the needs of the clinical domain.
Among the papers on Translational Bioinformatics, the cost of certain bioinformatic problems is growing faster than the resources to address them, both because of the cost of harvesting enough target DNA or because computation is so expensive. The paper by Aparicio et al. introduces the idea of environmental genomic, or metagenomic, studies, in which fragments extracted from a mix of target and environmental DNA are compared to sequences whose function is known as an aid to their classification. The paper describes various steps in the process and suggests optimizations. On the other hand, Mario Cannataro et al. explore the means to optimize protein-to-protein interaction (PPI) prediction. For example, predicting the configuration of a PPI network is one way to study the generation of protein complexes. However, thanks to the enormous number of possible configurations of protein interactions, automatic computation tools are essential. The approach taken here is to integrate the outputs of a number of existing predictors; these are individually highly sensitive to input configuration, but it is suggested that integrated results are stable. Mohammad Shahid et al. report on a sequel to the successful WISDOM malaria challenge (see HealthGrid 2006 and 2007 for reports) by porting the challenge (not the solution) to the VIOLA optical grid environment using Unicore; apart from lessons learned on the use of grids in docking problems, the authors also identify an approach to reduce the size of the compound database in order to improve efficiency. Maria Mirto et al. consider the problem of protein folding or, as it is formally known, the tertiary structure problem: primary structure is the sequence and secondary the alpha-helix or beta-sheet form. Folding, or precise 3-dimensional shape beyond secondary structure, is a compromise of electical forces, geometry, and other constraints, and determines the function of the protein. The ProGenGrid (Proteomics and Genomics Grid) project has implemented a protein tertiary structure prediction service in a grid environment. The service has been used for predicting the dicarboxylate carrier of Saccharomyces cerevisiae by using the homology modelling approach. A virtual reality environment is then used for 3D visualization. The section concludes with Xuan Liu's paper on BioNessieG, a grid version of an existing biochemical network simulator which was developed at the University of Glasgow. The paper describes the simulator and focuses in particular on how it has been extended to benefit from a wide variety of high performance computational resources across the UK through grid technologies to support larger scale simulations.
The fourth section, on Knowledge Management and Decision Support, kicks off with three papers from the @neurIST consortium, an evidently fruitful multidisciplinary project on the risk of aneurysm rupture. This risk in a given patient is determined by a multiplicity of variables, from the molecular level through cellular processes and tissue remodelling to the pathophysiology of the disease, not ignoring population level epidemiological aspects of the disease. In the first paper, Jimison Iavindrasana et al. discuss the management of patient data, with a focus on considerations of patient privacy and confidentiality, as well as the security features of the clinical information system. Christoph Friedrich et al., in the second paper, present @neuLink, a service-oriented environment used to extract relevant information from structured and unstructured information sources and to link genetics (molecular genetics as well as epidemiological factors) with the disease, thus allowing the interpretation of molecular data within a clinical research environment. Through the integration of multiple, complex data sources – clinical data, individual risk factors provided by @neuLink – the @neuRisk clinical decision support (CDS) system classifies patients with high aneurysm rupture risk and proposes suitable, individualized preventive therapy; this is introduced in the third paper by Dunlop et al. Jesus Luna et al., reporting from Cyprus, take grid computing precepts as normally given and analyse what they would mean in a real healthgrid situation. This includes issues of patient data communication through public networks, storage in nodes out of the hospital's control, and so on. Analysis of the Intensive Care Grid system reveals potential sources of attack and provides a solution that would be in line with legal requirements and security mechanisms. Peter Sloot et al. take the problem of decision support in the treatment of HIV drug resistance and explore, through the ViroLab project, the complexity of multilayered, non-linear, multiply-connected networks of influences in the decision process. The data cascades from -omics to health record, ascending or descending physical and temporal scales, disciplines, orders of magnitude. The result is an improved drug ranking system which has been successfully used in its prototype form.
The fifth and final section of this volume presents three research road maps. These are in different stages of development, and were commissioned for different reasons and in different contexts. However, collectively, they represent a remarkable set of proposals with many overlaps and contrasts.
The outcome of an ‘Integrated Research Team’ event on Healthgrid: Grid technology and biomedicine convened by the US Army's Telemedicine & Advanced Technology Research Centre (TATRC), the first road map “serves to communicate expectations and requirements to all parties – end users, policymakers, scientists, as well as technology leaders. The roadmap provides a foundation upon which TATRC may organize its strategic priorities and commit to resource allocations.”
caGrid is a middleware system which combines grid computing, service oriented architecture and the model-driven design paradigm to support development of interoperable data and analytical resources and federation of such resources in a Grid environment. The functionality provided by caGrid is an essential and integral component of the cancer Biomedical Informatics Grid (caBIGTM) program, established by the National Cancer Institute as a nationwide effort to develop enabling informatics technologies for collaborative, multi-institutional biomedical research with the overarching goal of accelerating translational cancer research.
Starting with the HealthGrid White Paper (2005), the EU funded SHARE road map project has aimed at identifying the most important steps and significant milestones towards wide deployment and adoption of healthgrids in Europe. The project has sought to reconcile likely conflicts between technological developments and regulatory frameworks by bringing together the project's technical road map and conceptual map of ethical and legal issues and socioeconomic prospects. A key tool in this process was a collection of case studies of healthgrid applications.
We report upon the development of sif (for service-oriented interoperability framework), a platform that has been developed to support the secure aggregation of medical data from disparate sources. By taking a data-agnostic approach to data access and transfer, sif provides a generic interface to data sources, which allows the current version to expose data from any relational database and any file system in a secure fashion. Application developers may then access and utilise such data via a simple API. sif is being developed within the GIMI (Generic Infrastructure for Medical Informatics) project; as such, we discuss its various applications within that context.
There has been much research activity in recent times about providing the data infrastructures needed for the provision of personalised healthcare. In particular the requirement of integrating multiple, potentially distributed, heterogeneous data sources in the medical domain for the use of clinicians has set challenging goals for the healthgrid community. The approach advocated in this paper surrounds the provision of an Integrated Data Model plus links to/from ontologies to homogenize biomedical (from genomic, through cellular, disease, patient and population-related) data in the context of the EC Framework 6 Health-e-Child project. Clinical requirements are identified, the design approach in constructing the model is detailed and the integrated model described in the context of examples taken from that project. Pointers are given to future work relating the model to medical ontologies and challenges to the use of fully integrated models and ontologies are identified.
XtremWeb-CH (XWCH) is a volunteer computing middleware that makes it easy for scientists and industrials to deploy and execute their parallel and distributed applications on a public-resource computing infrastructure. XWCH supports various high performance applications, including those having large storage and communication requirements.
Two high performance applications were ported and deployed on an XWCH platform. The first one is the Phylip package of programs that is employed for inferring phylogenies (evolutionary trees). It is the most widely distributed phylogeny package and has been used to build the largest number of published trees.Some modules of Phylip are CPU time consuming; their sequential version cannot be applied to a large number of sequences. The second application ported on XWCH is a medical application used to generate temporal dynamic neuronal maps.The application,named NeuroWeb,is used to better understand the connectivity and activity of neurons. NeuroWeb is a data and CPU intensive application.
This paper describes the different components of an XWCH platform and the lessons learned from gridifying Phylip and NeuroWeb. It also details the new features and extensions, which are being added to XWCH in order to support new types of applications.
The paper describes a platform developed for the secure management and analysis of medical data and images in a grid environment. Designed for telemedicine and built upon the EGEE gLite middleware and particularly the metadata catalogue AMGA as well as the GridSphere web portal, the platform provides to healthcare professionals the capacity to upload and query medical information stored over distributed servers. A job submission environment is also available for data analysis. Security features include authentication and authorization by grid certificates, anonymization of medical data and image encryption. The platform is currently deployed on several sites in Europe and Asia and is being customized for applications in the field of telemedicine and medical physics.
The NeuroLOG project designs an ambitious neurosciences middleware, gaining from many existing components and learning from past project experiences. It is targeting a focused application area and adopting a user-centric perspective to meet the neuroscientists expectations. It aims at fostering the adoption of HealthGrids in a pre-clinical community. This paper details the project's design study and methodology which were proposed to achieve the integration of heterogeneous site data schemas and the definition of a site-centric policy. The NeuroLOG middleware will bridge HealthGrid and local resources to match user desires to control their resources and provide a transitional model towards HealthGrids.
Computational and data grids have been developed to manage large amounts of data produced in scientific collaborations. While the developed technologies could be employed to help the daily work in the public sector and in healthcare, their wide adaptation has not been seen, yet. In this paper we present a survey, which we conducted to find out how the decision makers and system specialists in the public service sector see the role of Grid technologies in their future work. The respondents of the survey work as decision makers and systems specialists in the healthcare and the public services domain in Switzerland.
Grids offer powerful infrastructures and promising concepts for the development and deployment of advanced applications in medical research and healthcare. The construction of HealthGrids in practice, however, is challenging due to reasons of scientific, technical, and cultural nature, among them the large gap between communities that develop and use the technology. Whereas grid developments focus mostly on functionality, usability issues are also very important to enable the potential of grids to be fully exploited by those who could mostly benefit from it, the end-users. In this paper we make a retrospective of our efforts to develop the Virtual Lab for functional Magnetic Resonance Imaging (fMRI). This project aims at providing for the end-users a grid-based system to facilitate research and clinical usage of fMRI data for study of brain activation. We present the evolution of this project in three phases coined “low hanging fruit”, “trying out” and “end-user ready”, and the lessons learnt in each one. The evolution of the software architecture, which had a large impact on the user front-end, is discussed in more detail. The current architecture facilitates the construction of front-ends that enable users to access the grid infrastructure from a single user-friendly GUI. All (local and grid) resources are accessed directly by the users from a virtual desktop implemented by the Virtual Resource Browser (VBrowser).
This paper presents an approach of data integration in Health Virtual Organization (HVO). It targets large scale contexts where high distribution and autonomy of sources produce complex integration scenarios. The principle is to provide a high conceptual level composed of Virtual Data Objects (VDO) that can be queried independently of the data sources that are behind them and without requiring technical skills. The approach uses a mediation architecture improved to be viable in large scale contexts through the concept of query cartography and applying a semantic caching strategy specific for VDOs that reduces network latency.
This research is supported by the project Ecos-Colciencias C06M02
Grid technologies provide an infrastructure through which, amongst other things, data access and integration is facilitated across highly distributed and heterogeneous resources. Different domains have their own requirements on the nature of this data access and integration. The clinical domain offers arguably the greatest challenges facing the roll-out and adoption of Grid technologies to meet the changing face of post-genomic clinical research, especially with regard to information governance, ethics and hence security solutions. This paper outlines a novel system design for secure anonymous data access and linkage that meets the needs of key stakeholders in this space including end user researchers, data providers and owners and ethical oversight bodies amongst others. We identify how existing solutions developed within the Medical Research Council funded Virtual Organisations for Trials and Epidemiological Studies (VOTES) project are being re-factored to meet the needs of these players and to address information governance criteria.
Computational resources and computationally expensive processes are two topics that are not growing at the same ratio. The availability of large amounts of computing resources in Grid infrastructures does not mean that efficiency is not an important issue. It is necessary to analyze the whole process to improve partitioning and submission schemas, especially in the most critical experiments. This is the case of metagenomic analysis, and this text shows the work done in order to optimize a Grid deployment, which has led to a reduction of the response time and the failure rates. Metagenomic studies aim at processing samples of multiple specimens to extract the genes and proteins that belong to the different species. In many cases, the sequencing of the DNA of many microorganisms is hindered by the impossibility of growing significant samples of isolated specimens. Many bacteria cannot survive alone, and require the interaction with other organisms. In such cases, the information of the DNA available belongs to different kinds of organisms. One important stage in Metagenomic analysis consists on the extraction of fragments followed by the comparison and analysis of their function stage. By the comparison to existing chains, whose function is well known, fragments can be classified. This process is computationally intensive and requires of several iterations of alignment and phylogeny classification steps. Source samples reach several millions of sequences, which could reach up to thousands of nucleotides each. These sequences are compared to a selected part of the “Non-redundant” database which only implies the information from eukaryotic species. From this first analysis, a refining process is performed and alignment analysis is restarted from the results. This process implies several CPU years. The article describes and analyzes the difficulties to fragment, automate and check the above operations in current Grid production environments. This environment has been tuned-up from an experimental study which has tested the most efficient and reliable resources, the optimal job size, and the data transference and database reindexation overhead. The environment should re-submit faulty jobs, detect endless tasks and ensure that the results are correctly retrieved and workflow synchronised. The paper will give an outline on the structure of the system, and the preparation steps performed to deal with this experiment.
Proteins interact among them and different interactions form a very huge number of possible combinations representable as protein to protein interaction (PPI) networks that are mapped into graph structures. The interest in analyzing PPI networks is related to the possibility of predicting PPI properties, starting from a set of known proteins interacting among each other. For example, predicting the configuration of a subset of nodes in a graph (representing a PPI network), allows to study the generation of protein complexes. Nevertheless, due to the huge number of possible configurations of protein interactions, automatic based computation tools are required.
Available prediction tools are able to analyze and predict possible combinations of proteins in a PPI network which have biological meanings. Once obtained, the protein interactions are analyzed with respect to biological meanings representing quality measures. Nevertheless, such tools strictly depend on input configuration and require biological validation. In this paper we propose a new grid-based prediction tool that integrate of different prediction results.
Malaria remains a global health concern, which kills over a million people each year. In this paper we present work extending the approach of the WISDOM initiative by focusing on the problems noticed during the first WISDOM challenge against malaria and test the newly established, high bandwidth optical Grid environment VIOLA for advanced bioinformatics applications using the UNICORE middleware service. In addition we present an approach to reduce the size of the compound database to improve the efficiency of the screening.
This paper describes a protein tertiary structure prediction service implemented in a Grid Environment. The service has been used for predicting the dicarboxylate carrier (DIC) of Saccharomyces cerevisiae by using the homology modelling approach. The visualization of the predicted model is made possible by using an interactive virtual reality environment based on X3D and Ajax3d technologies.
The simulation of biochemical networks provides insight and understanding about the underlying biochemical processes and pathways used by cells and organisms. BioNessie is a biochemical network simulator which has been developed at the University of Glasgow. This paper describes the simulator and focuses in particular on how it has been extended to benefit from a wide variety of high performance compute resources across the UK through Grid technologies to support larger scale simulations.
We introduce the architecture of @neuLink, a service-oriented environment for biomedical knowledge discovery which has been developed in the course of EU Integrated Project @neurIST. The application integrates data from databases with information extracted from unstructured text sources. Moreover, @neuLink supports the analysis of primary biomolecular data associated with individual patients and thus enables the interpretation of molecular data inside a clinical research environment. Based on an assembly of data services, @neuLink interacts with the complex @neurIST grid infrastructure through a dedicated data access and data mediation service. Data types integrated by @neuLink are covering the entire span of biomolecular entities: from gene names in text to entries in EntrezGene; from mentions of drugs to Drugbank, from information on allelic variants in scientific literature to entries in dbSNP. The architecture of @neuLink allows easy integration of other webservice-based applications and thus the spectrum of analysis capabilities of @neuLink can be extended following the requirements of the users of the @neurIST system.
This paper presents an overview of computerised decision support for clinical practice. The concept of computer-interpretable guidelines is introduced in the context of the @neurIST project, which aims at supporting the research and treatment of asymptomatic unruptured cerebral aneurysms by bringing together heterogeneous data, computing and complex processing services. The architecture is generic enough to adapt it to the treatment of other diseases beyond cerebral aneurysms. The paper reviews the generic requirements of the @neurIST system and presents the innovative work in distributing executable clinical guidelines.
Novel eHealth systems are being designed to provide a citizen-centered health system, however the even demanding need for computing and data resources has required the adoption of Grid technologies. In most of the cases, this novel Health Grid requires not only conveying patient's personal data through public networks, but also storing it into shared resources out of the hospital premises. These features introduce new security concerns, in particular related with privacy. In this paper we survey current legal and technological approaches that have been taken to protect a patient's personal data into eHealth systems, with a particular focus in Intensive Care Grids. However, thanks to a security analysis applied over the Intensive Care Grid system (ICGrid) we show that these security mechanisms are not enough to provide a comprehensive solution, mainly because the data-at-rest is still vulnerable to attacks coming from untrusted Storage Elements where an attacker may directly access them. To cope with these issues, we propose a new privacy-oriented protocol which uses a combination of encryption and fragmentation to improve data's assurance while keeping compatibility with current legislations and Health Grid security mechanisms.
The complete cascade from genome, proteome, metabolome, and physiome, to health forms multiscale, multiscience systems and crosses many orders of magnitude in temporal and spatial scales. The interactions between these systems create exquisite multitiered networks, with each component in nonlinear contact with many interaction partners. Understanding, quantifying, and handling this complexity is one of the biggest scientific challenges of our time. In this paper we argue that computer science in general, and Grid computing in particular, provide the language needed to study and understand these systems, and discuss a case study in decision support for HIV drug resistance treatment within the European ViroLab project.