Ebook: From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences
The main focus of this publication is on technologies, solutions and requirements that interest the grid and the life-science communities to foster the integration of grids into health. The proceedings are especially interesting for grid middleware and grid application developers, biomedical and health informatics users, and security and policy makers with a common focus on the application in the health domain. Topics in this publication are: State-of-the-art of the grid research and use at molecule, cell, organ, individual and population levels; and security and imaging. In security, data protection and pseudonymization are being discussed. In imaging, there’s Globus MEDICUS, which federates DICOM devices through a grid architecture and KnowARC on facilitating grid networks for the biomedical research community. Finally, there’s a report on the successful use of multimodal workflows in diabetic retinopathy research.
HealthGrid 2007 (http://geneva2007.healthgrid.org) is the fifth edition of this open forum for the integration of grid technologies and their applications in the biomedical, medical, and biological domains to pave the path towards an international research area in the HealthGrid field. The main objective of the HealthGrid conference and of the HealthGrid Association is the exchange and discussion of ideas, technologies, solutions and requirements that interest the grid and the life-science communities to foster the integration of grids into health. Participation is encouraged for grid middleware and grid application developers, biomedical and health informatics users, and security and policy makers to participate in a set of multidisciplinary sessions with a common focus on the application in the health domain.
HealthGrid conferences have been organized on an annual basis. The first conference, held in 2003 in Lyon (http://lyon2003.healthgrid.org), reflected the need to involve all actors – physicians, scientists and technologists – who might play a role in the application of grid technology to health, whether healthcare or bio-medical research. The second conference, held in Clermont-Ferrand in January 2004 (http://clermont2004. healthgrid.org) reported research and work in progress from a large number of projects. The third conference in Oxford (http://oxford2005.healthgrid.org) had a major focus on the results and deployment strategies in healthcare. The fourth conference in Valencia (http://valencia2006.healthgrid.org) aimed at consolidating the collaboration among biologists, healthcare professionals and grid technology experts. This fifth conference will focus on the five domains defined by the European Commission as application areas for grids in the biomedical field: molecules, cells, organs, individuals, and populations. For each of these five domains, an invited speaker will give a state of the art address followed by concrete projects. The conference venue in a hospital setting should also help to locate healthgrid research where it undoubtedly belongs, in the biomedical field. Potential users need to be shown that grids have now gone beyond hype and can show concrete applications that demonstrate the success of the technology.
The conference includes a number of high-profile keynote presentations complemented by a set of high quality peer reviewed papers. The number of contributions to this conference has increased from previous occasions, reaching the number of 55 submissions of papers, demonstrations and posters from principal authors coming from 21 countries (ordered by the number of contributions: Italy, Switzerland, United Kingdom, Germany, USA, France, Spain, Poland, Russia, Taiwan, Australia, Belgium, Brazil, Cuba, Cyprus, Czech Republic, Greece, Hungary, Portugal, Romania, and Ukraine). Considering the affiliations of all the authors of the papers, the number of contributing countries is extended to 22 countries with Venezuela. These proceedings have been organized in eight chapters. Five chapters focus on state of the art of the grid research and use at molecule, cell, organ, individual and population levels. Two chapters present security and imaging papers. The last chapter includes the best poster contributions.
From among the themes of the conference, it may be thought that molecules present the most amenable target for a healthgrid approach. Although this may be true, it is far from obvious. Even in applications such as BLAST which treat symbolic ‘words’ in a molecular alphabet, there are a great many problems to be addressed, both when searching for large query strings and in seeking to maintain an appropriate ‘rollback’ distributed database, as Trombetti et al. show. Two papers, by Andreas Quandt et al. and by Zosso et al., apply grids in ‘tandem mass spectrometry’ a highly demanding application for protein identification. It is equally good to see applications to screening in Malaria (Jean Salzemann et al.) and early detection of Alzheimer's (Nabil Abdennadher et al.), while the application to sleep disorders in their full breadth (Sebastian Canisius et al.) demonstrates the considerable breadth of application that is possible on a grid platform. Hernandez et al. provide an overview of the substantial progress that has been made in the field in the wake of large platform projects.
By contrast, applications in the cellular and organ domains appear to present as much difficulty in conceiving possible projects as in devising grid deployment schemes. Sinnott et al. identify and are in part motivated by the intrinsic value of microarray data as a major issue, while Roberta Alfieri et al. consider a mathematical model of the cell cycle – two extremes in the exploration of cell issues. Marienne Hibbert et al. describe a molecular medicine model and Emerson and Rossi discuss a simulation of the human immune system. Among organs, Andrés Gómez et al. demonstrate a radiotherapy tool and Blanquer Espert et al. explore the management of DICOM objects on a grid.
The domain of the individual is highly attractive, since it provides an entry point to the whole HealthGrid project and the possible grid health record. Two papers by the SHARE collaboration provide outline ‘road maps’ for the creation of viable healthgrids. Jenny Ure et al. took schizophrenia as a case study for a series of workshops to understand what it might take to create appropriate ontologies for data integration. Michal Kosiedowski et al. discuss a grid-based Electronic Medical Library and Shu-Hui Hung et al. consider the merits of the treatment of asthma in a grid system. Stefan Rüping et al. seek to extend workflow management for knowledge discovery from combined clinical and genomic data towards a fuller electronic health record. Giovanni Aloisio et al. treat the paradigm of Service-Oriented Architecture in bioinformatics.
Thus we arrive at the population domain, where Fabricio Silva et al. report on a grid-based epidemic surveillance system in Brazil for essentially slow evolving diseases. Tiberiu Stef-Praun et al. present the Swift system of workflow description and execution.
Apart from these five domains, two topical areas are considered in depth: security and imaging. In security, in two separate papers Jean Herveg and Luigi Lo Iacono explore familiar questions of data protection and of pseudonymization respectively, providing a theoretical perspective against which some practical proposals by Petr Holub et al. on the Access Grid model and Harald Gjermundrød et al. on the EGEE model can be compared. In imaging, we have papers on Globus MEDICUS (Stephan G. Erberich et al.) which federates DICOM devices through a grid architecture and KnowARC (Henning Müller et al.) on facilitating grid networks for the biomedical research community. Finally, Adina Riposan et al. report on the successful use of multimodal workflows in diabetic retinopathy research.
ACKNOWLEDGEMENTS
The editors express their gratitude to the program committee and the reviewers; each paper was read by at least two reviewers, including the editors. The editors want to thank for the remarkable work that the staff of the HealthGrid association has invested in these conference proceedings and on the organisation of the conference, particularly Yannick Legré.
Opinions expressed in these proceedings are those of individual authors and editors, and not necessarily those of their institutions.
We present a method to grid-enable tandem mass spectrometry protein identification. The implemented parallelization strategy embeds the open-source x!tandem tool in a grid-enabled workflow. This allows rapid analysis of large-scale mass spectrometry experiments on existing heterogeneous hardware. We have explored different data-splitting schemes, considering both splitting spectra datasets and protein databases, and examine the impact of the different schemes on scoring and computation time. While resulting peptide e-values exhibit fluctuation, we show that these variations are small, caused by statistical rather than numerical instability, and are not specific to the grid environment. The correlation coefficient of results obtained on a standalone machine versus the grid environment is found to be better than 0.933 for spectra and 0.984 for protein identification, demonstrating the validity of our approach. Finally, we examine the effect of different splitting schemes of spectra and protein data on CPU time and overall wall clock time, revealing that judicious splitting of both data sets yields best overall performance.
Biomarker detection is one of the greatest challenges in Clinical Proteomics. Today, great hopes are placed into tandem mass spectrometry (MS/MS) to discover potential biomarkers. MS/MS is a technique that allows large scale data analysis, including the identification, characterization, and quantification of molecules. Especially the identification process, that implies to compare experimental spectra with theoretical amino acid sequences stored in specialized databases, has been subject for extensive research in bioinformatics since many years. Dozens of identification programs have been developed addressing different aspects of the identification process but in general, clinicians are only using a single tools for their data analysis along with a single set of specific parameters. Hence, a significant proportion of the experimental spectra do not lead to a confident identification score due to inappropriate parameters or scoring schemes of the applied analysis software. The swissPIT (Swiss Protein Identification Toolbox) project was initiated to provide the scientific community with an expandable multi-tool platform for automated and in-depth analysis of mass spectrometry data. The swissPIT uses multiple identification tools to automatic analyze mass spectra. The tools are concatenated as analysis workflows. In order to realize these calculation-intensive workflows we are using the Swiss Bio Grid infrastructure. A first version of the web-based front-end is available (http://www.swisspit.cscs.ch) and can be freely accessed after requesting an account. The source code of the project will be also made available in near future.
BLAST is probably the most used application in bioinformatics teams. BLAST complexity tends to be a concern when the query sequence sets and reference databases are large. Here we present BGBlast: an approach for handling the computational complexity of large BLAST executions by porting BLAST to the Grid platform, leveraging the power of the thousands of CPUs which compose the EGEE infrastructure. BGBlast provides innovative features for efficiently managing BLAST databases in the distributed Grid environment. The system (1) keeps the databases constantly up to date while still allowing the user to regress to earlier versions, (2) stores the older versions of databases on the Grid with a time and space efficient delta encoding and (3) manages the number of replicas for each database over the Grid with an adaptive algorithm, dynamically balancing between execution parallelism and storage costs.
In the last years an increasing demand for Grid Infrastructures has resulted in several international collaborations. This is the case of the EELA Project, which has brought together collaborating groups of Latin America and Europe. One year ago we presented this e-infrastructure used, among others, by the biomedical groups for the studies of oncological analysis, neglected diseases, sequence alignments and computational phylogenetics. After this period, the achieved advances are summarised in this paper.
Sleep medicine is gaining more and more interest and importance both within medical research and clinical routine. The investigation of sleep and associated disorders requires the overnight acquisition of a huge amount of biosignal data derived from various sensors (polysomnographic recording) as well as consecutive time-consuming manual analysis (polysomnographic analysis). Therefore, the development of automatic analysis systems has become a major focus in sleep research in the recent years, resulting in the development of algorithms for the analysis of different biosignals (EEG, ECG, EMG, breathing signals). In this study, an open source algorithm published by Hamilton et al. was used for ECG analysis, whereas the analysis of breathing signals was done using an algorithm published by Clark et al. using also variations of the intra-thoracic pressure for the detection of breathing disorders. The electromyogram (EMG) analysis was done with a self-made algorithm, whereas EEG analyses are currently under development, using both frequency analysis modules and pattern recognition procedures. Although all these algorithms have proved to be quite useful, their validity and reliability still needs to be verified in future studies. Taking into account that during a standard polysomnographic recording data from approximately 8 hours of sleep are collected, it is imaginable that processing this amount of data by the described algorithms very often exceeds the calculating capacity of current standard computers. Using Grid technology, this limitation can be transcended by splitting biosignal data and distributing it to several analysis computers. Therefore, Grid based automatic analysis systems may improve the effectiveness of polysomnographic investigations and thereby diminish the costs for health care providers.
After having deployed a first data challenge on malaria and a second one on avian flu, respectively in summer 2005 and spring 2006, we are demonstrating here again how efficiently the computational grids can be used to produce massive docking data at a high-throughput. During more than 2 months and a half, we have achieved at least 140 million dockings, representing an average throughput of almost 80,000 dockings per hour. This was made possible by the availability of thousands of CPUs through different infrastructures worldwide. Through the acquired experience, the WISDOM production environment is evolving to enable an easy and fault-tolerant deployment of biological tools; in this case it is the FlexX commercial docking software which is used to dock the whole ZINC database against 4 different targets.
This paper describes the parallelization (gridification) of the phylogenetic package PHYLIP on a desktop GRID platform termed XtremWeb-CH.
PHYLIP is a package of programs for inferring phylogenies (evolutionary trees). It is the most widely-distributed phylogeny package. PHYLIP has been used to build the largest number of published trees. It's known that some modules of PHYLIP are CPU time consuming; their sequential version can not be applied to a large number of sequences.
XtremWeb-CH (XWCH) is a software system that makes it easier for scientists and industrials to deploy and execute their parallel and distributed applications on a public-resource computing infrastructure. Universities, research centres and private companies can create their own XWCH platform while anonymous PC owners can participate to these platforms. They can specify how and when their resources could be used. The objective of XWCH is to develop a real High Performance Peer-To-Peer platform with a distributed scheduling and communication system. The main idea is to build a completely symmetric model where nodes can be providers and consumers at the same time.
In this paper we describe the porting, deployment, and execution of some PHYLIP modules on the XWCH platform. The parallelized version of PHYLIP is used to generate evolutionary tree related to HIV viruses.
Microarray experiments are one of the key ways in which gene activity can be identified and measured thereby shedding light and understanding for example on biological processes. The BBSRC funded Grid enabled Microarray Expression Profile Search (GEMEPS) project has developed an infrastructure which allows post-genomic life science researchers to ask and answer the following questions: who has undertaken microarray experiments that are in some way similar or relevant to mine; and how similar were these relevant experiments? Given that microarray experiments are expensive to undertake and may possess crucial information for future exploitation (both academically and commercially), scientists are wary of allowing unrestricted access to their data by the wider community until fully exploited locally. A key requirement is thus to have fine grained security that is easy to establish and simple (or ideally transparent) to use across inter-institutional virtual organisations. In this paper we present an enhanced security-oriented data Grid infrastructure that supports the definition of these kinds of queries and the analysis and comparison of microarray experiment results.
In 2005, a major collaboration in Melbourne Australia successfully completed implementing a major medical informatics infrastructure – this is now being used for discovery research and has won significant expansion funding for 2006 - 2009. The convergence of life sciences, healthcare, and information technology is now driving research into the fundamentals of disease causation. Key to enabling this is collating data in sufficient numbers of patients to ensure studies are adequately powered. The Molecular Medicine Informatics Model (MMIM) is a ‘virtual’ research repository of clinical, laboratory and genetic data sets. Integrated data, physically located within independent hospital and research organisations can be searched and queried seamlessly via a federated data integrator. Researchers must gain authorisation to access data, and inform/obtain permission from the data owners, before the data can be accessed. The legal and ethical issues surrounding the use of this health data have been addressed so data complies with privacy requirements. The MMIM platform has also solved the issue of record linking individual cases and integrating data sources across multiple institutions and multiple clinical specialties. Significant research outcomes already enabled by the MMIM research platform include epilepsy seizure analyses for responders / non responders to therapy; sensitivity of faecal occult blood testing for asymptomatic colorectal cancer and advanced adenomas over a 25-year experience in colorectal cancer screening; subsite-specific colorectal cancer in diabetic and non diabetic patients; and the influence of language spoken on colorectal cancer diagnosis, management and outcomes. Ultimately the infrastructure of MMIM enables discovery research to be accessible via the Web with security, intellectual property and privacy addressed.
ImmunoGrid is a 3 year project funded by the European Union which began in February 2006 and establishes an infrastructure for the simulation of the immune system that integrates processes at molecular, cellular and organ levels. It is designed for applications that support clinical outcomes such as the design of vaccines, immunotherapies and optimization of immunization protocols. The first phase of the project concentrated on improving and extending current models of the immune system. We are now entering the second phase which will design and implement a human immune system simulator. Since the new models are orders of magnitude more complex than the previous ones, grid technologies will be essential in providing the necessary computer infrastructure. The final phase of the project will validate the simulator with pre-clinical trials using mouse models.
Cell cycle is one of the biological processes that has been investigated the most in the recent years, this due to its importance in cancer studies and drug discovery. The complexity of this biological process is revealed every time a mathematical simulation of the processes is carried out. We propose an automated approach that mathematically simulates the cell cycle process with the aim to describe the best estimation of the model. We have implemented a system that starting from a cell cycle model is capable of retrieving from a specific database, called Cell Cycle Database, the necessary mathematical information to perform simulation using a grid approach and identify the best model related to a specific dataset of experimental results from the real biological system. Our system allows the visualization of mathematical expressions, such as the kinetic rate law of a reaction, and the direct simulation of the models with the aim to give the user the possibility to interact with the simulation system. The parameter estimation process usually implies time-consuming computations due to algorithms of linear regression and stochastic methods. In particular, in the case of a stochastic approach based on evolutionary algorithms, the iterative selection process implies many different computations. Therefore, a large number of ODE system simulations are required: the grid infrastructure allows to distribute and obtain the best model that fits the experimental data. The computation of many ODE systems can be distributed on different grid nodes so that the execution time for the estimation of the best model is reduced. This system will be useful for the comparison of models with different initial conditions related to normal and deregulated cell cycles.
The eIMRT project is producing new remote computational tools for helping radiotherapists to plan and deliver treatments. The first available tool will be the IMRT treatment verification using Monte Carlo, which is a computational expensive problem that can be executed remotely on a GRID. In this paper, the current implementation of this process using GRID and SOA technologies is presented, describing the remote execution environment and the client.
Today most European healthcare centers use the digital format for their databases of images. TRENCADIS is a software architecture comprising a set of services as a solution for interconnecting, managing and sharing selected parts of medical DICOM data for the development of training and decision support tools. The organization of the distributed information in virtual repositories is based on semantic criteria. Different groups of researchers could organize themselves to propose a Virtual Organization (VO). These VOs will be interested in specific target areas, and will share information concerning each area. Although the private part of the information to be shared will be removed, special considerations will be taken into account to avoid the access by non-authorized users. This paper describes the security model implemented as part of TRENCADIS. The paper is organized as follows. First introduces the problem and presents our motivations. Section 1 defines the objectives. Section 2 presents an overview of the existing proposals per objective. Section 3 outlines the overall architecture. Section 4 describes how TRENCADIS is architected to realize the security goals discussed in the previous sections. The different security services and components of the infrastructure are briefly explained, as well as the exposed interfaces. Finally, Section 5 concludes and gives some remarks on our future work.
Secure, flexible and efficient storing and accessing digital medical data is one of the key elements for delivering successful telemedical systems. To this end grid technologies designed and developed over the recent years and grid infrastructures deployed with their use seem to provide an excellent opportunity for the creation of a powerful environment capable of delivering tools and services for medical data storage, access and processing. In this paper we present the early results of our work towards establishing a Medical Digital Library supported by grid technologies and discuss future directions of its development. These works are part of the “Telemedycyna Wielkopolska” project aiming to develop a telemedical system for the support of the regional healthcare.
The primary goal of the Care for Asthma via Mobile Phone (CAMP) service is to provide an effective method by which Taiwan's asthma patients can easily monitor their asthma symptoms using a common mobile phone. With the CAMP service, the patient uses his own cellular phone to submit his daily peak expiratory flow rate (PEFR) and answer a simple questionnaire regarding to daily activities. The CAMP service participant then receives an asthma symptom assessment and care suggestion message immediately after imputing his information. This assessment, which is in accordance with the World Health Organization's (WHO) Global Initiative for Asthma (GINA) standard, includes weather conditions that might adversely affect the asthma patient (e.g. temperature, pollen count, etc.). This information is, in turn, used to advise the asthma patient how to avoid a severe asthmatic attack.
The paper documents a series of data integration workshops held in 2006 at the UK National e-Science Centre, summarizing a range of the problem/solution scenarios in multi-site and multi-scale data integration with six HealthGrid projects using schizophrenia as a domain-specific test case. It outlines emerging strategies, recommendations and objectives for collaboration on shared ontology-building and harmonization of data for multi-site trials in this domain.
This paper proposes a 10-year roadmap to achieve the goal to offer to healthcare professionals an environment created through the sharing of resources, in which heterogeneous and dispersed health data as well as applications can be accessed by all users as a tailored information providing system according to their authorisation and without loss of information. The paper identifies milestones and presents short term objectives on the road to this healthgrid.
We present the ‘HealthGrid’ initiative and review work carried out in various European projects. Since the European Commission's Information Society Technologies programme funded the first grid-based health and medical projects, the HealthGrid movement has flourished in Europe. Many projects have now been completed and ‘Healthgrid’ consulted a number of experts to compile and publish a ‘White Paper’ which establishes the foundations, potential scope and prospects of an approach to health informatics based on a grid infrastructure. The White Paper demonstrates the ways in which the healthgrid approach supports many modern trends in medicine and healthcare, such as evidence-based practice, integration across levels, from molecules and cells, through tissues and organs to the whole person and community, and the promise of individualized health care. A second generation of projects have now been funded, and the EC has commissioned a study to define a research roadmap for a ‘healthgrid for Europe’, seen as the preferred infrastructure for medical and health care projects in the European Research Area.
This paper describes the evolution of the main services of the ProGenGrid (Proteomics & Genomics Grid) system, a distributed and ubiquitous grid environment (“virtual laboratory”), based on Workflow and supporting the design, execution and monitoring of “in silico” experiments in bioinformatics.
ProGenGrid is a Grid-based Problem Solving Environment that allows the composition of data sources and bioinformatics programs wrapped as Web Services (WS). The use of WS provides ease of use and fosters re-use. The resulting workflow of WS is then scheduled on the Grid, leveraging Grid-middleware services. In particular, ProGenGrid offers a modular bag of services and currently is focused on the biological simulation of two important bioinformatics problems: prediction of the secondary structure of proteins, and sequence alignment of proteins. Both services are based on an enhanced data access service.
Recent advances in research methods and technologies have resulted in an explosion of information and knowledge about cancers and their treatment. Knowledge Discovery (KD) is a key technique for dealing with this massive amount of data and the challenges of managing the steadily growing amount of available knowledge. In this paper, we present the ACGT integrated project, which is to contribute to the resolution of these problems by developing semantic grid services in support of multi-centric, post-genomic clinical trials. In particular, we describe the challenges of KD in clinico-genomic data in a collaborative Grid framework, and present our approach to overcome these difficulties by improving workflow management, construction and managing workflow results and provenance information. Our approach combines several techniques into a framework that is suitable to address the problems of interactivity and multiple dependencies between workflows, services, and data.