Ebook: German Medical Data Sciences: Bringing Data to Life
The COVID-19 pandemic has brought into sharp focus the need for the collection of data. Such data cannot be collected or provided without medical informatics, documentation and health data management. Nor can health data be evaluated and converted into a useful tool for making the important decisions that affect us all without biometrics and epidemiology, bioinformatics and systems biology.
This book presents full papers from GMDS & CEN-IBS 2020, the first joint online conference of the German Association of Medical Informatics, Biometry and Epidemiology (GMDS) and the Central European Network & the International Biometric Society (CEN-IBS), held online between 6 and 11 September 2020. The title of the conference was Bringing Data to Life, a reference to the increasing amount of data in medical research which is inextricably related to the fast-developing digitalization of the health system. Many challenges must be addressed in order to make use of and benefit from these increasing sources of data, and these can only be faced if all disciplines related to data science work together. The conference aimed to bring together the diverse disciplines within data science, including medical informatics, bioinformatics, biostatistics, epidemiology, public health and medical documentation. Topics covered in the book include central themes relevant to society in general and advances in data technology which support innovations in medical research in particular.
The book brings together many topics related to the provision and analysis of data in medicine, and will be of interest to all those working in the field.
On behalf of all organizers, I’m pleased to present to you the full paper proceedings of the first joint online conference of the GMDS & CEN-IBS 2020. Our theme, Bringing Data to Life, takes account of the fact that, due to the fast developing digitalization of the health system, those of us who work in medical research are faced with an increasing amount of data. Of course, many challenges must be addressed if we are to make use of these data and benefit from this increase in sources of evidence, and these can only be faced if all data-science disciplines work together. The aim of our conference was to bring together these different disciplines within the data sciences, including medical informatics, bioinformatics, biostatistics, epidemiology, public health and medical documentation.
The key topics of the conference covered central themes, relevant to society in general and to supporting innovations in medical research in particular.
Prof. Dr. Geraldine Rauch
Institute of Biometry and Clinical Epidemiology, Charite, Berlin, Germany
The detection of cardiac arrhythmias has a long history in medicine, with current developments focusing on early detection using mobile devices. In basic research, however, the use cases and data differ greatly from the experimental setup. We developed a Python-based system to ease detection and analysis of arrhythmic sections in signals measured on extracted and stimulated cardiac myocytes. Multiple algorithms were integrated into the system, tested and evaluated. The best algorithm resulted in an F1-score of 0.97 and was primarily provided in the application.
This manuscript investigates sample sizes for interim analyses in group sequential designs. Traditional group sequential designs (GSD) rely on “information fraction” arguments to define the interim sample sizes. Then, interim maximum likelihood estimators (MLEs) are used to decide whether to stop early or continue the data collection until the next interim analysis. The possibility of early stopping changes the distribution of interim and final MLEs: possible interim decisions on trial stopping excludes some sample space elements. At each interim analysis the distribution of an interim MLE is a mixture of truncated and untruncated distributions. The distributional form of an MLE becomes more and more complicated with each additional interim analysis. Test statistics that are asymptotically normal without a possibly of early stopping, become mixtures of truncated normal distributions under local alternatives. Stage-specific information ratios are equivalent to sample size ratios for independent and identically distributed data. This equivalence is used to justify interim sample sizes in GSDs. Because stage-specific information ratios derived from normally distributed data differ from those derived from non-normally distributed data, the former equivalence is invalid when there is a possibility of early stopping. Tarima and Flournoy  have proposed a new GSD where interim sample sizes are determined by a pre-defined sequence of ordered alternative hypotheses, and the calculation of information fractions is not needed. This innovation allows researchers to prescribe interim analyses based on desired power properties. This work compares interim power properties of a classical one-sided three stage Pocock design with a one-sided three stage design driven by three ordered alternatives.
The objectives of this paper are to analyze the terminologies SNOMED CT and Logical Observation Identifiers Names and Codes (LOINC) and to provide a guideline for the translation of LOINC concepts to SNOMED CT. Verified research data sets were used for this study, so this experiment is replicable with other research data. 50 LOINC concepts of frequently performed laboratory services were translated to SNOMED CT. Information would be lost with pre-coordinated mapping but the compositional grammar of SNOMED CT allows for the linking of individual concepts into complicated postcoordinated expressions including all embedded information in LOINC concepts. All information can thus be transferred smoothly to SNOMED CT.
Reading is an important ability, especially for patients during their medical treatment. It is needed, for instance, to complete administrative forms and patient-reported outcome questionnaires in clinical routine. Unfortunately, not every patient is able to read caused by illiteracy, low vision or simply speaking another language. Thus, a minder is required to support the mentioned reading tasks. Providing patients with the possibility to read and understand texts without additional help is an important factor to improve their self-empowerment. Digital voice pens can be programmed to play prerecorded audio files if tipped onto predefined areas of interactive paper. They can be a tool for impaired patients to read texts aloud in multiple languages. In this work, we wanted to evaluate the possibilities of these digital voice pens. A feasibility study was conducted by using the commercially available tiptoi digital voice pen by Ravensburger AG and the tttool application by Joachim Breitner for the programming of the pen. Focusing on the use case of questionnaires, a schematic questionnaire was implemented which enforced the usage of a digital voice pen. To simulate foreign languages or illiteracy, questions and answers of the document were represented by placeholders and the digital voice pen was required to read aloud the question texts. The correctness of the given answers was documented and the usability of the digital voice pen was measured by the System Usability Scale. The evaluation was performed by 15 volunteers (8 male/7 female) between 24 and 35 years old. The usability and acceptance of digital voice pens were rated as “Good” in our constructed setting.
The Operational Data Model (ODM) is a data standard for interchanging clinical trial data. ODM contains the metadata definition of a study, i.e., case report forms, as well as the clinical data, i.e., the answers of the participants. The portal of medical data models is an infrastructure for creation, exchange, and analysis of medical metadata models. There, over 23000 metadata definitions can be downloaded in ODM format. Due to data protection law and privacy issues, clinical data is not contained in these files. Access to exemplary clinical test data in the desired metadata definition is necessary in order to evaluate systems claiming to support ODM or to evaluate if a planned statistical analysis can be performed with the defined data types. In this work, we present a web application, which generates syntactically correct clinical data in ODM format based on an uploaded ODM metadata definition. Data types and range constraints are taken into account. Data for up to one million participants can be generated in a reasonable amount of time. Thus, in combination with the portal of medical data models, a large number of ODM files including metadata definition and clinical data can be provided for testing of any ODM supporting system. The current version of the application can be tested at https://cdgen.uni-muenster.de and source code is available, under MIT license, at https://imigitlab.uni-muenster.de/published/odm-clinical-data-generator.
Rare lung diseases affect 1.5–3 million people in Europe while causing bad prognosis or early deaths for patients. The European Reference Network for Respiratory Diseases (ERN-Lung) is a patient centric network, funded by the European Union (EU). The aims of ERN-LUNG is to increase healthcare and research regarding rare respiratory diseases. An initial need for cross-border healthcare and research is the use of registries and databases. A typical problem in registries for RDs is the data exchange, since the registries use different kind of data with different types or descriptions. Therefore, ERN-Lung decided to create a new Registry Data-Warehouse (RDW) where different existing registries are connected to enable cross-border healthcare within ERN-Lung. This work facilitates the aims, conception and implementation for the RDW, while considering a semantic interoperability approach. We created a common dataset (CDS) to have a common descriptions of respiratory diseases patients within the ERN registries. We further developed the RDW based on Open Source Registry System for Rare Diseases (OSSE), which includes a Metadata Repository with the Samply.MDR to unique describe data for the minimal dataset. Within the RDW, data from existing registries is not stored in a central database. The RDW uses the approach of the “Decentral Search” and can send requests to the connected registries, whereas only aggregated data is returned about how many patients with specific characteristics are available. However, further work is needed to connect the different existing registries to the RDW and to perform first studies.
The diagnosis of patients with rare diseases is often delayed. A Clinical Decision Support System using similarity analysis of patient-based data may have the potential to support the diagnosis of patients with rare diseases. This qualitative study has the objective to investigate how the result of a patient similarity analysis should be presented to a physician to enable diagnosis support. We conducted a focus group with physicians practicing in rare diseases as well as medical informatics researchers. To prepare the focus group, a literature search was performed to check the current state of research regarding visualization of similar patients. We then created software-mockups for the presentation of these visualization methods for the discussion within the focus group. Two persons took independently field notes for data collection of the focus group. A questionnaire was distributed to the participants to rate the visualization methods. The results show that four visualization methods are promising for the visualization of similar patients: “Patient on demand table”, “Criteria selection”, “Time-Series chart” and “Patient timeline. “Patient on demand table” shows a direct comparison of patient characteristics, whereas “Criteria selection” allows the selection of different patient criteria to get deeper insights into the data. The “Time-Series chart” shows the time course of clinical parameters (e.g. blood pressure) whereas a “Patient timeline” indicates which time events exist for a patient (e.g. several symptoms on different dates). In the future, we will develop a software-prototype of the Clinical Decision Support System to include the visualization methods and evaluate the clinical usage.
Clinical data and above all individual patient data are highly sensitive. All the more it is important to protect these critical information while analyzing and exploring their specifics for further research. However, in order to enable students and other researchers to develop decision support systems and to use modern data analysis methods such as intelligent pattern recognition, the provision of clinical data is essential. In order to allow this while completely protecting the privacy of a patient, we present a mixed approach to generate semantically and clinically realistic data: (1) We use available synthetic data, extract information on patient visits and diagnoses and adapt them to the encoding systems of German claims data; (2) based on a statistical analysis of real German hospital data, we identify distributions of procedures, laboratory data and other measurements and transfer them to the synthetic patient’s visits and diagnoses in a semi-automated way. This enabled us to provide students a data set that is as semantically and clinically realistic as possible to apply patient-level prediction algorithms within the development of clinical decision support systems without putting patient data at any risk.
Sharing data is of great importance for research in medical sciences. It is the basis for reproducibility and reuse of already generated outcomes in new projects and in new contexts. FAIR data principles are the basics for sharing data. The Leipzig Health Atlas (LHA) platform follows these principles and provides data, describing metadata, and models that have been implemented in novel software tools and are available as demonstrators. LHA reuses and extends three different major components that have been previously developed by other projects. The SEEK management platform is the foundation providing a repository for archiving, presenting and secure sharing a wide range of publication results, such as published reports, (bio)medical data as well as interactive models and tools. The LHA Data Portal manages study metadata and data allowing to search for data of interest. Finally, PhenoMan is an ontological framework for phenotype modelling. This paper describes the interrelation of these three components. In particular, we use the PhenoMan to, firstly, model and represent phenotypes within the LHA platform. Then, secondly, the ontological phenotype representation can be used to generate search queries that are executed by the LHA Data Portal. The PhenoMan generates the queries in a novel domain specific query language (SDQL), which is specific for data management systems based on CDISC ODM standard, such as the LHA Data Portal. Our approach was successfully applied to represent phenotypes in the Leipzig Health Atlas with the possibility to execute corresponding queries within the LHA Data Portal.
Data integration is a necessary and important step to perform translational research and improve the sample size beyond single data collections. For health information, the most recent established communication standards is HL7 FHIR. To bridge the concepts of “minimal invasive” data integration and open standards, we propose a generic ETL framework to process arbitrary patient related data collections into HL7 FHIR – which in turn can then be used for loading into target data warehouses. The proposed algorithm is able to read any relational delimited text exports and produce a standard HL7 FHIR bundle collection. We evaluated an implementation of the algorithm using different lung research registries and used the resulting FHIR resources to fill our i2b2 based data warehouse as well an OMOP common data model repository.
The archiving and exchange interface for practice management systems of the Kassenärztliche Bundesvereinigung, defined by FHIR (Fast Healthcare Interoperability Resources) profiles with extensions, describes a new opportunity for medical practitioner to change the system provider. The expectation is to transfer an entire database of a legacy system to another system without data loss. In this paper the potential loss of data is analyzed by comparing parameters. The results show that during an import on average 75% of the parameters per profile are supported and on average only 49% of the reviewed parameters, existing in the exporting system, could be represented based on the interface specification.
Precision medicine is an emerging and important field for health care. Molecular tumor boards use a combination of clinical and molecular data, such as somatic tumor mutations to decide on personalized therapies for patients who have run out of standard treatment options. Personalized treatment decisions require clinical data from the hospital information system and mutation data to be accessible in a structured way. Here we introduce an open data platform to meet these requirements. We use the openEHR standard to create an expert-curated data model that is stored in a vendor-neutral format. Clinical and molecular patient data is integrated into cBioPortal, a warehousing solution for cancer genomic studies that is extended for use in clinical routine for molecular tumor boards. For data integration, we developed openEHR Mapper, a tool that allows to (i) process input data, (ii) communicate with the openEHR repository, and (iii) export the data to cBioPortal. We benchmarked the mapper performance using XML and JSON as serialization format and added caching capabilities as well as multi-threading to the openEHR Mapper.
Metadata repositories are an indispensable component of data integration infrastructures and support semantic interoperability between knowledge organization systems. Standards for metadata representation like the ISO/IEC 11179 as well as the Resource Description Framework (RDF) and the Simple Knowledge Organization System (SKOS) by the World Wide Web Consortium were published to ensure metadata interoperability, maintainability and sustainability. The FAIR guidelines were composed to explicate those aspects in four principles divided in fifteen sub-principles. The ISO/IEC 21526 standard extends the 11179 standard for the domain of health care and mandates that SKOS be used for certain scenarios. In medical informatics, the composition of health care SKOS classification schemes is often managed by documentalists and data scientists. They use editors, which support them in producing comprehensive and valid metadata. Current metadata editors either do not properly support the SKOS resource annotations, require server applications or make use of additional databases for metadata storage. These characteristics are contrary to the application independency and versatility of raw Unicode SKOS files, e.g. the custom text arrangement, extensibility or copy & paste editing. We provide an application that adds navigation, auto completion and validity check capabilities on top of a regular Unicode text editor.
In cancer registries, record linkage procedures are used to link records of the same patient from different health care providers. In the Clinical Cancer Registry of Lower Saxony, a multi-level combination of exact assignment using the statutory health insurance number and a probabilistic procedure with control numbers and address data is applied. The procedure implemented in the register application assigns the incoming messages in this way as far as possible automatically. The aim of the observation carried out was to check the efficiency of the match variables and threshold values used, above which manual assignment is required. Weak points were identified and approaches to solutions were developed.
Health-related quality of life (HR-QoL) as a parameter for patient well-being is becoming increasingly important. Nevertheless, it is mainly used as an endpoint in studies rather than as an indicator for adjustments in therapy. In this paper we will present an approach to gradually integrate quality of life (QoL) as a control element into the care delivery of oncology.
Acceptance, usability, interoperability and data protection were identified and integrated as key indicators for the development. As an initial approach, a questionnaire tool was developed to provide patients a simplified answering of questionnaires and physicians a clearer presentation of the results.
As communication standard HL7 FHIR was used and known security concepts like OpenID Concept were integrated. In a usability study, first results were achieved by asking patients in the waiting room to answer a questionnaire, which will be discussed with the physician in the appointment. This study was conducted in 2019 at theSLK Clinics Heilbronn and achieved 86% participation of all respondents with an average age of 67 years.
Although the evaluation study could prove positive results in usability and acceptance, it is necessary to aim for longitudinal surveys in order to include QoL as a control element in the therapy. However, a longitudinal survey through questionnaires leads to decreasing compliance and increasing response bias.  For this reason, the concept needs to be expanded. With sensors a continuous monitoring can be carried out and the data can be mapped to the individual, interpreted by machine learning.
Questionnaires are a concept that has been successfully applied in studies for years. However, since care delivery poses different challenges, the integration of new concepts is inevitable. The authors are currently working on an extension of the use of questionnaires with patient generated data through sensors.
The main goal of this project was to define and evaluate a new unsupervised deep learning approach that can differentiate between normal and anomalous intervals of signals like the electrical activity of the heart (ECG). Denoising autoencoders based on recurrent neural networks with gated recurrent units were used for the semantic encoding of such time frames. A subsequent cluster analysis conducted in the code space served as the decision mechanism labelling samples as anomalies or normal intervals, respectively. The cluster ensemble method called cluster-based similarity partitioning proved itself well suited for this task when used in combination with density-based spatial clustering of applications with noise. The best performing system reached an adjusted Rand index of 0.11 on real-world ECG signals labelled by medical experts. This corresponds to a precision and recall regarding the detection task of around 0.72. The new general approach outperformed several state-of-the-art outlier recognition methods and can be applied to all kinds of (medical) time series data. It can serve as a basis for more specific detectors that work in an unsupervised fashion or that are partially guided by medical experts.
Several standards and frameworks have been described in existing literature and technical manuals that contribute to solving the interoperability problem. Their data models usually focus on clinical data and only support healthcare delivery processes. Research processes including cross organizational cohort size estimation, approvals and reviews of research proposals, consent checks, record linkage and pseudonymization need to be supported within the HiGHmed medical informatics consortium. The open source HiGHmed Data Sharing Framework implements a distributed business process engine for executing arbitrary biomedical research and healthcare processes modeled and executed using BPMN 2.0 while exchanging information using FHIR R4 resources. The proposed reference implementation is currently being rolled out to eight university hospitals in Germany as well as a trusted third party and available open source under the Apache 2.0 license.
Medical routine data promises to add value for research. However, the transfer of this data into a research context is difficult. Therefore, Medical Data Integration Centers are being set up to merge data from primary information systems in a central repository. But, data from one organization is rarely sufficient to answer a research question. The data must be merged beyond institutional boundaries. In order to use this data in a specific research project, a researcher must have the possibility to query available cohort sizes across institutions. A possible solution for this requirement is presented in this paper, using a process for fully automated and distributed feasibility queries (i.e. cohort size estimations). This process is executed according to the open standard BPMN 2.0, the underlying process data model is based on HL7 FHIR R4 resources. The proposed solution is currently being deployed at eight university hospitals and one trusted third party across Germany.
The process of consolidating medical records from multiple institutions into one data set makes privacy-preserving record linkage (PPRL) a necessity. Most PPRL approaches, however, are only designed to link records from two institutions, and existing multi-party approaches tend to discard non-matching records, leading to incomplete result sets. In this paper, we propose a new algorithm for federated record linkage between multiple parties by a trusted third party using record-level bloom filters to preserve patient data privacy. We conduct a study to find optimal weights for linkage-relevant data fields and are able to achieve 99.5% linkage accuracy testing on the Febrl record linkage dataset. This approach is integrated into an end-to-end pseudonymization framework for medical data sharing.
Publicly available datasets – for example via cBioPortal for Cancer Genomics – could be a valuable source for benchmarks and comparisons with local patient records. However, such an approach is only valid if patient cohorts are comparable to each other and if the documentation is complete and sufficient. In this paper, records from exocrine pancreatic cancer patients documented in a local cancer registry are compared with two public datasets to calculate overall survival. Several data preprocessing steps were necessary to ensure comparability of the different datasets and a common database schema was created. Our assumption that the public datasets could be used to augment the data of the local cancer registry could not be validated, since the analysis on overall survival showed a significant difference. We discuss several reasons and explanations for this finding. So far, comparing different datasets with each other and drawing medical conclusions on such comparisons should be conducted with great caution.