

The adoption of Electronic Health Records is growing at a fast pace, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potentials, but also equally growing concern for patient confidentiality breaches. De-identification of patient information has been proposed as a solution to both facilitate secondary uses of clinical information, and protect patient information confidentiality. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster text de-identification than manual approaches. A U.S. Veterans Affairs clinical text de-identification project focused on investigating the current state of the art of automatic clinical text de-identification, on developing a best-of-breed de-identification application for clinical documents, and on evaluating its impact on subsequent text uses and the risk for re-identification. To evaluate this risk, we de-identified discharge summaries from 86 patients using our ‘best-of-breed’ text de-identification application with resynthesis of the identifiers detected. We then asked physicians working in the ward the patients were hospitalized in if they could recognize these patients when reading the de-identified documents. Each document was examined by at least one resident and one attending physician, and with 4.65% of the documents, physicians thought they recognized the patient because of specific clinical information, but after verification, none was correctly re-identified.