Unsupervised Extraction of Body-Text from Clinical PDF Documents

Bensahla, Adel; Zaghir, Jamil; Gaudet-Blavignac, Christophe; Lovis, Christian

doi:10.3233/SHTI240382

loading subjects...

Unsupervised Extraction of Body-Text from Clinical PDF Documents

Authors

Adel Bensahla, Jamil Zaghir, Christophe Gaudet-Blavignac, Christian Lovis

Pages

214 - 215

DOI

10.3233/SHTI240382

Category

Research Article

Series

Studies in Health Technology and Informatics

Ebook

Volume 316: Digital Health and Informatics Innovations for Sustainable Health Care Systems

Abstract

Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies