In this article, we compare the performance of a state-of-the-art segmentation network (UNet) on two different glioblastoma (GB) segmentation datasets. Our experiments show that the same training procedure yields almost twice as bad results on the retrospective clinical data compared to the BraTS challenge data (in terms of Dice score). We discuss possible reasons for such an outcome, including inter-rater variability and high variability in magnetic resonance imaging (MRI) scanners and scanner settings. The high performance of segmentation models, demonstrated on preselected imaging data, does not bring the community closer to using these algorithms in clinical settings. We believe that a clinically applicable deep learning architecture requires a shift from unified datasets to heterogeneous data.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com