Word embeddings have become the predominant representation scheme on a token-level for various clinical natural language processing (NLP) tasks. More recently, character-level neural language models, exploiting recurrent neural networks, have again received attention, because they achieved similar performance against various NLP benchmarks. We investigated to what extent character-based language models can be applied to the clinical domain and whether they are able to capture reasonable lexical semantics using this maximally fine-grained representation scheme. We trained a long short-term memory network on an excerpt from a table of de-identified 50-character long problem list entries in German, each of which assigned to an ICD-10 code. We modelled the task as a time series of one-hot encoded single character inputs. After the training phase we accessed the top 10 most similar character-induced word embeddings related to a clinical concept via a nearest neighbour search and evaluated the expected interconnected semantics. Results showed that traceable semantics were captured on a syntactic level above single characters, addressing the idiosyncratic nature of clinical language. The results support recent work on general language modelling that raised the question whether token-based representation schemes are still necessary for specific NLP tasks.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com