De-identification of Swedish medical chat messages with transformers
Typ
Examensarbete för masterexamen
Program
Data science and AI (MPDSC), MSc
Publicerad
2022
Författare
Arvidsson, David
Gerle, William
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Healthcare in Sweden is becoming more digital and even though new technology
could enable improved healthcare it also presents risks. In this thesis, which is
conducted together with Visiba Care Sweden AB, data security and privacy risks
are of special interest. Visiba Care offers a virtual care platform, where it is possible
for patients and healthcare professionals to chat. If chat messages could be
de-identified, they could be stored and used to improve healthcare for their patients.
The de-identification topic is widely studied within machine learning, however the
research on Swedish medical corpora is limited, specifically when considering text
corpora which consist of chat messages. Using KB-BERT for named entity recognition
(NER), this thesis investigated if it was possible to reach equal performance on
Swedish medical chat messages as the current state-of-the-art NER model reaches on
Swedish electronic patient records. Furthermore, the thesis investigated the importance
of training data size within this domain and also if a KB-BERT NER model
trained on rule-based annotated data could reach higher performance than the rules
it had been trained on.
Data was collected from two of Visiba Cares customers. The annotation process
followed strict annotation rules, where firstly a rule-based script annotated the data
before a manual review was conducted. KB-BERT was accessed through the open
source library Hugging Face and the hyperparameters were tuned using random
search to optimize performance. Furthermore, the decision threshold was tuned to
improve recall since this metric was considered to be more important than precision
in the given domain.
The results showed that it was possible to exceed current state-of-the-art performance
and also that using one class for all entities led to further performance increase.
Regarding training data size, the results showed that not only size is important,
but also the format of the entities. Lastly, we failed to create a KB-BERT
model trained on rule-based annotated data which reached higher performance than
the rules it had been trained on. A potential explanation to this could be that the
rule-based script did not produce annotations of high enough quality.
Beskrivning
Ämne/nyckelord
BERT, Named Entity Recognition, de-identification