De-identification of Swedish medical chat messages with transformers

Typ
Examensarbete för masterexamen
Program
Data science and AI (MPDSC), MSc
Publicerad
2022
Författare
Arvidsson, David
Gerle, William
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Healthcare in Sweden is becoming more digital and even though new technology could enable improved healthcare it also presents risks. In this thesis, which is conducted together with Visiba Care Sweden AB, data security and privacy risks are of special interest. Visiba Care offers a virtual care platform, where it is possible for patients and healthcare professionals to chat. If chat messages could be de-identified, they could be stored and used to improve healthcare for their patients. The de-identification topic is widely studied within machine learning, however the research on Swedish medical corpora is limited, specifically when considering text corpora which consist of chat messages. Using KB-BERT for named entity recognition (NER), this thesis investigated if it was possible to reach equal performance on Swedish medical chat messages as the current state-of-the-art NER model reaches on Swedish electronic patient records. Furthermore, the thesis investigated the importance of training data size within this domain and also if a KB-BERT NER model trained on rule-based annotated data could reach higher performance than the rules it had been trained on. Data was collected from two of Visiba Cares customers. The annotation process followed strict annotation rules, where firstly a rule-based script annotated the data before a manual review was conducted. KB-BERT was accessed through the open source library Hugging Face and the hyperparameters were tuned using random search to optimize performance. Furthermore, the decision threshold was tuned to improve recall since this metric was considered to be more important than precision in the given domain. The results showed that it was possible to exceed current state-of-the-art performance and also that using one class for all entities led to further performance increase. Regarding training data size, the results showed that not only size is important, but also the format of the entities. Lastly, we failed to create a KB-BERT model trained on rule-based annotated data which reached higher performance than the rules it had been trained on. A potential explanation to this could be that the rule-based script did not produce annotations of high enough quality.
Beskrivning
Ämne/nyckelord
BERT, Named Entity Recognition, de-identification
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index