De-identification of Swedish medical chat messages with transformers

Arvidsson, David; Gerle, William

De-identification of Swedish medical chat messages with transformers

Ladda ner

Master_Thesis_WilliamGerle_DavidArvidsson_2022.pdf (3.82 MB)

Publicerad

2022

Författare

Arvidsson, David

Gerle, William

Typ

Examensarbete för masterexamen

Program

Data science and AI (MPDSC), MSc

Sammanfattning

Healthcare in Sweden is becoming more digital and even though new technology could enable improved healthcare it also presents risks. In this thesis, which is conducted together with Visiba Care Sweden AB, data security and privacy risks are of special interest. Visiba Care offers a virtual care platform, where it is possible for patients and healthcare professionals to chat. If chat messages could be de-identified, they could be stored and used to improve healthcare for their patients. The de-identification topic is widely studied within machine learning, however the research on Swedish medical corpora is limited, specifically when considering text corpora which consist of chat messages. Using KB-BERT for named entity recognition (NER), this thesis investigated if it was possible to reach equal performance on Swedish medical chat messages as the current state-of-the-art NER model reaches on Swedish electronic patient records. Furthermore, the thesis investigated the importance of training data size within this domain and also if a KB-BERT NER model trained on rule-based annotated data could reach higher performance than the rules it had been trained on. Data was collected from two of Visiba Cares customers. The annotation process followed strict annotation rules, where firstly a rule-based script annotated the data before a manual review was conducted. KB-BERT was accessed through the open source library Hugging Face and the hyperparameters were tuned using random search to optimize performance. Furthermore, the decision threshold was tuned to improve recall since this metric was considered to be more important than precision in the given domain. The results showed that it was possible to exceed current state-of-the-art performance and also that using one class for all entities led to further performance increase. Regarding training data size, the results showed that not only size is important, but also the format of the entities. Lastly, we failed to create a KB-BERT model trained on rule-based annotated data which reached higher performance than the rules it had been trained on. A potential explanation to this could be that the rule-based script did not produce annotations of high enough quality.

Ämne/nyckelord

BERT, Named Entity Recognition, de-identification

URI

https://hdl.handle.net/20.500.12380/304941

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

De-identification of Swedish medical chat messages with transformers

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced