Semi-Supervised Named Entity Recognition of Medical Entities in Swedish

Typ
Examensarbete för masterexamen
Master Thesis
Program
Computer science – algorithms, languages and logic (MPALG), MSc
Publicerad
2016
Författare
Almgren, Simon
Pavlov, Sean
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
A big opportunity within today’s society is the vast amounts of data generated each day. Especially within the health-care sector where a lot of journals are written daily and needs to be processed in some way to properly identify the content within. Enter the field of Named Entity Recognition (NER), where text is analyzed to locate and classify entities into predefined classes; in our case Disorder & Finding, Pharmaceutical Drug and Body Structure. With a model that can do this with a great accuracy, analyzing medical texts could be automated and strain could be removed from people having to read through them manually. Since journals and other medical text often are very sensitive and should be handled with care due to privacy, a method for constructing these models without the need for real annotated journals would be a big step in the right direction. During this thesis we have implemented two models for solving the problem of NER for medical texts in Swedish. Both models were created from lists of seedterms, which consist of words and phrases found in medical taxonomies which we assume belong to one of the three categories. Training data were extracted from the health-care magazine Läkartidningen as well as a subset of Swedish Wikipedia. The first model implemented is based on the work of Zhang and Elhadad [23] where a vector representation is calculated for the possible words and compared against vectors calculated the same way for the different categories. The results of our implementation is on par with the results given by Zhang and Elhadad which suggests that this method works as well for Swedish as it does for English. The second model implemented is based on recurrent neural networks and is built from the same seed-terms as the first model but instead of using only vectorcalculations for classification the network is trained to automatically classify words on character-basis, reading the text both forwards and backwards at the same time. Solving the problem of NER using only unsupervised methods is inherently hard and techniques for solving the problem are not quite there yet. However, by just improving them bit by bit will in the end lead to great results.
Beskrivning
Ämne/nyckelord
Data- och informationsvetenskap , Computer and Information Science
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index