Semi-Supervised Named Entity Recognition of Medical Entities in Swedish

Examensarbete för masterexamen

Använd denna länk för att citera eller länka till detta dokument: https://hdl.handle.net/20.500.12380/248967
Ladda ner:
Fil Beskrivning StorlekFormat 
248967.pdfFulltext1.06 MBAdobe PDFVisa
Typ: Examensarbete för masterexamen
Master Thesis
Titel: Semi-Supervised Named Entity Recognition of Medical Entities in Swedish
Författare: Almgren, Simon
Pavlov, Sean
Sammanfattning: A big opportunity within today’s society is the vast amounts of data generated each day. Especially within the health-care sector where a lot of journals are written daily and needs to be processed in some way to properly identify the content within. Enter the field of Named Entity Recognition (NER), where text is analyzed to locate and classify entities into predefined classes; in our case Disorder & Finding, Pharmaceutical Drug and Body Structure. With a model that can do this with a great accuracy, analyzing medical texts could be automated and strain could be removed from people having to read through them manually. Since journals and other medical text often are very sensitive and should be handled with care due to privacy, a method for constructing these models without the need for real annotated journals would be a big step in the right direction. During this thesis we have implemented two models for solving the problem of NER for medical texts in Swedish. Both models were created from lists of seedterms, which consist of words and phrases found in medical taxonomies which we assume belong to one of the three categories. Training data were extracted from the health-care magazine Läkartidningen as well as a subset of Swedish Wikipedia. The first model implemented is based on the work of Zhang and Elhadad [23] where a vector representation is calculated for the possible words and compared against vectors calculated the same way for the different categories. The results of our implementation is on par with the results given by Zhang and Elhadad which suggests that this method works as well for Swedish as it does for English. The second model implemented is based on recurrent neural networks and is built from the same seed-terms as the first model but instead of using only vectorcalculations for classification the network is trained to automatically classify words on character-basis, reading the text both forwards and backwards at the same time. Solving the problem of NER using only unsupervised methods is inherently hard and techniques for solving the problem are not quite there yet. However, by just improving them bit by bit will in the end lead to great results.
Nyckelord: Data- och informationsvetenskap;Computer and Information Science
Utgivningsdatum: 2016
Utgivare: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
URI: https://hdl.handle.net/20.500.12380/248967
Samling:Examensarbeten för masterexamen // Master Theses



Materialet i Chalmers öppna arkiv är upphovsrättsligt skyddat och får ej användas i kommersiellt syfte!