Semi-Supervised Named Entity Recognition of Medical Entities in Swedish

Examensarbete för masterexamen

Please use this identifier to cite or link to this item:
Download file(s):
File Description SizeFormat 
248967.pdfFulltext1.06 MBAdobe PDFView/Open
Type: Examensarbete för masterexamen
Master Thesis
Title: Semi-Supervised Named Entity Recognition of Medical Entities in Swedish
Authors: Almgren, Simon
Pavlov, Sean
Abstract: A big opportunity within today’s society is the vast amounts of data generated each day. Especially within the health-care sector where a lot of journals are written daily and needs to be processed in some way to properly identify the content within. Enter the field of Named Entity Recognition (NER), where text is analyzed to locate and classify entities into predefined classes; in our case Disorder & Finding, Pharmaceutical Drug and Body Structure. With a model that can do this with a great accuracy, analyzing medical texts could be automated and strain could be removed from people having to read through them manually. Since journals and other medical text often are very sensitive and should be handled with care due to privacy, a method for constructing these models without the need for real annotated journals would be a big step in the right direction. During this thesis we have implemented two models for solving the problem of NER for medical texts in Swedish. Both models were created from lists of seedterms, which consist of words and phrases found in medical taxonomies which we assume belong to one of the three categories. Training data were extracted from the health-care magazine Läkartidningen as well as a subset of Swedish Wikipedia. The first model implemented is based on the work of Zhang and Elhadad [23] where a vector representation is calculated for the possible words and compared against vectors calculated the same way for the different categories. The results of our implementation is on par with the results given by Zhang and Elhadad which suggests that this method works as well for Swedish as it does for English. The second model implemented is based on recurrent neural networks and is built from the same seed-terms as the first model but instead of using only vectorcalculations for classification the network is trained to automatically classify words on character-basis, reading the text both forwards and backwards at the same time. Solving the problem of NER using only unsupervised methods is inherently hard and techniques for solving the problem are not quite there yet. However, by just improving them bit by bit will in the end lead to great results.
Keywords: Data- och informationsvetenskap;Computer and Information Science
Issue Date: 2016
Publisher: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
Collection:Examensarbeten för masterexamen // Master Theses

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.