Exploring Supervision Levels for Patent Classification

dc.contributor.authorvan Hoewijk, Adam
dc.contributor.authorHolmström, Henrik
dc.contributor.departmentChalmers tekniska högskola / Institutionen för matematiska vetenskapersv
dc.contributor.examinerAxelson-Fisk, Marina
dc.contributor.supervisorDannélls, Dana
dc.date.accessioned2022-06-29T13:50:20Z
dc.date.available2022-06-29T13:50:20Z
dc.date.issued2022sv
dc.date.submitted2020
dc.description.abstractMachine learning can help automate monotonous work. However, most approaches use supervised learning, requiring a labeled dataset. The consulting firm Konsert Strategy & IP AB (Konsert) sees great value in automating its task of manually classifying patents into a custom technology tree. But the ever-changing categories leaves a pre-labeled dataset unavailable. Can other forms of supervision be used for machine learning to excel without extensive data? This thesis explores how weakly supervised, semi-supervised, and supervised learning can help Konsert to classify patents with minimal hand-labeling. Furthermore, what effect class granularity has on performance is explored alongside whether or not using patents’ unique characteristics can help. Two existing state-of-the-art methods at two supervision levels are employed. Firstly, LOTClass, a keyword-based weakly supervised approach. Secondly, MixText, a semi-supervised approach. We also propose LabelLR, a supervised approach based on patents’ cooperative patent classification (CPC) labels. Each method is tested on all granularity levels of a technology tree provided by Konsert alongside a combined ensemble of the three methods. MixText receives all unlabeled patent abstracts together with the same ten labeled documents per class LabelLR receives. LOTClass on the other hand receives the unlabeled abstracts along with class keywords. Results reveal that the small training dataset of around 4 200 patents leaves LOTClass struggling while MixText excels. LabelLR outperforms MixText on the rare occasion when the CPC labels and the classifications closely match. The ensemble proves more consistent than LabelLR but only outperforms MixText on some granular classes. In conclusion, a semi-supervised approach appears to be the best balance of minimal manual work and classification proficiency reaching an accuracy of 60.7% on 33 classes using only ten labeled patents per class.sv
dc.identifier.coursecodeMVEX03sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/304947
dc.language.isoengsv
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectPatent, Weakly supervised learning, Semi-supervised learning, Supervised learning, BERTsv
dc.titleExploring Supervision Levels for Patent Classificationsv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
local.programmeData science and AI (MPDSC), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Master's_Thesis_Adam_van_Hoewijk_and_Henrik_Holmström_2022.pdf
Storlek:
2.3 MB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: