Exploring Supervision Levels for Patent Classification

van Hoewijk, Adam; Holmström, Henrik

Exploring Supervision Levels for Patent Classification

Ladda ner

Master's_Thesis_Adam_van_Hoewijk_and_Henrik_Holmström_2022.pdf (2.3 MB)

Publicerad

2022

Författare

van Hoewijk, Adam

Holmström, Henrik

Typ

Examensarbete för masterexamen

Program

Data science and AI (MPDSC), MSc

Sammanfattning

Machine learning can help automate monotonous work. However, most approaches use supervised learning, requiring a labeled dataset. The consulting firm Konsert Strategy & IP AB (Konsert) sees great value in automating its task of manually classifying patents into a custom technology tree. But the ever-changing categories leaves a pre-labeled dataset unavailable. Can other forms of supervision be used for machine learning to excel without extensive data? This thesis explores how weakly supervised, semi-supervised, and supervised learning can help Konsert to classify patents with minimal hand-labeling. Furthermore, what effect class granularity has on performance is explored alongside whether or not using patents’ unique characteristics can help. Two existing state-of-the-art methods at two supervision levels are employed. Firstly, LOTClass, a keyword-based weakly supervised approach. Secondly, MixText, a semi-supervised approach. We also propose LabelLR, a supervised approach based on patents’ cooperative patent classification (CPC) labels. Each method is tested on all granularity levels of a technology tree provided by Konsert alongside a combined ensemble of the three methods. MixText receives all unlabeled patent abstracts together with the same ten labeled documents per class LabelLR receives. LOTClass on the other hand receives the unlabeled abstracts along with class keywords. Results reveal that the small training dataset of around 4 200 patents leaves LOTClass struggling while MixText excels. LabelLR outperforms MixText on the rare occasion when the CPC labels and the classifications closely match. The ensemble proves more consistent than LabelLR but only outperforms MixText on some granular classes. In conclusion, a semi-supervised approach appears to be the best balance of minimal manual work and classification proficiency reaching an accuracy of 60.7% on 33 classes using only ten labeled patents per class.

Ämne/nyckelord

Patent, Weakly supervised learning, Semi-supervised learning, Supervised learning, BERT

URI

https://hdl.handle.net/20.500.12380/304947

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Exploring Supervision Levels for Patent Classification

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By