Exploring Supervision Levels for Patent Classification

Typ
Examensarbete för masterexamen
Program
Data science and AI (MPDSC), MSc
Publicerad
2022
Författare
van Hoewijk, Adam
Holmström, Henrik
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Machine learning can help automate monotonous work. However, most approaches use supervised learning, requiring a labeled dataset. The consulting firm Konsert Strategy & IP AB (Konsert) sees great value in automating its task of manually classifying patents into a custom technology tree. But the ever-changing categories leaves a pre-labeled dataset unavailable. Can other forms of supervision be used for machine learning to excel without extensive data? This thesis explores how weakly supervised, semi-supervised, and supervised learning can help Konsert to classify patents with minimal hand-labeling. Furthermore, what effect class granularity has on performance is explored alongside whether or not using patents’ unique characteristics can help. Two existing state-of-the-art methods at two supervision levels are employed. Firstly, LOTClass, a keyword-based weakly supervised approach. Secondly, MixText, a semi-supervised approach. We also propose LabelLR, a supervised approach based on patents’ cooperative patent classification (CPC) labels. Each method is tested on all granularity levels of a technology tree provided by Konsert alongside a combined ensemble of the three methods. MixText receives all unlabeled patent abstracts together with the same ten labeled documents per class LabelLR receives. LOTClass on the other hand receives the unlabeled abstracts along with class keywords. Results reveal that the small training dataset of around 4 200 patents leaves LOTClass struggling while MixText excels. LabelLR outperforms MixText on the rare occasion when the CPC labels and the classifications closely match. The ensemble proves more consistent than LabelLR but only outperforms MixText on some granular classes. In conclusion, a semi-supervised approach appears to be the best balance of minimal manual work and classification proficiency reaching an accuracy of 60.7% on 33 classes using only ten labeled patents per class.
Beskrivning
Ämne/nyckelord
Patent, Weakly supervised learning, Semi-supervised learning, Supervised learning, BERT
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index