Multi-Modal Learning for Threat Analysis
Typ
Examensarbete för masterexamen
Program
Publicerad
2022
Författare
Andreasson, Kajsa
Dass Raj, Ria
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In recent years, the area of multi-modality has gained immense interest in computer
vision, where it has showed to be powerful for the purpose of letting models learn
visual concepts from raw text instead of from manual annotations. One specific
model using this concept is CLIP [1], which has shown state-of-the art performance
on general zero-shot image classification tasks. However, few works have explored
how competitive CLIP is in specialized tasks. To fill this gap, this report explores
whether a CLIP model can be successfully adapted to the domain of security intelligence
using threat associated data collected from social media, while using the
same training task as in the original article. In addition, we explore how CLIP’s
Image Text Alignment abilities can be used for multi-modal event classification. We
present a novel approach to using CLIP’s zero-shot capabilities for event classification,
in addition to a traditional, supervised approach where CLIP is used for
feature extraction. Our fine-tuned model and the pre-trained CLIP model are used
side-by-side for both approaches to compare performance.
Our results show that CLIP can be successfully fine-tuned on social media data
where its zero-shot image-caption matching abilities are improved with 2%. We
furthermore show that our novel approach achieves an AUC-score of 22% and the
traditional approach 74%, which leads to the conclusion that using CLIP’s innate
zero shot capabilities for event classification requires far more work to be competitive
compared to a traditional approach. Finally, we conclude that our fine-tuning does
not affect the performance in the event classification setup.
Beskrivning
Ämne/nyckelord
Multimodality , ITA-Models , CLIP , Event Detection , Fine-tuning , Continued Training , Classification