Don’t Judge a Malware by its Binary
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Cyberattacks are projected to cost the global economy more than $10 trillion annually
by 2025, driven in large part by malware that remains difficult to detect,
classify, and contain. Today, most malware classification still relies on manually
engineered binary-level features, an expensive and brittle process. In this work, we
ask wether it is possible to predict how a Windows executable will behave without
first running it in a sandbox. By placing Windows PE samples into a continuous
“behavior space”, we aim to enable finer-grained distinctions than existing malware
family labels provide. EMBER feature vectors were paired with dynamic behavior
reports from the sandbox Recorded Future Triage. A deep metric learning model
with triplet loss (FaceNet-style) was trained to project EMBER vectors into clusters
defined by behavioral similarity. The model could create valuable embedding spaces
for classifying malware by family. Text embeddings for the reports were computed
with both BM25 combined with cosine similarity and a transformer encoder. When
we projected sandbox reports into text-embedding space, both BM25 combined with
cosine similarity and a transformer encoder revealed finer-grained behavioral structure.
In contrast, static EMBER feature vectors showed almost no alignment with
dynamic behavior, indicating that they carry insufficient behavioral features. Rich
behavioral embeddings can be built directly from sandbox reports using transformer
encoders, scaling more efficiently with corpus size than BM25 combined with cosine
similarity.
Beskrivning
Ämne/nyckelord
Malware, Metric Learning, Triplet Loss, EMBER, Text Embedding, Sandbox, BM25, Dynamic Malware Analysis.