Don’t Judge a Malware by its Binary

dc.contributor.authorKlynne, August
dc.contributor.authorÅqvist, Malte
dc.contributor.departmentChalmers tekniska högskola / Institutionen för fysiksv
dc.contributor.departmentChalmers University of Technology / Department of Physicsen
dc.contributor.examinerGranath, Mats
dc.contributor.supervisorHansson, Anders
dc.date.accessioned2025-06-11T08:04:36Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractCyberattacks are projected to cost the global economy more than $10 trillion annually by 2025, driven in large part by malware that remains difficult to detect, classify, and contain. Today, most malware classification still relies on manually engineered binary-level features, an expensive and brittle process. In this work, we ask wether it is possible to predict how a Windows executable will behave without first running it in a sandbox. By placing Windows PE samples into a continuous “behavior space”, we aim to enable finer-grained distinctions than existing malware family labels provide. EMBER feature vectors were paired with dynamic behavior reports from the sandbox Recorded Future Triage. A deep metric learning model with triplet loss (FaceNet-style) was trained to project EMBER vectors into clusters defined by behavioral similarity. The model could create valuable embedding spaces for classifying malware by family. Text embeddings for the reports were computed with both BM25 combined with cosine similarity and a transformer encoder. When we projected sandbox reports into text-embedding space, both BM25 combined with cosine similarity and a transformer encoder revealed finer-grained behavioral structure. In contrast, static EMBER feature vectors showed almost no alignment with dynamic behavior, indicating that they carry insufficient behavioral features. Rich behavioral embeddings can be built directly from sandbox reports using transformer encoders, scaling more efficiently with corpus size than BM25 combined with cosine similarity.
dc.identifier.coursecodeTIFX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/309374
dc.language.isoeng
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectMalware, Metric Learning, Triplet Loss, EMBER, Text Embedding, Sandbox, BM25, Dynamic Malware Analysis.
dc.titleDon’t Judge a Malware by its Binary
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Klynne_Åqvist.pdf
Storlek:
7.13 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: