Malicious Traffic Generator for ML-Based Network Anomaly Detection
Hämtar...
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Good-quality labelled network traffic remains a key bottleneck for research in machine
learning-based intrusion detection. Canadian Institute for Cybersecurity Intrusion De
tection System 2017 (CICIDS2017) and University of New South Wales Network-Based
2015 (UNSW-NB15) public benchmarks played an important role for evaluation; how
ever, these datasets are mostly static and problematic to reproduce, adapt or modify to
fit new requirements posed by containerization and service-oriented architectures. This
thesis addresses this problem by proposing a reproducible framework to construct, replay,
capture and label malicious traffic from Packet Capture (PCAP) traces inside a Docker
based testbed.
The framework identifies communicating hosts, protocol edges, DNS names, and Dynamic
Host Configuration Protocol (DHCP) metadata from the input traces. These elements
are then mapped into a synthetic multi-zone topology with automatic Docker Compose
configuration generation. Traffic is then rewritten and replayed from simulated source
containers via a Scapy-based replay engine. A routed gateway is used as an observation
point, a delay-injection point, and a capture point. Metadata about the replay process is
stored as ground truth, traffic is converted to Zeek connection logs, and flow labels are
derived based on replay time windows and traffic class metadata. An additional packet
to-flow mapping step is performed to improve data traceability.
While a new intrusion detection model is not a key contribution of this thesis, it in
troduces a reproducible pipeline for constructing a malicious traffic dataset. After early
live-execution trials, the project shifted to a replay-based design in order to improve re
producibility, containment, and experimental control.
Preliminary machine learning (ML) evaluation using Zeek connection-level features and an
Extreme Gradient Boosting (XGBoost) classifier showed that replay-generated datasets
achieved classification performance close to datasets generated directly from the original
PCAPtraces. The results suggest that the replay process preserved many of the flow-level
statistical properties relevant for ML-based intrusion-detection tasks.
Beskrivning
Ämne/nyckelord
malicious traffic generation, PCAP replay, Docker testbed, Zeek flow labelling, packet-to-flow traceability, intrusion detection, machine learning.
