Flow-Based Detection of Linux Backdoor
Communication

A NetFlow Based ML-Approach to Backdoor Detection in Linux
Environments

Master’s Thesis in Computer Science and Engineering

NAOMI ESPINOSA & LENIA MALKI

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024


Master’s thesis 2024

Flow-Based Detection of Linux Backdoor
Communication

A NetFlow Based ML-Approach to Backdoor Detection in Linux
Environments

NAOMI ESPINOSA & LENIA MALKI

Department of Computer Science and Engineering
Chalmers University of Technology

Gothenburg, Sweden 2024


Flow-Based Detection of Linux Backdoor Communication
A NetFlow Based ML-Approach to Backdoor Detection in Linux Environments
Naomi Espinosa & Lenia Malki

© Naomi Espinosa & Lenia Malki, 2024.

Examiner & Supervisor: Andrei Sabelfeld, Department of Computer Science and
Engineering
Advisor: Martin Forssén, Recorded Future

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Flow-Based Detection of Linux Backdoor Communication
Naomi Espinosa & Lenia Malki
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The increasing prevalence of Linux-based systems and their susceptibility to mal-
ware attacks necessitates effective detection mechanisms for backdoor communica-
tion. This thesis explores the application of machine learning (ML) models to detect
backdoor communication in Linux environments using flow-based data. Specifically,
it leverages NetFlow traffic data. The study aims to determine the effectiveness of
ML techniques in identifying malicious patterns associated with backdoor communi-
cation without inspecting the actual payload. Linux systems are underrepresented in
existing benchmark datasets, which predominantly focus on Windows environments.
To address this gap, our research trains models on flow data specific to Linux mal-
ware and environments. Through data preprocessing steps including feature map-
ping, aggregation, scaling, and feature selection methodologies like ANOVA F-test,
models were trained and evaluated on both benign and malicious traffic datasets.
The results indicate that ensemble models such as Random Forest (RF) and Ex-
treme Gradient Boosting (XGBoost) can effectively distinguish between normal and
anomalous traffic patterns, highlighting the potential of flow-based detection systems
in enhancing network security. The Synthetic Minority Over-sampling Technique
(SMOTE) was applied to address class imbalance, further improving the detection
performance though in terms of precision. We conclude that flow-based data is a
valuable tool for training models to classify malicious traffic in Linux environments.
Future work will focus on acquiring or creating higher quality datasets of malicious
Linux malware traffic to improve the capabilities of detection systems.

Keywords: backdoor detection, machine learning, NetFlow, Linux, malware, net-
work security, anomaly detection, flow-based data, big data

v


Acknowledgements
We would like to extend our gratitude to our supervisor and examiner from Chalmers,
Andrei Sabelfeld, for his invaluable guidance and support throughout our work.
Additionally, we are grateful to Martin Forssén, our supervisor at Recorded Future,
and the Analytics team at Recorded Future for their assistance in obtaining the
data essential for this project.

Naomi Espinosa, Gothenburg, 2024-06-02
Lenia Malki, Gothenburg, 2024-06-02

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Research Aim and Question . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 5
2.1 Network Traffic Patterns Across Operating Systems . . . . . . . . . . 5
2.2 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Network Traffic Collection Tools . . . . . . . . . . . . . . . . . 8
2.2.2 Triage - Sandbox Environment . . . . . . . . . . . . . . . . . . 8

2.3 Backdoors and C2 Communication . . . . . . . . . . . . . . . . . . . 10
2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 ML Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Feature Selection Methodologies . . . . . . . . . . . . . . . . . 13
2.4.3 Techniques For Imbalanced Datasets . . . . . . . . . . . . . . 14
2.4.4 Evaluation Metrics & Validation Techniques . . . . . . . . . . 14
2.4.5 Stratified K-Fold Cross Validation . . . . . . . . . . . . . . . . 15
2.4.6 Limitations of Machine Learning . . . . . . . . . . . . . . . . 16

3 Related Works 19
3.1 Flow-Based Data for Malware Traffic Detection . . . . . . . . . . . . 19
3.2 ML Approaches for Network Classification . . . . . . . . . . . . . . . 20
3.3 Features Used in ML-Based Network Traffic Classification . . . . . . 21

4 Method 25
4.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Collection of Malicious Data . . . . . . . . . . . . . . . . . . . 26
4.2.2 Collection of Benign Data . . . . . . . . . . . . . . . . . . . . 26

4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . 30

ix


Contents

5 Results 31
5.1 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Malicious Data Behaviour . . . . . . . . . . . . . . . . . . . . 31
5.1.2 Flow Distribution Amongst Malicious Samples . . . . . . . . . 32
5.1.3 Mean and Modes of Features in Data . . . . . . . . . . . . . . 32
5.1.4 Comparison of Incoming and Outgoing Packet Sizes . . . . . . 33
5.1.5 Distribution of Protocol in Data . . . . . . . . . . . . . . . . . 34
5.1.6 Distribution of Flow Duration in Data . . . . . . . . . . . . . 35
5.1.7 Comparison of Outgoing vs. Incoming Packets . . . . . . . . . 35

5.2 Model Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Average Performance Metrics Per Model . . . . . . . . . . . . 37
5.2.2 ROC Curves and AUC scores . . . . . . . . . . . . . . . . . . 38
5.2.3 FNR, FPR, Precision and Recall Curves . . . . . . . . . . . . 39

6 Discussion 41
6.1 Analysis of Model Results . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Identified Backdoor Communication Behaviour . . . . . . . . . . . . 43
6.3 Quality of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3.1 Period of Recorded Data . . . . . . . . . . . . . . . . . . . . . 45
6.3.2 Variation of Benign Data . . . . . . . . . . . . . . . . . . . . . 45
6.3.3 Variation of Malicious Data . . . . . . . . . . . . . . . . . . . 46
6.3.4 The Need of an ML-Based Detection Model . . . . . . . . . . 46

6.4 Flow-Based Versus Packet-Based Data . . . . . . . . . . . . . . . . . 47
6.5 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.5.1 Vulnerability to Evasion Attacks . . . . . . . . . . . . . . . . . 48
6.5.2 Deployment and Cost . . . . . . . . . . . . . . . . . . . . . . . 49
6.5.3 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusion 51

Bibliography 53

x


List of Figures

2.1 NetFlow output sample. . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Segment of a dynamic report generated for a malware sample submit-

ted to Triage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Details of Triage’s mapping of dynamic analysis behaviour to known

Mitre TTPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Overview of method section. . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Overview of pcap file extraction and conversion. . . . . . . . . . . . . 27
4.3 Correlation matrix for input variables. . . . . . . . . . . . . . . . . . 30

5.1 Number of flows per sample ID distribution. . . . . . . . . . . . . . . 32
5.2 Incoming and outgoing packet size in malicious and benign data. . . . 34
5.3 Overview of protocol distribution in benign and malicious data. . . . 35
5.4 Distribution of flow-duration within interquartile range (25th to 75th

percentiles). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.5 Number of outgoing vs incoming packets per data point in malicious

and benign data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6 Performance metrics of models with SMOTE applied. . . . . . . . . . 37
5.7 Performance metrics of models without SMOTE applied. . . . . . . . 37
5.8 Interpolated ROC curves of models with SMOTE applied. . . . . . . 38
5.9 Interpolated ROC curves of models without SMOTE applied. . . . . 38
5.10 Average FNR & FPR by model, with and without SMOTE. . . . . . 39
5.11 Average precision & recall by model, with and without SMOTE. . . . 39

xi


List of Figures

xii


List of Tables

2.1 NetFlow Version 5 Header Format. . . . . . . . . . . . . . . . . . . . 7
2.2 NetFlow version 5 flow record format. . . . . . . . . . . . . . . . . . 7
2.3 A subset of the nfdump toolset. . . . . . . . . . . . . . . . . . . . . . 8
2.4 Confusion matrix for anomaly traffic classification. . . . . . . . . . . . 15

3.1 Results from paper [3] detailing the results of backdoor detection
using different NetFlow datasets. . . . . . . . . . . . . . . . . . . . . 20

3.2 Remaining NetFlow features grouped into 5 main categories as de-
scribed in the paper [40]. . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Summary of network features used for RAT detection of paper A [39]
and paper B[41]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Selection of machine learning algorithms. . . . . . . . . . . . . . . . . 25
4.2 Selection of nfdump fields before statistical feature selection. . . . . . 28
4.3 Mapping of flg values to integers of base 2. . . . . . . . . . . . . . . . 28
4.4 Mapping of protocols to integers. . . . . . . . . . . . . . . . . . . . . 28
4.5 F-statistic for each feature and P-values associated with each F-statistic

rounded to three decimal points and sorted in descending order of F-
statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Original class distribution. . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Class distribution with SMOTE applied. . . . . . . . . . . . . . . . . 30

5.1 Overview of Linux backdoor samples analysis. . . . . . . . . . . . . . 31
5.2 Summary of techniques identified in Linux backdoor malware samples. 31
5.3 Mean and mode of numerical features in data. . . . . . . . . . . . . . 33
5.4 Normalised mean and mode of flow durations for malicious and benign

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 Amount of false alarms with 2 980 741 flows daily. . . . . . . . . . . . 48
6.2 Time elapsed for prepossessing and predictions for each model. . . . . 49

xiii


List of Tables

xiv


1
Introduction

In the digital age, the security of information systems is under constant threat
from various types of cyber attacks, with malware being a particularly formidable
threat. A backdoor is a technique used by different types of malware to establish
unauthorised access to a system. This technique may be designed to achieve several
malicious objectives, such as extracting sensitive data, monitoring and spying on the
victim, deleting crucial data, or even introducing additional malicious software [1].
The specific goals vary according to the type of malware, which employs a backdoor
technique. Once a backdoor has been deployed, it establishes communication with
a Command and Control (C2) server that can send instructions or simply receive
data. The consequences of such access are significant, as it allows attackers to
conduct a wide array of harmful activities [2]. Detecting and neutralising this type
of malicious technique is crucial for protecting the integrity, confidentiality, and
operational functionality of information systems.

To effectively counter these threats, it is important to recognise patterns in mal-
ware communication, especially through the C2 channels that backdoors typically
use. Understanding these patterns can lead to breakthroughs in detecting malicious
activity using network data.

In the past decade, there has been substantial research interest in leveraging machine
learning (ML) models for detecting and classifying anomalies in malicious network
traffic and malware. Lately, this trend has been extended to include the detection
of backdoors [3]. A common approach to train network detection models is to use
traces of network traffic, either in packet-based or flow-based format. A typical flow-
based format is NetFlow, a CISCO IOS technology that provides packet statistics as
they traverse the router [4]. Common features recorded by NetFlow include bytes
and packets sent and received, the duration of connections, protocol, and more.
Given the nature of the aggregated data and statistics that NetFlow captures, it
can be strategically leveraged to make informed decisions about the data traversing
our networks, such as pinpointing instances of malicious traffic, including the C2
communication of backdoors.

Research has demonstrated that the behaviour of malware, such as Trojans, varies
between different operating systems [5]. This highlights the ability of malware to
exhibit distinct behaviours depending on the operating environment. To broaden the
research landscape in this area, unlike many existing malware detection frameworks,
we have chosen to focus on Linux-based malware. The intention of this is to balance

1


1. Introduction

the research on this topic with respect to the choice of operating system for malware
analysis.

The intersection of using NetFlow data to detect backdoors in Linux-based malware
makes up a gap in the research domain and presents an opportunity for further
exploration and development in order to address the specific challenges associated
with identifying active backdoors on Linux systems using NetFlow logs.

Studying these two areas is crucial due to the widespread utilisation of NetFlow
and the prevalence of Linux-based systems, particularly within corporate environ-
ments [6][7]. NetFlow, being extensively employed for network monitoring, presents
a valuable avenue for enhancing backdoor detection capabilities. Although there
are alternative logging data sources, it is crucial to assess the potential of NetFlow
for the purpose of backdoor network communication detection in isolation before
exploring hybrid approaches to accomplish this objective. Furthermore, given the
significant presence of Linux machines on the internet, predominantly within corpo-
rate infrastructures, investigating the specific challenges associated with identifying
active backdoors on Linux systems becomes imperative for comprehensive cyberse-
curity measures.

1.1 Research Aim and Question

This study explores the effectiveness of employing flow-based data for the detection
of backdoor communication through machine learning methodologies. The research
concentrates specifically on the utilisation of NetFlow traffic data exclusively gath-
ered within Linux environments. In essence, this study tries to answer the question:
How effective is machine learning in identifying Linux backdoor communication pat-
terns from NetFlow traffic data?

1.1.1 Scope
The limitations of our research are as follows: Only IPv4 traffic will be used as a
basis for model training and evaluation. Furthermore, only ingress network traffic
is accommodated. Lastly, our research is confined to examining backdoor commu-
nication in Linux-based malware, specifically. Hence, all the data sets used in this
research are derived from systems running Linux-based kernels.

1.2 Structure of Thesis

This thesis is structured into seven chapters, each designed to incrementally build
upon the last, culminating in a discussion of the data quality, challenges, and prac-
tical implications of the detection of Linux backdoors using machine learning and
NetFlow data. The layout is as follows:

2


1. Introduction

Chapter 1: Introduction

The opening chapter provides an overview of the cyber threat landscape in corporate
environments, focusing specifically on the backdoor technique employed by various
malware families. It defines the problem of detecting such communications within
network traffic and outlines the research objectives and limitations.

Chapter 2: Background

This chapter explores the mechanics of backdoor network communication, the role
of NetFlow data in network security monitoring, and the application of machine
learning within this domain. It critically reviews recent advancements and identifies
gaps in current research, thereby justifying the necessity of this study.

Chapter 3: Related Work

This chapter reviews prior research relevant to our thesis, highlighting studies that
may contribute to addressing the problem at hand. It also provides the foundational
context upon which our methodology is built.

Chapter 4: Methodology

Here we describe the preprocessing steps required to prepare the data for machine
learning analysis and explain the rationale behind the selection of specific machine
learning models. This chapter also describes the metrics used to evaluate the model’s
performance. All steps are detailed in full, and any software dependencies are listed.

Chapter 5: Results

Here, the results of the applied models are presented. We look at the overall data as
well as dive deeper into the statistics of each subset of our data and interpret it in
the context of backdoor communication to understand what may have yielded our
results.

Chapter 6: Discussion

This chapter discusses the thesis results in relation to the research objectives, dives
into the challenges and successes encountered during the implementation of the
framework, and examines how these factors influenced the outcomes.

Chapter 7: Conclusion

Concluding the thesis, we draw out the most important findings in our project and
highlight the conclusions drawn when answering the research questions. Lastly, it
details suggestions for improvement and future work within the research area of
backdoor detection.

3


1. Introduction

4


2
Background

This chapter provides a background essential for understanding the complexities
involved in detecting malware C2 communication across different network environ-
ments. Initially, we explore the distinct behaviours of network traffic across various
operating systems (OSs), to motivate the importance of Linux-based research in this
domain. This sets the stage for an in-depth discussion on NetFlow, a critical tool
in network monitoring, detailing its capabilities to capture and represent network
traffic data effectively. Furthermore, we look into the world of ML and its capabili-
ties in the domain of network anomaly detection. By examining the methodologies
and justifications for employing ML techniques, this chapter lays the foundational
knowledge necessary to understand the methodology employed in subsequent chap-
ters.

2.1 Network Traffic Patterns Across Operating
Systems

Scientific evidence suggests that network traffic does exhibit different patterns based
on what OS is running on the host(s). The findings from the research conducted
by Barath, in the article "Network behaviour analysis of selected operating systems"
[8], support the claim that different OSs manifest unique network communication
patterns. The study utilised a controlled test environment to monitor network be-
haviours of systems running Windows, macOS, and Linux, using the NetFlow proto-
col to analyse the network traffic data. It was conclusively demonstrated that these
systems have distinct communication profiles with specific contacted addresses and
protocols, which can be passively monitored to identify the OS type [8].

Conversely, building on the assertion that OSs exhibit distinct network traffic pat-
terns, several studies have demonstrated the feasibility of passively identifying an
OS based on its network communication. In passive OS fingerprinting, "passive"
means identifying an OS by observing traffic without interaction. In the work, "A
Machine-Learning-Based Tool for Passive OS Fingerprinting with TCP Variant as
a Novel Feature" [9] , Hagos proposes and evaluates an advanced classification ap-
proach to passive OS fingerprinting using classical ML and deep learning techniques.
Their work is looking at the network traffic at a more granular level, inspecting
the packet headers of TCP traffic. However, their work emphasises the importance
of unique OS-specific characteristics in network behaviour, such as the underlying

5


2. Background

TCP variant. In contrast to the aforementioned study "Network behaviour analysis
of selected operating systems", Hagos used a combination of both system-generated
and user-generated network traffic in a successful attempt to passively identify the
OS of network traffic, meaning that deductions of OS can be made from not only
system-generated network traffic, but also in combination with user-generated traf-
fic.

Similarly, in the study "Operating System Fingerprinting via Automated Network
Traffic Analysis" [10], the authors introduce another method for OS fingerprinting
using genetic algorithms and ML. Unlike Hagos, this method looks at header infor-
mation from different protocols in the TCP/IP model, rather than just the TCP
protocol. Thus, their approach is capable of classifying any packet using TCP/IP
headers. These findings reinforce the notion that it is feasible to identify OS-specific
network behaviours through passive analysis of network traffic using packet header
data.

Research demonstrates that passive network data contains distinct patterns specific
to each OS. Additionally, ML methods can effectively detect these patterns. This
underscores the importance of recognising OS-specific differences in network traf-
fic patterns. Moreover, this would suggest that benign and malicious traffic from
hosts with different OSs would also show distinctions, highlighting the necessity of
accounting for OS-specific nuances when training ML models for classifying network
traffic.

2.2 NetFlow
Network traffic monitoring plays an important role in modern cybersecurity efforts,
enabling the identification of potential threats and the optimisation of network per-
formance. Two common formats for acquiring and representing network traffic are
packet-based and flow-based formats. In a packet-based format, network traffic is
captured and analysed at the level of individual packets. Detailed information on
individual packets is thus available, including payload data. Flow-based formats
provide much less granularity. Network data is aggregated into a common set of
shared features such as specific source and destination IP addresses, port numbers,
and protocols. Payload data is not included in flow-based data [11].

Analysing packet payloads is an impractical scenario due to the high CPU demand
required for real-time analysis. Therefore, flow monitoring serves as a practical alter-
native for collecting network traffic for security purposes. Cisco’s NetFlow protocol
is an example of a flow-monitoring technology. This solution enables the collection
of network traffic statistics, such as the volume of data transmitted, the frequency
of communication between nodes, the types of protocols used, or other relevant
network parameters, which can be used to detect network anomalies [4]. Conse-
quently, NetFlow data can be used to assess the feasibility of identifying remote
backdoor communication. Additionally, the versatility of NetFlow extends beyond
security applications, finding utility in network performance optimisation and ca-
pacity planning. By providing insights into traffic patterns and resource utilisation,

6


2. Background

NetFlow facilitates proactive network management, enabling organisations to allo-
cate resources efficiently and maintain optimal network performance.

Cisco has released several NetFlow versions, each with its own set of features and
improvements. Among these versions is NetFlow v5, which is the most commonly
used [12]. Each version also comes with its own datagram format [13]. NetFlow
v5 does not support IPv6 nor does it support templates for flexible flow record
definition. Additionally, NetFlow v5 uses a fixed header format with predefined
fields as opposed to NetFlow v9. Detailed insights into the composition of NetFlow
v5 datagrams are presented in Table 2.1 and Table 2.2.

Table 2.1: NetFlow Version 5 Header Format.
Bytes Contents Description
0-1 version NetFlow export format version number
2-3 count Number of flows exported in this packet (1-30)
4-7 SysUptime Current time in milliseconds since the export device booted
8-11 unix_secs Current count of seconds since 0000 UTC 1970
12-15 unix_nsecs Residual nanoseconds since 0000 UTC 1970
16-19 flow_sequence Sequence counter of total flows seen
20 engine_type Type of flow-switching engine
21 engine_id Slot number of the flow-switching engine
22-23 sampling_interval First two bits hold the sampling mode; remaining 14 bits hold the value of the sampling interval

Table 2.2: NetFlow version 5 flow record format.
Bytes Contents Description
0-3 srcaddr Source IP address
4-7 dstaddr Destination IP address
8-11 nexthop IP address of next hop router
12-13 input SNMP index of input interface
14-15 output SNMP index of output interface
16-19 dPkts Packets in the flow
20-23 dOctets Total number of Layer 3 bytes in the packets of the flow
24-27 First SysUptime at start of flow
28-31 Last SysUptime at the time the last packet of the flow was received
32-33 srcport TCP/UDP source port number or equivalent
34-35 dstport TCP/UDP destination port number or equivalent
36 pad1 Unused (zero) bytes
37 tcp_flags Cumulative OR of TCP flags
38 prot IP protocol type (e.g., TCP = 6; UDP = 17)
39 tos IP type of service (ToS)
40-41 src_as Autonomous system number of the source, either origin or peer
42-43 dst_as Autonomous system number of the destination, either origin or peer
44 src_mask Source address prefix mask bits
45 dst_mask Destination address prefix mask bits
46-47 pad2 Unused (zero) bytes

7


2. Background

2.2.1 Network Traffic Collection Tools
Various software tools exist to capture, process and analyse network data. An exam-
ple of such is nfdump which is a toolset designed for the collection and processing
of, but not restricted to, NetFlow data from compatible devices [14]. It comprises
several collectors, including nfcapd for NetFlow versions 1, 5, 7, and 9 as well as
nfpcapd for converting PCAP data into NetFlow format. Collected data is stored in
files for subsequent processing, where nfdump offers extensive capabilities for flow
filtering, aggregation, and analysis. Table 2.3 shows an overview of a subset of the
toolset relevant to this project. Figure 2.1 shows an output sample of an nfcapd file
through the use of nfdump.

Figure 2.1: NetFlow output sample.

The nfpcapd tool fills a crucial gap between packet-level and flow-level data by
converting raw packet data into flow records, facilitating network analysis. While
packet-level data offers granular insights into individual network interactions, flow-
level data provides aggregated information about communication patterns between
endpoints. By capturing and converting packet data into flow records, nfpcapd
enables users to bridge this divide, allowing for a more holistic understanding of
network behaviour. This capability is particularly valuable in scenarios where de-
tailed packet-level analysis is impractical due to volume or complexity, yet extensive
flow-level insights are necessary for effective network monitoring, troubleshooting,
and security analysis.

Tool Description
nfdump NetFlow display and analysing program
nfcapd NetFlow collector daemon
nfpcapd PCAP to NetFlow collector daemon

Table 2.3: A subset of the nfdump toolset.

2.2.2 Triage - Sandbox Environment
Malware samples can be analysed through static and dynamic analysis. Static anal-
ysis involves examining the malware’s code, structure, and properties without ex-
ecuting it. Dynamic analysis involves executing the malware within a controlled
environment, such as a virtual machine or sandbox, and observing its behaviour in
real time.

Recorded Future Triage’s Malware Analysis Sandbox is a tool designed to analyse
and dissect potentially malicious software samples in a controlled environment [15].
This sandbox is free and publicly available. Operating within a secure and isolated

8


2. Background

sandbox environment, it executes suspicious files or URLs and monitors their be-
haviour, allowing security analysts to observe and understand their actions without
risking harm to their network or systems.

Each malware sample is assigned unique identifiers: sample_id and task_id. If
the same malware sample undergoes analysis on multiple OSs, a distinct tuple of
sample_id, task_id is created for each analysis instance. It is possible to retrieve
a generated PCAP from any analysis instance. These PCAP files are created when
the sandbox environment detects network connection attempts by malware samples
during dynamic analysis. Additionally, each individual instance undergoing analysis
will be assigned a score ranging from 0 to 10. Instances scoring between 2 and
5 are labelled as ’Likely benign,’ those scoring between 6 and 7 are labelled as
’Exhibiting suspicious behaviour,’ and those with scores between 8 and 9 are labelled
as ’Likely malicious.’ Instances receiving a score of 10 are categorised as ’Known to
be malicious’. This may happen if a malware family is detected such as in Figure
2.2 which displays a segment of the dynamic report generated for a malware sample
submitted to Triage. The malware family in this example is "BPFDOOR".

Figure 2.2: Segment of a dynamic report generated for a malware sample submitted
to Triage.

To attribute a malware family or a broader malware category label to a sample ID,
Triage employs several methods. All of Triage’s labelling is done internally to ensure
that those who submit samples cannot label them arbitrarily, preventing data poi-
soning through mislabelling. Triage uses signatures to match the actions or payloads
of samples with known malware families. YARA, a tool for identifying and classify-
ing malware based on textual or binary patterns, is often used for this purpose [16].
Malware payloads are identified via YARA rules crafted for specific samples. File
hashes are also compared against third-party lists of known malware hashes. If a
sample cannot be attributed to a family, Triage maps behaviours identified in dy-
namic analyses to specific Tactics, Techniques and Procedures (TTPs) in MITRE’s
ATT&CK Matrix [17]. Figure 2.3 depicts a sample labelled with the backdoor tag,
showing the specific MITRE techniques linked to this tag and the processes run by

9


2. Background

the sample that exhibited this behaviour. The idea is that TTPs linked to a certain
category of malware result in that malware category being attributed to the sample.

Figure 2.3: Details of Triage’s mapping of dynamic analysis behaviour to known
Mitre TTPs

Due to the project’s time constraints, the malicious data’s source was derived from
Triage via the available PCAP files. An alternative methodology would involve
creating a system of virtual environments, deploying a network traffic collection tool
like nfcapd between designated endpoints, and executing malware samples within
these environments to collect the transmitted data. However, this approach operates
under the assumption that the C2 server, with which the malware communicates,
is operational a circumstance that is not always guaranteed. Consequently, in the
absence of a functioning C2 server, the data captured from such a setup would
largely consist of Address Resolution Protocol (ARP) requests. More on how the
malicious data was obtained is outlined in 4.2.1.

2.3 Backdoors and C2 Communication
A backdoor is a technique that malware can use to establish C2 communication. A
backdoor is also a technique employed to establish a covert channel for data exfiltra-
tion. The MITRE ATT&CK framework, a database of tactics, and TTPs derived
from real-world observations of malware and threat actors, details 18 command
and control techniques and sub-techniques that adversaries may use to communi-
cate with compromised systems [18]. Additionally, MITRE outlines 9 methods for
data exfiltration, noting that those methods do not always utilise the standard C2
communication channels. Instead, malware may open separate covert channels or
backdoors specifically for this purpose[19]. Since both C2 channels and potential
data exfiltration channels constitute backdoors, the techniques detailed in MITREs
chapters on exfiltration and C2 communication tactics are crucial for understanding
what backdoors look like in the real world. Also notable is that a significant portion
of the techniques and sub-techniques in this chapter of the MITRE ATT&CK frame-
work utilises application layer protocols such as HTTPS, DNS, mail, and file transfer

10


2. Background

protocols. In many cases, HTTPS or other encrypted channels are used for data ex-
filtration, and in other instances, the payload may be obfuscated, making detection
techniques that attempt to examine the contents of the payload challenging.

In the context of detecting backdoor communication, certain features extracted from
network traffic can be particularly revealing. These features include but are not lim-
ited to, unusual outbound traffic volumes, atypical protocol usage, and connections
to suspicious IP addresses or domains. Anomalies in traffic timing, such as signifi-
cant activity occurring during off-peak hours, and the presence of unrecognised or
rarely used ports, can also indicate malicious activity [20].

The consistency of data packet sizes and intervals between communications might
also provide clues about automated or scripted data transfers typical of backdoor
exploits. NetFlow is particularly well-suited to providing these features. It captures
data about ingress and egress points, traffic volume, and timing, making it an in-
valuable source of data for developing ML models aimed at detecting these stealthy
communications. By leveraging NetFlow data, researchers can create more effective
detection systems that help identify backdoor communications by analysing network
flow characteristics without needing to inspect the actual payload. ML models could
be used to analyse the duration and frequency of network sessions to detect patterns
that deviate from the norm, which are often indicative of C2 communications or data
exfiltration attempts.

2.4 Machine Learning
The subsequent sections provide an overview of ML approaches and a selection of
ML algorithms relevant to our project.

ML models can be broadly categorised into three main groups: supervised, unsu-
pervised, and semi-supervised learning. Supervised learning entails training models
on labelled data, unsupervised learning uncovers patterns in unlabelled data, and
semi-supervised learning combines aspects of both approaches.

Detecting anomalies or classifying instances into multiple categories can be done us-
ing ML. An anomaly is a pattern deviating from the norm. Both anomaly detection
and binary classification can employ supervised or unsupervised learning, depending
on data availability and problem requirements [21]. Supervised anomaly detection
can be construed as a form of binary classification task, wherein the objective is to
categorise instances into one of two classes: normal or anomalous. In this approach,
the algorithm is trained on labelled data, explicitly identifying anomalies as the pos-
itive class (class 1) and normal instances as the negative class (class 0). As there are
a vast amount of different ML algorithms to choose from, selecting the appropriate
one for a given task is important.

2.4.1 ML Models
Factors such as the volume and nature of the data, including considerations of
data imbalance and the potential presence of outliers, require careful consideration

11


2. Background

when selecting ML algorithms for a specific task. Certain ML algorithms exhibit
stronger robustness to outliers or possess superior scalability for handling expansive
datasets. Among the array of algorithms available, a couple have demonstrated
distinct advantages in this context as presented in the "Related Works" section 3.2.
The rest of this subsection gives some background information on these models.

Adaptive Boosting (ADABoost): ADABoost works by iteratively training a
sequence of weak learners, typically decision trees with a single split, each focusing
on the mistakes of the previous one. It assigns higher weights to the instances that
were misclassified in the previous round, thus forcing subsequent weak learners to
concentrate more on those instances. The final prediction is made by combining
the weak learners’ predictions through a weighted sum, where the weights are de-
termined by their individual performance during training. It’s commonly used for
classification tasks, particularly when dealing with binary classification problems.
However, it can also be adapted for regression tasks [22].

Decision Trees (DT): Decision trees partition the feature space into regions, mak-
ing predictions based on simple rules inferred from the training data. At each step,
the algorithm selects the feature that best splits the data into homogeneous subsets
with respect to the target variable. This process continues recursively until a stop-
ping criterion is met, such as reaching a maximum depth or no further improvement
in purity. Decision trees are flexible and can be used for both classification and re-
gression tasks. They are particularly useful when the relationships between features
and the target variable are non-linear or complex [23].

Extreme Gradient Boosting (XGBoost): XGBoost is an advanced implemen-
tation of gradient boosting, designed for speed and performance. Like ADABoost,
it builds an ensemble of weak learners sequentially, but it differs in its optimisation
objective and regularisation techniques. XGBoost employs gradient descent optimi-
sation to minimise a differentiable loss function, using second-order derivatives for
more accurate updates. It also incorporates regularisation terms to prevent over-
fitting, making it particularly effective for large-scale datasets. It’s widely used for
classification and regression tasks, especially when dealing with structured/tabular
data. XGBoost is known for its high performance and is often used in competitions
and production environments where speed and accuracy are crucial [22].

Gaussian Naive Bayes (GNB): This is a probabilistic classifier based on Bayes’
theorem, particularly effective for classification problems with continuous features.
It assumes that the features follow a normal (Gaussian) distribution and are con-
ditionally independent given the class label. Key concepts involve calculating the
posterior probability of each class by combining the prior probability of the class and
the likelihood of the data given the class, derived from the Gaussian distribution.
GNB is especially useful for real-time prediction due to its simplicity and efficiency,
handling large datasets well even with high-dimensional data [23].

K-Nearest Neighbours (KNN): KNN works by identifying the ’k’ closest data
points to a given input, based on a distance metric like Euclidean distance, and
then predicting the output based on the majority label (for classification) or average
value (for regression) of these neighbours. KNN is particularly effective for problems

12


2. Background

where the data distribution is unknown and requires little to no training [23].

Random Forest (RF): Random Forest is an ensemble method that constructs
multiple decision trees during training and combines their predictions to improve
generalisation and robustness. Each tree is built from a bootstrap sample of the
training data, and at each split, a random subset of features is considered. This
randomness helps decorrelate the individual trees, reducing the risk of overfitting.
The final prediction is then determined by aggregating the predictions of all the
trees, often through a simple majority or averaging scheme. RFs are commonly
used for classification and regression tasks [23].

2.4.2 Feature Selection Methodologies

Feature selection is a critical step in the process of network traffic classification, as
it helps in identifying the most relevant attributes that contribute to distinguishing
different types of network traffic. Furthermore, with feature selection, one has to
deal with a smaller data set, which can result in less computational power when
training ML models.

When dealing with numerical input variables and categorical target variables, a
suitable feature selection technique is the Analysis of Variance (ANOVA) F-test
[24]. It is a type of filter method, where features are evaluated independently of
the ML model. ANOVA F-test is a statistical method used to compare the means
of two or more groups to determine if there are statistically significant differences
among them. In the context of feature selection for ML models, ANOVA F-test
helps identify the features that are most relevant for predicting the target variable.

The SelectKBest function in the scikit-learn library is a handy tool for applying filter
methods in Python [25]. It operates by computing the F-statistic and associated
p-values for each feature. The p-value assesses the likelihood of observing the data
under the assumption that there’s no real relationship between variables, essentially
indicating the probability of random chance. In simpler terms, a low p-value implies
that the observed relationship likely isn’t due to random fluctuations whilst a high
p-value suggests that the observed relationship could reasonably occur by chance.
Features with extremely low p-values (close to 0) and high F-statistics are considered
highly significant.

Another method for feature selection is Correlation-based Feature Selection (CFS).
It is used to identify and retain the most relevant features for predicting a target
variable by analysing the correlation between features. The CFS algorithm begins
by evaluating the correlation between each feature and the target variable, followed
by assessing the correlation among feature pairs. It then identifies a subset of
features that exhibit a strong correlation with the target variable while maintaining
minimal correlation with one another. This is a useful technique to use as some ML
algorithms, such as those based on Naive Bayes, assume that the features are not
correlated to each other [26].

13


2. Background

2.4.3 Techniques For Imbalanced Datasets
When dealing with imbalanced datasets where one class is significantly more preva-
lent than the others, traditional ML algorithms may struggle to accurately predict
the minority class. To address this issue, several techniques can be employed:

Over-sampling: Over-sampling involves increasing the number of instances in the
minority class by randomly duplicating existing instances or generating synthetic
samples. This technique helps balance the class distribution and provides more
information for the model to learn from. However, it may also lead to overfitting if
not carefully implemented.

Under-sampling: Under-sampling aims to reduce the number of instances in the
majority class to match the minority class. This can be done by randomly removing
instances from the majority class or selecting a subset of instances using various
criteria. While under-sampling can help balance the dataset, it may also discard
valuable information and lead to loss of predictive power.

SMOTE (Synthetic Minority Over-sampling Technique): SMOTE is a pop-
ular technique for generating synthetic samples in the minority class. It works by
creating synthetic instances along the line segments joining k minority class nearest
neighbours. This helps alleviate the class imbalance problem while avoiding the
overfitting issues associated with simple over-sampling.

These techniques offer different approaches to handling imbalanced datasets, and
the choice of method depends on the specific characteristics of the dataset and the
ML algorithm being used [27].

2.4.4 Evaluation Metrics & Validation Techniques
ML models can be evaluated in various ways. The most common ways for measuring
their performance include accuracy, precision, recall, F1-score, confusion matrix, and
area under the ROC curve (AUC-ROC) [28].

Accuracy: Accuracy measures the proportion of correctly classified instances among
all instances. It is calculated as:

Accuracy = Number of correct predictions
Total number of predictions

Precision: Precision measures the proportion of true positive predictions among
all positive predictions made by the model. It is calculated as:

Precision = True Positives
True Positives + False Positives

Recall (Sensitivity): Recall, also known as sensitivity or true positive rate, mea-
sures the proportion of actual positive instances that were correctly classified by the
model. It is calculated as:

14


2. Background

Recall = True Positives
True Positives + False Negatives

F1-score: F1-score is the harmonic mean of precision and recall. It provides a
balance between precision and recall and is useful when classes are imbalanced. It
is calculated as:

F1-score = 2 × Precision × Recall
Precision + Recall

Confusion Matrix: A confusion matrix is a table that visualises the performance of
a classification model. It summaries the predictions of a classifier in a tabular format,
with rows representing the actual classes and columns representing the predicted
classes as shown in table 2.4. The main scores in a confusion matrix include:

• True Positive (TP): Instances correctly predicted as positive.

• True Negative (TN): Instances correctly predicted as negative.

• False Positive (FP): Instances incorrectly predicted as positive (Type I error).

• False Negative (FN): Instances incorrectly predicted as negative (Type II er-
ror).

Table 2.4: Confusion matrix for anomaly traffic classification.

Actual \Predicted Benign Malicious
Benign TN FP

Malicious FN TP

Area Under the ROC Curve (AUC-ROC): AUC-ROC is a performance mea-
surement for classification problems at various threshold settings. It represents the
area under the receiver operating characteristic (ROC) curve, which plots the true
positive rate against the false positive rate. AUC-ROC values range from 0 to 1,
where a higher value indicates better performance of the model.

2.4.5 Stratified K-Fold Cross Validation
Stratified k-fold cross-validation is a variant of k-fold cross-validation (CV) that
ensures each fold is representative of the entire dataset, particularly with respect to
the class distribution. This method is especially beneficial for classification problems
with imbalanced class distributions [29].

The process of a standard k-fold CV is as follows:

1. Dataset Division: The dataset is divided into k equally (or nearly equally)
sized folds.

2. Training and Validation: For each of the k folds:

15


2. Background

• Use k − 1 folds as the training set.

• Use the remaining fold as the validation set.

3. Model Training: Train the model on the training set and evaluate it on the
validation set.

4. Performance Aggregation: Aggregate the performance metrics across all
k folds to estimate the models overall performance.

Stratified k-fold CV follows the same basic procedure as standard k-fold cross-
validation with an important modification to the fold creation process:

• Stratification of Classes: During the division of data into k folds, each
fold is ensured to have approximately the same percentage of samples of each
target class as the entire dataset. For example, if the dataset comprises 70%
class 0 and 30% class 1, each fold will maintain this 70/30 distribution.

• Repeat the k-Fold Process: Each fold is used once as the validation set
while the remaining k − 1 folds form the training set. This process is repeated
k times.

• Aggregate Performance: The performance metrics are aggregated across
all folds to provide a comprehensive performance measure.

Stratified k-fold CV ensures that each fold is representative of the overall dataset,
maintaining the same distribution of classes. This is particularly important for
datasets with imbalanced classes. Furthermore, it can help to avoid bias in the per-
formance metrics that can arise if some folds are not representative of the dataset’s
class distribution. A trade-off to consider is that the use of any k-fold CV can be
computationally expensive and time-consuming, particularly for large datasets and
complex models.

2.4.6 Limitations of Machine Learning
A significant challenge with ML models is the limited availability of publicly acces-
sible and up-to-date datasets necessary, particularly labelled ones, for training [30],
[31], [32]. To solve this problem one could generate and collect new network data
in a controlled environment where it is possible to separate malicious traffic from
benign traffic and thus label the data. Whereas normal traffic may be easier to
acquire by collecting traffic data from a real-world network context, it becomes hard
to verify that there are no instances of malicious traffic in the data. Data verification
directly impacts the reliability and performance of ML models in anomaly detection.
Accurately verified data ensures that the models are trained on correct and repre-
sentative examples, which is critical for the models to perform well in real-world
scenarios. Poor data verification can lead to models that are either over-sensitive
(high false positives) or under-sensitive (high false negatives), both of which are
undesirable in critical applications like network security.

Another problem at hand is the choice between supervised and unsupervised learning.
Unsupervised anomaly detection algorithms are more flexible when it comes to an

16


2. Background

imbalance between anomaly and non-anomaly classes in the data and may also be
fitting when the labelled data is difficult to produce with high accuracy. On the other
hand, the use of unsupervised ML algorithms may also introduce a greater difficulty
in terms of interpretation of results, as they operate solely based on patterns in the
data without reference to labelled examples of malicious behaviour [32]. Supervised
ML algorithms allow for more accurate detection and classification of anomalies
though they require a large amount of labelled training data, which may not always
be available in real-world scenarios. Furthermore, they may face challenges with the
class imbalance problem. This problem evolves around imbalanced datasets, where
you fail to capture the minority class of the dataset [33].

17


2. Background

18


3
Related Works

The presence of backdoors in Linux-based systems represents a particularly insid-
ious risk, allowing unauthorised access and control over sensitive information and
resources. The detection of such backdoors is crucial for safeguarding the confiden-
tiality, integrity, and availability of data and services. Over the years, researchers
and practitioners have explored various techniques and methodologies to address this
challenge, later, with a particular focus on leveraging ML algorithms and NetFlow
data analysis for effective detection and mitigation. In this related works section,
we dive into the existing literature on detecting malware network communication
using ML and NetFlow data, examining key findings and methodologies in the field.

The first paper [1] looking into the problem of network-based backdoor detection
was written by Yin Zhang from Cornell University and was published year 2000 in
the 9th USENIX Security Symposium. The paper details many of the problems with
detecting backdoor network communication that holds fast to this day; how to define
what constitutes a backdoor and distinguishing backdoor traffic from large quantities
of legitimate traffic. Zhang highlights one of the main challenges is distinguishing
legitimate traffic that resembles backdoors from actual malicious backdoors. He
emphasises that well-defined policies regarding normal traffic behaviour, such as
the typical ports and services used, can aid in differentiating malicious and benign
traffic. Zhang also notes that creating detection mechanisms or algorithms for all
types of backdoors and normal traffic can be challenging, but specialised solutions
can be developed for more specific network segments or familiar networks.

Naturally, the field of network security has evolved significantly since 2000, and
the research field has increasingly come to be dominated by ML-based detection
methods in recent years [34].

3.1 Flow-Based Data for Malware Traffic Detec-
tion

Typically, backdoor detection is done by Network Intrusion Detection Systems (NIDS)
inspecting packets (or their aggregated flows) as they traverse the network. The data
is processed by a detection algorithm or framework that classifies a flow or a packet
as malicious or benign [34]. While packet-based data may provide more information
to perform analysis on, it is less scalable and adapts poorly to implementation in

19


3. Related Works

corporate infrastructures [35].

Notably, the paper titled ’NetFlow Datasets for Machine Learning-based Network In-
trusion Detection Systems’ [3] is especially important to our research and offered in-
sights into using flow-based datasets for model training. The paper created NetFlow
versions of four benchmark NIDS packet-based datasets: UNSW-NB15, BoT-IoT,
ToN-IoT, and CSE-CIC-IDS2018. The generated NetFlow datasets were named NF-
UNSW-NB15, NF-BoT-IoT, NF-ToN-IoT, and NF-CSE-CIC-IDS2018. Although
many of the original datasets do contain data points labelled as a backdoor (UNSW-
NB15, CSE-CIC-IDS2018), they either fail to mention the OS of the client-side
susceptible to the backdoor (UNSW-NB15) [36], does not have backdoor labelled
data (NF-CSE-CIC-IDS2018)[3] or, only have backdoor attacks for a Windows and
macOS client-side (CSE-CIC-IDS2018) [37].

Given the potential differences in network traffic patterns among various operating
systems discussed in Section 2.1, the absence of Linux network data in available
benchmark datasets is problematic for our research. Ideally, backdoor traffic from
Linux hosts would be needed to develop detection models that accurately can detect
backdoor communications in Linux environments, to fill a critical gap in current
research methodologies.

Furthermore, the study ’NetFlow Datasets for Machine Learning-based Network
Intrusion Detection Systems’ [3] trained models using the created datasets and pro-
vided a detailed analysis of the results. Of particular interest to our research is
the detection of backdoor attacks, which are detailed in Table 3.1. The findings
specifically demonstrate the effectiveness of using NetFlow data to detect backdoors
in traffic originating from Windows clients/hosts. Despite varying results across the
datasets, there was a notable increase in accuracy for detecting backdoor activity
compared to the traditional packet-based counterparts.

Table 3.1: Results from paper [3] detailing the results of backdoor detection using
different NetFlow datasets.

Dataset NetFlow
DR NetFlow F1 Packet-

Based DR
Packet-
Based F1

UNSW-NB15 39.17% 0.17 13.96% 0.08
ToN-IoT 99.22% 0.98 98.05% 0.31
CSE-CIC-IDS2018 N/A N/A N/A N/A
NF-UQ-NIDS 90.95% 0.92 N/A N/A

3.2 ML Approaches for Network Classification
Network anomaly detection has been the subject of extensive research in the field of
cybersecurity, as evidenced by a comprehensive survey conducted by the authors of
[34]. 290 research articles published between 2000 and 2020 used ML methodologies
for anomaly detection. Among these articles, the most common applications were
Intrusion Detection Systems (IDSs) and network anomaly detection, comprising 68

20


3. Related Works

and 66 articles, respectively. This paper highlights the use of ML algorithms as a
popular and feasible method to develop anomaly detection models based on network
traffic.

Nassif et al. [34] highlight that while unsupervised learning is frequently applied due
to its suitability in environments with limited labelled data, the evolving complexity
of anomaly detection calls for more sophisticated approaches in which supervised or
semi-supervised approaches would be ideal. The authors also looked at what types of
models and feature selection were suitable for different types of applications. Nassif
et al. point out the growing reliance on hybrid models that combine multiple ML
techniques, and ensemble methods that aggregate predictions from several models
to improve accuracy and robustness. These approaches are particularly effective
in handling the diverse and complex nature of modern anomalies. The critical
role of feature selection and extraction is emphasised, with techniques like Principal
Component Analysis (PCA) and CFS being pivotal in reducing noise and enhancing
model performance. However, the selection of a suitable ML algorithm for a given
task constitutes a central phase in constructing an ML model with the capability to
effectively address the task at hand.

In the context of ML techniques for traffic-flow-based intrusion detection, "methods
based on decision trees [...] have turned out to be the most efficient". Examples
of such algorithms are RF and DT. This was concluded by the authors of [38]
whom analysed various ML techniques to ascertain which ones yielded optimal traffic
classification outcomes. On the other hand, algorithms such as ADABoost, GNB
and KNN have repetitively shown good performance in other network classification
tasks where ML has been applied [39], [34].

3.3 Features Used in ML-Based Network Traffic
Classification

In a study on network anomaly detection using sampled NetFlow data, significant
emphasis was placed on the optimisation of feature selection from NetFlow version
5 data. The researchers focused on the features that would potentially yield the
highest information gain for anomaly detection models. This process reduced the
number of features from the original twenty-four to eleven. The selected features,
which can be seen in table 3.2 include both traffic descriptors and network interface
data, which are crucial for analysing network flows [40].

21


3. Related Works

Table 3.2: Remaining NetFlow features grouped into 5
main categories as described in the paper [40].

Retained Features
Source IP
Destination IP
Input interface
Output interface
Packets
Bytes
Source port
Destination port
Flags
Protocol
Type of Service (ToS)

Subsequently, to manage the complexity and dimensionality of the data, PCA was
employed. This statistical technique was utilised to reduce the dimensionality of
the feature set from eleven to five principal components. This reduction was made
to preserve 95% of the variance in the data, thus maintaining the integrity and the
predictive power of the models while significantly reducing computational overhead
[40].

The paper [39] "An Approach to Detect Remote Access Trojan in the Early Stage
of Communication" employs a unique approach to data formatting for anomaly de-
tection, which, although not strictly adhering to standard flow-based formats like
IPFIX or NetFlow, is derived from packet headers in pcap data transformed into
a custom format. This custom approach to feature extraction from early-stage
TCP session data is particularly relevant to our research interests, as it provides a
refined method for identifying potential malicious activities through initial traffic
behaviours. While the study concentrates specifically on detecting Remote Access
Trojans (RATs) rather than backdoors, the nature of RAT traffic which can be
considered a specific instance of backdoor traffic makes their findings relevant. The
methodology focuses on capturing a combination of packet and data size metrics
that reflect the typical operational patterns of RATs, which often involve significant
outbound data transfer with minimal inbound communication. Find a summary of
the key features used in their detection model in column A of table 3.3. The five
original features (PacNum, OutByte, OutPac, InByte and InPac) were chosen based
on existing works. To gain more information from these features, the authors the
two additional features (O/Ipac and OB/OP) were calculated.

Additionally, the paper [41] "Optimal Remote Access Trojans Detection Based on
Network Behavior" by Khin Swe Yin and May Aye Khine also contributes to this
discussion by exploring the network behaviour-based detection model. This study
addresses the challenge of detecting RATs in their early stages by focusing on the
first twenty packets from the SYN of the TCP three-way handshake to the twentieth
packet. In Table 3.3, a comparison of the features used by both papers is shown,

22


3. Related Works

providing an overview of the various metrics utilised in semi-flow-based detection
models for RATs. This paper [41] will be referred to as paper B in the study.

Table 3.3: Summary of network features used for RAT detection of paper A [39] and
paper B[41].

Feature Paper A Paper B Description
PacNum X Number of packets in the early stage of communication
Duration X Duration of a flow
OutByte X X Total size of outbound data
OutPac X Number of outbound packets
InByte X X Total size of inbound data
InPac X Number of inbound packets
O/Ipac X X Ratio of outbound to inbound packets
OB/OP X X Average size of outbound packets
IB/IP X Average size of inbound packets
OB/IB X Ratio of outbound data to inbound data

23


3. Related Works

24


4
Method

The following chapter presents the methodology followed to address our research
question. In the section on Model Selection, ML algorithms were chosen based on
findings from related works. The Data Collection section explains the process of col-
lecting both malicious and benign data, with specific filtering criteria detailed. Data
Preprocessing outlines procedures for transforming and refining the dataset, includ-
ing conversion to CSV format, flow aggregation, and feature engineering. Feature
Selection utilised ANOVA F-values and p-values to select top features for analysis.
Model Training and Evaluation employed stratified k-fold CV, with and without
SMOTE oversampling, to evaluate each model’s performance. Finally, the Limita-
tions section addresses constraints and considerations regarding data collection and
methodology.

Model
Selection

Data
Collection

Data Pre-
processing

Feature
Selection

Model
Training

and
Evaluation

Limitations

Figure 4.1: Overview of method section.

4.1 Model Selection
Based on the findings from the related works section 3.2, the ML algorithms in 4.1
were chosen for the task of binary network traffic classification.

Table 4.1: Selection of machine learning algorithms.

Machine Learning Algorithm Abbreviation
Adaptive Boosting ADABoost
Bayesian Naive Bayes BNB
Decision Trees DT
Extreme Gradient Boosting XGBoost
Random Forest RF

25


4. Method

4.2 Data Collection
The following section explains how data was collected to train and evaluate the
selection of machine learning algorithms.

4.2.1 Collection of Malicious Data
PCAP files from Triage were collected based on a set of search filters. The pseu-
docode in Algorithm 1 shows the criteria for extracting these PCAP files.

Algorithm 1: Pseudocode for extracting PCAP files.
for each sample_id, task_id in submitted_samples do

if (OS = ’Linux’ and tag = ’backdoor’ and Score > 5 and
Network_Traffic_Available = TRUE) then

extract_pcap(sample_id, task_id);
end

end

Triage’s scoring parameter was used to filter out samples which did not appear to
exhibit any malicious behaviour according to the platform’s dynamic analysis. A
score above five is considered to show suspicious and/or malicious behaviour. Essen-
tially, all PCAP files belonging to malware samples which had been (1) successfully
run on a Linux platform, (2) had been tagged with the backdoor tag, (3) had a
dynamic analysis score above five and (4) had successfully been able to establish a
network connection were extracted.

The SHA256 value of each obtained sample_id was checked. This was done to
remove duplicates of the same malware sample file.

4.2.2 Collection of Benign Data
Collection of benign data must ensure the integrity of the data. One way to achieve
this is to generate your own benign data, maintaining full control over the network
traffic contained within. A benefit of this approach is the ability to tailor the patterns
and behaviours of the normal traffic. Another way is to use real-world flow data,
ensuring its integrity is not compromised. The risk of compromise in real-world
data must be analysed, which can only be done with knowledge of the traffic sent
over the network. This data should then undergo thorough vetting to eliminate
any potential threats. An effective approach is to cross-reference the collected data
against a comprehensive list of known malicious IPs and domains, ensuring the
dataset remains free of compromised elements.

The benign data from this project was collected from a network segment within
Recorded Future’s corporate infrastructure. This network segment is specifically
designated for calling external APIs and downloading packages from external repos-
itories, implying that the benign data should be communicating with predefined
trusted sources. However, there remains a small risk that these sources could be

26


4. Method

compromised, potentially introducing vulnerabilities into the systems they interact
with.

To mitigate this risk and ensure the benign nature of the data, any potentially
malicious data points were filtered out by comparing them against Recorded Future’s
extensive IP risk list. Recorded Future generates risk scores for IPs by combining
intelligence gathered from their automatic analysis of unstructured text, integrating
threat intelligence from multiple sources such as threat feeds, security reports, and
dark web monitoring. These sources provide a comprehensive overview of risky IP
addresses, making the list reliable for identifying and filtering out potential threats
[43].

4.3 Data Preprocessing

This section outlines the procedures undertaken to prepare and refine the dataset
for subsequent model training.

The extracted PCAP files from Triage were converted into nfcapd files using nfp-
capd. The nfcapd files were further transformed into CSV files, using nfdump, while
simultaneously aggregating them into bidirectional flows. Aggregation occurs at the
connection level, where the 5-tuple protocol (comprising protocol type, source IP,
destination IP, source port, and destination port) is considered. This helps reduce
the size of the data. An overview of the extraction and conversion process is shown
in figure 4.2

Figure 4.2: Overview of pcap file extraction and conversion.

To mitigate potential bias towards specific nodes, both source and destination IP
addresses, along with source and destination ports, were deliberately omitted from
the dataset. Instances of network flows associated with malicious activity were
designated with the label "1", while those indicative of benign traffic were labelled
as "0". Furthermore, four supplementary fields were generated and subsequently
incorporated into the dataset. The twelve attributes, selected prior to any statistical
feature selection process, are enumerated in Table 4.2. Furthermore, any NaN values
were dropped and Inf values were replaced by 0.

27


4. Method

Table 4.2: Selection of nfdump fields before statistical feature selection.

Field Description
1 td Duration of the flow in seconds and milliseconds.
2 pr Protocol used in the connection.
3 flg TCP flags ORed of the connection.
4 ipkt Input Packets
5 opkt Output Packets
6 ibyt Input Bytes
7 obyt Output Bytes
8 IbytByIpkt Number of input bytes by input packets
9 ObytByOpkt Number of output bytes by output packets
10 ObytByIbyt Number of output bytes by input bytes
11 OpktByIpkt Number of output packets by input packets

The flg and pr fields were mapped to integers. Each flag is assigned a unique
numerical value that corresponds to a power of 2 as shown in table 4.3. The empty
string is mapped to 0, indicating no flags present. The sum of flg values thus
represents a unique combination of flags being activated. In the mapping provided
for protocols, each protocol is assigned a unique numerical value as shown in table
4.4. The mapping of the flg and pr fields was based on the documentation provided
in the manual pages of NFDUMP [14].

Table 4.3: Mapping of flg values to inte-
gers of base 2.

Flag X U P R F S A
Mapping 63 32 16 8 4 2 1

Table 4.4: Mapping of protocols to inte-
gers.

Protocol RSVP AH ESP GRE ICMP UDP TCP
Mapping 6 5 4 3 2 1 0

The data was scaled to improve the stability and generalisation performance of the
machine learning models. Two scaling methods were considered: StandardScaler
and RobustScaler which are both provided by scikit-learn. The choice between
these methods was determined based on the characteristics of the data and the
desired robustness of the model. RobustScaler, chosen for its ability to handle
outliers effectively, was applied to the dataset to mitigate the impact of extreme
values on feature scaling. By transforming the data using RobustScaler, the features
were normalised using the interquartile range, reducing the influence of outliers and
enhancing the model’s resilience to variations in the dataset.

4.4 Feature Selection
Feature selection was performed using the SelectKBest function from the scikit-learn
library. The selection of SelectKBest with ANOVA is grounded in its ability to han-
dle high-dimensional data and its effectiveness in identifying features that contribute

28


4. Method

most to the variance between classes, as discussed in section 4.4 . This method sim-
plifies the model and enhances performance by reducing the dimensionality of the
input data.

The ANOVA F-values between labels and features, along with their respective p-
values, as detailed in Table 4.5. Subsequently, the top six features were selected for
further analysis. These features include: pr, flg, ObytByOpkt, OpktByIpkt, Opkt-
ByIpkt, and td.

Table 4.5: F-statistic for each feature and P-values associated with each F-statistic
rounded to three decimal points and sorted in descending order of F-statistic.

Feature F-statistic P-value

pr 288113.664 0.000
flg 11972.300 0.000
ObytByOpkt 97.611 0.000
IbytByIpkt 77.446 0.000
OpktByIpkt 49.140 0.000
td 5.605 0.018
ibyt 1.040 0.308
obyt 0.961 0.327
ipkt 0.856 0.355
ObytByIbyt 0.479 0.489
opkt 0.082 0.775

The top features display the greatest difference in variance when comparing benign
data with malicious data. In terms of model training and classification ability, a
higher F-value indicates a greater ability to distinguish between classes, as it shows
a significant variance between groups. The p-values associated with these F-values
further support their significance, as they are all very low. A low p-value, typically
less than 0.05, indicates that the observed variances are highly unlikely to have
occurred by chance.

The top six features were chosen because they exhibited the highest F-values, making
them the most informative for our classification model. Specifically, features pr,
flg, ObytByOpkt, IbytByIpkt, OpktByIpkt had particularly high F-values, all with
corresponding p-values of 0.000, indicating strong statistical significance. The sixth
feature, td, while having a slightly higher p-value of 0.018, still met the threshold
for statistical significance and was included because it was seen as a telling and
informative feature regarding malicious activity.

In order to not include features which were highly correlated with each other, CFS
was also implemented. Highly correlated pairs of features could lead to overfitting
of models due to features providing similar data to the model. Furthermore, some
of the selected models assume no correlation between features and thus, must be
removed. As no features were highly correlated with each other (see Figure 4.3), no
further features were removed.

29


4. Method

Figure 4.3: Correlation matrix for input variables.

4.5 Model Training and Evaluation
All models were trained and evaluated using stratified k-fold CV which was k = 5.
These models underwent training and evaluation utilising varying proportions of
the original dataset. In the first strategy, SMOTE was employed to oversample
the minority class, this being the malicious class, by synthetically generating new
samples. This strategy aimed to establish a ratio of benign to malicious samples
approximately equivalent to 80 : 20. The second approach refrained from any form
of resampling. Tables 4.6 and 4.7 show the different class distributions in the original
data set (after prepossessing) and when SMOTE has been applied. This approximate
80 : 20 ratio was chosen as opposed to a 50 : 50 ratio because, in real-life scenarios,
it is not realistic for benign and malicious data to occur in equal ratios.

Performances of each fold for each model were collected and the average accuracy
across all folds was computed. The average AUC was used along with interpolation
for all FPRs and TPRs in order to plot an average ROC curve.

Table 4.6: Original class distribution.

Label Count Percentage
0 2980741 99.975
1 740 0.025

Table 4.7: Class distribution with
SMOTE applied.

Class Count Percentage
0 2384593 83.46
1 476918 16.54

30


5
Results

5.1 Data Statistics
The following section presents statistics on the data which was used to train and
evaluate the machine learning models.

5.1.1 Malicious Data Behaviour
This project’s malicious data focused on 43 Linux backdoor samples, and 23 of
these samples had an unknown malware family, however, they were labelled as a
backdoor thanks to the activity identified in Triage’s sandbox. 20 of the samples were
attributed to backdoor malware families Iptablez, BPFDoor, XZutil, and Metasploit
backdoors. See the distribution amongst our samples in table 5.1.

Table 5.1: Overview of Linux backdoor samples analysis.

Description Count
Total Linux Backdoor Samples Analysed 43
Labels from Specific TTPs 23/43
Samples Attributed to Malware Families 20/43

A detailed summary of the key Mitre techniques and their occurrences within the
collected samples are presented in table 5.2 below.

Table 5.2: Summary of techniques identified in Linux backdoor malware samples.

Technique Description Count out of 43
T1016 System Network Configuration Discovery 38
T1049 System Network Connections Discovery 26
T1568 Dynamic Resolution 14
T1053 Scheduled Task/Job 12
T1574 Hijack Execution Flow 11
T1082 System Information Discovery 29
T1497 Virtualisation/Sandbox Evasion 5
T1562 Impair Defences 4

31


5. Results

One of the notable techniques employed by malware samples in our data is Dynamic
Resolution (T1568), which complicates the tracking of C2 infrastructure by dynami-
cally resolving domain names. Dynamic DNS (DDNS) is a method used within this
technique where domain names are frequently updated with new IP addresses. This
approach allows the malware to maintain consistent contact with its C&C servers
even if the IP addresses change, making it more resilient against takedowns and
blocking efforts [44].

Furthermore, we see techniques aimed at virtualisation and sandbox evasion (T1497)
and hijacking execution flow (T1574) to modify how programs execute, redirecting
execution to attacker-controlled code. Lastly, a notable trend is the network and
system discovery techniques utilised for identifying active network connections, lis-
tening ports and network configurations (T1049, T1016).

5.1.2 Flow Distribution Amongst Malicious Samples
Figure 5.1 illustrates the distribution of flow occurrences per individual malware
sample. Flows belonging to the same malware sample but a different behavioural,
i.e. a different Linux platform, have all been aggregated.

Figure 5.1: Number of flows per sample ID distribution.

5.1.3 Mean and Modes of Features in Data
Table 5.3 provides an overview of the mean and median values for key features, which
include total duration (td), outbound bytes (obyt), inbound bytes (ibyt), outbound
packets (opkt), inbound packets (ipkt), the ratio of inbound bytes to inbound packets

32


5. Results

(IbytByIpkt), ratio of outbound bytes to outbound packets (ObytByOpkt), and the
ratio of outbound packets to inbound packets (OpktByIpkt).

These statistics are critical for understanding the underlying patterns of network
traffic associated with both benign and malicious flows. The mean and median val-
ues help highlight the typical behaviour observed in the data, whereas discrepancies
between these measures can indicate the presence of outliers or skewed data distri-
butions. Throughout the results section, we will refer back to these statistics to
make comparisons on trends shown in other figures.

Table 5.3: Mean and mode of numerical features in data.

Feature Mean Malicious Median Malicious Mean Benign Median Benign

td 8304.74 79.00 14092.85 112.00
obyt 187452.07 195.50 1220782.49 6150.00
ibyt 2881.24 164.00 99423.22 2763.00
opkt 137.99 1.00 177.81 10.00
ipkt 41.52 2.00 155.64 10.00
IbytByIpkt 91.69 75.00 354.39 219.60
ObytByOpkt 354.02 170.00 881.23 648.56
OpktByIpkt 1.06 1.00 0.94 0.91

In reviewing the non-normalised mean and median values for flow duration (td)
presented in Table 5.3, the different durations for which benign and malicious data
were captured must be considered. These initial measurements might not be directly
comparable, as the length of the observation periods significantly varies between the
two datasets. However, by normalising these flow durations, details of which are
provided in Table 5.4, a more accurate understanding of the data is gained. The
normalised durations show that, when adjusted for the length of their respective
capture periods, malicious activities engage in sustained, longer-duration flows to a
much greater extent than benign activities.

Table 5.4: Normalised mean and mode of flow durations for malicious and benign
data.

Type Normalised Mean (ms) Normalised Mode (ms)
Malicious 2.306 0.022
Benign 0.163 0.001

5.1.4 Comparison of Incoming and Outgoing Packet Sizes
Figure 5.2 reveals consistent trends across both datasets. It is observed that, on
average, the size of outgoing packets surpasses that of incoming packets for both
benign and malicious data. This indicates a general pattern where outgoing packets
contain more data than incoming ones. Notably, benign data typically features larger

33


5. Results

packets compared to malicious data, suggesting differences in data transmission
behaviours between the two types.

Figure 5.2: Incoming and outgoing packet size in malicious and benign data.

5.1.5 Distribution of Protocol in Data

The proportion of protocols in the data is illustrated by Figure 5.3. It is observed
that the malicious data comprised approximately equal parts of the transport layer
protocols, TCP and UDP, whereas the benign data consisted almost entirely (99.9%)
of TCP flows. This distinct contrast in protocol distribution, with clear homogene-
ity in the benign dataset against the variance in the malicious data, significantly
enhances the protocol feature’s impact as a determinant in model classification deci-
sions. This difference in variance is a key contributor to the high F-statistic for the
protocol feature observed in the ANOVA feature selection, as detailed in Section 4.4
and shown in Table 4.5.

34


5. Results

Figure 5.3: Overview of protocol distribution in benign and malicious data.

5.1.6 Distribution of Flow Duration in Data
Figure 5.4 presents a density plot comparing flow durations in malicious and benign
data. The plot focuses on the interquartile range, which encompasses the middle
50% of each dataset. Examining the axes, we observe significant differences in
flow durations within this range. Specifically, all flow durations in the interquartile
range for benign data are below 600 ms, whereas those for malicious data extend
up to approximately 3000 ms. This indicates that the range of flow duration for
malicious data within this interval is about five times greater than that for benign
data, suggesting more variability in the duration of malicious flows.

Figure 5.4: Distribution of flow-duration within interquartile range (25th to 75th
percentiles).

5.1.7 Comparison of Outgoing vs. Incoming Packets
Analysis of the trend lines depicted in the graph (Figure 5.2) shows that the slope
for the malicious data is noticeably steeper compared to that of the benign data.

35


5. Results

This steeper slope suggests a more pronounced increase in outgoing packets relative
to incoming packets in the malicious dataset. Such a trend indicates that malicious
communications send out more outgoing packets than incoming packets, to a greater
extent than observed in the benign data. Supporting this observation, data from
Table 5.3 shows that the ’OpktByIpkt’ feature indicating the ratio of outgoing to
incoming packets has a higher median and mode in the malicious dataset than in
the benign dataset.

Figure 5.5: Number of outgoing vs incoming packets per data point in malicious
and benign data.

36


5. Results

5.2 Model Performances
The following section presents the results of the evaluation metrics for each model.
These evaluation metrics are True Negatives (TN), False Positives (FP), False Nega-
tives (FN), False Negative Rate (FNR), False Positive Rate (FPR), Precision, Recall
as well as ROC curve (receiver operating characteristic curve) and corresponding
Area under the ROC Curve (AUC). It is important to point out that the results
presented in tables 5.6 and 5.7 are averages based on stratified K-Fold CV with
five folds. Furthermore, the ROC curves presented in figures 5.8 and 5.9 have been
interpolated for each model and the AUC represents the average AUC score of all
five folds per model.

5.2.1 Average Performance Metrics Per Model

Figure 5.6: Performance metrics of models with SMOTE applied.

Model TN FP FN TP FNR FPR Precision Recall

ADABoost 595991 157 20 128 0.135135 0.000263 0.449123 0.864865
DT 595989 159 20 128 0.135135 0.000267 0.445993 0.864865
GNB 590090 6058 49 99 0.331081 0.010162 0.016079 0.668919
KNN 595839 310 21 127 0.141892 0.000520 0.290618 0.858108
RF 596038 110 19 129 0.128378 0.000185 0.539749 0.871622
XGBoost 593198 2951 19 129 0.128378 0.004950 0.041883 0.871622

Figure 5.7: Performance metrics of models without SMOTE applied.

Model TN FP FN TP FNR FPR Precision Recall

ADABoost 596132 17 29 119 0.195946 0.000029 0.875000 0.804054
DT 596126 22 30 118 0.202703 0.000037 0.842857 0.797297
GNB 590497 5652 56 92 0.378378 0.009481 0.016017 0.621622
KNN 596086 62 50 98 0.337838 0.000104 0.612500 0.662162
RF 596144 4 35 113 0.236486 0.000007 0.965812 0.763514
XGBoost 596147 1 81 67 0.547297 0.000002 0.985294 0.452703

37


5. Results

5.2.2 ROC Curves and AUC scores
Looking solely at the ROC curves presented in figure 5.8 and 5.9, all of the models
are performing better than random guess (AUC = 0.50). Furthermore, all models
have higher AUC scores when SMOTE is applied to address the imbalanced dataset.

Figure 5.8: Interpolated ROC curves of models with SMOTE applied.

Figure 5.9: Interpolated ROC curves of models without SMOTE applied.

38


5. Results

5.2.3 FNR, FPR, Precision and Recall Curves
In the following two figures, each model’s FNR, FPR, precision and recall scores
have been plotted for the two cases when SMOTE has been applied and when only
the original data set has been used. The GNB classifier stands out among the models
for its lack of precision and high FNR. In terms of recall, the scores are relatively
similar among all models except for XGBoost which falls behind on the original data
set. FPRs are relatively low for all models.

Figure 5.10: Average FNR & FPR by model, with and without SMOTE.

Figure 5.11: Average precision & recall by model, with and without SMOTE.

39


5. Results

40


6
Discussion

This chapter analyses the performance of various ML models in classifying network
traffic as benign or malicious using NetFlow data. Stratified k-fold cross-validation
and SMOTE were employed to address data imbalance, revealing significant varia-
tions in model performance, particularly in precision and recall.

We discuss the impact of SMOTE on class distribution and model accuracy. En-
semble methods like Random Forest and XGBoost showed superior performance,
especially in handling imbalanced datasets.

Feature selection was performed using ANOVA F-values and CFS to enhance model
interpretability and performance. We highlight the challenges posed by feature
correlation, particularly for models assuming feature independence, such as Gaussian
Naive Bayes.

The implications of our findings are discussed in the context of practical network
traffic classification, emphasising the effectiveness of the best-performing models in
real-world scenarios. This chapter provides a concise evaluation of model strengths
and weaknesses, guiding future improvements in cybersecurity measures.

6.1 Analysis of Model Results
This section evaluates the performance of the ML models in classifying network
traffic as either benign or malicious using NetFlow data. Stratified k-fold CV was
employed to ensure robust evaluation, and SMOTE was applied to address the class
imbalance. The results demonstrate significant variations in model performance, par-
ticularly in precision and recall when trained on original versus SMOTE-enhanced
datasets.

Consideration of the significant variance in class distribution across scenarios is im-
perative. Table 4.7 illustrates the class distribution post-application of SMOTE.
When dealing with imbalanced datasets, traditional evaluation metrics such as ac-
curacy can be misleading because they do not account for the imbalance between
classes. Therefore, it’s essential to use evaluation metrics that provide a more nu-
anced view of model performance, particularly on the minority class. The two
metrics precision and recall have thus been used to evaluate the models, focusing on
the minority class.

41


6. Discussion

Random Forest (RF): RF showed robust performance due to its ensemble nature,
reducing overfitting and increasing stability. The model’s high AUC and precision on
the original dataset highlight its effectiveness. However, SMOTE reduced precision
significantly, likely due to the introduction of synthetic samples that added noise,
albeit improving recall and reducing the FNR. The increase in FPR after SMOTE
indicates some benign flows were misclassified as malicious.

ADABoost: ADABoost’s boosting technique, which focuses on difficult-to-classify
instances, demonstrated sensitivity to class imbalance. The increase in recall with
SMOTE indicates a positive effect from additional minority samples, but the re-
duced precision and increased FPR suggest potential overfitting to these synthetic
instances.

Decision Tree (DT): DTs are prone to overfitting, especially with imbalanced
datasets. SMOTE improved recall by providing more minority class samples but
decreased precision due to less informative synthetic samples. The increase in FPR
post-SMOTE indicates a trade-off between correctly identifying malicious flows and
misclassifying benign ones.

Gaussian Naive Bayes (GNB): GNB assumes normality and feature indepen-
dence, which might not hold for network traffic data. Its poor performance and
negligible improvement with SMOTE highlight the limitations of these assumptions
in handling imbalanced data. The slight improvement in recall and a corresponding
decrease in FNR came at the cost of a higher FPR.

K-Nearest Neighbours (KNN): KNN’s instance-based learning is sensitive to
class imbalance. SMOTE significantly increased recall but decreased precision,
indicating that synthetic samples helped cover the minority class but introduced
noise. The lower FNR post-SMOTE suggests better identification of malicious flows,
though the increased FPR indicates more benign flows were incorrectly classified as
malicious.

XGBoost: XGBoost’s gradient boosting technique and regularisation parameters
make it robust to different data distributions. SMOTE substantially improved re-
call and reduced FNR, but drastically reduced precision, suggesting overfitting to
synthetic samples and a significant increase in FPR.

In conclusion, ensemble methods such as RF, XGBoost and ADABoost demonstrate
superior performance in both cases where resampling was applied and not. This is
consistent with the finding from related works. In practical network traffic classifica-
tion scenarios, the distribution of malicious and benign data instances often exhibits
a significant class imbalance. As such, the efficacy of models in handling imbalanced
datasets and discerning nonlinear relationships within the data assumes crucial im-
portance. Models demonstrating adeptness in accommodating class imbalance and
capturing intricate nonlinear dependencies offer a more viable solution for accurate
network traffic classification.

It is important to consider that there are specific situations where applying SMOTE
not be appropriate. Its applicability is contingent upon several factors: it may cause
overfitting on small datasets, exacerbate noise issues, become less effective in high-

42


6. Discussion

dimensional data, fail to represent complex data distributions accurately, require sig-
nificant computational resources, prove impractical for real-time applications, and
mislead models when minority class outliers are present. These considerations under-
score the importance of judiciously assessing the suitability of SMOTE for a given
dataset and problem context.

6.1.1 Feature Selection
The importance of feature selection has previously been mentioned in section 2.4.2.
Models such as RFs and DTs are less dependent on prior feature selection implemen-
tations. While prior feature selection can still improve the performance of DTs and
RFs by reducing noise and improving interpretability, these models are inherently
less dependent on it due to their internal mechanisms for handling features. This
makes them particularly useful in situations where automated feature selection is
challenging. However, models based on Naive Bayes’s theorem, such as GNB, are
sensitive to correlated features. This sensitivity arises because the Naive Bayes algo-
rithm assumes that the features are conditionally independent given the class label,
an assumption that rarely holds in real-world datasets. This is a possible reason
why GNB is performing less compared to the other models.

Some works, as mentioned in 3.2, have opted to include features such as source and
destination addresses, as well as source and destination ports, as part of their feature
set. However, we dropped these features before applying any statistical feature selec-
tion. This decision is supported by several compelling reasons that underscore the
effectiveness of excluding addresses and ports in network traffic classification mod-
els. Relying on addresses and ports for classification may introduce noise and reduce
the generalisation capability of the model. By excluding these features, the model
becomes more robust to changes in network topology and configuration, enhancing
its stability and performance across different network environments. Secondly, ad-
dresses and ports may not always be indicative of malicious activity or meaningful
patterns in network traffic. While certain IP addresses or port numbers may be as-
sociated with known malicious entities or services, relying solely on this information
for classification can lead to false positives or miss important anomalies.

In our case, ANOVA F-values were used for feature selection because this lightweight,
filter-based approach preserves interpretability, unlike PCA. ANOVA is particularly
beneficial as it evaluates the significance of each feature in relation to the target vari-
able, making it easier to understand which features contribute most to the model’s
predictions. However, as it does not accommodate the correlation between features
themselves, CFS was used as a complementary method.

6.2 Identified Backdoor Communication Behaviour

The insights from the data visualisations and statistics in Section 5.1 draw a picture
of the behavioural differences between benign and malicious network flows. Specif-

43


6. Discussion

ically, the longer flow durations and larger sizes of outgoing packets compared to
incoming ones in malicious data align with backdoor behaviours designed to main-
tain persistent connections and facilitate data exfiltration.

The significantly steeper slope observed for outgoing versus incoming packets in
malicious data, as illustrated in Figure 5.2, supports the hypothesis that malicious
entities engage in sending data rather than receiving it. This critical observation
not only validates feature selection strategies for ML models but also underscores
the importance of incorporating metrics such as the ’OutpktByIpkt’ ratio - a key
indicator of malicious activity, as validated by several studies discussed in Section
3.3.

Looking at the techniques utilised by malicious samples in our data, as presented
in Section 5.1.1, and Table 5.2 we can make some interesting inferences. The utili-
sation of Dynamic Resolution (T1568) underscores the increasing sophistication of
C2 communication strategies. By leveraging Dynamic DNS (DDNS), malware can
ensure uninterrupted contact with C2 servers despite IP address changes, thus en-
hancing their resilience against conventional IP blocking and takedown efforts [44].
This adaptability complicates efforts to track and dismantle malicious infrastruc-
ture, underscoring the need for detection mechanisms that go beyond blocklist IPs
to analyse patterns in network traffic as a way to counteract threat actors’ evasion
strategies.

Additionally, the deployment of virtualisation and sandbox evasion techniques (T1497)
[45] and execution flow hijacking (T1574) [46] signifies a deliberate effort by attack-
ers to evade detection and maintain control over infected systems. The prevalent
use of network and system discovery techniques (T1049, T1016) further illustrates
the comprehensive approach malware authors take to map out and exploit network
environments, facilitating lateral movement and deeper infiltration.

These insights into the sampled malware’s operational tactics and behavioural pat-
terns reveal the complexity and sophistication of modern malware strategies. The
demonstrated adaptability in communication methods and evasion tactics calls for
detection systems that analyse traffic patterns beyond traditional signature-based
and IP-based methods. Notably, since specific patterns emerge from the malicious
network data, our detection mechanisms must leverage this information.

6.3 Quality of Data

This section addresses key challenges and limitations associated with data quality,
exploring how these factors influence the integrity and efficacy of our analysis. We
will discuss the impact of data capture duration and variability in data collection,
which are critical for developing reliable ML models and ensuring accurate anomaly
detection.

44


6. Discussion

6.3.1 Period of Recorded Data
The duration of data capture significantly impacts the quality of benign data used in
our analysis. While capturing data over 24 hours provides a comprehensive dataset
beneficial for robust ML training, it also substantially increases resource demands.
Since most benign network traffic comprises short-duration flows, often less than 10
KB and lasting a few hundred milliseconds [47], a 24-hour capture period may not
always be necessary.

Several factors influence the decision on capture duration. For endpoint user traffic,
capturing data over an entire day or more may be essential to accommodate daily
behavioural variations and ensure comprehensive coverage. In controlled environ-
ments like corporate network segments with repetitive tasks, such as the one we
examine for Recorded Future, 24 hours or shorter may be optimal for capturing a
complete cycle of network interactions.

In contrast, the analysis of malicious activities, particularly those involving per-
sistent backdoor connections, may benefit from extended capture periods to fully
observe potential data exfiltration and keep-alive connections. The primary goal
is to ensure that the capture duration aligns with the full range of observable traf-
fic patterns, thereby improving the accuracy and reliability of anomaly detection
models. However, the limitations imposed by Triage’s sandbox environment, which
restrict our malicious samples to a maximum of 60 minutes of runtime, can truncate
longer-lasting behaviours, potentially skewing our analysis.

Disparities in data collection durations across datasets can further influence the vari-
ance and comparability of flow duration, as seen when comparing normalised with
non-normalised durations in Table 5.4 and Table 5.3. The non-normalised durations
would indicate that the benign data had longer duration flows than malicious, on
average. This is to be expected since it was captured over a 24 times longer period.
After normalising the durations we find that the opposite trend emerges from the
data. This shift in trend highlights the need to consider capture durations when
examining your data and underscores the need for customised capture strategies
based on expected network behaviour.

6.3.2 Variation of Benign Data
A major challenge in developing an optimal detection model is the variability of
benign data. For instance, detection frameworks tend to perform better when net-
work traffic patterns are predictable and consistent [1]. Such predictability aids
in training ML models by clearly defining "normal" traffic. Focusing on specific
network subnets with uniform traffic would produce the best outcome since back-
door communications, which can mimic a range of benign traffic types, inherently
complicate detection. For example, the benign data we utilised from Recorded Fu-
ture originated from two specific network segments designed to perform well-defined
tasks. This intentional segmentation within their network infrastructure allowed us
to obtain data with well-defined characteristics of normal traffic. Consequently, this
coherent data simplified the task of training our model to accurately identify what

45


6. Discussion

constitutes normal traffic patterns.

6.3.3 Variation of Malicious Data
A major hurdle in this project was obtaining high-quality network data from Linux-
based backdoor malware. To mitigate the risk of using outdated malware with
inactive C