High-Speed, Low-Latency, and Secure
Networking with P4
A study of the programming language P4 and it’s potential use
at Saab Surveillance

Master’s thesis in Communication Engineering

Oskar Claeson & William Kruse

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021


MASTER’S THESIS EENX30

High-Speed, Low-Latency, and Secure Networking with P4

A study of the programming language P4 and it’s potential use at Saab Surveillance

Oskar Claeson & William Kruse

Department of Electrical Engineering
Division of Communications, Antennas, and Optical Networks

CHALMERS UNIVERSITY OF TECHNOLOGY

Göteborg, Sweden 2021


High-Speed, Low-Latency, and Secure Networking with P4
A study of the programming language P4 and it’s potential use at Saab Surveillance
Oskar Claeson & William Kruse

© Oskar Claeson & William Kruse, 2021.

Supervisor: David Olsson, SAAB Surveillance, X Innovations Lab.
Supervisor: Carl Kylin, Division of Communications, Antennas, and Optical Networks.
Examiner: Erik Ström, Division of Communications, Antennas, and Optical Networks.

Master’s thesis EENX30
Department of Electrical Engineering
Division of Communications, Antennas, and Optical Networks.
Chalmers University of Technology
SE-412 96 Göteborg
Sweden
Telephone +46 (0)31 772 1000

Cover: Graphical representation of a P4 enabled switch, created in Blender by William Kruse

Typeset in LATEX
Gothenburg, Sweden 2021

iv


High-Speed, Low-Latency, and Secure Networking with P4
A study of the programming language P4 and it’s potential use at Saab Surveillance
Oskar Claeson & William Kruse
Department of Electrical Engineering
Chalmers University of Technology

Abstract
The scientific world is rapidly evolving, pulling society more and more toward digitalization. This
puts pressure on the digital infrastructures such as communications networks. Not only are the
requirements more demanding concerning latency and bandwidth with modern machine- and deep-
learning solutions, but the increasing threat in cyber-security also demands safer and more robust
network solutions.

In this project, both a literature review and experimental study of the programming language P4 is
made. The purpose being investigating if P4 can help optimize network solutions regarding the more
demanding requirements and its potential use in Saab’s radar solutions. Three concepts utilizing P4
were experimented with, evaluated, and discussed. First, the concept of modification and possibility to
write custom-made protocol stacks was evaluated through creating a custom protocol stack, denoted
Shorternet, which modified a standard link-layer protocol as well as completely removing the internet
layer. Second, In-Band-Network-Telemetry, a way to gather data about the network usage and traffic
was implemented where the last switch in a flow created a telemetry report with information derived
from the switches that were traversed. Third, a data plane firewall was added to switches inside a
network which denied access to network devices that were not part of the initial network. Shorternet
provided promising results with not only increased effective bandwidth, mainly for smaller sized
packets, but also reduced latency and jitter. In-Band-Network-Telemetry proved successful but the
increased overhead introduced a significant trade-off with bandwidth and latency. The firewall was
easily implemented within the switches and managed to block access at link-layer level through the
use of one single rule. This showed potential for further expansion and suggests that the addition of
a controller could provide easier management and flexibility. P4 showed great potential on several
areas and may well become a staple in modern networking solutions with its added flexibility for
development.

Index Terms: Data plane programming, P4, TCP/IP, Network Telemetry, Congestion, Bandwidth,
Latency, Security, Software Firewall.

v


Acknowledgements
We would like to give our thanks to SAAB Surveillance and Chalmers University of Technology for
the opportunity to do this master thesis.

We would also like to give thanks to a couple of people at SAAB Surveillance. David Olsson, our
supervisor, for the help and guidance during the project and Sven Nilsson for providing the support
and resources needed to complete the project. Also, to our supervisor at Chalmers, Carl Kylin, we
give thanks for the valuable help with the thesis writing.

Oskar Claeson & William Kruse, Göteborg, August, 2021

vi


List of abbreviations
This section contain a list of the most commonly occurring abbreviations from the report, in order
to be able to familiarize with them beforehand and also to be able to fall back upon should it be
forgotten.

Abbreviation Term

Bmv2 Behavioral Model V2

CC Congestion Control

DNS Domain Name System

IETF Internet Engineering Task Force

INT In-Band Network Telemetry

IP Internet Protocol

MAC Media Access Control

MTU Maximum Transmission Unit

PISA Protocol-Independent Switch Architecture

PSA Portable Switch Architecture

SDN Software Defined Network(ing)

TCP Transmission Control Protocol

TTL Time-To-Live

UDP User Datagram Protocol

vii


Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Related research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 TCP/IP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Ethernet and IPv4 headers . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 P4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Behavioural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 In-band network telemetry . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3a INT modes of operation . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3b INT-header placement for INT-MD . . . . . . . . . . . . . . . . . 10
2.2.3c Applications with INT . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.4 Firewall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Method 13
3.1 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Removing unused information in packets . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Benchmark test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 In-Band Network Telemetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Switch logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 data plane firewall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Results and Discussion 23
4.1 Shorternet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Effective bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Latency and jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 General discussion about Shorternet . . . . . . . . . . . . . . . . . . . . . . 27

4.2 INT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 General security aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Security concerns and bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii


4.3.2 Cryptographic Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Firewall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Conclusion 38

References 39

ix


1
Introduction

The project was performed together with Saab Surveillance and evaluated the idea of implementing
P4 programs in network infrastructure to optimize traffic flow and improve security within Saab’s
radar solutions.

1.1 Background

There is a notable increase of interest within the scientific world concerning the use of as much of
the raw data provided as possible. This is a considerably different approach compared to the previous
more traditional approach, which focus on extracting the relevant data and then discard the rest.
Signal and data processing in radar systems are no exception to this, because with the additions of
modern machine- and deep-learning solutions the amount of transmitted data is increasing. For Saab
Surveillance this has led to new demands on their networking infrastructure, since it must be able to
handle massive amounts of data with low latency while retaining high levels of security. In order to
update this infrastructure, a possible solution could be to implement custom-made protocols using a
new and upcoming domain specific programming language known as P4. The previous standard way
of implementing network infrastructure involves a bottom-up design where the hardware defines the
network functions due to its limitations. With P4, however, it is possible to use a more beneficial
approach of top-down design where the network is instead designed by what is wanted or needed by
the developers rather than being limited by the hardware. Multiple network hardware manufacturers
have already launched programmable hardware and are continuously developing new versions. This
has enabled the use of P4 to create custom-made networking solutions deep in the TCP/IP stack,
which may lead to improved performances of future computer networks.

1.2 Scope

The report aims to investigate and evaluate whether or not the programming language P4 can optimize
and secure network traffic in radar systems. Custom-made P4 protocols/applications are tested,
evaluated, and discussed with regards to bandwidth, latency, jitter and security while compared
to traditional TCP/IP networks. Furthermore, a literature review is made to research potential
applications of P4. The project is limited to compare custom-made protocols with TCP/IP networks
based on Ethernet (802.3-2018) and IPv4. The layers that are investigated during the project are the
transport-, network- and link-layers. The project is also limited to software based solutions.

1


1.3 Ethics
The purpose of the project is to evaluate P4, which is a programming language for software defined
data planes in computer networks. Traditionally, the data plane functionality is hardware defined
and in order to implement new network functionalities, network devices need to be replaced. With
software defined data planes there may be less occasions where there is a need to replace network
devices, which could results in less electronic waste. At the same time, in order to obtain a software
defined data plane functionality in a traditional network the corresponding network device or devices
needs to be replaced.

Regarding the method in this project, there are no clear ethical problems at hand. The project is
entirely based on simulations and emulations of network functionality, thus, no one is affected during
the method nor is any waste produced.

The results from this project may lead to further studies of software defined data plane programming
as the intended purpose of this project is to evaluate if P4 can be utilized to improve network
performance and security. Since this project is carried out in collaboration with the defense company
Saab AB, there may be derived works intended for military purposes from this project.

1.4 Related research
As P4 is a relatively new and trending language seemingly becoming the de-facto language for data
plane programming there are several research articles, short papers etc being published about P4. The
working group developing P4 have released several publications together with specifications of the
language. These can be found at [1].

2


2
Theory

In this chapter some theoretical knowledge related to the project is presented. First, the TCP/IP
communication model is explained. Secondly, Ethernet and IPv4 headers are presented explicitly.
Lastly, P4 and its components and applications are described.

2.1 TCP/IP model

The TCP/IP model as described by the IETF [2] is layered as shown in figure 2.1. Each layer
incorporate different protocols depending on the need of the host. The application layer is the top
layer in which user specific protocols reside, e.g HTTPS and FTP but also system functions such as
DNS. The transport layer primarily uses two protocols called TCP and UDP. TCP [3] is a reliable
connection-oriented protocol which ensures the arrival of the packet, where connection-oriented
means that the protocol establishes a connection before sending data. UDP [4] is a "datagram"
best-effort transport service that strive to minimize the protocol overhead, thus providing a fast but
unreliable service. IP is one of the protocols that resides in the Internet layer [2]. IP comes in
two versions, IPv4 [5] and IPv6. IPv6 has multiple improvements compared to IPv4, where the most
considerable improvements include increased address field sizes and a simpler header format [6]. The
link layer [2] houses protocols that provide the required interface to the connected network. Ethernet
(IEEE 802.3) is a widely-used link-layer protocol family, which uses MAC-addresses to communicate
with other devices.

Figure 2.1: The communication layers of the TCP/IP model. The Link layer is often referred to as
layer two (L2), the Internet layer as layer three (L3), and the Transport layer as layer four (L4).

3


2.1.1 Ethernet and IPv4 headers
The frame format of the Ethernet standard [7, p. 118] can be seen in figure 2.2. The full packet
structure is not described, see the Ethernet standard for details. The frame fields are described as
follows in the Ethernet standard:

• The Destination Address and Source Address fields specifies the destination and source of a
packet with a 46 bit long MAC address, which is succeed by two flag bits which totals to 48
bits. In the Destination field the first bit indicates if the address is an individual address or a
group address. In the Source address field this bit is always 0. The seconds flag bit tells if the
MAC address is locally or globally administered.

• The Length/Type field specifies the length of the data field or the Ethertype. The value of the
field indicates if it should be interpreted as length or Ethertype. If the value is equal or less than
0x05DC the field specifies the length, and if it is equal or greater than 0x0600 it specifies the
Ethertype. Ethertype indicates what protocol is above Ethernet in the protocol stack.

• The Payload field contains the data.

• The Pad field is appended if the frame size is shorter than 64 bytes. Taking the header sizes into
account the data need to be larger or equal than 46 bytes.

• The Frame Check Sequence (FCS) contains a 4-byte cyclic redundancy check (CRC) value,
which is used to verify the received packet is correct.

Figure 2.2: Ethernet frame structure, separated in bytes. The Destination Address and Source
Address uses six bytes respectively. Length/Type fields uses two bytes. The payload and pad fields

are in variable length. The last four bytes contain the Frame Check Sequence (FCS).

The IPv4 frame [5], see figure 2.3, consists of several fields and the maximum allowed length is
65,535 bytes but all hosts are only required to be able to accept datagrams up to 576 bytes. The
header portion of the IPv4 frame has a maximum size of 60 bytes, but the typical size is about 20
bytes according to [5].

The specific fields and their lengths is defined in [5], which is as follows:

• The Version field specifies the version of the IP header.

• The IHL field specifies the length of the IP header.

• The Type of Service field specifies what quality of service that may be desired by the host. This
includes but is not limited to: precedence, delay, throughput and reliability. The use of these

4


parameters are up to the particular network to implement.

• The Total Length field specifies the total length of the header and data in bytes.

• The Identification field is assigned by the sender when a fragmentation of the packet has
occurred to aid in reassembling the initial packet.

• The Flags field includes control flags that indicate, if a packet is allowed be fragmented and if
it already is fragmented.

• The Fragment Offset field specifies the relative position of the current packet towards the whole
fragmentation.

• The Time to Live field specifies how many hops the packet has available, this value is set by the
sender and then decremented by each hop.

• The Protocol field specifies the protocol used for the data in the IPv4 frame, see [8] for the full
list.

• The Header Checksum field includes a checksum of the header only, which is computed at each
hop.

• The Source Address and Destination Address fields contains the 32-bit long IP-address of the
source and destination respectively.

• The Options field is a variable length field, which may optionally be used for frames for more
functionality. For a full description of the Options field see, [5].

• The Padding field ensures that the IPv4 header length is in modulo 32 bits when the Options
field is present.

Figure 2.3: IPv4 header structure, seperated in bits. Each row represents four bytes. The Options
field is optional and of varying length. The Padding field is added to ensure that the IPv4 header

length is in modulo 32 bits when the Options field is present.

5


2.2 P4
The programming language P4, which stands for Programming Protocol-independent Packet
Processors, is designed as a means to further improve the functionality of Software-Defined
Networking (SDN) by working in conjunction with SDN control protocols. The design of P4 was
proposed with three main goals in mind [9]. Firstly, to be able to modify the behavior of network
hardware after deployment. Secondly, to introduce a protocol independence of the hardware in
order to not to be limited to specific network protocols. Lastly, a target independence where the P4
programmer need not consider the specifics of the hardware that is to be controlled. Currently, there
has been two releases of P4 specifications, the first one now known as P414 [9] and an updated version
known as P416 [10]. The P416 version intend to stabilize the language using P414 as a fixed core
together with a set of specific libraries suitable for evolution as well as other flexibility mechanisms.
This library should be included in every p4 program by adding the definition file core.p4.

2.2.1 Architectures
When writing a program in P4, there is a need for a definition of the target in order for a compiler to
properly compile the program to the hardware. This is where the concept of architectures comes into
play. An architecture, as described in the P416-specifications [10], is a .p4 file definition similar to
the core.p4 and is intended for vendors to provide the programmers with in order for them to be able
to compile to the hardware. Having access to several architecture definitions can be seen as enabling
portability for a P4 program to multiple different targets.

The programmable pipeline of a switch is often referred to on an architectural level as a Protocol-
Independent Switch Architecture (PISA). The PISA is based on three parts, all of which are
programmable: A parser, a match-action pipeline, and a deparser. The model is explained very simply
in figure 2.4. Stephen Ibanez, a PhD student with connections to the P4 consortium, discuss the PISA
shortly in the beginning of an introduction video made by the consortium [11]. The parser can be
seen as a finite state machine that reads the headers of incoming packets and extracts identifying fields
which are to be matched in the next stage. At the match-action pipeline the extracted data is passed
through three stages: firstly, a checksum verification which is intended to catch erroneous packets.
Secondly, the packets get passed along one or more match-action units, which are programmed to
perform a match on the fields and potentially follow up with an action. Thirdly, a checksum update
is performed since during the match-action stages there may have been some modifications of the
packet which must be updated. Finally, packets arrive at the deparser which re-serialize the headers
together with the payload before it is passed through to the output.

To illustrate the workflow through PISA a simple example is presented: a simple IP forwarding
procedure. A packet consisting of an IPv4 header over Ethernet frame together with a payload arrives
at the parser. The parser extracts the first-most header, which is the Ethernet header and reads the
information. One of the fields in the Ethernet header is the type field containing information about
the following protocol, in this case IPv4, which tells the parser that the next state is to parse the IPv4
header. After the IPv4 header is extracted then usually any following protocol such as TCP or UDP are
parsed, however, in this simplified case this is ignored since only the IPv4 and Ethernet headers are of
interest. Instead the packet is directly accepted after IPv4 is parsed and is then passed through to the
match-action pipeline. Here the packet together with its metadata is processed and potentially some
actions are performed depending on matching tables entries. For instance in this case the forwarding

6


Figure 2.4: Simple description of the block structure flow for the PISA. First, a packet arrives at the
programmable parser and afterwards it is forwarded to the programmable match+action pipeline,

which includes a checksum verification, match+action table, and checksum update. Lastly, the
packet is forwarded to the programmable deparser.

is supposed to port the packet to the correct output port connected to the next hop according to the
forwarding table lookup. Thus, the matching is based on the IPv4 destination address. Now the packet
is either dropped due to a faulty address or the packet is forwarded accordingly. At the deparser the
two headers, IPv4 and Ethernet, are then re-serialized in the correct order and passed to the output of
the switch ending the forwarding workflow example.

The P4 consortium have been working on a way of abstracting the hardware pipeline, which is needed
in order to not be bound to any specific switching chip. The work-in-progress specification for the
target architecture is denoted Portable Switch Architecture (PSA) [12], which consists of six P4
programmable blocks together with two fixed-function blocks. The two fixed-function blocks are
called "Packet buffer and Replication Engine (PRE)" and "Buffer Queuering Engine (BQE)" both of
which are target dependent. Comparing the PSA with the simple PISA, see figures 2.4 & 2.5, it can
be noted that the programmable blocks in the PSA can be seen as two PISA structures, one for ingress
and the other for egress. What this means is the first section takes incoming packets to the switch and
process them before sending them in to the PRE, after the PRE the packets are once again processed
before sending them out to the BQE.

Figure 2.5: Simple description of the block structure flow for the PSA. A packet first arrives at an
ingress process consisting of a parser, match+action pipeline, and deparser. Then it is forwarded to
the fixed function PRE. The PRE then forward the packet to an egress process similar to the ingress

process before it is forwarded to the fixed function BQE.

As the P4 consortium is working on the PSA that is to be an one-fit-all solution, a more simple
and commonly used architecture by P4 programmers at the moment is the V1model. The V1model
architecture is based on P416, but is designed to mimic the functionality of a P414 switch, in order to be
able to translate older programs onto the later version. V1model is structured with six programmable
blocks and a fixed function traffic manager, which are visualized in figure 2.6. The traffic manager
handles packet queuing, replication and scheduling, similar to the fixed function blocks purpose in

7


the PSA model. The idea behind the V1model is for programmers to be able to utilize and translate
P414 programs until the PSA has been fully defined which will provide more functionality. As the
V1model is not being focused on by the P4 developers there are some flaws and "illegal" actions that
are accepted by compilers that will not properly work in the hardware, which may cause some issues.

Figure 2.6: Simple description of the block structure flow for the V1model architecture. First, a
packet arrives at an ingress process consisting of a parser and match+action pipeline, which is not

including the checksum update. Then the packet is forwarded to a traffic manager before it is
forwarded to an egress process. The egress process consists of a match+action pipeline without a

checksum verification and ends with the packet arriving at the deparser.

Currently, switch vendors are working on defining their own architectures, for example the company
Barefoot (owned by Intel since 2019) have developed their own architecture called Tofino Native
Architecture (TNA) for their Tofino switches, which is not described here. In the language
specification of P416 [10] it is specified that manufactures of P4 enabled switches are expected to
provide an accompanying architecture definition with their product. The reasoning behind this is that
vendors can then express the uniqueness of their chips and build upon them for newer releases etc.
and not be bound by standards. The architecture definition, however, does not need to express every
functionality of the target, the manufacturer may even opt to have several definitions for the same
target but with different capabilities.

2.2.2 Behavioural model
The behavioural model version 2 (Bmv2) switch [13] is a reference P4 switch that has been configured
to use different P4 targets such as simple_switch (V1Model), simple_switch_grpc (V1Model) and
psa_switch (PSA). Bmv2 targets use a .JSON file, which is generated at compilation of a .p4 program
to apply the behavior of said program. Mininet, which is a network emulator software [14], can then
be used to emulate networks consisting of Bmv2 switches loaded with the programs.

The performance of Bmv2 can potentially be affected be a number of factors [15]. These include,
but are not limited to, system hardware, number of match-action tables and entries in the .p4 program
installed on the switch, host OS, and the version of the Bmv2 switch. The P4 consortium claim that
Bmv2 is rated up to around 1Gb/s. Antonin Bas, a developer for P4, mentions in a github issue [16]
that the ingress pipeline is often the bottleneck of the program. They also mention that the throughput
linearly depends on the amount of match-action tables.

2.2.3 In-band network telemetry
In-Band Network Telemetry (INT) specification [17] was initially introduced by the P4.org
Applications Working Group in 2015. The current version 2.1 was released in 2020. It proposes
a new way to gather telemetry data in networks without applying the control plane. This data can be
used for multiple applications, which are mentioned in section 2.2.3c. The specification defines data

8


that can be gathered in each switch, such as node ID, Egress/Ingress interface ID, hop latency and
Egress/Ingress timestamps.

There are multiple headers defined in the INT format. INT-shim is a header that indicates which mode
is used and what protocol is next in the protocol stack. The different modes are presented in the next
section. INT-header can contain information such as version, per hop length of headers and data,
remaining hops and instructions. INT-data contain data from each switch respectively. INT report
header is a header which precede the telemetry report data. The telemetry report format is specified
in [18], which is published by the P4.org Applications Working Group.

There are three different switches defined in an INT-flow. INT-source is the first P4 switch in a
flow. The INT-source can add the initial INT-headers and INT-data. Depending on the mode, which
is explained in the next sub-section, INT-source sends reports to the monitoring system or adds the
initial INT-headers and INT-data to the packets. INT transit-nodes are consecutive P4 switches in the
flow, which can read INT-headers and act accordingly. INT transit-nodes send reports directly to the
monitoring system or append their data depending on the mode. INT-sink is the last P4 switch in a
flow. The INT-sink is able to remove all the potential INT-headers and INT-data and forwards the
original packet to the destination.

2.2.3a INT modes of operation

The INT specification [17] defines three modes of operation: INT-XD (eXport Data), INT-MX
(eMbed instruct(X)ions) and INT-MD (eMbed Data). A visualization of INT-XD, INT-MX and INT-
MD can be seen in figures 2.7, 2.8 and 2.9, respectively. With INT-XD each node in the network sends
reports to the monitoring system and the packet from the source host is not altered any point. For INT-
MX, an INT header are appended at the source node which include some protocol information and
instructions that successive nodes can read and act upon. With INT-MD, an INT header and the data
from each node is appended to the packet, a report is then created at the sink and sent to the monitoring
system. INT-MD is focused on in this project.

Figure 2.7: A description of the consecutive packet header formats and how reports are sent to the
monitoring system when using the INT-XD mode of operation.

9


Figure 2.8: A description of the consecutive packet header formats and how reports are sent to the
monitoring system when using the INT-MX mode of operation, where INT headers are added at

INT-source. INT headers may include INT-shim and INT-header

Figure 2.9: A description of the consecutive packet header formats and how reports are sent to the
monitoring system when using the INT-MD mode of operation, where INT headers and INT-data is

added to the packet at each hop. INT headers may include INT-shim and INT-header

2.2.3b INT-header placement for INT-MD

INT headers and INT-data can be placed at different places in packets. A list of examples is available
in the INT-specification [17], however, it is entirely up to the network programmer to decide where
INT is placed and what headers to use. In all cases the INT-source adds the INT headers and the
headers that correspond to a certain mode of operation to the packet. A couple of commonly used
INT-MD placements over TCP/UDP from the specification can be seen in figure 2.10.

In figure 2.10a INT is placed between the TCP-header and TCP-payload. The DSCP flag is set to
0x17 in the IPv4 header to indicate that INT is present. In figure 2.10b INT is placed between the
UDP-header and UDP-payload. In this case the destination port field in the UDP-header is used to
indicate that INT is present. However, what that value should be is not currently specified in the
specification.

The specification uses a couple of different terms than what are used this thesis. INT-metadata in the
specification refers to the INT-header combined with the INT-data. INT-header in the specification

10


also refers to all the INT headers and data i.e INT-shim, INT-header and INT-data combined.

(a) INT placed between TCP header and payload. (b) INT placed between UDP header and payload.

Figure 2.10: Figures with examples of different INT-MD header placements, as specified in the INT
specification [17].

2.2.3c Applications with INT

It is possible to leverage INT in many different applications. Some examples that are presented in this
thesis are Congestion control (CC), load balancing and anomaly detection.

CC aims to limit the bandwidth of the senders to avoid large queues in network devices. Congestion
typically occurs when one or several flows exceed the bandwidth limit a device is capable of
processing/outputting. When a switch experiences congestion, the latency increases due to longer
queues. With CC the queue build-up can be reduced thus reducing latency at the cost of bandwidth.
CC was first applied after the internet congestion collapse in 1986 when V. Jacobson released a
congestion avoidance algorithm [19]. With today’s fine-grained metrics, the reason for congestion
and other faults can be pinpointed on a fine level. For example Intel Deep Insight claims it is able
to resolve the latency induced on a packet in nanoseconds [20]. Yuliang et al. proposed HPCC:
High Precision Congestion Control [21] which leveraged INT with P4. They implemented a new
CC algorithm and showed results of 95% less flow-completion time compared to DCQCN [22] and
TIMELY [23]. DCQCN and TIMELY are previous congestion control systems in Remote Direct
Memory Access (RDMA) networks. Flow-completion time refers to the time it takes for a flow of
multiple packets to traverse the network. The authors implement a custom form of INT where the
sink sends back a report to the initial sender. The sender can then use this information to control
its bandwidth. By default the algorithm causes a 5% loss in maximum bandwidth, which results in
almost zero queues. The maximum bandwidth is defined as the bandwidth of the link with the lowest
bandwidth capability, e.g bottleneck link. The bandwidth loss could be configured if bandwidth is
crucial, but the sent bandwidth can not exceed the bandwidth capability of the bottleneck link. The
HPCC protocol appends a header of 2 bytes at the source and 8 bytes of INT data is added at each
hop. The algorithm was designed to improve the congestion algorithms already present in RDMA
over Converged Ethernet Version 2 networks.

Another form of congestion control is load balancing, however, the difference is that with load

11


balancing the flows are separated between switches. A comparison can be that two traffic lanes
are used instead of one. J. Kim et al. implements load balancing with INT [24]. Each switch gathers
and saves network metrics with INT. Then the switches route flows accordingly. They compare it to
Equal-Cost Multi-Path (ECMP) [25] routing and conclude that their system has lower throughput and
longer flow-completion time, when congestion is not present in the network. They explain that this is
due to the increased overhead. However, during congestion the throughput is significantly higher and
the flow-completion time is lower.

Load balancing can also be applied by a controller, J. Hyun, N. V. Tu and J. W. Hong discusses and
implements parts of knowledge-defined networking [26]. The idea is to feed the data generated by
INT to machine learning algorithms in a knowledge plane. Insights from the knowledge plane can
then be handed over to the control plane which can take appropriate actions in the network. The
authors argue that, the control plane have difficulties handling the large amount of INT data alone.
The knowledge plane should thus run on another machine to offload the controller. The authors claim
that this system can be used for traffic engineering and anomaly detection.

2.2.4 Firewall
The purpose of a firewall is to filter out unwanted or harmful traffic to provide security within the
network. Generally, a firewall sees usage in internet- and transportation-layers of the TCP/IP model,
but can also be able to analyze application level traffic. Traditionally, firewalls are either hardware or
software based. A hardware based firewall, also known as perimeter firewalls, protects the network
and the traffic going in or out from the perimeter. A network may have several hardware firewalls
within the network and not solely at the edges to create more boundaries and separate accesses within
the network. This is useful for managing and controlling how the network functions, but it also
provides security for devices without built-in firewalls, for example printers. Software firewalls on
the other hand works like a backup to individual systems for when a breach in the network has already
occurred. The software firewall may detect these breaches by consulting a database, which indicates
whether the traffic is of a malicious nature or not. With the rise of SDN and virtualization, physical
hardware firewalls can instead be virtualized as software implementations to safeguard cloud-based
networks or virtual networks to provide more security. An example of a virtual hardware firewall is
VNGuard, proposed by Deng et al. [27], which adaptively finds and places virtual firewalls at entry
points. Hu et al. proposed another SDN motivated firewall called FlowGuard [28] which intends to
provide firewall protection for OpenFlow-based networks. This, however, is required to follow the
constraints of OpenFlow protocols. With the usage of the P4 language, a firewall can be programmed
directly on P4-enabled switches, allowing for custom network solutions to be protected at the data
plane level. This may also provide firewalls within the network, which can protect against potential
attacks launched within the network. One such firewall implementation is called P4Guard, by Datta
et al. [29]. Datta et al. state that P4Guard is an easily configurable software implementation that can
update functionalities and install dynamic firewall rules on-the-fly through data plane programming
utilizing the P4 language. P4Guard leverages a controller which are able to both add and remove
software firewalls dynamically through the network. P4Guard is based on the Bmv2 switch which
limits the performance compared to VNGuard, in a similar fashion as that described in section 2.2.2.

12


3
Method

This chapter presents the approach and methods used to fulfill the scope of the project. The intention
is to also provide a base upon which readers may understand and replicate the tests and provided
results. First, the testing environment used in the project is described. Second, an experiment to test
the use case of removing redundant information and how it was performed is explained. Third, the
concept of INT and how it was implemented during the project is described. Lastly, a proof of concept
firewall idea is presented and described.

3.1 Testing environment

To be able to conduct the tests and evaluate the programming language and its potential, a virtual
environment was constructed. The environment was emulated using Mininet and consisted of Bmv2
simple_switch_grpc switches that were loaded with P4 software. No SDN controller was used, instead
table entries were added to the switches at startup with simple_switch_CLI. New entries could also
be added at runtime with simple_switch_CLI if needed. Mininet runs on a virtual machine running
Ubuntu 18.04 with 8 Gb of ram and 4 CPUs. The host machine runs Ubuntu 20.04 with an Intel
i7-10510U processor and 16Gb of 2667 MHz RAM. Versions of the different software that were
used were, Bmv2 (1.14.0), Mininet (20.3.0b2), protobuf (3.6.1), p4c (at commit 24895c1), and PI (at
commit c65fe2e).

3.2 Removing unused information in packets

One of the potential use cases of P4 is the flexibility to easily change, add new, and remove protocols
from the standard protocol stack of a packet-switching network. In a high performance application
where every byte counts, this flexibility could potentially improve the efficiency of the traffic by
removing redundant information. One such case that was tested was to modify the standard layer two
Ethernet frame as well as completely removing the layer three protocols. Ethernet was chosen as this
is the most commonly used standard for link-layer communication, especially concerning IP packets.
The motivation behind the modifications were that in an air gapped and statically defined network
with limited switches and hosts there may not be a need to use the full MAC addresses to be able to
uniquely identify the network devices. Furthermore, if the network structure is known, then a routing
process, which utilizes IP, is no longer needed. The modified protocol stack, which is denoted by the
authors as Shorternet, together with the standard TCP/IP stack are visualized in figure 3.1.

13


With the idea to bypass standard IPv4/IPv6 protocols, the layer four protocol was added directly
on the modified layer two protocol instead. This way the header size of the packets is reduced
and leaves more space for raw payload. However, with this removal the packet no longer have a
Time-to-Live (TTL) field nor a transport layer type field, which usually reside in the IP protocol.
Therefore, the TTL field is included in the new layer two protocol in order to secure the network from
endlessly propagating packets, as well as changing the standard etherType field to address the layer
four protocols instead. The previous size of the Ethernet frames were two MAC address fields of 6
bytes each and the etherType field of two bytes. The new and modified layer two protocol instead has
four single byte fields, two address fields, a new protocol field, and lastly, the TTL field. Thus the
new layer two protocol has been reduced from 14 header bytes to 4 header bytes. With the removal of
the IP protocols, at least another 20 bytes have been removed from the protocol stack (due to optional
headers this can be larger). In total a minimum of 30 header bytes are removed from each packet, see
figure 3.1. The reason why for example the TTL field is 1 byte when it is enough to have 4 bits or less
is because when using the V1model, some limitations of the older version P414 is retained, where the
total header size need to be in full bytes.

Figure 3.1: Simple representation of the differences between Standard TCP/IP protocol stack (top)
vs the Shorternet protocol stack (bottom). The MAC part of the TCP/IP structure has been reduced

and includes a TTL field instead in the Shorternet structure, leaving more room for the Data section.
The IP frame has been removed in Shorternet, leaving more room for Payload

The potential gain in effective bandwidth should reduce as the packet length increases. In order to
prove this hypothesis the payload length was increased during benchmark tests to gather insight of
how the payload length affect the results. The tests were performed on an emulated network using
Mininet as described in section 2.2.2. The structure of the network is visualized in figure 3.2 where
Host 1 is the sender and Host 4 the receiver. The test consisted of a single flow traversing the network
of 100 000 packets per flow with varying packet lengths.

14


Figure 3.2: Structure of the network used for the benchmark test of Shorternet vs Standard TCP/IP
protocol stack. The path that packets traversed during the tests was: Host 1 −→ Switch 1 −→ Switch 3

−→ Switch 2 −→ Host 4.

3.2.1 Benchmark test

In order to create a fair comparison between the benchmark-tests of the standard and Shorternet
protocol stack, custom test scripts were written. These were needed as the existing tools to measure
network statistics such as received bandwidth, latency, jitter, and packet loss are not compatible with
the custom written Shorternet protocol stack. The benchmark test scripts are written in Python2.7 and
utilize sending timestamps and sequence numbers together with a string of randomized text as raw
payload. At the receiving side these are then used to calculate the network performance metrics. The
flowchart of these scripts are presented in figure 3.3

The idea of the test scripts were to create packets of equal packet length where 100 000 packets are
sent with a similar packet rate to ensure fairness. Due to the fact that the network is emulated, the
performance is potentially reduced by a number of factors as mentioned in section 2.2.2. In early
tests it was concluded that on the test setup the bandwidth of the network was at the highest at around
40Mbit/s. This bandwidth reduced even further when adding more switches to the emulation. To
ensure the network could keep up with the flow while still retaining fairness, the packet rate was fixed
to approximately 1000 packets per second (pps) during all tests using a sleep function in the scripts.
By tweaking the sleep function, the packet rate could be changed to a slightly higher value and still
see good performance from the switches. However, due to the way packets were created with Scapy,
which is a program in python that enables packet manipulation [30], the fairness was compromised
when tweaking the sleep function because then the packet rates were no longer similar between the
two protocol stacks. Therefore, the packet rate of 1000 was used. The tests were conducted by
sending 100 000 packets for each packet length. This gives a large enough sample base of which
accurate averages and standard deviations could be derived.

Packet loss is calculated via the sequence numbers, where after a packet has been received the
sequence number is read and then checked if this matches the expected number. If the received
number is higher than the expected one then packet loss has occurred, which is kept track of with a

15


counter. At the same time, if a received sequence number is lower than the expected then the packet
is late and the loss counter is updated since this packet is no longer lost but is instead out of order.
Neither the sender nor the switches are able to resend the same packet, thus making sure no duplicate
packets traverse the network. The latency metric is modeled as a stochastic variable named L, which
is defined as the time-difference between the received timestamp of the sent packet and a timestamp
taken at the receiver side when a packet has arrived:

L = tarrival− ttransmitted. [ms] (3.1)

The arrival timestamp is taken directly on arrival in order to not include the processing time of the
script itself. A latency measurement, li, i ∈ [1,n], is generated for each packet that arrives at the
receiver. The latency for a specific packet length is estimated as an average over all n measurements
according to:

µL =
∑ li
n

, [ms] (3.2)

where n in this case is 100 000 packets minus the number of packets lost. If a packet has been lost,
the latency of that packet is simply not included, thus, the number of packets lost is to be presented
together with the latency. There are different ways to define jitter, for computer networks it can
be seen as a metric describing variations in latency and is sometimes referred to as Packet Delay
Variation. In this test, during each iteration it is assumed that the network channel is not varying
with time. For instance, the channel does not need a stabilization period and the transmission flow is
neither bursty nor varying in size due to how the test is carried out. Therefore, jitter is defined as the
standard deviation σ of the latency and is estimated according to:

σL =

√
∑(li−µL)2

n−1
. [ms] (3.3)

The final metric that is estimated is the received effective bandwidth. This was done via the following
equation:

Be f f =
bytestot

tflow
· payload length

packet length
, [bytes/s] (3.4)

where a counter, bytestot keeps track of the number of processed bytes as well as using a first arrival
timestamp and last arrival timestamp to find the total flow time, t f low. The payload length and packet
length variables indicate the length of the raw payload of packet in bytes and the total length of a
packet in bytes respetively.

16


Figure 3.3: Flowchart describing the logic of the scripts used for the sender and receiver during the
benchmark test implementation. In this case Host 1 ran the Sender script to send 100 000 packets

with an artificial packet rate created via a sleep call in the end of the loop. Host 4 ran the Receiver
script, which listens for incoming traffic and if a correct packet is received it is then processed

accordingly

17


3.3 In-Band Network Telemetry

A network with INT was implemented and compared against a traditional network without INT in
order to gather insight into how INT affects the network. Bandwidth, latency, and jitter was compared
between the two networks.

The implemented INT protocol follows the structure of INT-MD over UDP as specified by the INT
specification [17]. Focus on applying INT to UDP traffic was chosen because applications in radar
systems often use UDP or similar best effort protocols to stream data from the antenna-arrays. The
INT protocol was constructed with CC in mind, the goal being to achieve high bandwidth networks
while maintaining low-latency. However, there was not enough time to properly implement and
evaluate a CC algorithm. In contrast to how INT reports were forwarded in the INT specification
where reports were sent to a monitoring system, the INT report generated by the INT-Sink is sent
back to the sender instead. This is done because the sender should apply CC on its own output, which
uses INT-Data, and in a sense becomes the monitoring system for its own flows. The INT-shim header
was not used at all. This was possible because only INT-MD over UDP was present in the network.
If other INT protocols were to to be used, INT-Shim would be needed to separate these.

The INT-header format can be seen in figure 3.4. This header is added at the INT-source. The hop
count specifies how many hops an INT packet has traversed, this field is incremented at each hop.
Hop count is used when parsing INT-Data at consecutive nodes. The UDP destination port number
is copied to the UDP Port field from the UDP header. The INT-source changes the UDP destination
port number to a known number to indicate that INT is present. The INT-sink uses the saved UDP
port number when reconstructing the original packet. The INT report header uses the same format as
the INT-header.

The INT-data header structure can be seen in figure 3.5. This header is appended to the packet at
each hop. Timestamp saves the egress timestamp. Transmitted bytes saves the total sent bytes from
the egress port. Queue length saves the queue length at the egress port. Switch ID saves the id of a
switch. Port saves the egress port the packet is forwarded through.

Figure 3.4: Structure of the implemented INT-header fields. The first byte contains the Hop count
and the remaining two bytes contain the UDP Port. The total size of the INT-header header is three

bytes long.

The system was then tested in two tests, where both were conducted using a linear network topology.
In the first test the diameter of the network was increased to gather insight how the flow length affects
the network. In the second test one additional UDP-flow is introduced to the network to gather data
on how multiple flows affect the network.

18


Figure 3.5: Structure of the implemented INT-data fields. The first four bytes contain the Timestamp,
bytes 4-7 contain the value of total Transmitted bytes. Bytes 8-9 represent the Queue length and the
last two bytes represent the Switch ID and Port respectively. The total size of the INT-data header is

12 bytes long.

3.3.1 Switch logic
All of the following logic assume that the packet being processed is a UDP-packet. Otherwise the
process follows a basic forwarding logic. In figure 3.6 the complete flow of a packet through a switch
is visualised. Each switch parses all the present headers in the packet, if INT-data is present the
payload is then parsed as a variable header. This enables the switch to remove the packet data later
on when creating a report if the switch is a sink. At the end of the ingress pipeline the switch checks
if it is the last one in a flow. This is done via matching the destination address against a table and if it
is a match, the packet is cloned. The clone is used to create an INT report in the egress pipeline. The
packets are then passed to the traffic manager and then to the egress pipeline.

The egress pipeline has two different parts, one for a potential cloned packet and one where INT-data
and/or INT-header is to be appended. If the packet is not a clone the switch continues to check if the
switch is a sink. If it is not a sink then the INT-header and INT-data is added as usual. Should the
switch be a sink, the INT-header and data is removed to ensure that INT is not present at the receiving
end for the original packet. If the packet is a clone, INT-data is added from the sink switch. Then the
payload is removed and the report is forwarded towards the host that sent the packet in the first place.

In section 2.2.3b it is mentioned that the UDP destination port is used to indicate INT in a packet.
However, the value of this parameter is not decided by the specification. The port number to indicate
that INT-metadata is present in a packet was chosen as 1337 and the port number to indicate that
the packet is a report was chosen as 1338. This information is used in multiple places in the logic.
In the parser the UDP destination port is checked to indicate if it should parse INT-header and data.
Furthermore, if the packet is an report then INT-data should not be appended to that packet.

19


Figure 3.6: Flowchart of the P4 switch logic for the INT implementation. The ingress block
describes the ingress pipeline for INT. If cloning occurs, two packets are sent out of the ingress
pipeline. The egress block describes the egress pipeline for INT. The results are then sent to a

checksum updater, and lastly, to the deparser. If cloning occurred in the ingress pipeline the egress
pipeline is invoked twice for one ingress packet.

20


3.3.2 Benchmark
In the first test with increasing network diameter, three sub-tests were performed using three, four and
five switches respectively. The topology with four switches can be seen in figure 3.7. A UDP flow
was introduced with iperf2 [31]. iperf2 is a free software network measurement tool for measuring
bandwidth, packet loss, and jitter. Specifically, the jitter measurements of iperf2 are, in contrast
to Shorthernet, done as specified in section 6.4.1 in the RFC3550 [32]. The jitter, in this case, is
estimated at every new received packet together with a weighted value from the previous estimation.
The result is a smoothed value of the deviation of the latency between packets. This method is more
accurate for a channel varying with time, which is the case with INT. This is partly due to that there is
a stabilization period during the start of a flow since the report flow start after the first packet arrives
at the last switch, as well as that congestion is expected to occur.

The payload length was fixed to ensure that the Maximum transmission unit (MTU) was not exceeded
in any of the topologies. For each topology, multiple tests with an increase in bandwidth was
conducted to detect when congestion occurred in the network. The performance of the network is
reduced with an increasing number of switches, as mentioned in section 2.2.2. Because of this, the
maximum bandwidth was configured to 6 Mbit/s for all hosts and switches in order to reduce the
effect of limited system performance. The tests ran for 120 seconds for each bandwidth, split into
six 20 seconds sub-tests. This means each flow was 20 seconds long, which ensures that congestion
occurs if the switches are not able to handle the incoming bandwidths. If flows are too short, packets
only experience longer queues and no packets are dropped.

In the second test the topology was fixed at four switches. The bandwidth was also limited to 6 Mbit/s
to ensure fairness. In this test the bandwidth was increased as in test one, but at the same time another
iperf2 UDP flow was introduced between two other hosts at a fixed bandwidth of 2 Mbit/s. This
was done to gather better insight into how multiple flows with INT affect the network, since there is
seldom one single flow residing in the network at a time. The flows are visualized with black arrows
in figure 3.7. The test ran for 120 seconds for each bandwidth, split into six, 20 seconds sub-tests.

In both tests the metrics were gathered from an iperf2 server which was active at the receiver. Latency
was gathered in both tests by using ping [33]. The interval for ping was configured to 0.1 seconds.

Figure 3.7: INT benchmark network setup with four switches and two flows. Each black line
represents one flow.

21


3.4 data plane firewall
A proof of concept data plane firewall was designed and implemented. The idea of the firewall was to
block outgoing and incoming traffic towards or from a specific network device. The firewall works by
associating MAC addresses with port numbers, meaning that each switch needs a rule that matches
the incoming or outgoing port with the MAC source or destination addresses respectively. These
rules are implemented as table entries. If a packet does not have matching table entries that packet is
dropped directly at the switch. Packets that should be dropped are marked by using the mark_to_drop
action, which is provided by simple_switch. The firewall logic is visualized in figure 3.8. The firewall
logic only resides in the ingress pipeline. After the parser the MAC source Address is matched against
the table. If the incoming port is valid for that address the next step is basic forwarding with IPv4,
other protocols are not supported. The resulting MAC Destination Address is then matched against
the outgoing port, if it is valid the packet is sent to the traffic manager. The table entries that were
needed for each switch was added with simple_switch_CLI after startup.

Figure 3.8: Flowchart of the P4 firewall logic. The logic completely resides in the ingress pipeline.
Rules in the table decides if a packet should be dropped or not.

To test that the firewall worked as intended, false packets generated with Scapy [30] were injected
with MAC addresses that were already present in the network. The network structure was the same
as when testing Shorternet, see figure 3.2. In this case, however, Host 4 is initially not part of the
internal network and should therefore be blocked by the firewall until a table entry is written which
allows packets to flow to and from Host 4.

22


4
Results and Discussion

In this chapter, the given results from the three conducted tests are presented and discussed.
Furthermore, general concerns regarding security related to P4 are discussed. Lastly, potential ideas
for further development are discussed.

4.1 Shorternet

After conducting the custom-made tests to benchmark the performances of the standard and
Shorternet protocol stacks it was noted that Shorternet provided an improvement in not only
bandwidth, which was expected, but also in latency and jitter.

4.1.1 Effective bandwidth
The results concerning the received effective bandwidth of both tests are visualized in figure 4.1
together with a comparison between the two shown in figure 4.2. Here it is noted that the bandwidth
received while using Shorternet is ever so slightly larger than for the standard solution continuously
throughout the test. It is also noted that the effective bandwidth gain of using Shorternet over the
standard solution is inversely proportional to the packet lengths. For small packets the effective
bandwidth is almost doubled whereas for the larger packets the gain is below 5%. Theoretically, if the
transmission speed and packet length is the same for both protocol stacks then the received effective
bandwidth and gain should equate accordingly.

Beff = packet rate ·packet length · payload length
packet length

[bytes/s] (4.1)

For Shorternet the headers account for 12 bytes of the packet length whereas for the standard solution
the headers account for 42 bytes. Thus, the payload lengths are equal to packet length - 12 for
Shorternet and packet length - 42 for the standard.

Bgain =
BShorternet

BStandard
=

payload lengthShorternet

payload lengthStandard
(4.2)

Assuming a packet rate of 1000 pps the calculated gain in effective bandwidth for varying packet
lengths is presented in table 4.1.

23


packet length (Bytes) Gain (%)
74 93.75
300 11.62
850 3.713

1514 2.038
9000 0.335

Table 4.1: Table of theoretically calculated gain in effective bandwidth for different packet lengths
using a set packet rate of 1000 pps

Figure 4.1: Plot of the measured effective bandwidth as a function of packet length for Shorternet
(blue) and standard TCP/IP (orange).

Figure 4.2: Plot of the measured, together with the theoretically calculated, gain in effective
bandwidth from using Shorternet compared to TCP/IP as a function of packet length. The blue curve

represent the measured gain and the orange curve represent the theoretical gain.24


Here it is noted that for small packets, Shorternet sees a large gain in effective bandwidth whereas for
larger packets this gain reduces towards zero with increasing packet lengths. This is to be expected
as it is the relation between payload length and packet length that determines the gain in effective
bandwidth when the packet rate is fixed according to (4.2). Since Shorternet always have 30 more
bytes of payload for equal packet lengths it can be derived from (4.2) that the gain can be seen as
30+x

x = 30
x +1 where x is the payload length of the standard solution and +1 indicate that Shorternet

have a larger payload and thus a positive gain. This resulting 30
x factor is inversely proportional to the

packet lengths which approaches zero for larger packet lengths. By comparing the practical test results
with the theoretical, see figure 4.2, it is shown that the provided result clearly resemble the theoretical
calculations. Derived from these results, both the theoretical and experimental, an improvement in
effective bandwidth is clearly provided when removing redundant information within a protocol stack.
However, one must consider how this relates to actual data transmissions within a network. There is an
obvious potential for increased bandwidth within networks handling small packets, even an increase
of 5% may prove beneficial for the commercial success of a product.

4.1.2 Latency and jitter

By analyzing the latency of the two protocol stacks, shown in figure 4.3, it can be seen that the
latency of the Shorternet stack seems more stable and about 3 ms lower on average compared to the
standard stack. The latency values are derived by estimating an average of the latency of every packet
traversing the network during the test, as described in section 3.2.1. It was also noted that regarding
packet losses, see figure 4.4, the packet loss rate was less than 0.01% during both flows, except for the
standard protocol stack with packet length 694 bytes which reached about 0.025%. In comparison,
the standard protocol stack experienced packet loss on more flows and also dropped a larger amount
of packets than Shorternet. Since the latencies of the dropped packets are not included the amount
of packets dropped reduce the sample size when estimating the average latency. Nevertheless, the
number of lost packets is comparably quite few and should not affect the estimated average latency
as much as seen in figure 4.3. The test scripts for both protocol stacks are equal in the way latency
is derived as well as the emulated network, thus, it is reasonable to say that the improved latency
comes from the reduced headers. From observations of the estimated jitter, plotted in figure 4.5 it
is noted that the jitter is significantly lower for Shorternet with about a 7 ms difference on average
between the two. Additionally it seems as though the jitter, similar to the latency, is not dependent
on the packet lengths for both protocol stacks. As jitter is the estimated standard deviation of the
latency, this suggests that Shorternet provides a more stable latency. This could also be derived from
the latency results in figure 4.3. The estimated average latency of Shorternet seemed almost constant,
whereas for the standard protocol stack the latency has a lot of variance, assuming that the latency is
not dependent on packet length.

The throughput of an emulated Bmv2 P4 switch partly depends on the amount of match-action tables
and table entries, see section 2.2.2. During the tests to compare Shorternet with standard UDP over
IP the amount of match-action tables and keys were equal and the actions taken were similar. These
included one table lookup and one forwarding action where one vs four lines of code were executed,
which produce a minuscule difference in computing time. Therefore, the reduced latency could rather
be due to that the switches parse less information. For the Shorternet protocol stack the parser no
longer parse any layer three information, since it is no longer present in the packets. Another aspect
that further suggests that the improvement is related to the switch processing the packets quicker for

25


Figure 4.3: Plot of the estimated average measured latency as a function of packet length for
Shorternet (blue) and standard TCP/IP (orange).

Figure 4.4: Plot of the calculated packet loss in percent(%) when using Shorternet (blue) and
standard TCP/IP (orange) for each flow of varying packet lengths.

Shorternet is that the latency does not seem to change with increased packet lengths or when there
are fewer instances where packets are dropped. Thus, the reduced latency from the tests is dependent
on the fixed amount of headers that have been removed, which once again would correspond to less
information processed at the parser since the raw payload is not processed at the switches.

It would have been interesting to test the effects of adding or removing more headers, should there

26


have been more time. A test could have been conducted with a fixed packet length but with varying
amounts of header bytes to parse, while still performing the same ingress pipeline processing. This
could have helped to identify more clearly what is causing the reduced latency. There is a possibility
that the latency improvement of Shorternet is related to the characteristics of the emulated Bmv2
switches rather than the P4 program itself. A study [34] was made by H. Hasanin et al. where
one of their tests regarding P4 was to test the cases of parsing more headers in different network
hardware. In the test they go from parsing Ethernet, IPv4, UDP, and Precision Time Protocol (PTP)
headers to increase the amount of headers parsed by adding dummy headers to the stack. The results
from their test was that more headers to parse do provide an increase in latency in the order of
single microseconds, depending on which P4 target was used. It was also noted from their results
that increased packet lengths also provide an increase in latency in the order of approximately 0.1-5
microseconds, depending on which P4 target was used. Because of their results and the much lower
order of which the latency increased when parsing more headers it is likely that the results from the
tests performed in this project is dependent on the Bmv2 switches. With a test on actual hardware the
latency improvement of using Shorternet over standard protocols may look similar to the results from
the study by H. Hasanin et al.

Figure 4.5: Plot of the estimated jitter as a function of packet length for Shorternet (blue) and
standard TCP/IP (orange).

4.1.3 General discussion about Shorternet
To get a better understanding of how the gain in effective bandwidth relates to actual transmissions
within a network, the following example is presented. Assume there is a large transmission of 10
GB of data. In this case, an increase of 5% in effective bandwidth equates to 500 MB of data, which
is roughly estimated the same as 665000 packets of 750 bytes (disregarding headers). In certain
scenarios packets are sent on-the-fly meaning that the sender does not wait for packets to be filled up
to a certain point but rather spew out packets as the data arrives. For such scenarios this improvement
is negligible. Regarding other scenarios this could lead to less packets being transmitted potentially

27


resulting in less packets lost and re-transmitted and also less packet processing at the receiver. Pair
this together with the potential improvement in latency and the total transmission and processing time
has been reduced by a decent amount, proving the concept to be somewhat successful.

As these tests were done in an emulated environment using a single laptop one should consider this
when discussing the calculated metrics. Another aspect that should be considered was that the packet
rate had to be artificially created with a sleep method in the test scripts due to the slow processing
power of the switches, when emulated on a laptop. While this ensured a more fair comparison between
the two protocol stacks it almost removed queue build-ups completely at the switches. The traffic
behaviour might, therefore, be miss-representative of a real-world scenario.

From the presented results it seems that the Shorternet protocol stack provides the network with
improvements concerning both effective bandwidth, latency, and jitter. In retrospect, a test consisting
of several flows could have been more suitable to provide more representative data regarding
network performance of a real-world scenario. Nonetheless, the test provided data that could be
used to compare two different types of flows and what effect this has in a controlled environment.
Furthermore, due to the removal of the third layer in the stack some functionality is removed which
must be considered. In a traditional setup, the layer two switching is responsible for physical
addressing and error correction, whereas layer three is responsible for logically routing addresses,
error handling, packet sequencing etc. For instance, without layer three IP packet fragmentation is
no longer available. This could potentially be modified to a layer two field instead, if one deems that
functionality necessary. Moreover, the network is no longer able to utilize applications which assume
that the traditional protocol stack is used, e.g internet as this depends on IP routing capabilities or
iperf and similar tools. The question of whether or not the removal of layer three protocols is a
suitable modification rather boils down to if it is needed. In this case, the modification to shorten
the MAC addresses rely on the network being relatively small in size. Additionally, if the network
is well-defined, IP is not needed to keep track of potential new hosts. P4 is flexible and enables the
system designer to choose e.g address lengths and which protocols are needed. In this case Shorternet
is more or less a proof of concept, rather than a solution for all systems. This clearly shows the utility
aspect of P4 as a tool since with P4 it is more flexible to adapt and mold the network to what the
system designer requires of the network.

4.2 INT

The results from the first test with INT enabled or disabled can be seen in figures 4.6, 4.7, 4.8 and 4.9.
The figures show that there is almost no difference for networks with small diameter when congestion
is avoided. For higher transmitted bandwidths the network experiences congestion, which is worsened
with INT enabled. This is expected because the number of packets double as the sink generates a
report for each packet, and the header size is increased for INT packets. There is quite a difference
in the bandwidth for five switches compared to the smaller network diameters, see figure 4.6. The
reason for this can be found in figure 4.8, which shows that the packet loss for five switches with INT
enabled is more severe compared to the other cases. It can be noted that for five switches with INT
enabled the packet loss starts to increase at a lower transmitted bandwidth. Furthermore, the average
packet loss is around 0-2% until congestion occurs in the network. For latency there was a quite large
impact from INT, see figure 4.7. The increased latency is a sign of congestion in the switches. The
results showed that the latency is increased with both INT enabled and disabled. However, when INT

28


(a) (b)

(c)

Figure 4.6: Plots of the average measured bandwidths for a network diameter of three, four and five
switches respectively. For a) and b) the results are similar but in c) the bandwidth when INT is

enabled is lower compared to when INT is enabled. The error bars represent one standard deviation
of the measurements.

was enabled the congestion was more severe due to the increased amount of overhead. The network
performance declined with the network diameter, which points to the added header size since the
number of packets was similar for each topology. When the transmitted bandwidth reached 5 Mbit/s
the bandwidth decreased for five switches, due to severe congestion. Jitter, nevertheless, saw no
major difference, see figure 4.9. Moreover, there were some outliers for jitter, especially when INT
was enabled for three switches. The reason why these outliers occurred was never identified and they
are therefore not removed. It should be noted that the latency when INT was disabled saw no major
difference with more switches added.

29


(a) (b)

(c)

Figure 4.7: Plots of the average measured latency for a network diameter of three, four and five
switches respectively. The latency is higher when INT is enabled. Furthermore, the latency increases
with the network diameter when INT is enabled. The error bars represent one standard deviation of

the measurements.

Figure 4.8: Plot of the packet loss percentage for each network diameter with INT enabled and
disabled. The packet loss is the total amount of packets dropped versus the total amount of packets

sent for a transmitted bandwidth.

30


(a) (b)

(c)

Figure 4.9: Plots of the average measured jitter for different network diameters. The jitter is similar
for all network diameters. There exists some outliers which affects the results. However, the outliers
are present with INT enabled and disabled. The error bars represent one standard deviation of the

measurements.

In the second test where an additional flow was introduced, see figure 4.10, the results showed that
the bandwidth, see figure 4.10a, with INT enabled and disabled were similar until the transmitted
bandwidth reached 5 Mbit/s. After that point the gap increases, this is due to the same reasons as in
the first test where packets were dropped due to congestion. However, a certain transmitted bandwidth
should match the scenarios of the first test but with a 2 Mbit/s difference because the amount of
packets is roughly the same. The reason behind the larger gap could be related to the fact that in
the first test each packet was sent with a consistent frequency which is favorable in a networking
environment. With two flows the frequencies between these flows are not necessarily matched, which
can cause more packets to be received at a certain time frame. This will, in theory, increase the jitter
as the interval between packets changes, comparing figure 4.10e with 4.10f it is apparent that the jitter
increased with multiple flows. The latency of the network, see figure 4.10c, showed similar effects
from enabling INT as in the first test, see figure 4.10d. However, this time the congestion started at
a lower transmitted bandwidth of 3.5 Mbit/s, which is natural since the second flow of 2 Mbit/s is
present. It should be noted that the performance of the network regarding bandwidth, latency, and
jitter is good while congestion is avoided, for example 3-3.5 Mbit/s.

31


(a) (b)

(c) (d)

(e) (f)

Figure 4.10: Plots of the measured bandwidth, latency, and jitter for two flows with four switches
(left) and for one flow and four switches (right). The error bars represent one standard deviation of

the measurements.

32


Figure 4.11: Plot of the packet loss percentage with two flows and four switches with INT enabled
and disabled. The packet loss is the total amount of packets dropped versus the total amount of

packets sent for a transmitted bandwidth.

An important aspect to consider is that the network performance is dependent on the system
performance. The maximum bandwidth was limited for each switch and host to reduce the effect
of this dependency. However, because each switch shared the same system resources they still
affected each other to some extent. The Bmv2 target performance is also dependent on the program
complexity. While there was the same amount of table entries for both tests, the P4 program with INT
disabled is much less complex compared to the P4 program with INT enabled. The performance of a
hardware switch might not be affected by the program itself as it was for the Bmv2 switch. Therefore,
multiple reasons points to do the same tests on hardware switches to get the full picture.

The INT protocol that was implemented is quite simple in design. If legacy switches were present in
the network there might be complications with the INT headers. Additionally, in this project, only
UDP traffic was exposed to INT. Because the programmer has full control, INT can be added to any
other protocol in order for flows other than UDP traffic to be able to obtain vital network information.
Also INT was applied to every single UDP packet, it might be interesting to investigate if that is
needed and if it is possible to lower the amount of INT packets that circulate the network to reduce
queue build-up. For example adding INT at every other packet would reduce the number of packets
with added headers by half. Also, the amount of reports would be halved as well.

The results from both tests, show that solely enabling INT has a negative impact on the network.
The idea was to use INT to deliver fine-grained metrics to a CC algorithm which would avoid the
queue-build up that occurred during the tests. For applications where latency is crucial, CC is a way
to keep the queue build-ups at a minimum. Low-latency does not come without a cost, in this case the
trade-off is lower bandwidth. As mentioned in section 2.2.3c, HPCC caused almost zero-queues but
had a 5% loss in bandwidth. As a result HPCC achieved 95% faster flow-completion time compared
to earlier implementations. HPCC uses 2 bytes for header and adds 8 bytes of data at each hop. This
is one byte less INT-Header and 4 bytes less INT-Data compared to the implementation in this project.
The header sizes depend on what metrics that are gathered and what resolution they have. R. B. Basat,
et al. proposed PINT [35] which lowers the INT header overhead to as low as two bytes per packet,
which was evaluated together with HPCC. In this project the header sizes were not optimized, which
leaves room for improvements.

33


There are multiple use cases beside CC for INT. Load balancing and anomaly detection were two
other examples mentioned in section 2.2.3c. Load balancing is similar to CC but instead a flow can be
separated into multiple flows distributed between switches or hosts rather than limiting the bandwidth.
INT brings higher resolution in metrics such as load of different switches and hosts, which can provide
better decisions of where traffic should be forwarded. Anomaly detection can be used to find patterns
of attacks such as DDoS and network breaches. The telemetry data could be provided to a controller
which can take appropriate action to stop or mitigate these attacks. Furthermore, accountability could
be provided if data from INT is saved, which can be used at later stages in investigations of attacks
such as breaches. Data from INT could also be used to optimize networks because it is possible to pin-
point where there might be bottlenecks or security risks. However, INT produces a lot of data and it
might not be feasible to store such large amounts of data. To summarize, there are many applications
which could be applied with the use of INT. Depending on what application is to be implemented, it
is up to the developers to investigate whether the benefits outweigh the added overhead or not.

4.3 General security aspects

In this section some concerns regarding security flaws and improvements are presented and discussed
for programmable data planes with P4. These include bugs, debugging of P4 code, and the idea of
added cryptography at the Link layer level.

4.3.1 Security concerns and bugs
Agape et al. investigates in their position paper [36] what new security challenges P4 introduces.
They point out that security studies regarding OpenFlow still apply, for more information regarding
OpenFlow, see [37]. Agape et al. explain a number of security flaws. For example, switches that run
P4Runtime are vulnerable to Man-in-the-middle attacks and channel flooding. In the case of Man-
in-the-middle-attacks, important non-encrypted gRPC messages which contain information such as
configuration files, tables and control instructions can be captured between the switch and controller.
This information can be used to spoof the controller. In the case of channel flooding the P4Runtime
agent on the controller or the switches can flood each other, resulting in slow response time or denial
of service of the controller or higher latency of the network.

P4 enables the programmer to define the data plane completely, and with faster development, the
risk for runtime bugs is inherently increased. This might not be the case for certain programming
languages which includes necessary checks to find bugs. However, for P4 this is not the case and
additionally the network developer needs think how the P4 logic handles an unexpected value or
protocol. Otherwise bugs and exploits might be possible as a result. M. Dumitru, D. Dumitrescu
and C. Raiciu investigates how bugs in P4 programs can be exploited in different targets [38]. The
authors point out that P4 suffers similar weak-points as in the programming language C. They further
explain that there are some apparent differences that results in weaker attacks towards P4 compared
to C. These are as follows:

• In P4, if a read or write operation is invalid when running the program then only the value
(reads) or the memory location (writes) of that operation are invalid. In C if these faults occur,
then the whole behaviour of the program is undefined after that point resulting in an execution
error.

34


• Code injections that are present in other languages are not possible in P4 since the code is
immutable after deployment.

• In P4 the execution order is immutable, which provides control-flow integrity [39] by default.

They uncover in testing that the Bmv2 switch leaks information from prior packets. Regardless, Bmv2
is not meant as a production target [13], which the authors also note.

It is clear that with expanded programmability, there is a greater risk for bugs slipping through to
production networks. These bugs could impose security flaws, fault and downtime. As a result
multiple debugging methods and tools have been proposed [40], [41], [42], [43]. ASSERT-P4 [42]
proposed by L. Freire et al. and Vera [41] proposed by R. Stoenescu both use symbolic execution
to find bugs like, parsing/deparsing errors, loops, etc. Both work by representing the P4 program in
a different form which enables them to use an engine to find bugs. The overall functionality of the
engines are that the different parts of a P4 program are represented and then different packets are
injected. P4Tester [40] proposed by Y. Zhou et al. finds bugs by injecting certain probes in a network.
These probes use different headers (Ethernet, IPv4, and so fourth) to invoke table rules and then save
information in the same packet that can be used for debugging. Y. Zhou et al. proposed P4DB [43]
which lets the programmer add "debugging snippets" in the code while developing. These snippets
are similar to standard breakpoints. Breakpoints enable the programmer to pause code during runtime
and inspect variables. With P4DB the execution is not paused but rather the programmer gets the state
of different elements at the points of interest instead.

4.3.2 Cryptographic Hash

By providing network security at lower levels such as the Link layer the network can be safeguarded
against cyberattacks, e.g Man-in-the-middle, eavesdropping, etc. which often exploit the vulnerability
of the link layer. One way of providing such security is through using MACsec (802.1AE-2006),
which is a security protocol introduced by the IEEE 802.1 work group 2006 [44]. MACsec uses
encryption to ensure confidentiality, integrity and authenticity between peers in the network. F.
Hauser et al. proposes P4-MACsec which implements MACsec in a SDN with P4 switches [45].
The authors point out that automated deployment of MACsec in legacy switches is not feasible.
P4-MACsec leverages a SDN controller for automated deployment. Furthermore, P4 is used to
implement MACsec functionality in the data plane. They further point out that they used the
Advanced Encryption Standard in Galois/Counter mode (AES-GCM) for encryption and decryption.
However, cryptographic hashes are not natively supported by P4 targets, at the current time. D. Scholz
et al. investigate the feasibility of implementing cryptographic hash algorithms in different P4 targets
such as CPUs, NPUs and FPGAs [46]. They concluded that no hash algorithm, that they tested,
deliver enough performance on any platform. They recommend that P4 targets should implement a
family of cryptographic hash functions, recommended by the P4 specification, that suits the target.
Because of P4 the potential of new network applications is demonstrated, which can motivate the
switch manufactures to develop and implement new features.

35


4.4 Firewall

During the test, the firewall blocked packets with spoofed MAC addresses and Host 4 had no access
to the network. However, the switch and controller (simple_switch_CLI) runs P4Runtime, which is
vulnerable to man-in-the-middle attacks, see 4.3.1. If an intruder gains access to the controller, new
table entries that gives access to the network can be installed. Also, if the attacker has physical access
to the switch, entry to the network is not a difficult task.

For a complete solution, controller logic needs to be implemented together with the switch P4
behavior. This is currently not implemented. How hosts/network devices should be verified and
how these rules are distributed to the networks switches was not investigated. Also any impact on
performance was not measured. Currently, this implementation of a data plane firewall adds two rules
for each individual host and switch. One rule for outgoing traffic and one rule for incoming traffic.
The number of table entries should affect performance of the Bmv2 switch according to previously
mentioned factors described in section 2.2.2. However, a production switch might not have the same
limitations as the Bmv2.

Currently, this firewall implementation only checks the MAC addresses and port numbers. However,
other rules could be implemented, for example the switches could check the destination port of the
transportation layer headers and drop packets accordingly. There is also potential for other ideas of
rules. One such rule could be that a new host may only initiate communication between already
present hosts if these have previously initiated communications with the new host beforehand.

4.5 Future work

This thesis implemented prototypes of different applications for P4 with a more "proof of concept"
methodology rather than actual implementations. These prototypes saw potential and could be
expanded upon and also include tests using real hardware and not only through emulated systems.
Performance metrics from tests on real hardware could bring more insight into how these protocols
impact the network. Some ideas on how these concepts can be built upon, which would have been
interesting to evaluate and analyze, are presented.

The idea of Shorternet was to conceptualize custom-made protocol stacks that differ quite a lot from
the standardized model. However, as was discussed, this removes some functionality. An interesting
idea for further study could be whether or not a hybrid network is feasible, which may utilize both
completely custom-made models as well as the standards. This may prove useful for networks with
a well defined and fixed local network structure but has connection points outwards to either WAN
or another distant LAN. Another aspect where a hybrid network could prove useful is to be able to
combine tailor-made protocols while still retaining the standard functionality for applications that
expect the standard protocols. Furthermore, Shorternet showed the flexibility of P4 and its easy to
use concept for packet processing. This brings possibilities for new protocols that can expand upon
current standardized protocols such as adding packet segmentation or sequencing packets in lower
layers than previous to remove redundant information while retaining functionality.

INT as a tool was implemented, many applications can be built using the data INT offers. For example
congestion control and load balancing. The INT protocol could, however, be optimized in multiple

36


ways, for example shorter headers and less INT packets in a network. It is also worth pointing out that
INT can be applied to other protocols, for example TCP, if there is a need for feedback and control of
these protocols.

As technology continues to evolve together with digitization, cyber-security is becoming more crucial
than ever for companies to survive. With the concept of data plane firewalls there is potential for
increased levels of security within computer networks. In this project one such firewall concept was
presented, a simple yet effective concept. However, it saw a need for a controller to be constructed
in order to apply the firewall in networks. This could be expanded upon and also include a way of
dynamically adding and removing further sets of rules as well as exploring new ideas for rules.

37


5
Conclusion

To summarize, three different applications of P4 were implemented and investigated. Shorternet
showed positive gains in bandwidth, latency, and jitter. Regarding bandwidth the gain was significant
for small payloads but for larger payloads this gain was reduced as the gain is inversely proportional
to packet lengths. The latency and jitter saw improvements of approximately 75% less latency and
86% less jitter on average using Shorternet, but, may have been affected by the characteristics of
the emulated network. Although there were improvements regarding network performance, there
were some downsides as well that are worth considering. Applications in the network that take for
granted that standard communication protocols are present will not work as intended. This calls for
completely custom-made solutions involving the full protocol stack, which might not be feasible. For
example, a new benchmarking tool needed to be constructed because legacy tools like iperf2 could
not be used. In-Band Network Telemetry (INT) showed great potential and together with congestion
control or other applications, such as anomaly detection, could provide low-latency or more secure
networks at the expense of reduced bandwidth. Results showed that the added overhead does decrease
the performance as more traffic and/or larger data packets are introduced to the network. But this
decrease may prove to not be of significance if the INT data can be utilized with more advanced
CC algorithms or other implementations. Lastly, a data plane firewall which matched physical port
numbers with MAC addresses was implemented. The firewall proved to be effective and could easily
be expanded upon with new rules. However, a controller that is able to verify and securely distribute
rules to the switches needs to be added before the firewall is deployed in real networks. The firewall
is not completely foolproof, if an attacker is able to spoof controller messages or can access the
controller, this could compromise the firewall.

Programmable data planes open up a new front of security threats. P4 as a tool provides programmers
with more flexibility, paired together with faster innovation cycles this may introduce more bugs
into production. Because of this, several debuggers have been developed, where P4DB is one such
example which helps the developer to find runtime bugs. Some existing security flaws were presented
and because of the coexistence of Software Defined Networks (SDN) and programmable data planes,
security concerns from SDN carry over as well.

During the project, P4 was found to be a powerful and easy to use tool. Prototypes were fast to develop
and implement. With the help of the behavioral model version 2 (Bmv2) and Mininet, prototypes
could be implemented and tested in an emulated environment which accelerated development, albeit
not fully representing a production network. The network could be customised at a very low level,
which opens up for customization and flexibility. Many applications which utilize P4 are emerging,
which point to scientific interest in the language as well as practicality.

38


References

[1] P4 Consortium, “P4 Language and Related Specifications.” 2021. [Online]. Available: https:
//p4.org/specs/(accessed on: 2021-05-14).

[2] Requirements for Internet Hosts - Communication Layers, RFC 1122, R. Braden, Internet
Engineering Task Force, Los Angeles, USA, 1989. Available: http://www.rfc-editor.org/
rfc/rfc1122.txt.

[3] Transmission Control Protocol, RFC 793, J. Postel, Information Sciences Institute, Los Angeles,
USA, 1981. Available: https://datatracker.ietf.org/doc/html/rfc793.

[4] User Datagram Protocol, RFC 768, J. Postel, Information Sciences Institute, 1980. Available:
https://datatracker.ietf.org/doc/html/rfc768.

[5] Internet Protocol, RFC 791, J. Postel, Information Sciences Institute, Los Angeles, USA, 1980.
Available: https://datatracker.ietf.org/doc/html/rfc791.

[6] Internet Protocol, Version 6 (IPv6) Specification, RFC 8200, S. Deering, R. Hinden, Internet
Engineering Task Force, 2017. Available: https://datatracker.ietf.org/doc/html/
rfc8200.

[7] IEEE Standard for Ethernet, IEEE 802.3-2018, IEEE Computer Society, New York, USA, 2018.
Available: https://ieeexplore.ieee.org/document/8457469.

[8] Internet Assigned Numbers Authority (IANA), "Protocol Numbers," 2021. [Online]. Available:
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
(accessed on 2021-02-17).

[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco,
A. Vahdat, G. Varghese, and D. Walker, “P4: Programming protocol-independent packet
processors,” SIGCOMM Comput. Commun. Rev., vol. 44, p. 87–95, July 2014. [Online].
Available: https://dl.acm.org/doi/10.1145/2656877.2656890, Accessed on: 2021-01-
26 ).

[10] M. Budiu and C. Dodd, “The P416 Programming Language,” SIGOPS Oper. Syst. Rev., vol. 51,
p. 5–14, Sept. 2017. [Online]. Available: https://dl.acm.org/doi/10.1145/3139645.
3139648, Accessed on: 2021-01-26.

39

https://p4.org/specs/
https://p4.org/specs/
http://www.rfc-editor.org/rfc/rfc1122.txt
http://www.rfc-editor.org/rfc/rfc1122.txt
https://datatracker.ietf.org/doc/html/rfc793
https://datatracker.ietf.org/doc/html/rfc768
https://datatracker.ietf.org/doc/html/rfc791
https://datatracker.ietf.org/doc/html/rfc8200
https://datatracker.ietf.org/doc/html/rfc8200
https://ieeexplore.ieee.org/document/8457469
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
https://dl.acm.org/doi/10.1145/2656877.2656890
https://dl.acm.org/doi/10.1145/3139645.3139648
https://dl.acm.org/doi/10.1145/3139645.3139648


[11] P4 Language Consortium. "01 Introduction to Data Plane Programming (Stephen Ibanez),"
Youtube, Nov. 30, 2017. [Video file]. Available: https://www.youtube.com/watch?v=
qxT7DKOIk7Q (accessed on 2021-05-05).

[12] The P4.org Architecture Working Group, "P416 Portable Switch Architecture (PSA),"
2018. [Online]. Available: https://p4lang.github.io/p4-spec/docs/PSA-v1.1.0.pdf
(accessed on 2021-02-03).

[13] Behavioral model, Version 1.14.0, [Software], P4 Language Consortium, 2021. Available:
https://github.com/p4lang/behavioral-model (accessed on: 2021-05-14).

[14] Mininet, Version 2.3.0, [Software], 2021. Available: https://github.com/mininet/
mininet (accessed on: 2021-02-14).

[15] A. Bas, “Performance of bmv2.” 2019. [Online]. Available: https://github.com/p4lang/
behavioral-model/blob/main/docs/performance.md (accessed on: 2021-05-05).

[16] A. Bas, V. Kumar, H. Hu, C.W. Cen, "P4 forwarding large packets," 2018. [Online]. Available:
https://github.com/p4lang/behavioral-model/issues/567 (accessed on: 2021-03-
18).

[17] The P4.org Applications Working Group, “In-band Network Telemetry (INT) Dataplane
Specification.” 2020. [Online]. Available: https://raw.githubusercontent.com/p4lang/
p4-applications/master/docs/INT_v2_1.pdf (accessed on: 2021-02-07).

[18] The P4.org Applications Working Group, “Telemetry Report Format Specification.” 2020.
[Online]. Available: https://github.com/p4lang/p4-applications/blob/master/
docs/telemetry_report_v2_0.pdf (accessed on: 2021-02-07).

[19] V. Jacobson, “Congestion avoidance and control,” SIGCOMM Comput. Commun. Rev., vol. 18,
p. 314–329, Aug. 1988. [Online]. Available: https://dl.acm.org/doi/10.1145/52324.
52356, Accessed on: 2021-05-17.

[20] Intel Corporation, “Intel Deep Insight Network Analytic Software.” 2020. [Online]. Available:
https://www.intel.com/content/www/us/en/products/network-io/
programmable-ethernet-switch/network-analytics/deep-insight.html (accessed
on: 2021-05-12).

[21] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly,
M. Alizadeh, and M. Yu, “HPCC: High Precision Congestion Control,” in Proceedings of
the ACM Special Interest Group on Data Communication, SIGCOMM ’19, (New York, NY,
USA), p. 44–58, Association for Computing Machinery, 2019. [Online]. Available: https:
//dl.acm.org/doi/10.1145/3341302.3342085, Accessed on: 2021-03-11.

[22] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel,
M. H. Yahia, and M. Zhang, “Congestion Control for Large-Scale RDMA Deployments,”
in Proceedings of the 2015 ACM Conference on Special Interest Group on Data
Communication, SIGCOMM ’15, (New York, NY, USA), p. 523–536, Association for
Computing Machinery, 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/

40

https://www.youtube.com/watch?v=qxT7DKOIk7Q
https://www.youtube.com/watch?v=qxT7DKOIk7Q
https://p4lang.github.io/p4-spec/docs/PSA-v1.1.0.pdf
https://github.com/p4lang/behavioral-model
https://github.com/mininet/mininet
https://github.com/mininet/mininet
https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md
https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md
https://github.com/p4lang/behavioral-model/issues/567
https://raw.githubusercontent.com/p4lang/p4-applications/master/docs/INT_v2_1.pdf
https://raw.githubusercontent.com/p4lang/p4-applications/master/docs/INT_v2_1.pdf
https://github.com/p4lang/p4-applications/blob/master/docs/telemetry_report_v2_0.pdf
https://github.com/p4lang/p4-applications/blob/master/docs/telemetry_report_v2_0.pdf
https://dl.acm.org/doi/10.1145/52324.52356
https://dl.acm.org/doi/10.1145/52324.52356
https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/network-analytics/deep-insight.html
https://www.intel.com/content/www/us/en/products/network-io/programmable-eth