Minimizing read bottlenecks in
I/O bound systems
Effective and adaptive I/O using C++

Master’s thesis in Computer Science and Engineering

Tobias Bäckemo

Konrad Olsson

Department of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2022


Master’s thesis 2022

Minimizing read bottlenecks
in I/O bound systems

Effective and adaptive I/O using C++

Tobias Bäckemo

Konrad Olsson

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2022


Minimizing read bottlenecks in I/O bound systems
Effective and adaptive I/O using C++
Tobias Bäckemo, Konrad Olsson

© Tobias Bäckemo, Konrad Olsson, 2022.

Supervisor Company: Patrik Ellren, Carmenta
Supervisor Department: Ahmed Ali-Eldin Hassan, Department of Computer Science
and Engineering
Examiner: Carl-Johan Seger, Department of Computer Science and Engineering

Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2022

iv


Minimizing read bottlenecks in I/O bound systems
Effective and adaptive I/O using C++
Tobias Bäckemo & Konrad Olsson
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The overall performance of a computer system that reads a lot of data is directly
linked to its ability to extract performance from the storage hardware. This thesis,
conducted together with the company Carmenta, examines how to optimize a sys-
tem’s data reads. The targeted storage device is an NVMe SSD. Several experiments
simulating different workloads were conducted and laid the foundation for a new set
of guidelines. Lastly, the guidelines were evaluated in a custom-built benchmark.
Based on the evaluation, it was concluded that by using the guidelines, a system’s
performance could be increased.

Keywords: I/O, read, C++, SSD, NVMe, Windows, GIS, mmap.

v


Acknowledgements
Firstly we would like to thank all the people at Carmenta for creating such a fun
and friendly work environment. Especially we would like to thank Patrik Ellrén
for all his guidance and assistance during this thesis. We would also like to thank
Magnus Åkerstedt Bergsten for providing us with technical assistance.

Furthermore, we would like to thank our supervisor from Chalmers, Ahmed Ali-
Eldin Hassan, for his guidance during this thesis. We would also like to thank our
examiner Carl-Johan Seger for valuable feedback.

We had to reach out to some experts during the thesis regarding memory-mapped
files. So thank you, Dr. Alexandra (Sasha) Fedorova, University of British Columbia,
and Dr. Andrew Crotty, Carnegie Mellon University, for answering our emails.

Last but not least, a great thank you to all of our friends at Chalmers for making
our time here so amazing. If you come across this thesis, know that you are not
forgotten.

Tobias Bäckemo, Konrad Olsson, Gothenburg, June 2022

vii


Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 NVMe SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The iostream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Memory-mapped file . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 I/O metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.1 Files system block size . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Optimization techniques used by file-systems . . . . . . . . . . 9

2.5.2.1 Buffer cache . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.2.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Reading block size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Parallel I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Synchronous and Asynchronous read . . . . . . . . . . . . . . . . . . 10

3 Research methodology 11
3.1 Prestudy phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Reading techniques . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Development phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Reader overview . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Instance variables . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.4 Developed Readers . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.4.1 StreamReader . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4.2 MappedReader . . . . . . . . . . . . . . . . . . . . . 14
3.2.4.3 BufferReader . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4.4 AsyncReader . . . . . . . . . . . . . . . . . . . . . . 14

3.2.5 Testing methodology . . . . . . . . . . . . . . . . . . . . . . . 15

ix


Contents

3.2.6 Considering the cache . . . . . . . . . . . . . . . . . . . . . . 15
3.2.7 Simulating random reads . . . . . . . . . . . . . . . . . . . . . 15
3.2.8 Testing parallel and asynchronous reading . . . . . . . . . . . 16
3.2.9 Synchronous single-threaded tests . . . . . . . . . . . . . . . . 16
3.2.10 Asynchronous tests . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.11 Parallel tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Experiment phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Evaluation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Experiments 21
4.1 Block size impact on performance . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Sequential reads using different block sizes . . . . . . . . . . . 21
4.1.2 Random read using different block sizes . . . . . . . . . . . . . 22
4.1.3 Latency using different block sizes . . . . . . . . . . . . . . . . 23
4.1.4 Choosing the optimal block size . . . . . . . . . . . . . . . . . 23

4.2 Queue depth impact on performance . . . . . . . . . . . . . . . . . . 24
4.2.1 Sequential reads using different queue depths . . . . . . . . . . 24
4.2.2 Random reads using different queue depths . . . . . . . . . . . 25
4.2.3 Choosing the optimal queue depth . . . . . . . . . . . . . . . 25

4.3 MappedReader chunk size . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Comparing readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Sequential reads . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1.1 File size . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1.2 Total bytes read . . . . . . . . . . . . . . . . . . . . 27
4.4.1.3 Discussion on sequential reads experiments . . . . . . 27

4.4.2 Random reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.2.1 Read request size . . . . . . . . . . . . . . . . . . . . 29
4.4.2.2 File size . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2.3 Total bytes read . . . . . . . . . . . . . . . . . . . . 31
4.4.2.4 Reading more than memory . . . . . . . . . . . . . . 33
4.4.2.5 Discussion on random reads experiments . . . . . . . 34

4.5 Parallel reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 Parallel sequential reads . . . . . . . . . . . . . . . . . . . . . 35
4.5.2 Parallel random reads . . . . . . . . . . . . . . . . . . . . . . 36
4.5.3 Discussion on parallel reads experiments . . . . . . . . . . . . 37

4.6 Integrating asynchronous reading . . . . . . . . . . . . . . . . . . . . 38
4.7 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Benchmarks 41
5.1 AdaptiveReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Carmenta test-suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.1 Carmenta test-suite benchmark . . . . . . . . . . . . . . . . . 42
5.2.2 Animation files Benchmark . . . . . . . . . . . . . . . . . . . . 43

5.3 Interactive map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Discussion of benchmarks results . . . . . . . . . . . . . . . . . . . . 43

6 Conclusion 45

x


Contents

7 Future Work 47
7.1 Create a better model for cache hits . . . . . . . . . . . . . . . . . . . 47
7.2 Testing other storage hardware . . . . . . . . . . . . . . . . . . . . . 47
7.3 Exploring more options in Windows . . . . . . . . . . . . . . . . . . . 47
7.4 Testing a different OS . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 Explore DirectStorage . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 49

A Test computer I

B Guidelines III

xi


Contents

xii


1
Introduction

Many modern computer systems read and process large amounts of data while run-
ning. Although storage devices have become faster, the ability to scale processing
surpasses the ability to scale Input/Output (I/O). Because of this reason, it is of
utmost importance that storage I/O is optimized [15].

At the application level, a developer has multiple options when deciding how to read
data from secondary storage [37]. Which option is most suitable depends on several
factors. Two factors are what kind of secondary memory the host system is using
and the system’s workload.

With the introduction of non-volatile memory express solid state drives (NVMe
SSDs), latency has drastically decreased. Compared to a hard disk drive (HDD),
the NVMe SSD is orders of magnitude faster [34]. Due to the significant perfor-
mance improvement, latency from CPU tasks is no longer negligible. For example,
unnecessary memory copies can now noticeably affect a system’s performance [9].
Moreover, the internal architecture of SSDs differs from an HDD’s architecture. The
best method for reading from one type of storage device is not necessarily the best
method for reading from another [13].

Another factor that impacts which option is most suitable for reading data are a
system’s workload [32]. A system can access data in many different ways. Two
examples are data being accessed in small chunks at several offsets in a file and
accessing data by large sequential reads.

If the workload and the targeted storage device are not considered, reading data can
become a bottleneck of a system’s overall performance. Based on this consideration,
this thesis introduces a set of guidelines that a developer can use to optimize reading
in their system.

1.1 Background
An example of a computer system that reads a lot of data is a Geographic Infor-
mation System (GIS). A GIS’s purpose is to display and analyze geospatial data.
Displaying the data is done by combining multiple layers of information. These
layers can be, for example, elevation information, roads, rivers, buildings, and geo-

1


1. Introduction

graphic names [11].

Carmenta, the company at which this thesis was produced, provides the tools to
build Geographic Information Systems. The main tool Carmenta provides for this is
called Carmenta Engine, which is a software development kit for creating geospatial
applications. These geospatial applications are often run on personal computers
where all the data needed resides locally [4]. An improvement in reading performance
will benefit the performance of the geospatial application as a whole, providing a
better user experience with a shorter startup time and smoother interaction with
the application.

1.1.1 Previous work
Regarding previous work, three types of papers were of extra interest. Firstly, pa-
pers containing discussions on how the advantages of using an SSD can be utilized.
There exist several papers of this type [40, 38, 36]. The papers highlight how sub-
optimal decisions are made due to a lack of awareness of the internal architecture
of the SSD and provide possible solutions.

However, what is in common in the previously mentioned papers and differs from
the scope of this thesis, is that all of the presented solutions are based on modify-
ing or building a new file system. Furthermore, in contrast to the previous papers,
this thesis only presents guidelines for optimization that can be implemented at
the application level. Since one aim for this thesis was to utilize NVMe SSDs, un-
derstanding how they operate is crucial. Luckily, such papers exist in abundance,
providing information about access latency, physical design, possibilities for paral-
lelism, internal page sizes, and more [33, 39].

Secondly, another interesting type of paper for this thesis was those that discussed
how one could optimize against different workloads. Saif, Nussbaum, and Song
(2020) [33] have written a paper in which they examine how I/O access patterns
impact the performance of an SSD. Inspiration was taken from experiments and
analyses included in their report. However, what differs between their project and
this project is that they did not use different reading techniques. For example, they
are not comparing the buffered reading with memory-mapped reading. Moreover,
all of their experiments were conducted on a Linux machine.

Lastly, papers that evaluate and compare different read methods were of interest.
For this thesis, experiments that compared memory-mapped file reading and tra-
ditional buffered reading were of extra interest. Andrew Crotty, Viktor Leis, and
Andrew Pavlo (2022) [7] are very critical to using memory-mapped file reading in
the context of Database Management Systems. Their paper points out some of the
flaws of using memory-mapped file reading. Even though this thesis is not focused
on database management systems, the insights they share were of great value for
analyzing results. Moreover, what is different from their work compared to this
thesis is that their experiments were conducted on a Linux machine.

2


1. Introduction

However, memory-mapped file reading has also been pointed out as an excellent
alternative to traditional read calls. Alexandra (Sasha) Fedorova, professor at the
University of British Columbia, gives an overview of how memory-mapped files work
and some of the gained advantages [10]. Although all of her experiments are con-
ducted on Linux, she shares knowledge that was used to analyze results when com-
paring memory-mapped files in Windows with other I/O techniques.

This thesis is unique because it researches how a developer can optimize reading
at the application layer by choosing the preferable I/O technique given a system’s
workload.

1.2 Purpose
The purpose of this project was to examine how a read-heavy system can be op-
timized on an application level. The increase in performance has been gained by
utilizing the NVMe SSD’s advantages and optimizing against different workloads.
The main idea was to select the best suitable read strategy given a workload and
provided that the host system is using an NVMe SSD. By being smart regarding
the choice of reading technique, one can extract more performance from storage
hardware and hence remove bottlenecks.

1.3 Formulation of the problem
As mentioned above, the goal of this project was to examine how a read-heavy
system can be optimized by considering its workload and utilizing the advantages
that come with NVMe SSDs. To achieve the goal of this thesis, the following three
questions had to be answered.

• How can reading be optimized at the application level by utilizing knowledge
about a system’s workload?

• How can the advantages of using NVMe SSD be utilized in the context of
reading data?

• How should the developed guidelines be evaluated?

1.4 Limitations
The following four limitations were set to keep the project’s scope reasonable.

1. It was decided to formulate and evaluate the new guidelines on a computer
that runs Windows 10 64-bit version. Specifications can be seen in appendix
A. Developing guidelines for other operating systems such as macOS or Linux
would also be interesting. However, due to lack of time, it was decided to limit
the project’s scope by only focusing on Windows 10.

3


1. Introduction

2. Data can be stored in many different ways, in databases, on the network, on
USB drives, etc. However, this thesis only considers data stored locally in
secondary memory.

3. All code is written in C++. Due to this choice, implementations in other
languages have not been evaluated.

4. The different I/O techniques were tested in experiments with synthetic data.
However, when it comes to benchmarking, it was decided only to evaluate the
guidelines for file formats that handle geospatial data.

1.5 Disposition
The remainder of this thesis is organized as follows. The following chapter is a theory
chapter, giving all the background knowledge necessary for understanding the rest
of this thesis. Subsequently, a chapter about the research methodology presents the
crucial parameters for testing. Moreover, the chapter about research methodology
also includes a description of the implemented testing functions. After that comes
the chapter in which the experiments on synthetic data are presented and discussed.
Ensuing the experiments comes a chapter where the gained knowledge is used to
improve the performance of a real-life system. Finally, conclusions and suggestions
for future work are presented.

4


2
Theory

This chapter presents the theory about hardware and software relevant to under-
standing this thesis, starting with the targeted storage device for this project, the
NVMe SSD.

2.1 NVMe SSD
A solid-state drive (SSD) is a flash memory-based storage device. An SSD is made
out of an array of flash chips. Each chip is made up of two or more planes. The
planes contain multiple blocks, where each block, in turn, is made up of 64 or 128
pages. A page is the smallest unit of data that can be read or written. The data
area of the page contains the stored data and can be 2KB, 4KB, 8KB, or 16KB in
size. In Figure 2.1, one can see the architecture of an SSD. The list that follows
briefly describes the components that an SSD consists of and their tasks [18, 20, 13].

• Host interface: The SSD and host computer communicate over the host
interface.

• SSD Controller: The main task of the SSD controller is to translate the I/O
operations into flash memory operations.

• RAM: The SSD disk has an internal RAM in which it can cache relevant data
such as read commands or recently accessed data. The internal RAM is also
used to store the map table, which contains information about where data is
stored on the device.

• Processor: The processor gives commands and tasks to the flash controller.
• Flash Controller: Manages flash components. There can be multiple flash

controllers in a single SSD.
• DMA Controller: Responsible for transferring data when receiving com-

mands from the processor.
• Channel: Flash memory packages are connected to different channels. Mul-

tiple channels can be accessed in parallel.
• Flash Memory Packages: Stores data.

In contrast to a hard disk drive (HDD), an SSD contains no moving parts. Since the
memory in an SSD is not accessed by a physical arm, memory access latency does
not depend on, in theory, if memory is sequentially or randomly accessed. SSDs
are also highly parallel because of their many flash chips, meaning that one SSD
can service multiple I/O requests simultaneously. Another vital distinction between

5


2. Theory

Figure 2.1: The internal architecture of a solid state drive

HDDs and SSDs is that while HDDs use the same block size as the page size that
the OS typically uses, namely 4KB, an SSD uses what is called a clustered page.
A clustered page is an internal unit of reading or write operations between 4KB
and 4MB [33, 18]. The idea is that when the SSD receives a write request for some
number of pages, it will write them to different chips. When the same data is to be
accessed, the SSD can fetch it in parallel, increasing throughput. See Figure 2.2 for
a visual representation.

When a host wants to send an I/O request to the SSD, it does it via the host inter-
face. The type of interface has a significant impact on the overall I/O performance
[39]. In personal computers, there are two common ways to connect SSDs. Through
Serial Advanced Technology Attachment (SATA) or Non-volatile memory express
(NVMe). The NVMe interface is used to access flash-based storage over the periph-
eral component interconnect express (PCIe) bus. NVMe is a newer interface that
has become common in personal computers during the last couple of years. There
are two main benefits of the NVMe interface compared to the SATA interface.

The first benefit is the increased throughput enabled partly by the bandwidth of the
PCIe but also by features of the NVMe controller. The NVMe SSD connects to 4
PCIe lanes, which gives a theoretical bandwidth of about 3.94 GB/s and 7.88 GB/s
on PCIe 3.0 and PCIe 4.0, respectively, while SATA 3.0 speed is about 550 MB/s.
When it comes to the interface itself, SATA supports a single I/O queue with a
depth of 32, while NVMe offers as many queues as CPU cores, with a maximum of
64K and a depth of 64K per queue [35].

The second advantage is access latency. The I/O path in both hardware and software
is shorter for NVMe devices compared to SATA [39].

6


2. Theory

Figure 2.2: Parallelism of an SSD

2.2 The iostream
The iostream is a C++ class included in the C++ standard library [6]. The class
provides functions that can be used to read from and write to standard output but
also other tasks such as file I/O. The functions are built on top of system calls
as an abstraction to make I/O easier to use and faster to implement for program-
mers. A downside with iostream is that the implementation includes unnecessary
memory copies, which affects the overall performance if the storage device is fast [9].

2.3 Memory-mapped file
Memory-mapping is a technique that can, among other things, be used for reading
and writing to files. The technique is available in many operating systems. Two ex-
amples are Linux and Windows [37]. The main idea is to use the operating system’s
paging to perform reading and writing.

When a calling process asks the kernel to memory-map a file, the kernel will allocate
a part, equal to the size of the file, of that process’s virtual address space and return
a pointer pointing at the beginning of this allocation to the calling process. At this
point, no data is loaded. When the process uses the given pointer to access a page of
mapped file data, i.e., read the file, a page fault will be triggered, and the operating

7


2. Theory

system will swap in the demanded data [7].

It has been hinted that there is an internal parameter in Windows called memory-
mapped file chunk size. The internal parameter defines how many pages are paged
in upon a page fault. This chunk size can be different for different versions of Win-
dows [5].

Three advantages of using memory-mapped files in Windows for accessing files are
listed below [14, 17].

• Because there is no explicit system read calls to be issued, a system can gain
performance by avoiding unnecessary memory copies and context switches.

• Using memory-mapped files is very convenient. Once the memory-mapped file
is open, reading from a file becomes a simple dereferencing of the pointer.

• In contrast with buffered reading, there is no need to care for buffers when
using memory-mapped files. Instead, the operating system will do this task.

One disadvantage with using memory-mapped files is that when it is used on a 32-bit
machine, the virtual address space is limited, and hence it is not possible to map files
that are larger than 2-3 GB [14]. Another disadvantage is that asynchronous I/O is
not possible with memory-mapped files. If a page fault occurs, the operating system
will block the calling process until the data is fetched from secondary memory [30].

2.4 I/O metrics
How well a specific reading strategy performs is described by an I/O metric. Two
commonly used I/O metrics are latency and throughput [13]. Latency is the time
that it takes to perform a specific I/O task. For example, how long it takes to
read a 1GB file. Throughput is defined as the number of bytes transferred divided
by the latency, B/s. This metric can be interpreted as the speed at which the
system can read or write data. Using different I/O metrics will describe the distinct
characteristics of a reading strategy [13].

2.5 File systems
The file system defines how files are named, stored, and read from the storage device.
In this report, file systems refer to the storage partition’s format. Some examples of
such formats are the File Allocation Table (FAT) [27], New Technology File System
(NTFS) [25] and Apple File System (APFS) [2].

2.5.1 Files system block size
Most of a system’s files are usually stored in the secondary memory [37]. A way to
manage the memory space could be to allocate memory in the disk for the given file
size. However, this strategy has a major defect. If the file grows, for example, a user

8


2. Theory

might append new text to a text file, then the file may have to be copied to another
place in storage since the space is no longer enough. Copying large amounts of data
from one place in the disk to another takes time and is not something you want to
do unnecessarily.

Another strategy, that copes with this drawback, is to instead divide the file into
some amount of blocks with a given block size. The blocks do not have to be
bordering. A file system’s selected fixed block size impacts read performance and
space efficiency. A very large block size will result in unused memory since small
files will only take up a small portion of the total block size. On the other hand, a
small block size will instead make it such that many seeks will be needed to gather
all of the data since it is spread out over several blocks. Microsofts file system NTFS
default block size depends on the volume of the disk. Disks with a volume size lesser
than 16 terabytes have a block size of 4096 bytes [23].

2.5.2 Optimization techniques used by file-systems
To best utilize the available hardware and hence get better overall performance,
several file systems have been implemented with different optimization techniques
[37]. Bellow follows two commonly used optimization techniques.

2.5.2.1 Buffer cache

The buffer cache is a portion of data that resides in secondary storage but is tem-
porarily also located in the main memory, even though the data is not currently
used [37]. The idea is to minimize the number of reads from secondary storage. The
file system will first look in the buffer cache when a read operation is executed. If
the required data is located in the memory, the data can be accessed a lot faster in
comparison to if the data would have to be accessed from secondary storage [37].

2.5.2.2 Prefetching

Another optimization technique is called prefetching. Prefetching, like buffer-cache,
relies upon the idea of storing data in the memory since accessing it is faster than
accessing the disk. However, what differentiates prefetching from buffer-caching, is
that the data is loaded into memory before being demanded. For sequential reads,
prefetching can increase a system’s performance. For example, if a block is read,
there is a high probability that a neighboring block will soon be demanded. However,
for random reading, prefetching can be more complex, and if the prefetched data is
never used, it can harm the system’s performance [37].

2.6 Reading block size
When reading data from the secondary memory using buffered reading, it is done
in chunks of a predetermined size. The block size that is used for buffered read

9


2. Theory

has an impact on both latency, and throughput [34]. Reading in larger chunks will
naturally take up more main memory, but it can also lead to greater throughput.
Previous research shows that for the NVMe SSD, a larger block size is always prefer-
able [34]. However, it also shows that one cannot fully utilize its capabilities even
with a block size of 64MB. To do that, one must use multiple threads or asynchronous
reading, which are discussed in the following sections.

2.7 Parallel I/O
Parallel I/O, in this thesis, refers to when multiple threads are used to make I/O-
requests. Not to be confused with the parallel architecture and mechanisms used
by an SSD mentioned before. The idea is that using multiple threads to make read
requests will increase performance. Furthermore, simultaneously sending multiple
I/O-requests allows the kernel to reorder I/O-requests, improving performance even
on an HDD. Prior studies also show that NVMe SSDs are better at utilizing parallel
I/O than HDDs [34].

2.8 Synchronous and Asynchronous read
In Windows, it is possible to do either synchronous or asynchronous reads [30].
When using buffered read synchronously, the kernel will block the rest of the sys-
tem until all demanded bytes are read. However, in some cases, this way of doing
I/O can harm a system’s performance since it could, for example, block available
resources from performing other computations.

An alternative way of doing I/O is asynchronously. The kernel will not block the
system when reading asynchronously, even though the data has not been read into
memory. Using asynchronous read has some benefits compared to synchronous read.
Firstly, the calling process can continue executing other instructions before check-
ing if the result has been returned instead of waiting for the reading to complete.
Secondly, creating numerous I/O requests enables the kernel drivers to reorder, split
up, or overlap multiple I/O requests for increased throughput [29].

Similar to the parallel I/O approach, asynchronously I/O will allow several read
requests to be handled simultaneously. A difference between asynchronously and
parallel reading is that in the case of asynchronous reading, a single thread can
start up many requests but will have to, in some later part of the program, gather
the results by listening to multiple events. The amount of simultaneous events is
called the queue depth [34].

10


3
Research methodology

The research was conducted in four phases. The phases were the prestudy, the
development, the experiment, and the evaluation phase.

3.1 Prestudy phase
In this thesis, it was decided to define the workload as a set of parameters. A
decision had to be made on which parameters should be included in the the workload.
Moreover, which tools available at the application layer for reading data had to be
investigated.

3.1.1 Parameters
Several parameters impact the reading performance. However, a decision had to
be made regarding which parameters should be examined. The decision of which
parameters should be examined was based on two sources of information. The first
source was available literature such as books and scientific papers regarding read-
ing performance. Examining the structure of the SSD and how reading is handled
in Windows gives hints regarding which parameters might impact reading perfor-
mance. Secondly, discussions were held with supervisors at Carmenta, in which they
described their system’s workload. The following parameters were examined.

File size: The file size is the measure of how much data a file holds [37]. A
system can read files of many different sizes. The spectrum is vast, and files can
differ in size by factors of hundreds of millions.

Read request size: The read request size is the amount of data read in a sin-
gle read operation. The size typically differs from a few bytes up to megabytes large
reads [19].

Total bytes read: Total bytes read in this context means the amount of data
read in one file. This size is not limited to the file size since a system can read a
part of the file more than once.

State of cache: A cache hit is when a read operation is executed, but the system
can find the data in memory and does not have to access the secondary memory.
Due to the significant difference in latency between disk and memory access, read

11


3. Research methodology

operations performed in a cache hit situation will have a greater throughput. How-
ever, when conducting a benchmark for a particular reading technique, it is essential
to consider the state of the cache. Otherwise, the results will be misleading. There-
fore, it has been decided to run all experiments on a cold cache. In other words, all
cache is cleared before the execution of the experiment starts [9].

Available threads: Generally, PCs today have a multi-core processor and, by
extension, multiple available threads for performing actions. Some applications re-
quire heavy computations, leaving little CPU resources available for I/O, while other
programs do only little computation and therefore have more resources to put on
I/O. If one wants to read a lot of data, it might be worth using more CPU resources,
i.e., multiple threads on the reading of data to increase throughput.

Access pattern: When reading data, the access pattern impacts reading perfor-
mance [34]. Therefore, it was decided to categorize a access pattern as one of two
general types. The first type of workload that was considered was sequential read.
As the name implies, sequential read means that the data is accessed in a sequenced
order. An example of a use case where this workload shows up is when reading
an image as a whole. The second type of access pattern that was considered was
random read. In contrast to sequential read, random read is when data is read from
several different locations in a file, not necessary in order. An example of where this
workload shows up is if the system wants to draw a map and only needs some parts
of the geographic data available in a file [32].

3.1.2 Reading techniques
There are several alternatives for how one can read data from a disk in application
space. The goal of the developed guidelines is to select the best suitable reading
technique given the parameters previously described. By doing literature studies
and discussions with the supervisors at Carmenta, the decision was made to examine
four different techniques. In the next section, the selected reading techniques will
be described.

3.2 Development phase
The developed guidelines presented in the next chapter were based on results from
several experiments. However, in order to conduct the experiments, a reliable test
environment had to be developed. The designed test environment consists of test
functions that test how a specific reading technique performs for a particular work-
load. The components that build up the test environment and design decisions made
to make the tests reliable are described below.

3.2.1 Reader overview
Each synchronous reading technique is implemented as a class with a common in-
terface called IReader. Using a common interface has several advantages when

12


3. Research methodology

comparing reading techniques. One advantage of this design decision is that time
is saved by avoiding redundant code. Common tests could be used instead of de-
veloping a test function for each reading technique. Another advantage is that the
readers’ evaluation is more reliable when common test functions are used. By using
the same test code, all readers have the same preconditions.

The interface is inspired by how reading is done in a typical read-heavy system,
namely Carmenta Engine. The idea was that by following the interface, the readers
would be easy to integrate into a real-life system. The interface forces the class to
have two instance variables and implement a set of required functions. Below is a
quick overview of the interface.

3.2.2 Instance variables
The IReader interface has two instance variables. The first is the path_ variable,
which is a string containing the path to the file to be opened. The second variable
is the maxExtent_ variable, which stores the length of the file.

3.2.3 Functions
size_t readBytes(size_t offset, size_t amount, char* buf)
This method is used to read from the currently opened file. The function returns
how many bytes were read.
readBytes takes three arguments.

• offset: Where the reading should begin.
• amount: The amount of bytes to be read.
• buffer: A pointer to a buffer. After the read operation is executed, the

demanded data will be located in the buffer.

bool isOpen()
Returns whether a file is open or not.

bool openNewFile(std::string path)
This method opens a new file handle for the provided file path path. If the file handle
is already opened, it will not be reopened since this would clear the cache related
to the file. Returns whether the file could be opened or not.

bool close()
This function closes the currently open file handle. Returns whether it successfully
closed the file or not.

std::string getName()
This utility function returns the name of the reader class as a string. It is mainly
used when one wants to print information to the console.

13


3. Research methodology

3.2.4 Developed Readers
Based on the insights gained from the earlier mentioned literature studies regarding
reading techniques, it was decided to implement the following four readers.

3.2.4.1 StreamReader

The StreamReader uses functions provided by iostream to execute reading opera-
tions. This reader was created to use as a comparison against low-level methods.

3.2.4.2 MappedReader

The MappedReader uses the available Windows system calls CreateFileW(), Create-
FileMappingW() and MapViewOfFile() to create a memory mapping to the desired
file [12, 22, 24]. As explained in the theory section, once the file is mapped, reading
some data from the file becomes a simple dereference of a pointer. Reads are done by
simply doing a memory copy from the mapped region to the provided buffer. This
is not how one would use a memory-mapped file in production, but the memory
copying forces the pages to be fetched by the OS.

3.2.4.3 BufferReader

The BufferReader uses the available Windows system calls CreateFileW() and Read-
File() to read files [12, 28]. To use the BufferReader one important internal com-
ponent has to be set, that is the block size. The block size is the largest amount of
data read at once. If the read request size is less than the block size, BufferReader
will use the Windows system call to read all of the requested data at once. However,
there is a max limit on how big the buffer size can be, 4GB per reading operation.
The limitation comes from the fact that the parameter in the Windows system call
that determines the amount to read is only 32-bit long [28].

3.2.4.4 AsyncReader

The AsyncReader uses the sameWindows system calls as BufferReader, but with dif-
ferent arguments, for making asynchronous read requests. AsyncReader has a limit
of 64 simultaneous requests, so if the reader needs to make more, it will first wait
for the current ones to finish before creating new requests. AsyncReader does not
implement the IReader interface since the interface is not compatible with how the
asynchronous calls work. Therefore the asynchronous reader is both implemented
and tested a bit different from the synchronous ones.

The main difference between the AsyncReader and the synchronous implementations
is the function responsible for reading data. In synchronous readers, the function
returns when the read has been performed, but making the asynchronous reader
behave the same way would ruin the whole purpose of reading data asynchronously.
Instead when AsyncReader reads data it returns an OVERLAPPED structure.

14


3. Research methodology

An OVERLAPPED structure contains the offset from which to start reading the
data and an event that gets triggered once the read has been performed [26]. There-
fore it is the one calling the read method which is responsible for checking that the
read has been successfully performed.

3.2.5 Testing methodology
To get reliable data, each data point was created by taking the average of multiple
tests. The number of trials was ten. For example, to see what throughput was
achieved by reading an entire 1GB file in 4KB blocks, the file was read ten times
with the 4KB block size, and the average latency was then used to calculate the
throughput. The reason for running so many tests for a single data point is to
minimize the impact of outliers on the final result.

3.2.6 Considering the cache
Issues can arise if one is not careful about handling the cache when conducting
reading experiments. One would expect similar results for an experiment that runs
several times. However, if one does not consider the state of the cache, performance
can increase significantly after the first run of the experiment. The speedup comes
from data being cached at the first run of the experiment. When the experiment is
rerun, the requested data can be found in the memory, and hence access time drops.
Therefore, it was decided to run all experiments on a cold cache. A cold cache is a
state of the system in which no file data is already present in memory at the read
time [9].

Two actions have been taken to ensure the cache is clear before conducting any
experiments. Firstly, a Windows tool called RAMMap is used to clear the buffer
cache [31]. Secondly, the file handle is closed and re-opened between tests since it
was observed during testing that some cached data is connected to the file handle.
By closing the file handle, this data is cleared.

3.2.7 Simulating random reads
There are a couple of ways one can go about simulating random reads. The first
intuitive option is to generate some random offsets within the file and then read
a certain amount of bytes at them. This approach is versatile since the number
of random reads and the amount to read at each offset can change. However, this
approach was not chosen for two reasons. Firstly, the mentioned approach does not
simulate a real-world scenario well since it reads from completely random offsets. In
reality, a file will contain blocks of data, and it does not make sense to make a read
at an offset that is not located at the beginning of a block. Secondly, it can be hard
to predict cache hits when the block size and number of reads can are randomly
chosen. For example, if one wants to make an experiment determining throughput
on a cold cache using random reads, doing many reads with a large block size will
result in many cache hits on the last reads.

15


3. Research methodology

Instead, a block size is chosen, representing the chunk of data to be read at each
random read. The file is then split into some offsets based on the block size. Due to
this design decision, the number of random reads in a file will depend on the block
size and the file size. Shuffling these offsets and doing block-sized reads from them
created a random read scenario more grounded in reality. This method also ensures
that the same data is not read multiple times, avoiding cache hits when suitable.

3.2.8 Testing parallel and asynchronous reading
From previous work, it has been shown that more performance can be extracted
from an NVMe SSDs when parallel or asynchronous reading is used [34]. For this
thesis, a set of test functions was developed to evaluate asynchronous and parallel
reading’s impact on performance. In the parallel experiments, the work was split
between threads, and in the asynchronous experiments, the work was split between
asynchronous calls.

When executing a sequential read, the work is divided into block-sized calls. When
reading many random offsets, the work is divided by the individual random reads.
Of course, one can combine the parallel and asynchronous approaches to optimize
performance, although this possibility is not explored in this thesis due to time con-
straints.

To enable parallel execution, the work was divided into blocks. The tests used the
OpenMP API, an API for parallel programming, to separate the responsibility of
reading the blocks between threads [3]. The chosen block size decides the total
number of blocks to read. For example, if the file size is 24KB and the block size is
4KB, there will be six blocks. The next design decision is how to schedule the reads
between the threads. For simplicity, OpenMP’s default scheduling is used, which
is quite simple: The reading operations are split evenly and in order between the
threads. So, for example, if there are six reads to be done and two threads, the first
thread will execute the first three reads, and the second thread will execute the last
three reads.

3.2.9 Synchronous single-threaded tests
The performance of a reader is measured for different workloads with the help of
test functions. The test functions take a pointer to an IReader as input and use the
interface’s visible functions to perform the reading tasks. Below is a description of
the implemented synchronous single-threaded test functions.

size_t readEntireFile(IReader* reader,std::string path)

This function uses a reader to read an entire file into a buffer. First, it starts a
timer. Then it opens the file and allocates a buffer of appropriate size. After that,
the read is performed. Lastly, it stops the timer, deallocates the buffer, and returns
the measured time.

16


3. Research methodology

size_t readSeq(
IReader* reader,
std::string path,
size_t amount)

The readSeq function works exactly like readEntireFile but with a configurable num-
ber of bytes to read.

size_t readManyShuffled(
IReader* reader,
int nrRandomReads,
std::string path,
size_t amount)

In the readManyShuffled function the file is divided into blocks of size amount. Then
the blocks are shuffled with a standard library function. Then, the timer is started,
and all the blocks of data are read by the provided reader. Lastly, the timer is
stopped, and the duration is returned.

size_t readOneBlock(
IReader* reader,
std::string path,
size_t blockSize)

The function uses a reader to read a single block of blockSize at offset zero. Returns
the time it took to read the block.

3.2.10 Asynchronous tests
As mentioned earlier, the asynchronous reader is different since the reader returns
an OVERLAPPED structure instead of a result when issuing a request for reading
data. Therefore, waiting for the request to return a result must be done within
the testing function. There are two asynchronous test functions: readManySeq and
readManyRandom.

size_t readManySeq(
AsyncReader* reader,
std::string path,
size_t amount)

readManySeq works as follows. The file is divided into blocks, and the read opera-
tions are then divided between asynchronous calls. Since the number of blocks to
read might be more than the number of possible simultaneous calls, many reads
are started and then waited to finish before starting another batch. This process
continues until all blocks have been read. The timer is started at the beginning of
this reading process and stopped when it is done. Lastly, the duration is returned.

17


3. Research methodology

size_t readManyRandom(
AsyncReader* reader,
std::string path,
size_t amount)

The function readManyRandom works in the same way as readManySeq, but the
offsets are shuffled so that the reads are done in random order.

3.2.11 Parallel tests
The parallel tests are very similar to single-threaded ones in most regards, but
with some differences. There are two functions used to make parallel read test:
readSequentialParallel and readManyShuffledParallel.

size_t readSequentialParallel(
char reader_type,
std::string path,
size_t amount)

The function readSequentialParallel is used to read a file sequentially and works
much like readManySeq but divides the work between threads instead of asyn-
chronous calls. It also comes with an extra consideration. Even though each thread
will read from exactly the same file, they cannot use the same file handle unless it
is a memory-mapped file since that would lead to race conditions. Since reading a
memory-mapped file is just accessing an address in memory, there is no race con-
dition if multiple threads want to access the same mapped file. Therefore, if the
reader to be used is the memory mapping one, just one file handle is opened, but if
it is any other reader, one handler is opened for each thread. The timer is started
before the threads are created, so that thread creation overhead is measured. The
threads read their designated part of the file, then the threads are terminated, the
timer stops, and the duration is returned.

size_t readManyShuffledParallel(
char reader_type,
std::string path,
size_t amount)

readManyShuffledParallel works precisely as the sequential version but with the
shuffling included.

3.3 Experiment phase
After the test environment was developed, several experiments were conducted for
different workloads. All of the experiments were run on synthetic data. Synthetic
data is data that is created artificially rather than data taken from any real-world
system [8]. Based on the results from these experiments, guidelines were developed.

18


3. Research methodology

The guidelines are based on utilizing knowledge about a system’s workload and are
tailored for a system that runs an NVMe SSD as the storage device.

3.4 Evaluation phase
As previously mentioned, the span of parameters that impact the final performance
is enormous. Hence, it is essential to do in-house benchmarking for the actual
workload that the targeted system will have to handle. To strengthen belief in
the recommended guidelines, the guidelines were tested on a benchmark that mim-
ics Carmenta engine’s workload. The hypothesis was that by using the developed
guidelines, one would be able to see a performance gain compared to running the
same benchmark on a system that uses a naive way of reading data. The data
received from Carmenta was created by recording all executed read calls in some
selected use-cases. These reads were recorded into Comma Separated Value (CSV)
files. Since all file reads made by the Carmenta I/O system are done using 4KB
blocks, all entries in the CSV files were 4KB reads. One could determine which
reads were larger sequential ones by comparing subsequent entries. Therefore, all
CSV files belonging to a load were parsed and combined into a set of reading instruc-
tions. The benchmark iterates through the list of operations and has the examined
reader execute each read operation. The performance is measured as the time it
takes to run through all operations.

19


3. Research methodology

20


4
Experiments

In the following chapter, results gained from several experiments are revealed. Firstly,
the results gained from experiments regarding the internal components of the read-
ers are presented and discussed. Secondly, results from comparisons between readers
are presented. Thirdly, it is examined how throughput can be increased by using
multiple threads. Lastly, the guidelines are presented.

4.1 Block size impact on performance

To use buffered reading, one has to set a block size. As will be shown, the choice of
block size significantly impacts reading performance. In the following section, it is
examined how the choice of block size impacts the throughput for both sequential
reads as well as for random reads. Moreover, the latency for different block sizes is
also examined. BufferReader has been used to test the different block sizes. All the
data-points displayed in the following three graphs have been computed by taking
the average of multiple tests as described in 3.2.5. Notice that the Y-axes have a
logarithmic scale.

4.1.1 Sequential reads using different block sizes

In the first experiment regarding the choice of block size, an entire 1GB file is read
sequentially using the readEntireFile function. The results of the experiment are
shown in Figure 4.1.

Notice that using a larger block size gives a higher throughput when reading sequen-
tially except for when using a block size of 256KB. Choosing a block size of 4MB
instead of the common preference of 4KB will, for this specific use case, improve
the throughput by a factor of approximately two. Moreover, one can see that the
increase in throughput starts to stabilize when using block sizes greater than 4MB.

21


4. Experiments

Figure 4.1: Block size impact on sequential read performance. The file size is 1GB.
The entire file is read. Y-axis is logarithmic.

4.1.2 Random read using different block sizes

In the second experiment, it was examined how the throughput is affected by block
size when doing random reads. This was achieved using the readManyShuffled func-
tion. The results from the experiment are shown in Figure 4.2.

As the result shows, using a larger block size for random reads will, as for sequential
reads, result in higher throughput up to a specific threshold. For random reads,
one can see that the gain in throughput stabilizes after choosing a block size that is
greater than 4MB.

Figure 4.2: Block size impact on random read performance. The file size is 1GB.
The entire file is read. Y-axis is logarithmic.

22


4. Experiments

4.1.3 Latency using different block sizes
In the last experiment regarding the choice of block size, the single block latency
was examined. The experiment was carried out using the readOneBlock function. It
was measured how long it takes to read blocks of different sizes with buffered read.
The results gained from the experiment can be seen in Figure 4.3.

From the results, one can notice that using larger block sizes results in higher latency
for almost all block sizes. However, reading a single block of size 4KB takes less
time than reading a block of size 1KB or 256B.

Figure 4.3: Block size impact on single block latency. The file size is 1GB. Y-axis
is logarithmic.

4.1.4 Choosing the optimal block size
Although the workloads were different, the same pattern emerged. Using a larger
block size results in a higher throughput up to some threshold. In these experiments,
this threshold was approximately 4MB. But bear in mind that this threshold can be
different if the experiments are run on a computer system that has different software
and hardware.

However, the throughput does not tell the whole story. One must also consider
latency. As the block size increases, so does the latency. Overhead will be introduced
if unnecessary large block size is used for a small read request. For example, if the
system needs to read a total of 256KB in a file, using a block size of 4MB would
result in latency 30 times larger than what one would have gotten if a block size of
256KB were used.

When using buffered read for reading data, the following guidelines are suggested
based on the results gained from three previously mentioned experiments.

• If request size ≤ 4KB then use block size of 4KB.

23


4. Experiments

• If request size ≥ 4MB then use block size of 4MB.

• Else use a block size equal to the request size.

Later in this chapter, different readers are compared. We will then use two readers
to represent buffered reading. The first one is BufferReader4KB which will use a
block size of 4KB. Since Carmenta uses a block size of 4KB when performing reads,
BufferReader4KB can be seen as a representation of how reading is done in their
system today. The second one is BufferReader4MB which will use a block size in
line with the guidelines above.

4.2 Queue depth impact on performance
When doing asynchronous calls, one can issue multiple read requests simultaneously.
As mentioned in the theory chapter, the queue depth is the number of simultaneous
calls. This section examines how the choice of queue depth impacts throughput.
Experiments were conducted for both sequential and random reads.

4.2.1 Sequential reads using different queue depths
First up are sequential asynchronous reads. The file size is fixed to 4GB and the
block size to 4MB. 4MB was chosen since that was one of the best performing block
sizes for a buffered read. The experiment was performed using the readManySeq
function. The result can be seen in Figure 4.4. Notice that the Y-axis is linear.

From the gained results, one can see a positive correlation between queue depth
and throughput for sequential asynchronous reads. Therefore, by using a larger
queue depth, throughput can be improved. However, as it was for block size, there
is a threshold. Using a queue depth greater than 16 will not further increase the
throughput for sequential asynchronous reads.

Figure 4.4: Queue depth impact on sequential read performance. The file size is
4GB. The entire file is read. Y-axis is linear.

24


4. Experiments

4.2.2 Random reads using different queue depths
When using asynchronous reading for conducting random reads, one must also con-
sider how the read request size influences the choice of queue depth. Since it might
occur that one queue depth is optimal for one request size but not for another, both
the read request size and the queue depth are examined in this test. This experiment
made use of the readManyRandom function and a 1GB file. The result can be seen
in Figure 4.5. The Y-axis has a logarithmic scale.

The result shows that the performance of multiple queue depths was very similar
for larger random reads, but the difference in performance was larger at minor read
requests. Overall, a queue depth of 32 or 64 had similar results throughout the
entire test.

Figure 4.5: Queue depth and read request size impact on random read perfor-
mance. File size is 1GB. Entire file read. Y-axis is logarithmic.

4.2.3 Choosing the optimal queue depth
Since the result shows that a queue depth of 16, 32 and 64 gave a similar performance
for sequential reads and a queue depth of 32 and 64 gave a similar performance for
random reads, it is logical to apply a queue depth of 32 when using the asynchronous
reader. Therefore in all following experiments, a queue depth of 32 is used. However,
an important thing to note is that this result might have been different if other
hardware or OS were used.

25


4. Experiments

4.3 MappedReader chunk size
As mentioned in the theory chapter, Windows has an internal parameter called
memory-mapped file chunks size. This parameter can differ between different ver-
sions of Windows. Even though a developer can not change this internal parameter,
knowing what this internal parameter is will help analyze the outcomes of other
experiments.

To be able to estimate what the memory-mapped file chunk size was for the Windows
version on the test computer, an experiment was executed using the readOneBlock
function. In this experiment, the single block latency was measured for Mappe-
dReader. The results can be seen in Figure 4.6. In this graph, the Y-axis scales
linearly.

By looking at the result, one can notice that the increase in latency follows a pattern.
After the block size has been increased by 32KB, the latency is increased by 250
microseconds. Based on the results, it can be concluded that the memory-mapped
file chunk size is 32KB. This means that data will always be swapped in as 32KB
chunks, and it is not possible to swap in less since this is an internal parameter
in Windows. Later, this knowledge is used to analyze the results from the other
experiments.

Figure 4.6: Finding the memory-mapped file chunk size. File size is 1GB. Y-axis
is linear.

4.4 Comparing readers
As mentioned earlier in this paper, several reading techniques can be used to read
data. These techniques have been implemented as readers such that they can be
compared. However, which reader is best to use is not a simple question to answer.

In the following part, results gained from several experiments will be presented.

26


4. Experiments

In each experiment, one specific parameter is in focus. The results of the single-
threaded reading experiments are divided into two sections. In the first section,
readers are compared for a sequential access pattern, and in the second section, the
readers are compared for a random access pattern.

4.4.1 Sequential reads
In this section, how the different readers compare against each other for sequential
reads is examined. Note that both graphs presented in this section use a logarithmic
scale on the Y-axes.

4.4.1.1 File size

The purpose of conducting the file size experiment was to examine how file size
affects throughput for sequential reads. An experiment was conducted using the
readSeq function, in which a fixed amount of bytes was sequentially read in files of
different sizes. Files of sizes from 1MB to 4GB were used, and the first 1MB was
read from each file. The result can be seen in Figure 4.7. In this experiment, the
results were inconsistent. Therefore, the graph looks noisy.

By looking at the results in Figure 4.7, one can see that file size impacts Mappe-
dReader, BufferReader4KB and StramReader. Furthermore, one can also see that
BufferReader4MB and AsyncReader have the most outstanding throughput for all
different file sizes that were tested.

4.4.1.2 Total bytes read

An experiment was conducted to examine how the total bytes read in a file impacts
throughput. In the experiment, different amounts of a 1GB file were read. The
readSeq function was used for this experiment. The results are shown in Figure 4.8.

The results show that in contrast to file size, the total bytes read in a file significantly
impacts sequential read performance. Reading more data in a file results in increased
throughput for all readers.

Another aspect worth pointing out is that similar to the result gained from the
file size experiment, AsyncReader and BufferReader4MB showed the best sequential
read performance. Regardless of the total bytes read, the results show that these
two readers will consistently outperform the other readers.

4.4.1.3 Discussion on sequential reads experiments

Two parameters were looked at to examine how the readers compare against each
other for sequential reads. The results gained from the file size experiment show

27


4. Experiments

Figure 4.7: File size impact on sequential read performance. The total bytes read
are 1MB. The Y-axis is logarithmic.

that file size seems to impact performance for some readers. Though, when it comes
to buffered readers, one cannot make a reliable conclusion since the results were so
inconsistent. MappedReader ’s performance was more consistent and loosed nearly
20% throughput when opening a file of size 4GB compared to a file of 1MB. One
explanation for this is that, when opening a file with MappedReader, it will map the
entire file into memory. Hence, one expects mapping a larger file to take more time
than mapping a smaller one.

In contrast to the file size, the total bytes read impacts all readers performance.
Reading a larger amount of the file results in higher throughput. An explanation for
this behavior is that prefetching can be more utilized when more data is read. In ad-
dition, a larger read implies the possibility of using a larger block size, which results
in higher throughput. BufferReader4MB and AsyncReader performed similarly up
until a total of 4MB was read. After this point AsyncReader began issuing multi-
ple simultaneous requests which resulted in AsyncReader having a higher through-
put compared to BufferReader4MB. Although MappedReader has advantages over
buffered reading, such as no context switches and no unnecessary memory copies, it
is never faster than using buffered reads with large block sizes.

In none of the experiments regarding sequential reads did BufferReader4KB per-
form best. Since BufferReader4KB, will always use a block size of 4KB, it will never
be able to outperform BufferReader4MB for sequential reads. StreamReader did
not outperform any of the other readers in any sequential experiment. Previous

28


4. Experiments

Figure 4.8: Total bytes read impact on sequential read performance. The file size
is 1GB. The Y-axis is logarithmic.

studies have shown that StreamReader in certain circumstances is slower than Buf-
ferReader4KB [9]. The reason that StreamReader was slower than BufferReader4KB
was that it has a greater CPU overhead. The CPU cost becomes noticeable if the
cache is warm or the disk is very fast. However, since all of our experiments are
run on cold cache, the effect of StreamReader ’s unnecessary CPU cost became less
notable.

4.4.2 Random reads
The results gained from the sequential experiments showed that using buffered read
with a large block size had the best performance for sequential reads. In none of
the experiments did MappedReader, StreamReader, and BufferReader4KB perform
better. However, this is not the case for random reads. As will be seen in this
section, using MappedReader is beneficial in certain situations. Notice the Y-axis
has a logarithmic scale in all five graphs belonging to this section.

4.4.2.1 Read request size

When examining how different readers perform for random reads, an additional
parameter must be considered. That parameter is the read request size, which is
how much data is requested for each random read. The following experiment was
conducted to investigate how the different readers perform for different read request
sizes. For each read request size, the function readManyShuffled was used, and an

29


4. Experiments

entire 1GB file was read. The results are presented in Figure 4.9.

By looking at the results from the read request experiment, one can see that in con-
trast to the experiments regarding sequential reads, which reading technique that
has the highest throughput depends on the examined parameter. In this experiment
setup, using memory-mapped files for random reading chunks less than or equal to
4KB outperforms synchronous buffered reading. At 1KB or less, memory-mapped
file reading also outperforms asynchronous reading. Somewhere between 1KB and
4KB, asynchronous reading begins to outperform all other readers. At 16KB, Buffer-
Reader4KB and StreamReader do a big leap in throughput and outperforms both
MappedReader and BufferReader4MB. For all examined request sizes larger than
64KB, BufferReader4MB was the fastest of the synchronous readers.

Figure 4.9: Request size impact on random read performance. File size is 1GB.
Entire file read. Y-axis is logarithmic.

4.4.2.2 File size

In the previous experiment, it was shown that MappedReader outperformed the
other readers when the read request was less than or equal to 4KB. However, both
the file size and the total bytes read were fixed in that experiment. The file size, as
well as the total bytes read, were set to 1GB. Therefore, these two parameters also
have to be examined.

An experiment was conducted in which only a total of 1MB were read in files of

30


4. Experiments

different sizes. Moreover, the read request size was set to 4KB. The reason that a
read request size of 4KB was used was to examine if MappedReader would still out-
perform the other synchronous readers as the file size changed. In this experiment,
readManyShuffled was once again used. The results can be seen in Figure 4.10.

Although the graphs are noisy, it is easy to notice a trend. MappedReader is im-
pacted by file size since its throughput drops when reading larger files. However, this
is not true for the other readers since they have approximately the same throughput
regardless of the file size.

Figure 4.10: File size impact on random read performance. Read request size is
4KB. Total bytes read are 1MB. Y-axis is logarithmic.

4.4.2.3 Total bytes read

To examine how the total bytes read in a file impacts throughput for random reads,
two experiments were conducted. To run the experiments, the readManyShuffled
function was used. In both experiments, a 1GB file was used, and the total amount
of data read changed in between each test.

For the first experiment, a read request size of 4KB was used to see how the perfor-
mance of MappedReader would be affected in comparison to the other readers. The
results can be seen in Figure 4.11.

In the results gained from the first total bytes read experiment, one can see that,

31


4. Experiments

similar to the file size experiment, the parameter has a big impact on the through-
put of MappedReader. Reading more bytes in a file will improve MappedReaders
throughput.

In the second experiment, another read request size was used. As was shown in
the read request size experiment, for request sizes between 4KB and 64KB, Buffer-
Reader4KB and StreamReader performed best out of the synchronous readers. To
examine if this always is the case, the request size was set to 12KB. The results are
presented in Figure 4.12.

In the second experiment, a pattern similar to the one in the first total bytes read
experiment appears. Namely, reading a larger portion of the file improves the
throughput of MappedReader. But in this case, it also improves the throughput
of StreamReader and BufferReader4KB. However, in comparison to the first experi-
ment, BufferedReader4MB is surpassed at a larger amount read, at about a quarter
of the file size.

Figure 4.11: Total bytes read impact on performance. Read request size is 4KB.
The file size is 1GB. Y-axis is logarithmic.

32


4. Experiments

Figure 4.12: Total bytes read impact on performance. Read request size is 12KB.
The file size is 1GB. Y-axis is logarithmic.

4.4.2.4 Reading more than memory

The largest file read in the previous experiment was of size 1GB. Another use case
for a system can be to read vastly larger files. From prior studies regarding using
memory-mapped files for reading data, it has been shown that the throughput given
by memory-mapped files drops when reading a file larger than memory [7]. However,
their experiment was run on a Linux machine. A similar experiment was run on
Windows to examine if memory-mapped reading has the same flaw.

Forty files of size 1GB were randomly read using memory-mapped files. After each
file was read, the throughput was calculated. Since the files were not closed after the
reading was done, they remained in memory. After the memory-mapped reading was
done, the same procedure was performed with a buffered reader. The result from
the experiment can be seen in Figure 4.13.

From the graph one can see that before the memory is full, the throughput of
MappedReader stays at about 150MB/s, but as soon as the memory gets full, the
throughput drops to around 30MB/s. This is just 1/5 of MappedReader ’s initial
throughput.

33


4. Experiments

Figure 4.13: Reading more than memory impact on performance. 40 files of size
1GB are read. Read request size is 4KB. Y-axis is logarithmic

4.4.2.5 Discussion on random reads experiments

The results gained from the sequential experiments implied that using asynchronous
and synchronous buffered reading is the best option for sequential reads. In contrast,
which reading technique is best to use for random reads depends on several parame-
ters. Nevertheless, the results have shown that MappedReader is a clear competitor
under certain circumstances.

An advantage that BufferedRead4MB and AsyncReader have over the other readers
is that they can read data in large chunks. However, it is not always possible to
read in large chunks when doing random reads since the read request size limits the
block size. For example, if the system has to read small records of 4KB in a large
file, BufferReader4MB and AsyncReader will not be able to utilize the advantages
gained from using a larger block size. This can be seen in the request size experiment
4.9. When the read requests gets larger, BufferReader4MB and AsyncReader can
use a larger block size, and hence the throughput increases.

MappedReader was able to perform better in the read request size experiment for
smaller request sizes. However, as the file size and total bytes read experiments show,
a small read request is not the only parameter affecting the throughput of Mappe-
dReader. One also needs to read a large portion of the file. In the experiments using
a 4KB request size, approximately 13% of the file had to be read for MappedReader
to surpass BufferReader4MB in throughput.

Why does reading a larger portion of the file positively influence the throughput

34


4. Experiments

for MappedReader? This is most likely because data is already in memory. At the
beginning of this chapter, it was shown that MappedReader will always swap in
at least 32KB of data. In both the first total bytes read experiment and file size
experiment, a read request size of 4KB was used. When reading the entire file, that
is, if the total bytes read equals the file size, MappedReader peaks in throughput.
In that scenario, for each 32KB block in the file, there will be only one actual read
from secondary memory. The remaining seven 4KB blocks are already in memory.
When reading a smaller portion of the file, the chance of finding the requested 4KB
in memory becomes smaller. In the worst-case scenario, for each 4KB read request,
MappedReader will have to swap in 32KB and will never find the requested data
in memory. If data can be found in memory, the read operation goes very fast.
Moreover, MappedReader does not have to do any unnecessary memory copy from
kernel space to user space, nor any context switches. Hence, the MappedReader is
faster when the read request is small, and a larger portion of the file is read.

Moreover, a similar behavior was found in the results gained from the second total
bytes read experiment, with a 12KB request size. In that scenario, reading a more sig-
nificant portion of the file positively influenced BufferReader4KB and StreamReader.
The likely reason that BufferReader4KB and StreamReader performed better when
reading a larger portion of the file has to do with prefetching. When reading many
subsequent 4KB blocks, the OS prefetches a more significant portion of the file. This
can be seen in the second total bytes read experiment 4.12. When a larger portion
of the file is read, BufferReader4KB and StreamReader is more likely to find the
prefetched data in memory.

Lastly, how MappedReader and BufferReader4KB performed when there was more
data to be read than available memory was examined. The results gained from the
reading more than memory experiment were similar to the results of previous studies.
When reading more than memory, MappedReader becomes significantly slower [7].

4.5 Parallel reads
The sequential and random reads experiments were conducted using only a single
thread. In this section, it is examined if using multiple threads can increase the
performance of the readers. Similar to previous sections, the experiments are run
for both sequential and random reads. Note that all Y-axes in the following figures
use a logarithmic scale.

4.5.1 Parallel sequential reads
Firstly, it is interesting to see if sequential reads can be sped up using multiple
threads. The function readSequentialParallel is used to read an entire 1GB file with
a block size of 4MB. The results can be seen in Figure 4.14.

From the graph, one can see that the BufferReader4MB always performs best and
that the throughput peaks when using two to four threads. Using more threads

35


4. Experiments

than four leads to a decrease in throughput. The other readers are not even close
in terms of throughput to the BufferReader4MB.

Figure 4.14: The number of threads impact on parallel sequential read perfor-
mance. File size is 1GB. Entire file is read. Y-axis is logarithmic.

4.5.2 Parallel random reads

Next up is the two parallel random reads experiments. Like in the sequential test,
multiple threads are used to see if throughput can be increased. This is done using
the test function readManyShuffledParallel. The results when using a read request
size of 4KB can be seen in Figure 4.15 and the result when using a read request size
of 4MB can be found in Figure 4.16.

The results show that when the read request size is small, using the memory-mapped
technique is always better, and that throughput is almost increasing linearly, for all
readers, with the number of threads. As for a large read request size of 4MB, the
graph looks very similar to the parallel sequential reads experiment. Buffered reading
is preferable, and throughput peaks when using four threads.

36


4. Experiments

Figure 4.15: Number of threads impact on parallel random read performance. File
size is 1GB. Entire file is read. Request size is 4KB. Y-axis is logarithmic.

4.5.3 Discussion on parallel reads experiments

As noted in the experiment, buffered reading performs best and will benefit from us-
ing two to four threads in the case of sequential reads. This is because simultaneous
request enables the SSD to utilize its internal parallelism better. The throughput
drops when using more than four threads. The reason for this is that a higher
amount of threads leads to higher overhead handling the threads, together with the
fact that the maximum possible throughput of the SSD has already been reached
when using four threads. The conclusion drawn from this is that one can amplify
the throughput of sequential reads significantly by just using one extra thread.

When looking at the random reads with a small read request size, one can see that
memory-mapped reading always results in the highest throughput. However, the
peak throughput is never reached, and performance increases with the number of
threads used. A higher-performing processor might get the highest possible through-
put, but in this case, one must consider if all that processing power is not put to
better use elsewhere. In the case of larger read request sizes, the result is, as noted
in the experiment, very similar to the sequential one. Once again, buffered reading
is the best performing, although in this case, peak throughput is reached when using
four threads.

The takeaway from these experiments is that which reading technique has the highest
throughput for a specific workload does not change when using multiple threads.
Instead, using multiple threads to read files can be an excellent way to increase
throughput further. Furthermore, if one wants to fully utilize the performance of

37


4. Experiments

Figure 4.16: Number of threads impact on parallel random read performance. File
size is 1GB. Entire file is read. Request size is 4MB. Y-axis is logarithmic.

the NVMe SSD using synchronous reads, taking advantage of multiple threads is
required.

4.6 Integrating asynchronous reading
Asynchronous reading was shown to be the best alternative for multiple workloads.
Using asynchronous reading was the only way to fully utilize the NVMe SSD’s
capabilities on a single thread and often outperforms all other options. When it
comes to sequential reads, asynchronous reading can be used much like parallel
reading to increase performance. It is, however, essential to note that the random
reads experiments offer ideal conditions for asynchronous reading since all the read
operations are known beforehand.

In addition to read requests not being known beforehand in a real-world application,
using asynchronous reading comes with many other considerations that need to be
handled to reach good performance. For example, one might want to interleave
asynchronous read requests with computation to make the system more effective.
A question related to this endeavor is how to handle the asynchronous read queue.
Should you send all requests at the same time they come in or wait for multiple
requests to send them simultaneously? When and how should you wait for a request
to finish? Providing an optimal answer to these questions is out of scope for this
thesis. Additionally, asynchronous reading is more complicated to integrate than
synchronous reading and might be difficult to integrate if one wants to use different
reading techniques based on the situation.

38


4. Experiments

4.7 Guidelines

The results gained from the experiments show that by using asynchronous reading
instead of synchronous reading, one can significantly improve performance. Asyn-
cReader showed the best performance in all experiments except the scenario of
random reading a larger portion of the file with a read request size of less than 1KB.
However, as pointed out in the previous section, it can be difficult and sometimes
impossible to utilize asynchronous reading to its fullest. Hence, it is recommended
that the demanding developer examine if asynchronous reading can be integrated
into their system. However, the developer should have in mind that, based on gained
results, using memory-mapped files is faster than asynchronous reading for random
reading of a larger portion of the file with a read request size smaller than 1KB.
Therefore, the best overall reading strategy is to swap between memory-mapped
reading and asynchronous reading depending on the workload.

Even though a developer cannot or does not want to use asynchronous reading, it is
still possible to optimize reading. Therefore, a set of general guidelines are now pre-
sented for optimizing a read-heavy system by considering its workload and utilizing
the advantages that come with NVMe SSDs. The guidelines are not dependent on
the application and are therefore easy to implement in real-life system.

The results regarding single-threaded sequential reads have shown that Buffer-
Reader4MB consistently outperforms the other synchronous readers regardless of
the value of the examined parameter. Therefore, based on the results from the
experiments, the following is suggested for sequential reads.

• Use buffered read for sequential reads.

• Choose block size according to the conditions given in 4.1.4.

In contrast to the guidelines for sequential reads, for random reads, the choice
of reading technique depends on the workload. The discussion regarding single
threaded random reads concluded that using memory-mapped files is only prefer-
able over buffered read if the read request size is small. Moreover, it was shown
that cache hits were very important for memory-mapped file reading. To predict
how often one gets a cache hit is difficult. One way of predicting it is to assume
that the offsets are uniformly distributed and that the read request size is 4KB. As
mentioned in the previous discussions, if the read request size is 4KB and the offsets
are uniformly distributed, when reading more than 13%, memory-mapped file read-
ing is preferable. Otherwise, a buffered read with a large block size is preferable.
Although this is a very rough estimate, it gives an idea of how one can reason about
the number of cache hits. Because of the hard assumptions, in-house benchmarking
is recommended.

Moreover, it was shown that for random reads, StreamReader and BufferReader4KB
performed equally good or better than MappedReader for read requests that were

39


4. Experiments

between 4KB and 32KB. However, in order for BuffedReader4KB and StreamReader
to be as good as MappedReader, a large portion of the file had to be read. Hence,
for simplicity’s sake, the guidelines will suggest using MappedReader in that interval
since the reader is most often the best alternative.

There was also one particular case in which MappedReader performed worse. That
case was when more data was read than available memory. The OS won’t evict
memory-mapped pages until there is no memory left. Once this happens, perfor-
mance will drastically decrease.

Based on the abovementioned points, the conditions below have been formulated
as guidance. These conditions should be considered every time a file is opened
for reading. If all conditions evaluate to true, memory-mapped reading should be
used. Otherwise, one should use buffered reading with the same block size rules as
sequential reading.

• The amount of data to be read from the file is less than the available memory.

• The reads to be made are at non-sequential offsets, i.e., random.

• It is expected that a larger portion of the file will be read.

• The read request size is less than or equal to 32KB.

Once the choice of reading technique has been made, one can decide if multiple
threads can be used to execute the read.

The guidelines have been visualized as a flowchart and can be seen in appendix B. In
the flowchart, the conditions have been grouped into levels. There are three different
levels, system, file, and request level. To evaluate if a condition at the system level
is true, the developer needs to have knowledge about the system as a whole. At the
file level, the developer needs to know how the file is accessed; and at the request
level, the developer only needs to have knowledge about the particular request.

Parallel reading can, as shown experimentally, increase the throughput in many
cases. Multiple threads do not change what read technique to use but rather amplify
the throughput. If there is a big sequential read to be done, use two or four threads
to maximize the throughput. For random reads, if the reads are small, which often
is the case, use all the threads that are available in the application’s thread pool.
For larger random reads, four threads are enough.

These recommendations work in the scenario that only one file is handled at a time.
However, many applications might want to process multiple files at a time. In that
case, one must do in-house benchmarks to see how to best divide the work between
the threads. Therefore, how one uses multiple threads depends on the system at
hand.

40


5
Benchmarks

To evaluate the guidelines, a new reader has been implemented, calledAdaptiveReader.
The reader was based on the guidelines presented in the previous chapter. Moreover,
a custom-built benchmark has been developed. The benchmark mimics the behavior
of a read-heavy system, namely the Carmenta Engine. The new AdaptiveReader has,
together with the other readers, been evaluated for three different Carmenta Engine
use-cases. The outline of this chapter is as follows. Firstly, AdaptiveReader will be
described. After that, the results gained from the benchmarks will be presented.

5.1 AdaptiveReader

The main idea behind AdaptiveReader is to swap between memory-mapped file
reading and buffered reading by following the earlier presented guidelines. However,
the conditions that are not on the request level cannot be evaluated by merely
observing a read request. For example, when receiving a request, the reader cannot
know if the requested file will be accessed several times in a random read pattern.
Therefore, in contrast to the other readers, a flag is passed as an argument when
the developer wants to open a file with an AdaptiveReader. The flag is called the
swapFlag and should only be set if and only if the following three statements are all
true.

• The amount of data to be read from the file is less than the available memory.

• The reads to be made are at non-sequential offsets, i.e., random.

• It is expected that a larger portion of the file will be read.

In the case that AdaptiveReader opens a file and the swapFlag is set to false, it will
work exactly as BufferReader4MB. However, if the swapFlag is set to true, then
AdaptiveReader will decide which reading technique to use by looking at the read
request. When the swapFlag is true, for a read request, if the request size is less
than or equal to 32KB, AdaptiveReader will use memory-mapped file reading. If the
request size is larger than 32KB, then a buffered read with a large block size will be
used.

41


5. Benchmarks

In a real-world scenario, the developer would set the swapFlag based on their knowl-
edge about their system’s workload. However, in a benchmark, the flag has to be set
programmatically. To set the flag, an analysis of the log files was conducted. Since
the total size of all of the files read from in the benchmark is less than memory,
it’s impossible to read more than memory. Moreover, by analyzing the log files,
it was noticed that, for the most part, a larger portion of each file was randomly
read. Hence for simplicity’s sake, each time a file was opened by AdaptiveReader,
swapFlag was set to true.

5.2 Carmenta test-suite
The Carmenta Test Suite is a set of tests that are used to validate and evaluate
the Carmenta Engine. Two benchmarks based on the log files from the Carmenta
test suite were used to evaluate the readers. The first benchmark is called the
"Carmenta test-suite benchmark" (CT), and the second one is called the "Animation
files benchmark" (AM). Note that the Y-axes in the following two graphs are linear.

5.2.1 Carmenta test-suite benchmark
CT contains all of the read operations that are executed in the Carmenta test suite.
A total of 652.88MB are read. Of the total bytes read, 9.2% were requested with
read request sizes larger than 32KB. The results gained from the benchmark can be
seen in Figure 5.1.

The result given from the CT benchmark shows that MappedReader and Adap-
tiveReader had the lowest latency. BufferReader4KB and StreamReader got the
highest latency. In this benchmark, AdaptiveReader was approximately 2.5 times
faster than BufferReader4KB.

Figure 5.1: Results gained from Carmenta test-suite benchmark. Y-axis is linear.

42


5. Benchmarks

5.2.2 Animation files Benchmark
As CT, AM is based on a log file from the Carmenta Test suite. However, what
differs is that AM only contains a small part of all of the executed read operations in
the Carmenta test suite. The process that is examined concerns reading animation
files. In AM, a total of 7MB are read. However, in comparison to CT, almost all
data read is requested with read request sizes larger than 32KB. Of the total bytes
read, 99.8% were requested with read request sizes larger than 32KB.

In contrast to the results from the previous benchmark, BufferReader4MB performed
best. AdaptiveReader performed as good as BufferReader4MB. MappedReader, Buf-
ferReader4KB and StreamReader performed significantly worse.

Figure 5.2: Results gained from Animation files. Y-axis is linear.

5.3 Interactive map
Interactive map (IM) mimics the use case in which a user interacts with a geospatial
applications. The logs that the benchmark is based on come from an application
in which a developer at Carmenta interacted with a map. A total of 2415MB were
read during the benchmark. Of that amount, 0.6% were read in chunks larger
than 32KB. The results are shown in Figure 5.3. The results gained from the
Interactive map benchmark are similar to the ones gained from the Carmenta test
suite. AdaptiveReader and MappedReader performed best.

5.4 Discussion of benchmarks results
The analysis of the log files showed that the three first statements of the guidelines
were all true. Hence the choice of reading technique only depended on the read

43


5. Benchmarks

Figure 5.3: Results gained from Interactive map. Y-axis is linear.

request size. Since CT and IM contain many read requests that are less than 32KB
and the three other statements were true, the guidelines suggest that MappedReader
should be used. By looking at the results gained from CT and IM, it is clear that it is
better to use memory-mapped reading over buffered reading since it gets a lower la-
tency for this use case. In contrast to CT and IM, AM contains larger read requests.
As the guidelines suggest, in that scenario, it is better to use BufferRead4MB than
to use MappedReader.

AdaptiveReader performed as well as all of the readers in all three examined use-
cases. However, it did not perform significantly better than the best non-adaptive
reader. It did not perform much better because the benchmarks either contained
almost only small reads or almost only large reads.

In CT and IM, only 9.2% and 0.6% of the total bytes read were read in chunks larger
than 32KB. Hence AdaptiveReader will use memory-mapped file reading for almost
all of the read operations and thus get a latency very similar to MappedReader.
However, for AM, 99.8% of all bytes read are read in chunks larger than 32KB.
Thus, AdaptiveReader will almost only use buffered read and get a latency very
close to the one of BufferReader4MB.

In order to see a significantly lower latency for AdaptiveReader compared to the
best non-adaptive, the workload has to contain a mix of larger and smaller read
request sizes. In that scenario, neither MappedReader nor BufferReader4MB will
be the best choice, and hence switching between techniques will be more crucial in
order to get the highest performance.

Whilst the non-adaptive readers had their weaknesses for different use cases, Adap-
tiveReader was always as least as good as the best non-adaptive reader. Therefore,
it can be concluded that reading can be optimized by utilizing knowledge about a
system’s workload and the available hardware.

44


6
Conclusion

In this paper, guidelines have been presented that can be used to reduce the I/O
bottleneck in read-heavy systems. The guidelines are based on results from several
experiments in which it was examined how multiple reading techniques perform for
different workloads. Furthermore, a new reader was created based on the guidelines
and evaluated with a custom-built benchmark that mimics the behavior of a real
read-heavy system. The results gained from the benchmark indicate that the new
reader, which adapts a reading strategy based on the guidelines, performs signifi-
cantly better than a reader using a static strategy. Hence, it was concluded that
reading could be optimized at the application level by utilizing knowledge about a
system’s workload and utilizing the advantages of an NVMe SSD.

45


6. Conclusion

46


7
Future Work

Optimizing data reading is a vast area of research, and there is a lot of research to
be done that could expand on this work. Below are some suggestions on areas that
could be expanded upon.

7.1 Create a better model for cache hits
It was shown in the experiments that the number of cache hits has an impact on
which reading technique that performs best. Moreover, the guidelines suggest that
it is beneficial to use memory-mapped files if the system should read a larger portion
of the file. The suggestion is based on the idea that it is more likely to get cache
hits if a larger part of the file is read.

However, this is only true if the read request size is 4KB and the memory-mapped
chunk size is 32KB, which is not always the case. A ”larger portion of the file" is also
an unspecific amount. It would be beneficial to have a better model for reasoning
about cache hits. This is, however, not a simple problem, and there was not enough
time to explore it in this thesis.

7.2 Testing other storage hardware
Another suggestion for future work is to perform the tests made in this thesis on
other storage mediums than an NVMe SSD. For example, an HDD, SATA SSD, or
a USB-connected SSD. One could also perform the experiments on different NVMe
SSDs to see if there is a difference between brands, capacities, etc.

7.3 Exploring more options in Windows
When opening a file in Windows, the system’s caching of data read from a file
can be disabled using the FILE_FLAG_NO_BUFFERING option [1]. Since the
behavior of the cache has been a major part of the consideration when designing the
guidelines, completely disabling the cache can potentially lead to a different results.
However, using the FILE_FLAG_NO_BUFFERING option comes with additional

47


7. Future Work

constraints regarding alignment and file access [1]. Therefore, due to the additional
constraints, the flag’s impact on reading performance was not examined.

Even if the cache is kept activated, hints about how the file is about to be accessed
can be given to the OS when creating the file handle [21]. The hint for sequen-
tial read is FILE_FLAG_SEQUENTIAL_SCAN, and the hint for random read
is FILE_FLAG_RANDOM_ACCESS. By using these hints, the OS can optimize
how it caches data. The reason that these flags were not explored in this thesis was
mainly due to time constraints.

7.4 Testing a different OS
Another interesting addition to this research would be to execute the experiments
on a Linux machine to see if any guidelines would change. Since Linux is free open-
source software, it provides more overall control over the system and the low-level
software. On Windows, it was shown that when using memory-mapped files, 32KB
were swapped in at each page fault which shaped the guidelines substantially. If
this parameter can be manipulated in Linux, the guidelines might look different.
Linux also provides a different interface for asynchronous I/O [34]. Comparing this
to what is available on Windows might change the reasoning around asynchronous
reading.

If the hardware were the same, conducting experiments on Linux would simultane-
ously be a good comparison of reading performance between Linux and Windows.

7.5 Explore DirectStorage
Microsoft DirectStorage for Windows was released in spring 2022 [16]. It is a tech-
nique to reduce overhead and CPU workload when loading data from secondary
storage into a system’s graphic processing unit (GPU). It is mainly intended for
games but can perhaps also be used in other kinds of applications. For example, if
a system utilizes the GPU to accelerate computations, it might benefit from using
DirectStorage to load data into the GPU.

48


Bibliography

[1] alvinashcraft, v kents, DCtheGeek, drewbatgit, mijacobs, and msatranjr. File
buffering. https://docs.microsoft.com/en-us/windows/win32/fileio/fi
le-buffering. Accessed: 2022-05-27.

[2] Apple. File system formats available in disk utility on mac. https://docs.m
icrosoft.com/en-us/troubleshoot/windows-client/backup-and-storag
e/fat-hpfs-and-ntfs-file-systems. Accessed: 2022-03-18.

[3] OpenMP ARB. The openmp api specification for parallel programming. https:
//www.openmp.org/. Accessed: 2022-05-24.

[4] Carmenta. Carmenta engine. https://carmenta.com/en/geospatial-techn
ologies/carmenta-engine. Accessed: 2022-05-17.

[5] Raymond Chen. How do i prefetch data into my memory-mapped file? https:
//devblogs.microsoft.com/oldnewthing/20120601-00/?p=7483. Accessed:
2022-03-16.

[6] cplusplus.com. <iostream>. https://www.cplusplus.com/reference/iost
ream/. Accessed: 2022-05-18.

[7] Andrew Crotty, Viktor Leis, and Andrew Pavlo. Are you sure you want to use
mmap in your database management system? 2022.

[8] Cem Dilmegani. Synthetic data in-depth synthetic data guide: What is it? how
does it enable ai? https://research.aimultiple.com/synthetic-data/.
Accessed: 2022-05-30.

[9] Niall Douglas. P1031r2: Low level file i/o library. Technical report, C++
Standards Committee, 2019.

[10] Alexandra (Sasha) Fedorova. Why mmap is faster than system calls. https:
//sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e
75ab37. Accessed: 2022-05-18.

49

https://docs.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://docs.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://docs.microsoft.com/en-us/troubleshoot/windows-client/backup-and-storage/fat-hpfs-and-ntfs-file-systems
https://docs.microsoft.com/en-us/troubleshoot/windows-client/backup-and-storage/fat-hpfs-and-ntfs-file-systems
https://docs.microsoft.com/en-us/troubleshoot/windows-client/backup-and-storage/fat-hpfs-and-ntfs-file-systems
https://www.openmp.org/
https://www.openmp.org/
https://carmenta.com/en/geospatial-technologies/carmenta-engine
https://carmenta.com/en/geospatial-technologies/carmenta-engine
https://devblogs.microsoft.com/oldnewthing/20120601-00/?p=7483
https://devblogs.microsoft.com/oldnewthing/20120601