Hardware BVH builder based on the PLOC++
algorithm

Master’s thesis in Computer science and engineering

Keivan Saberian

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Hardware BVH builder based on the PLOC++
algorithm

Keivan Saberian

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Hardware BVH builder based on the PLOC++ algorithm

Keivan Saberian

© Keivan Saberian, 2023.

Supervisor: Ulf Assarsson, Embedded Electronics Systems and Computer Graphics
Examiner: Erik Sintorn, Embedded Electronics Systems and Computer Graphics

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Hardware BVH builder based on the PLOC++ algorithm

Keivan Saberian
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The demand for high-quality visual effects in 3D rendering of real-time applications
is on the rise. To meet this demand, researchers have focused on integrating ray
tracing support into graphics hardware. However, support for dynamic scenes still
poses a significant challenge. This is due to the fact that the underlying spatial data
structures, most commonly the bounding volume hierarchy, must be rebuilt every
frame in the worst case.

This thesis introduces a hardware accelerator for the construction of bounding vol-
ume hierarchies. The proposed hardware is based on the state-of-the-art PLOC++
algorithm, and aims to address the memory-intensive construction through a band-
width economical approach where most external memory traffic is converted into
on-chip streaming traffic similar to PLOCTree [Viitanen et al. 2018]. The proposed
unit is on average 2.19 times faster in simulation, with a 3.94 times improvement in
memory traffic.

Keywords: computer, science, computer science, engineering, computer graphics,
ray tracing, bounding volume hierarchy, graphics hardware, project, thesis.

v


Acknowledgements
This thesis was made possible thanks to Chalmers for providing the tools necessary.
I would also like to extend thanks to my supervisor and examiner for providing both
feedback and advice during various stages of the project. Finally, I would like to
thank Benthin Carsten for providing me with the PLOC++ repository.

Keivan Saberian, Gothenburg, 2023-06-13

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 5
2.1 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Bounding Volume Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Morton Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Hardware Description Design . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Field-programmable Gate Array . . . . . . . . . . . . . . . . . . . . . 6

3 Algorithm 9
3.1 Parallel Locally-Ordered Clustering . . . . . . . . . . . . . . . . . . . 9
3.2 PLOC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 From Software to Hardware . . . . . . . . . . . . . . . . . . . . . . . 11

4 Hardware Architecture 15

5 Evaluation 17

6 Results 19

7 Conclusion 23

Bibliography 25

ix


Contents

x


List of Figures

4.1 System Architecture of a hardware unit with N sweep units. As
indicated by the figure, both sweep units and memory reside inside
the control module similar to how various PC components reside on a
motherboard. Note that the figure is meant as a simplified overview
for visualization purposes, as opposed to a detailed schematic of the
logic inside the control unit. . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Architecture of a PLOC SWEEP unit with search radius of 16. Here,
DME stands for Distance Metric Evaluator, and CMP refers to a
comparator unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.1 Relative performance speedup of the proposed hardware unit over the
GPU implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

xi


List of Figures

xii


List of Tables

6.1 Summary of results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Area and power breakdown of a single PLOC SWEEP unit. . . . . . 21

xiii


List of Tables

xiv


1
Introduction

Recent developments in GPUs (such as RTX cores in modern NVIDIA GPUs)
[Burgess 2020] alongside an increasing demand and interest of high quality visual
effects in 3D rendering of real-time applications, has put an emphasis on research
for achieving high performance real-time ray tracing [Deng et al. 2017]. A key com-
ponent in realising such performance is the need to organize scene geometry into
spatial acceleration data structures, with the most common data structure being
the Bounding Volume Hierarchy (BVH) [Vinkler et al. 2017; Viitanen et al. 2018;
Meister et al. 2021; Benthin et al. 2022].

Dynamic scenes, as in real-time applications, requires support for the BVH to be in
the worst case reconstructed every frame in order to maintain high visual quality
[Viitanen et al. 2018; Benthin et al. 2022]. For this reason, much development has
been seen in offloading the BVH construction onto hardware accelerators. However,
Viitanen et al. [2018] notes that state-of-the-art accelerators lack foundation in
modern algorithms. They therefore instead identified and advocated for the more
recent Parallel Locally-Ordered Construction (PLOC) algorithm as more suitable
for hardware acceleration.

An improved version of PLOC was published in July 2022, titled PLOC++, which
outperforms the original in both computational- and memory requirements [Benthin
et al. 2022]. The thesis project aims to design a hardware accelerator for the state-
of-the-art BVH construction algorithm PLOC++ in order to improve its build time,
memory traffic, and power consumption for the use of real-time ray tracing.

In particular, the following questions will be explored:

1. How does the chosen hardware implementation of PLOC++ compare to the
original software version, as well as the hardware accelerated PLOC algorithm
by Viitanen et al. [2018]?

2. How well can the proposed design be integrated into desktop-, mobile-, and
game console GPUs for the use of real-time ray tracing?

1.1 Related Work
BVH construction is a widely researched area spawning numerous algorithms, and
in turn hardware accelerators [Meister et al. 2021]. This raises the need to compare

1


1. Introduction

builder quality, which is commonly done through the Surface Area Heuristic (SAH)
metric [Viitanen et al. 2018]. SAH provides an estimation of a given tree node’s ray
intersection probability through its bounding box’s surface area; it was introduced
alongside the top-down SAH sweep BVH construction algorithm [MacDonald and
Booth 1990]. SAH sweep has since then been extended into the binned SAH sweep
algorithm [Wald 2007], which in turn has become the basis for previous hardware
accelerators [Doyle et al. 2013].

As for other accelerators, there exists those based on the Linear Bounding Volume
Hierarchy (LBVH) algorithm [Viitanen et al. 2017] introduced by Lauterbach et al.
[2009]. Furthermore, some accelerators have utilized refitting [Woop et al. 2006; Nah
et al. 2015], i.e. algorithms that reuse BVH topology as opposed to constructing
new BVHs from scratch, and some are based on k-d trees instead of BVHs [Nah
et al. 2014; Liu et al. 2015].

Finally, a hardware accelerator exists for the original PLOC algorithm [Viitanen
et al. 2018] that has been shown to produce large performance improvements, while
also reducing memory bandwidth requirements, as compared to the original software
implementation. This has lead the authors of PLOC++ [Benthin et al. 2022] to
note that the implementation of PLOC++ could similarly benefit by having the
merged nearest neighbor search and hierarchy construction phase be offloaded to a
dedicated hardware unit. Additionally, they suggest that a hardware unit for sorting
key-value integer pairs would also be beneficial, since sorting the Morton codes is
starting to have similar run-time cost as the merged phase.

The advent of PLOC++, alongside the apparent improvements possible through
hardware acceleration, has become motivation for this thesis. This motivation is
furthered by the observation that previous builders, as discussed above, lack foun-
dation in modern construction algorithms [Viitanen et al. 2018].

1.2 Delimitation
As mentioned in Section 1.1, the authors of PLOC++ note that dedicated hardware
units for offloading:

1. The merged nearest neighbor search and hierarchy construction phase

2. The sorting of Morton code key-value pairs

can both lead to large benefits in algorithmic performance. This thesis will focus
only on 1), i.e. the merged nearest neighbor search and hierarchy construction phase,
as this constitutes the core part of PLOC++, and is also its main separation point
from similar algorithms such as PLOC.

1.3 Ethical Considerations
The work of this thesis, its purpose and final product, is done in the hopes of
contributing to the vast field of research in real-time ray tracing, such that the

2


1. Introduction

technology can become accessible to a wider audience through consumer grade GPUs,
smartphones, and video-game consoles. That said, it can be speculated that such
accessibility might be gated by high product pricing, and thus only available to
part of the population until said pricing drops as time since commercial launch
passes. This kind of speculation can be made for all sorts of technologies, however to
bridge the gap between low-end and high-end products, technological advancements
are likely to be needed in the first place. It is also important to recognize the
possibilities for future technology, and user experiences in various media, that this
kind of research can enable.

Furthermore, the hardware can be verified using software simulation as well as syn-
thesis on FPGA (see Section 2.4). The act of simulation, coupled with the repro-
grammability of FPGAs, makes it possibly to conduct this research without produc-
ing hardware that will go to waste.

3


1. Introduction

4


2
Background

This section presents necessary background for understanding the thesis work.

2.1 Ray Tracing
Ray casting is a fundamental process in computer graphics that involves firing a
ray from a specific location to determine the objects that are present in a particular
direction. Ray tracing, on the other hand, is a technique built on ray casting in order
to determine the light transport between various scene elements. These concepts
constitute an inherent component of many modern rendering algorithms. In its most
basic form, ray tracing involves emitting rays from the camera through the pixel grid
into the scene. The closest object of each ray is then identified, at which point its
shadow is computed by firing a ray from the intersection point towards each light
source to find if objects are present in between. Additional rays can also be spawned
from an intersection point to further refine the lighting and shading of the scene,
and the same is true for the number of rays sent per pixel [Akenine-Möller et al.
2018].

2.2 Bounding Volume Hierarchy
A bounding volume is a volume enclosing a set of objects. It allows for a simpler
geometric representation of its contained objects, and is used as their proxy to speed
up computations related to for example rendering or collision detection. Different
types of bounding volumes exists. Relevant to this thesis is the axis aligned bounding
box (AABB). An AABB is a rectangular volume whose faces have normals that align
with the axes of the coordinate system [Akenine-Möller et al. 2018].

For real-time rendering, the scene can be organized into bounding volumes forming a
tree data structure. The root node of this tree would represent a bounding volume
which encompasses the entire scene. An internal node is a bounding volume of
other bounding volumes, referred to as children. The root node is an example of an
internal node, with the caveat of not having a parent. Finally, leaf nodes contain the
actual geometry to render, and thus have no children. This data structure is known
as a bounding volume hierarchy (BVH) [Akenine-Möller et al. 2018], and is the most
commonly used data structure for achieving high performance real-time ray tracing
[Vinkler et al. 2017; Viitanen et al. 2018; Meister et al. 2021; Benthin et al. 2022].

5


2. Background

In particular, this thesis concerns itself with binary tree BVHs, which means that
each internal node has at most two children [Akenine-Möller et al. 2018].

2.3 Morton Code
Morton code [Morton 1966] is a system for assigning integer values to sets of n-
dimensional data. This is done by creating a number pattern that orders these sets
based on their spatial proximity. This system is sometimes referred to as a Z-curve
due to the pattern created in two-dimensional sets.

The Morton code is popular because it is easy to calculate with simple bit-interleaving
operations. For example, the n-bit Morton code of vector v = (x,y,z) is created by
interleaving the bits of v* = (x*,y*,z*) ϵ ((0, 2n/3), (0, 2n/3), (0, 2n/3)). In other
words, the coordinates of v are first mapped into discrete integer values in the
range (0, 2n/3). The bit representation of these coordinates after mapping are then
interleaved: x0y0z0x1y1z1...xn/3yn/3zn/3 [Vinkler et al. 2017].

2.4 Hardware Description Design
Designing digital circuits and systems at a high level of abstraction is usually done
through Hardware Description Languages (HDLs). One such language is SystemVer-
ilog, which can be used in conjunction with design suite tools such as Xilinx Vivado
for synthesis and analysis of the hardware description.

Synthesis can be thought of as the process of translating a high-level description of
a digital circuit or system into a low-level implementation that can be programmed
onto an FPGA (Section 2.5). This is done by converting the HDL design into gate-
level representations and logic components. Synthesis, in other words, means that
the abstract description of hardware provided in a high-level language, will wire
the FPGA circuitry to implement the specified behaviour similar to how a compiler
translates software code into CPU readable instructions [Hauck and DeHon 2008].

To note is that SystemVerilog is not only an HDL, but also a Hardware Verification
Language (HVL). This means that the language provides additional constructs which
can only be used for verification purposes, and not synthesis. The specific subset
of SystemVerilog that can be used for synthesis is done at so called register transfer
level (RTL) [IEEE 2018], and the same is often true for other HDLs [Hauck and
DeHon 2008].

2.5 Field-programmable Gate Array
A Field-programmable gate array (FPGA) is a hardware device that can be config-
ured, and at any point re-programmed, to implement a wide variety of tasks. The
term re-programmed in this sense means that configurations are not permanent, but
rather programmed into the chip in such a way that new specifications can be pro-
vided to override the configuration with possibly a whole new set of functionality.

6


2. Background

To clarify: reprogramming can be done for something as simple as small bug fixes,
or to extend behaviour with new features, or even to implement something entirely
different. This is done through so called programming points in a specific portion of
the FPGA memory. All logic and routing elements are controlled by these memory
bits, and can be configured by supplying a file or bitstream into the device.

As opposed to microprocessors, computations in an FPGA are implemented spatially,
which allows for the processing of a large number of operations in parallel across the
silicon chip. This gives potential for large performance savings in terms of execution
time, power, and silicon area [Hauck and DeHon 2008].

7


2. Background

8


3
Algorithm

This section starts with an overview of the PLOC and PLOC++ algorithms, before
proceeding to explaining the process of translating these to a hardware suitable
implementation.

3.1 Parallel Locally-Ordered Clustering
Parallel Locally-Ordered Clustering (PLOC) [Meister and Bittner 2018] is a bottom
to top BVH construction algorithm, where clusters of scene triangles are iteratively
merged until only one cluster (the top-level BVH node) remains. Each scene tri-
angle initially corresponds to one unique cluster. These initial clusters are in a
pre-processing step sorted in 1D along the Morton codes of their centroid. The
main part of the algorithm then consists of repeated applications of the following
three phases:

• Nearest-neighbor search: Based on the property that nearby indices in the
Morton code sorted array are likely to be close in 3D-space, the algorithm finds
and stores the closest neighboring cluster in a search radius R of neighboring
indices in the sorted array, for each cluster. In other words, for each index i in
the array, find the corresponding index in the range [i - R, i + R] that minimizes
the distance function. In particular, the distance between two clusters is
defined as the area of their merged bounding volume, with ties broken by
lower index. This function is said to follow a non-decreasing property, as
explored by [Walter et al. 2008], which claims that no better neighbor will
emerge in the future for currently mutually agreeing nearest-neighbors.

• Merging of clusters: After finding the nearest-neighbor of each cluster, a sec-
ond CUDA kernel is launched where any mutually agreeing nearest-neighbors
are merged into one AABB. This merged cluster overrides the lower index of
the two merged nodes, while the upper index is marked as invalid. Nodes with-
out any mutual nearest-neighbor are left unchanged. Valid/Invalid metadata
is stored in a separate array.

• Compaction: The final phase removes any invalid entries of the array. This is
done through three kernel launches, and employs a prefix sum to determine
the position of valid node indices. A prefix sum of a sequence A, is a second
sequence B, such that each element at index x in B is the sum of all elements

9


3. Algorithm

up to and including index x in A.

The array does not get re-sorted in Morton code order after each iteration due to
its required cost, and through the observation that Morton code orderings are suffi-
ciently kept between iterations for the approximate nature of the nearest-neighbor
search.

3.2 PLOC++
The authors of PLOC identify three main bottlenecks of the algorithm. The first is
the complexity of the nearest-neighbor search. The second is the overhead associated
with five kernel launches required for each iteration of the algorithm, and the third
is iterating through-, and performing memory operations on auxiliary arrays (for
example for keeping track of valid/invalid states and performing the prefix sum
during the compaction phase).

PLOC++ [Benthin et al. 2022] is an extension of PLOC that improves performance
by addressing these bottlenecks. This thesis work concerns itself only with PLOC++
Two-Level, an alternate version of PLOC++, which is the one to be implemented
in hardware. For brevity, PLOC++ Two-Level will henceforth be referred to simply
as PLOC++.

The algorithm splits the Morton code sorted array into independent sequences such
that each sequence independently constructs a BVH within an OpenCL work group.
The algorithm allows each such sequence to build up to their respective root node
in only one kernel launch. This comes from the assumption that sequences are
independent. Viewing each sequence as independent means that the output of the
nearest neighbor search in one such sequence, is the only data required for merging
its clusters. This data can therefore be stored in local memory, as opposed to global,
because it is irrelevant to other sequences. The nearest neighbor search and merging
phase can thus be done in a single kernel launch. Furthermore, the compaction
phase can happen in the same kernel launch as well, because the assumption of
independence means that changes in one sequence does not affect other sequences.

However, to maintain BVH quality, each independent sequence does not fully con-
struct its BVH. Instead, the iterative process stops once the total number of clusters
within a sequence reaches a given threshold. When all sequences have reached this
threshold, they get merged to one range, on which a second kernel is launched to
build the top-level bounding volumes. The total number of kernel launches is thus
reduced from five times the number of PLOC iterations, to only two.

Efficiently partitioning the Morton code sorted array into independent ranges is
done by viewing the Morton codes as forming a binary tree, such that the root
node corresponds to the full range over all clusters. The two children of each node
can then be viewed as splitting the range into two independent sequences. This
tree is traversed top-down, breadth-first, to split the largest ranges until the desired
number of ranges have been made.

Furthermore, changes have been made to the nearest neighbor search algorithm itself

10


3. Algorithm

in order to reduce its impact on the overall performance. Recall from Section 3.1 that
the search is done in a range [i - R, i + R] for each position i of the input array, and a
search radius R. The search complexity thus becomes N∗R∗2, where N is the number
of elements in the array. This is reduced by half to N ∗R in PLOC++ by utilizing the
commutative property of the distance function distance(x, y) = distance(y, x), which
means that computing distance(x, y) contributes to the nearest neighbor search for
both of clusters x and y. Exact implementation details will however be omitted as
they are not necessary to understand the hardware implementation.

3.3 From Software to Hardware
Listing 1 shows an abstract overview of the PLOC++ algorithm. Each of the compo-
nents inside the PLOC SWEEP loop, i.e. NN-search, merge, and compaction phases,
constitute their own independent loops across each cluster in the sequence. In other
words, a phase does not start until the previous is completed for each element.

Preprocessing:
Compute Morton codes for each cluster
Sort clusters according to Morton codes
Split the Morton code sorted array into independent sequences

Main Algorithm:
For each independent sequence:

while (number of clusters > threshold) : PLOC SWEEP
Find nearest neighbor of each cluster in radius R
Merge mutually agreeing nearest neighbors
Remove empty indices after merging

Concatenate independent sequences into one range
Build top-level BVH by repeatedly applying PLOC SWEEPs

Listing 1: Overview of PLOC++

While PLOC++ presents fewer operations on auxiliary data buffers as compared to
the original PLOC, the overview in Listing 1 still requires a lot of such operations
in each iteration of the PLOC SWEEP loop. Performing these operations on data
buffers residing within the GPU memory becomes problematic, because the data
sizes involved in these processes exceed the permissible cache limits and result in
inefficiencies that compromise the overall system’s performance. When designing
a dedicated hardware unit, the latency might even increase unless a large enough
storage is present locally within the unit. To note however is that the data size is
in the worst case equal to the scene’s primitive count.

Designing a hardware unit with enough local memory to store such an amount be-
comes unfeasible because the primitive count differs between scenes, and therefore
difficult to predict. It might also not be desirable to accommodate such storage

11


3. Algorithm

due to its large area needs, as well as energy requirements. Finally, providing the
storage does not remedy the inherent problem of latency associated with perform-
ing memory operations. The hardware should therefore present an approach that
minimizes memory buffer traffic and elevates local data transfers, thereby enhancing
throughput.

This necessitates a two-part solution, with the initial step involving the transfor-
mation of single PLOC SWEEP units into autonomous, pipelined, streaming pro-
cesses. This step minimizes latency between the NN-search, merging, and com-
paction phases by starting the next phase for a given cluster as soon as its inde-
pendent requirements are met. The pipelined streaming approach also makes it
possible for clusters to move between phases without additional memory operations.
The second step involves performing multiple on-chip sweeps, further negating traffic
between sweep units and augmenting the system’s reliance on on-chip data streams.

Listing 2 shows an overview of the algorithm encapsulating the desired behaviour
of PLOC SWEEP units. The algorithm is designed for a throughput of one AABB
per clock cycle. It takes input from FIFO memory and performs the entire PLOC
SWEEP loop from Listing 1 on its independent AABB sequence using only local
circular buffers of size B, where B must be a power of 2 in order to take advantage
of the quick modulo operation performed by the mask() function.

The phases have been merged into one loop through the observation by Viitanen
et al. [2018], that once nearest[i] elements have been processed by the NN-search,
there is enough information to perform the merging/compaction phase up until
nearest[i - search radius]. This is because individual clusters only depend on clusters
within the search radius. As such, this algorithm is similar to the one presented in
PLOCTree [Viitanen et al. 2018], however with some slight modifications, as well
as the addition of the improved NN-search algorithm of PLOC++.

As mentioned in Section 3.2, the NN-search takes advantage of the commutative
property of the distance function. After initializing the nearest[] array to a large
default value, the search is performed in the range [i+1, i + search radius] for each
cluster i. The 32-bit distance of each pair (i, i + r), is then left-shifted to allow the
relative offset to the neighbor be stored in the lower 32-bits. In actuality, this can
be implemented in SystemVerilog without explicit bit manipulation through the use
of packed structs.

Finally, once each sweep pipeline has finished building their BVH, the top-level
BVH is built by concatenating the sequences into one range that is given as input
to only one sweep pipeline. This mirrors the implementation of PLOC++ where
independent BVHs are first built, and then collected into one range for constructing
the top-level BVH in a total of two kernel launches.

12


3. Algorithm

input[B];
nearest[B] = MAX_VALUE;
is_merged[B] = false;
search_radius = 16;
encode_mask = NOT((1 << 32) - 1);

c_idx = 0;
while(threshold not reached)

----Nearest Neighbor Search-------
min_dist_and_offset = MAX_VALUE
input[mask(c_idx + search_radius)] = fifo_in.pop();
nearest[mask(c_idx + search_radius)] = MAX_VALUE;

for (r = 1; r <= search_radius; r++)
dist = distance(input[c_idx], input[mask(c_idx+r)]);
my_offset = r;
other_offset = (i-r);
encoded_dist = (dist << 32) AND encode_mask;
my_dist_and_offset = encoded_dist OR my_offset;
other_dist_and_offset = encoded_dist OR other_offset;

min_dist_and_offset = min(min_dist_and_offset,
my_dist_and_offset);

min(nearest[mask(c_idx+r)], other_dist_and_offset);
end for
min(nearest[c_idx], min_dist_and_offset);

--------Merging & Compaction-----------
l_idx = mask(c_idx - search_radius);
l_nearest = mask(l_idx + lower 32 bits of nearest[l_idx]);

if(l_idx = mask(l_nearest + lower 32 bits of nearest[l_nearest])
&& l_idx < l_nearest)

output <= merge(input[l_idx], input[l_nearest]);
is_merged[l_nearest] = true;

else if (!is_merged[l_idx])
output <= input[l_idx];

else
//merged in previous iteration

---------------
mask(i32 value){

return value AND (B - 1); //B = size of input[]
}

Listing 2: Algorithm describing the behaviour of PLOC SWEEP units.

13


3. Algorithm

14


4
Hardware Architecture

An overview of the architecture can be seen in Figure 4.1. Its main components
are: memory, PLOC SWEEP units, as well as a control unit for handling input and
output to each sweep unit.

Figure 4.1: System Architecture of a hardware unit with N sweep units. As indicated
by the figure, both sweep units and memory reside inside the control module similar
to how various PC components reside on a motherboard. Note that the figure is
meant as a simplified overview for visualization purposes, as opposed to a detailed
schematic of the logic inside the control unit.

The control unit receives external input in the form of scene primitives and supplies
them to the PLOC SWEEP units in parallel. The input primitives are sorted ex-
ternally before being received such that the control unit can correctly supply them
to their corresponding sweeps through simple indexing. This continues for each
sweep until its entire external input has been provided, after which the control unit
proceeds to send input to the sweep through its associated FIFO memory.

The PLOC SWEEP units are designed as a translation of the algorithm in Listing 2,
and can be seen in Figure 4.2. A sweep unit consists of distance metric evaluators

15


4. Hardware Architecture

(DMEs), comparator units, a merge unit, as well as local circular buffers for storing
auxiliary data of AABBs currently in its pipeline. Each unit is assigned an inde-
pendent sequence of the Morton code sorted array, which is stored inside its own
FIFO section of the memory, and is populated in accordance with the Morton code
ordering.

Figure 4.2: Architecture of a PLOC SWEEP unit with search radius of 16. Here,
DME stands for Distance Metric Evaluator, and CMP refers to a comparator unit.

At the start of every clock cycle, the control unit supplies the sweep unit with
batches of 1 + search radius AABBs, and the following events take place in order:

1. The input AABBs are sent to search radius DME units where distance(Ci, Ci+r)
is computed for Ci = lowest index AABB in batch, and rϵ{1.. search radius}.
Recall from Section 3.1 that the distance between two AABBs is defined as
the area of their merged bounding volume.

2. The DME output of clusters (Ci, Ci+r) is checked against the currently saved
distance of the nearest neighbor of Ci+r, while also being sent to a tree of
comparator units. The comparator units compare the outputs of each DME
to find the lowest distance. This distance is checked against the distance of
the currently saved nearest neighbor of Ci. At this point, the nearest neighbor
of Ci in range [Ci−r, Ci+r] has been found.

3. Cluster Ci−r is checked for any mutually agreeing nearest neighbor. If one is
found, and if not merged in a previous pass, then the two are sent to the merge
unit before being output from the pipeline to be stored in the FIFO memory.
If there is no mutually agreeing nearest neighbor, then the cluster is output
from the pipeline as is. If merged in a previous pass, then the cluster will not
be output, and will instead eventually be overwritten in a future iteration of
the pipeline.

Once the sweep units have built their BVH up until the threshold as in Listing 2, the
top-level BVH will be constructed by one sweep pipeline. This process is handled
by the control unit, and must respect the Morton code ordering of the input. Note
that this process can start as soon as the sweep unit whose input sequence contains
the lowest index AABBs has finished building, and thus said sweep unit can be
appointed to construct the top-level pipeline.

16


5
Evaluation

The hardware, as described in Section 4, was implemented at the register transfer
level in SystemVerilog. The number of PLOC SWEEP units was set to 32, and
similarly to PLOC++ [Benthin et al. 2022], the search radius was set to 16 and the
iterative build process for each independent sweep terminates once search radius or
fewer clusters remain.

The design ran build simulations of various scenes, from which performance metrics
in terms of execution time, power consumption, and memory traffic were measured
and compared to the GPU implementation (https://github.com/embree/embree/
tree/ploc) running on Intel Arc A750. The simulation was conducted on hardware
similar to the Intel Arc A750 in order to achieve a fairer comparison.

PLOCTree [Viitanen et al. 2018], on the other hand, does not have a public repos-
itory available. As such, the proposed hardware will only have its area and power
consumption compared to the reported values of PLOCTree in its original paper.

In pre-processing, the scene primitives were extracted from OBJ files, then sorted in
Morton code order, and finally split into independent sequences in Python. These
sequences were then input to the hardware unit in parallel. Thus, all performance
measurements for both the proposed hardware design and the GPU implementation
exclude any pre-processing and regard only the construction from primitives to root
node.

As a side note: while the implementation and its resulting BVH have been thor-
oughly tested for correctness in simulation as well as compared to the GPU imple-
mentation, the actual output BVH of a synthesized implementation was not loaded
into a software ray tracer for validating correctness of the resulting trees, nor for
measuring their ray tracing performance. This is due to a lack of available FPGAs
at Chalmers this current semester. That said, the algorithm in Listing 2 has been
implemented in software and have had its results rendered for verification purposes
prior to being realized as hardware.

17

https://github.com/embree/embree/tree/ploc
https://github.com/embree/embree/tree/ploc


5. Evaluation

18


6
Results

The main results are summarized in Table 6.1. The detailed performance results
can be found in Figure 6.1, while area and power is shown in Table 6.2. Here,
TDP stands for thermal design power. TDP is the amount of power a processor is
designed to dissipate without exceeding the highest possible operating temperature
of its transistors. It is often used to estimate chip power consumption [Esmaeilzadeh
et al. 2011].

As mentioned in Section 5, the proposed design was evaluated using a search radius
of 16 and consists of 32 sweep units in total. While the results heavily depends
on these parameters, it should be made clear that the hardware has been designed
to work independently of their values. Both parameters can be changed before
synthesis without requiring changes in the code. In this case, the search radius was
chosen to be the same as in the original PLOC++ for the sake of comparison, and
the number of sweep units was chosen to provide an appropriate trade-off between
performance, power consumption, and area.

Table 6.1: Summary of results.

GPU PLOC++, PLOCTree HW PLOC++
Intel Arc A750 [Viitanen (simulated)

et al. 2018]
Area (mm2) 10962 2.43 15.1
Clock (GHz) 2.05 1.0 2.05
TDP (W) 225 0.362 5.16
Build time 219% - 100%
Mem. traffic 394% - 100%

19


6. Results

Figure 6.1: Relative performance speedup of the proposed hardware unit over the
GPU implementation.

The hardware boasts an average improvement of 2.19x and 3.94x for build time
and memory traffic respectively over the GPU implementation during simulation.
Initially, results after synthesis were planned to be measured. However, as of yet,
no FPGA has been available for use due to a high demand of FPGA resources at
Chalmers this semester. Once synthesized, the observed performance numbers are
expected to be lower. This is because:

1. The clock frequency of modern FPGAs tend to be much lower than the selected
value of 2.05GHz. In particular, the PYNQ-Z2, which was originally meant to
be used, has a reported maximum frequency of 650Mhz [Xilinx 2018].

2. The run-time behaviour of actual hardware is unlikely to be as idealistic as
compared to simulation. Furthermore, memory traffic will likely differ between
architectures due to their differences in available memory resources.

That said, it is difficult to predict by how much the results will differ after synthesis,
because it depends on the actual hardware used. In this regard, simulation might
actually prove to give a fairer comparison as it can be tailored to match the GPU’s
specifications.

It should also be kept in mind that, as previously mentioned, the number of sweep
units, as well as the search radius, can be changed independently of the design
specification. Different values will have different performance results. For example,
increasing the number of sweep units will lead to faster build times and better
memory traffic, but will require more area while also increasing TDP.

Additionally, while the simulated design shows a decent improvement in memory
traffic, there is still much more room for further optimization. The following lists
a smorgasbord of techniques for further improvements in memory efficiency, which
becomes increasingly important as the number of scene primitives increases:

• Increased on-chip memory.

• More sophisticated memory management as opposed to the simplistic FIFO.

• Compression techniques.

• Increased number of sweep units.

20


6. Results

• Pipelined sweep units such that the output of one becomes input to the other,
similar to [Viitanen et al. 2018].

• Increased number of independent sequences per sweep unit.

Note also that the last item on the list is done during pre-processing, as opposed to
requiring changes in the design. Furthermore, any combination of these techniques
can be used independently of each other. That said, as before, these changes come
with their own trade-offs in terms of build performance, power consumption, area
usage, and even tree quality.

Table 6.2: Area and power breakdown of a single PLOC SWEEP unit.

Component Area (mm2) Power (mW)
FIFO memory 0.2 63
Distance metrics evaluators 0.16 39
Comparator tree 0.03 40
Merge unit 0.02 6
Data buffers 0.04 12
Total 0.45 160

As shown in Table 6.2, a single sweep unit has a rough power of estimate of 160mW.
Given the large number of sweep units used, this results in quite a high TDP for
the hardware unit (5.16W). For reference, game benchmarks have shown a power
consumption of 1-3W for smartphones in 2014 [Pathania et al. 2014]. While recent
smartphones have become more advanced since, and thus likely consume a larger
amount of power, the current configuration of hardware PLOC++ would still consti-
tute a large percentage of the power budget. If lower performance is of less concern,
then it becomes simple to reduce the TDP by reducing the number of sweep units.
However, depending on the number of sweep units having to be reduced, PLOC-
Tree [Viitanen et al. 2018] might become more suitable due to its smaller TDP
requirements. The same is also true for any area concerns.

On desktop or console GPUs however, the proposed hardware would only account
for a very small percentage of the overall TDP and silicon area. In this case, there
might even be enough room to possibly increase the number of sweep units for
further performance gains without affecting the overall TDP by much.

21


6. Results

22


7
Conclusion

The proposed hardware accelerator for BVH construction based on the PLOC++
algorithm has shown an average improvement of 2.19x and 3.94x during simulation
in terms of build time and memory traffic respectively, as compared to the GPU
implementation. This result depends however on the search radius size, the number
of sweep units, and the number of independent sequences assigned to each sweep
unit. These parameters can be changed without impacting the design specification.

Similarly, both power consumption and silicon area are also dependant on the num-
ber of sweep units. A reduced number could allow the hardware to fit into mobile
GPUs, much like PLOCTree [Viitanen et al. 2018], at the cost of build performance.
On the other hand, the current design is small enough to fit into desktop and con-
sole GPUs. In this case, the number of sweep units could even be increased for
further performance gains while still remaining a small percentage of the overall
power consumption.

This thesis has shown an implementation of a hardware BVH builder based on the
state-of-the-art construction algorithm PLOC++. While the current configuration
has yet to be synthesized, the overall design has potential for performance improve-
ments over the original for ray tracing in both desktops, consoles, and possibly
smartphones.

23


7. Conclusion

24


Bibliography

Tomas Akenine-Möller, Eric Haines, Naty Hoffman, Angelo Pesce, Michat Iwanicki,
and Sébastien Hillaire. Real-Time Rendering, Fourth Edition. A. K. Peters, Ltd.,
USA, 4th edition, 2018. ISBN 0134997832.

Carsten Benthin, Radoslaw Drabinski, Lorenzo Tessari, and Addis Dittebrandt.
Ploc++ parallel locally-ordered clustering for bounding volume hierarchy con-
struction revisited. Proceedings of the ACM on Computer Graphics and Interac-
tive Techniques, 5(3):1–13, 2022.

John Burgess. Rtx onthe nvidia turing gpu. IEEE Micro, 40(2):36–44, 2020. doi:
10.1109/MM.2020.2971677.

Yangdong Deng, Yufei Ni, Zonghui Li, Shuai Mu, and Wenjun Zhang. Toward
real-time ray tracing: A survey on hardware acceleration and microarchitecture
techniques. ACM Comput. Surv., 50(4), aug 2017. ISSN 0360-0300. doi: 10.1145/
3104067. URL https://doi.org/10.1145/3104067.

Michael J. Doyle, Colin Fowler, and Michael Manzke. A hardware unit for fast
sah-optimised bvh construction. ACM Trans. Graph., 32(4), jul 2013. ISSN 0730-
0301. doi: 10.1145/2461912.2462025. URL https://doi.org/10.1145/2461912.
2462025.

Hadi Esmaeilzadeh, Ting Cao, Yang Xi, Stephen M. Blackburn, and Kathryn S.
McKinley. Looking back on the language and hardware revolutions: Measured
power, performance, and scaling. In Proceedings of the Sixteenth International
Conference on Architectural Support for Programming Languages and Operating
Systems, ASPLOS XVI, page 319332, New York, NY, USA, 2011. Association
for Computing Machinery. ISBN 9781450302661. doi: 10.1145/1950365.1950402.
URL https://doi.org/10.1145/1950365.1950402.

Scott Hauck and André DeHon. Reconfigurable Computing: The Theory and Practice
of FPGA-Based Computation. Morgan Kaufmann, Amsterdam, 1st edition, 2008.
ISBN 9780123705228.

IEEE. Ieee standard for systemverilog–unified hardware design, specification, and
verification language. IEEE Std 1800-2017 (Revision of IEEE Std 1800-2012),
pages 1–1315, 2018. doi: 10.1109/IEEESTD.2018.8299595.

C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha. Fast bvh

25

https://doi.org/10.1145/3104067
https://doi.org/10.1145/2461912.2462025
https://doi.org/10.1145/2461912.2462025
https://doi.org/10.1145/1950365.1950402


Bibliography

construction on gpus. Computer Graphics Forum, 2009. ISSN 1467-8659. doi:
10.1111/j.1467-8659.2009.01377.x.

Xingyu Liu, Yangdong Deng, Yufei Ni, and Zonghui Li. Fasttree: A hardware kd-
tree construction acceleration engine for real-time ray tracing. In 2015 Design,
Automation Test in Europe Conference Exhibition (DATE), pages 1595–1598,
2015.

David J. MacDonald and Kellogg S. Booth. Heuristics for ray tracing using space
subdivision. The Visual Computer, 6:153–166, 1990.

Daniel Meister and Jií Bittner. Parallel locally-ordered clustering for bounding vol-
ume hierarchy construction. IEEE Transactions on Visualization and Computer
Graphics, 24(3):1345–1353, 2018. doi: 10.1109/TVCG.2017.2669983.

Daniel Meister, Shinji Ogaki, Carsten Benthin, Michael J. Doyle, Michael Guthe,
and Jií Bittner. A survey on bounding volume hierarchies for ray tracing.
Computer Graphics Forum, 40(2):683–712, 2021. doi: https://doi.org/10.1111/
cgf.142662. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.
142662.

Guy M Morton. A computer oriented geodetic data base and a new technique in
file sequencing. 1966.

Jae-Ho Nah, Hyuck-Joo Kwon, Dong-Seok Kim, Cheol-Ho Jeong, Jinhong Park,
Tack-Don Han, Dinesh Manocha, and Woo-Chan Park. Raycore: A ray-tracing
hardware architecture for mobile devices. 33(5), sep 2014. ISSN 0730-0301. doi:
10.1145/2629634. URL https://doi.org/10.1145/2629634.

Jae-Ho Nah, Jin-Woo Kim, Junho Park, Won-Jong Lee, Jeong-Soo Park, Seok-Yoon
Jung, Woo-Chan Park, Dinesh Manocha, and Tack-Don Han. Hart: A hybrid
architecture for ray tracing animated scenes. IEEE Transactions on Visualization
and Computer Graphics, 21(3):389–401, 2015. doi: 10.1109/TVCG.2014.2371855.

Anuj Pathania, Qing Jiao, Alok Prakash, and Tulika Mitra. Integrated cpu-gpu
power management for 3d mobile games. In Proceedings of the 51st Annual De-
sign Automation Conference, DAC ’14, page 16, New York, NY, USA, 2014. As-
sociation for Computing Machinery. ISBN 9781450327305. doi: 10.1145/2593069.
2593151. URL https://doi.org/10.1145/2593069.2593151.

Timo Viitanen, Matias Koskela, Pekka Jääskeläinen, Heikki Kultala, and Jarmo
Takala. Mergetree: A fast hardware hlbvh constructor for animated ray tracing.
ACM Trans. Graph., 36(5), oct 2017. ISSN 0730-0301. doi: 10.1145/3132702.
URL https://doi.org/10.1145/3132702.

Timo Viitanen, Matias Koskela, Pekka Jääskeläinen, Aleksi Tervo, and Jarmo
Takala. Ploctree: A fast, high-quality hardware bvh builder. Proceedings of
the ACM on Computer Graphics and Interactive Techniques, 1(2):1–19, 2018.

Marek Vinkler, Jiri Bittner, and Vlastimil Havran. Extended morton codes for
high performance bounding volume hierarchy construction. In Proceedings of
High Performance Graphics, HPG ’17, New York, NY, USA, 2017. Association

26

https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.142662
https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.142662
https://doi.org/10.1145/2629634
https://doi.org/10.1145/2593069.2593151
https://doi.org/10.1145/3132702


Bibliography

for Computing Machinery. ISBN 9781450351010. doi: 10.1145/3105762.3105782.
URL https://doi.org/10.1145/3105762.3105782.

Ingo Wald. On fast construction of sah-based bounding volume hierarchies. In 2007
IEEE Symposium on Interactive Ray Tracing, pages 33–40, 2007. doi: 10.1109/
RT.2007.4342588.

Bruce Walter, Kavita Bala, Milind Kulkarni, and Keshav Pingali. Fast agglom-
erative clustering for rendering. In 2008 IEEE Symposium on Interactive Ray
Tracing, pages 81–86, 2008. doi: 10.1109/RT.2008.4634626.

Sven Woop, Erik Brunvand, and Philipp Slusallek. Estimating performance of a
ray-tracing asic design. In 2006 IEEE Symposium on Interactive Ray Tracing,
pages 7–14, 2006. doi: 10.1109/RT.2006.280209.

Xilinx. PYNQ-Z2 User Manual, 2018. URL https://www.mouser.com/datasheet/
2/744/pynqz2_user_manual_v1_0-1525725.pdf.

27

https://doi.org/10.1145/3105762.3105782
https://www.mouser.com/datasheet/2/744/pynqz2_user_manual_v1_0-1525725.pdf
https://www.mouser.com/datasheet/2/744/pynqz2_user_manual_v1_0-1525725.pdf


Bibliography

28


	List of Figures
	List of Tables
	Introduction
	Related Work
	Delimitation
	Ethical Considerations

	Background
	Ray Tracing
	Bounding Volume Hierarchy
	Morton Code
	Hardware Description Design
	Field-programmable Gate Array

	Algorithm
	Parallel Locally-Ordered Clustering
	PLOC++
	From Software to Hardware

	Hardware Architecture
	Evaluation
	Results
	Conclusion
	Bibliography