Efficient Path Rendering For
Low-End GPUs

Master’s thesis in Computer science and engineering

Linus Wallman

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025


Master’s thesis 2025

Efficient Path Rendering For
Low-End GPUs

Linus Wallman

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2025


Efficient Path Rendering For Low-End GPUs
Linus Wallman

© Linus Wallman, 2025.

Supervisor: Erik Sintorn, Department of Computer Science and Engineering
Advisor: Daniel Gunnarsson, Kollmorgen
Examiner: Ulf Assarsson, Department of Computer Science and Engineering

Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2025

iv


Efficient Path Rendering For Low-End GPUs
Linus Wallman
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Modern discrete graphics cards are powerful, and succeed in massively rendering
paths for varying domains, one being path-based layouts for automated guided ve-
hicles (AGVs). However, integrated GPUs (iGPUs) are not as powerful as discrete
graphics cards, and therefore there is a need for algorithms to accelerate the process
of massively rendering paths on iGPUs. This thesis presents a method to massively
render paths, more specifically B-spline curves, efficiently and scalably on iGPUs.
Instead of rasterizing the polylines, we determine which pixels should be colored
based on their intersections with line segments sampled from B-spline curves. We
show that our method can massively render paths efficiently and scalably, but only
for static layouts. To extend our method to dynamic scenarios, we suggest using a
linear quad tree or AABB tree instead of a uniform grid.

Keywords: Computer science, computer graphics, path rendering, integrated GPU.

v


Acknowledgements
Firstly, I would like to give a thank you to my supervisor, Erik Sintorn, for his
invaluable guidance in shaping this thesis and for his support throughout the project.

Secondly, I am grateful for Daniel Gunnarsson and Henrik Persson at Kollmorgen for
making this thesis possible. Furthermore, I would like to give an additional thank
you to my company advisor, Daniel Gunnarsson, for his guidance throughout this
project.

Lastly, I would also like to thank my colleagues at Kollmorgen, including Ali Majidi,
Mohammadamin Mohammadi, and the rest of Purple Team for their stimulating
discussions, support, encouragement, and for making the work environment enjoy-
able.

Linus Wallman, Gothenburg, 2025-06-10

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Dynamic Layouts . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Previous Work 5
2.1 CPU and GPU Path Rendering . . . . . . . . . . . . . . . . . . . . . 5
2.2 Path Rendering in Vector Graphics . . . . . . . . . . . . . . . . . . . 5
2.3 Rendering Paths Using Hardware Tessellation . . . . . . . . . . . . . 6
2.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theory 9
3.1 Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 B-Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 B-Spline Basis Function . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Cox-De Boor Recurrence Basis Function Construction . . . . . 11
3.2.3 Varying B-Spline Behavior via Knot Vector . . . . . . . . . . . 12
3.2.4 Cubic B-Spline Curve . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Hardware Tessellation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Tessellation Shaders . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Acceleration Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Method 15
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Tessellation Shader Implementation . . . . . . . . . . . . . . . . . . . 16

4.2.1 Dynamic Adaptive Sampling . . . . . . . . . . . . . . . . . . . 16
4.2.2 Sampling in Tessellation Evaluation Shader . . . . . . . . . . 17
4.2.3 View Frustum Culling . . . . . . . . . . . . . . . . . . . . . . 17
4.2.4 Pre-Pass Compute Shader Implementation . . . . . . . . . . . 18
4.2.5 B-Spline Bounding Box . . . . . . . . . . . . . . . . . . . . . . 19

ix


Contents

4.3 Per-Pixel Intersection Implementation . . . . . . . . . . . . . . . . . 20
4.3.1 Uniform Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 Counting Segments in Grid Cells . . . . . . . . . . . . . . . . 21
4.3.4 Computing Prefix Sum . . . . . . . . . . . . . . . . . . . . . . 22
4.3.5 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.6 Per-Pixel Intersection Testing . . . . . . . . . . . . . . . . . . 22

5 Results 25
5.1 Environment and Hardware Setup . . . . . . . . . . . . . . . . . . . . 25
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3.1 Tessellation Shader Implementation Performance . . . . . . . 29
5.3.2 Per-Pixel Intersection Testing Implementation

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Path Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusion 33
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Dynamic Layouts . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.2 Anti-Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Ethics and Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.1 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.2 Conflict of Interest . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.3 Social and Environmental Responsibility . . . . . . . . . . . . 34

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Bibliography 37

x


List of Figures

1.1 Example of a B-spline curve with 8 control points representing a path. 1

3.1 Overview of OpenGL’s rendering pipeline. Dashed lines indicate op-
tional stages. Green stages are programmable, and yellow are non-
programmable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 B-Spline parametric equation. . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Four B-Spline basis functions of third degree with a domain defined

in the normalized knot vector span t ∈ [0, 1]. . . . . . . . . . . . . . . 10
3.4 Expanded recursion tree for the basis function N0,4. . . . . . . . . . . 11
3.5 Left: Open uniform B-Spline curve. Middle: Clamped uniform B-

Spline curve. Right: Clamped closed uniform B-Spline curve. . . . . 12
3.6 Cubic B-Spline curve with 4 control points. . . . . . . . . . . . . . . . 12
3.7 Tessellation stages in the rendering pipeline. . . . . . . . . . . . . . . 13

4.1 Tessellation Evaluation Shader B-spline segment sampling implemen-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 View frustum culling compute shader pseudocode. . . . . . . . . . . . 18
4.3 Overview of compute shader dispatch flow. A four phase single-

dispatch is executed in the order depicted in the figure. Then, using
the result from the binning step, pixel’s resulting color are set. . . . . 20

4.4 Pseudocode to convert world positions to grid cell indices. . . . . . . 21
4.5 Pseudocode to increment the number of segments per grid cell. . . . . 22
4.6 Naive parallel prefix sum pseudocode. . . . . . . . . . . . . . . . . . . 22
4.7 Screen-space to world space computation. . . . . . . . . . . . . . . . . 23

5.1 Dataset D1, rendered using our proposed per-pixel intersection testing
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Dataset D2, rendered using our proposed per-pixel intersection testing
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Plot with four different zoom-levels showcasing how performance is
increased when non-visible paths are culled. The four samples show-
cased in the figure are from the dataset, D1. . . . . . . . . . . . . . . 28

5.4 View frustum culling plotted alongside tessellation and rendering times
as a function of input size. . . . . . . . . . . . . . . . . . . . . . . . . 29

5.5 Four phase compute initialization comparison for datasets D1 and D2. 30

xi


List of Figures

5.6 Paths rendered using per-pixel method. Left: smooth lines by cal-
culating coverage and fading depending on distance to pixel. Right:
non-smooth lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.7 Clear visual aliasing and jaggedness rendered using per-pixel intersec-
tion testing method.Left: Jaggedness and aliasing in non-smoothing/fading
solution. Right: Less visual aliasing by smoothing and fading edges. 31

xii


List of Tables

5.1 Maximum number of B-spline curves rendered at 60 FPS across two
datasets, D1 and D2, using the two proposed methods. Per-pixel
intersection enables rendering more paths at the expense of higher
memory usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Per-pixel intersection test performance under varying grid occupancy
using dataset D1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Per-pixel intersection test performance under varying grid occupancy
using dataset D2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xiii


List of Tables

xiv


1
Introduction

Figure 1.1: Example of a B-spline curve with 8 control points representing a path.

In AGV (Automated Guided Vehicle) applications, application engineers specialize
in designing path-based layouts that define vehicle routes within logistics environ-
ments such as warehouses. These layouts rely on B-spline curves, as shown in
Figure 1.1, to model smooth and precise paths. Since layout visualization often
involves interactive panning and zooming during design, real-time performance is
essential. The system must remain efficient and scalable to prevent bottlenecks that
could hinder usability.

While rendering a few high-quality paths with a small number of control points can
be performed efficiently on the CPU, the computational demand increases signifi-
cantly as the number of paths and control points grows. In such cases, leveraging
the GPU becomes essential to achieve real-time performance.

In this thesis, we investigate how to massively render paths, more precisely, cubic
basis-splines (B-Splines), efficiently and scalably. We present a new method that can
massively render smooth curves efficiently using the iGPU. As a comparison, a naïve
method of rendering paths is implemented using tessellation shaders. Furthermore,
throughout this report, the terms path and B-spline curve are used interchangeably.

1


1. Introduction

1.1 Motivation
There is currently no hardware support for rendering paths on the GPU. However,
there exist several acceleration techniques using the GPU to render paths efficiently
(see Chapter 2). Unfortunately, application engineers designing layouts do not typ-
ically have access to discrete graphics cards that these techniques assume. They
do, however, have iGPUs that we could target, but the acceleration techniques pre-
sented in Chapter 2 are not suitable for iGPUs due to limited memory bandwidth,
compute power, and available memory. This motivates the investigation into how
to maximize iGPU utilization in massively rendering paths.

2


1. Introduction

1.2 Problem Statement
This thesis aims to answer the following question: How are paths rendered efficiently
and scalably on iGPUs? To answer that question, the following sub-questions will
be explored:

• What are the current limitations of iGPUs in efficiently rendering paths?

• How can we maximize GPU utilization in efficiently rendering paths, and what
techniques can be adapted, combined, or improved to enhance path rendering
on iGPUs?

1.3 Limitations

1.3.1 Paths
We only consider constructing paths using cubic B-splines in this thesis. Typically,
cubic degree NURBS, Bézier or B-splines are sufficient to describe paths given their
continuity properties, simplicity, affine transformation invariance, and more [1]. Ar-
bitrary degree NURBS offer more generality, however, such as being able to accu-
rately model conic objects [2] which is crucial in CAD tools.

1.3.2 Dynamic Layouts
In this thesis, we only consider visualizing paths, and not the scenario where new
paths are created or existing paths changed. However, the use case of how to
efficiently visualize paths is still interesting, and therefore propose how to extend or
improve our current method towards a full solution allowing dynamic layouts. This
is further discussed in the discussion section, and methods of addressing this issue
are proposed.

3


1. Introduction

4


2
Previous Work

Path rendering is an active research area, often with a focus on vector graphics and
CAD-related tools. In this chapter we motivate and describe how our contribution
fits into the research of path rendering.

2.1 CPU and GPU Path Rendering
Gatilov [3] shows how to vectorize NURBS curves on the CPU using Horner’s rule
to reduce additions and multiplications when evaluating polynomials. Additionally,
Gatilov states that pre-computing blending between control points and basis func-
tions significantly increases performance. While vectorization on the CPU itself falls
outside our scope, identifying other sources of speed-up is relevant to our purpose;
as Gatilov suggests, pre-computing blending between control points and basis func-
tions is beneficial as we seek to improve performance in the evaluation of B-spline
curves.

Path rendering is not limited to the CPU. Krishnamurthy et al. present a method for
optimizing the evaluation of NURBS curves on the GPU [4]. In their paper, they
show how to efficiently evaluate arbitrary degree NURBS curves by packing the
knot values and basis functions into textures. By storing the knot values in the four
channels (RGBA) and basis functions in the height direction based on the degree of
the curve, they can efficiently and uniformly calculate any degree with one unified
fragment shader. However, this method requires building these textures on the CPU
in a pre-processing step, which is computationally expensive and memory-intensive.

2.2 Path Rendering in Vector Graphics
Loop and Blinn [5] propose a method to render resolution independent curves on
the GPU, which is later adapted and used by the same authors to render vector
art [6]. Typically, one either tessellates the path representation into triangles or
takes sufficient samples to render paths as polylines. Loop and Blinn propose to
instead exploit that the Bézier curve lies inside of the convex hull region created by
its control points. The control points are then used as vertices of a triangle and a
set of texture coordinates are mapped to its vertices. In the fragment shader they
evaluate f(u, v) = u2 − v < 0 and consider a pixel to be inside the curve if the

5


2. Previous Work

equation is true. To achieve anti-aliased curves, they approximate a signed distance
function in the fragment shader. They then compute the pixel’s alpha value by
either looking up in a precomputed texture with alpha values based on the distance
or calculate a blending function using the distance on the fly.

Santina [7] uses a similar approach to Loop and Blinn [5] to render paths, but
for NURBS curves. Additionally, Santina converts the curved parts of the path
to triplets of vertices, where two of the vertices are on the curve, and the third is
not. The remaining parts of the paths are triangulated using a modified Delaunay
triangulation. As the number of paths grow, this method will suffer from potential
overdraw with expensive per-fragment operations. Furthermore, the method fails to
scale as the number of paths increase — for every segment, they will generate one
triangle, and paths typically consist of more than one segment. Thus, rendering 1
million paths requires generating at least 1 million triangles.

Kilgard and Bolz [8] suggest an alternative method called Stencil, then Cover to
render filled or stroked paths. Contrary to other methods, they do not use any
intersection tests to determine coverage; instead these are implicit via stencil buffer
usage.

Batra et al. [9] build on the work of Kilgard and Bolz [8] where they present a method
using hardware tessellation to shade paths and a novel method of repurposing RGBA
buffers to render CMYK colors.

Scanline rasterization is another method of rendering paths introduced by Li et al.
[10]. They propose a GPU scanline approach that first generates boundary fragments
(pixels on Bézier or segment boundaries); first for all monotonic segments, and then
for all curved segments. These are then sorted by path index, merged per pixel that
is covered by the same path, and lastly rendered with a coverage mask.

2.3 Rendering Paths Using Hardware Tessellation
Kumar and Sud presents a method using tessellation shaders to render strokes [11].
In their proposed method they instrument the tessellation shader to output polylines
representing cubic Bézier curves, and then approximate these as quadratic Bézier
curves in the geometry stage by pairing together successive points from the polyline.
Then to render the strokes they use the same method as proposed by Nehab and
Hoppe [12].

2.4 Contribution
We aim to render a very large number of paths (on the order of hundreds of thou-
sands) at real-time frame rates. Although modern desktop CPUs often have 8 cores,
16 threads, and support SIMD (Single Instruction Multiple Data) vectorization, even
a modest iGPU typically exposes 32 or more execution units (EUs) with wide vector
pipelines [13]. As a result, when rendering hundreds of thousands of paths or more,
a CPU-only approach becomes infeasible: it simply cannot match the amount of

6


2. Previous Work

parallel threads that an iGPU can launch. In contrast, for scenes with only a small
number of paths, the CPU may still be faster due to higher clock frequencies.

Pre-computing the blending between control points and basis functions, as concluded
by Gatilov [3], yields negligible performance gain for NURBS curves when sampling
the curves on the GPU, which Gatilov also concluded for the CPU.

In this thesis, we focus on cubic B-splines, which is one reason why the method
proposed is not suitable [4]. We want to reduce bandwidth and achieve that by
exploiting that the knot values are uniform and repeated per B-spline in our case,
and the basis polynomials are constant since we only consider cubic B-splines.

Paths in CAD-related tools are rarely changed extensively and could therefore be
considered mostly static. Therefore, our contribution proposes to use a technique
where we, instead of using any of the techniques mentioned above, decide each
pixel’s color directly by testing if a pixel is on any of all sampled line segments.
Consequently, we are able to render one million paths in 60 frames per second.

7


2. Previous Work

8


3
Theory

3.1 Graphics Pipeline

Figure 3.1: Overview of OpenGL’s rendering pipeline. Dashed lines indicate optional
stages. Green stages are programmable, and yellow are non-programmable.

The graphics pipeline consists of a set of stages that primitives, often triangles, pass
through when requested to be rendered using a graphics API such as OpenGL [14]
to generate an image displayed on the screen. Every stage plays a core role in the
pipeline with its own responsibilities, initiated by a draw call on the CPU side,
and subsequently processed in parallel in the order depicted in Figure 3.1. These
stages consist of both programmable and non-programmable stages, where the pro-
grammable ones are called shaders, which are programs that are executed on the
GPU [15].

The core idea with having a graphics pipeline is to enable high-performance, but
also to provide a programming model that enables programmers to configure the
pipeline to suit their needs, instead of only providing fixed-function hardware [16].
There are stages that are fixed-function hardware and non-programmable, as shown
in Figure 3.1. One example is the rasterization stage where inputted geometry is
sent to the rasterizer which computes which fragments are covered by the inputted
geometry [15].

9


3. Theory

3.2 B-Spline

C(t) =
n∑

i=0
Ni,p(t)Pi

Figure 3.2: B-Spline parametric equation.

A basis-spline (B-Spline) is a spline, which is a way to mathematically model a
smooth curve that is determined by a set of polynomials partitioned over the sub-
domains of the spline domain [1]. The partitioning is defined by a set of values,
called the knot vector, a sequence of knots with uniform spacing which subdivide
the domain. Formally, the knot vector is defined as:

T = {t0, t1, . . . , tm}, t0 ≤ t1 ≤ · · · ≤ tm.

m = n + d + 1.

where d is the degree of the B-Spline, and n is the number of control points.

3.2.1 B-Spline Basis Function

Figure 3.3: Four B-Spline basis functions of third degree with a domain defined in
the normalized knot vector span t ∈ [0, 1].

In Figure 3.2 the parametric equation for B-Splines is depicted, where Ni, p(t) is a
basis function, and Pi is a control point. This is expressed as a linear combination,
where the basis function acts as a weight that is blended with every control point [1].
Basis functions are polynomials with degree p − 1 (p is the order of the B-Spline)
that are non-zero in the knot span [ ti, ti+d+1). For this reason, moving a control
point affects only a local part of the B-Spline.

B-Splines have the following properties: (1) local support, (2) linear independence,
(3) Cd−1 continuity, and (4) partition of unity [1]. Cubic B-Splines are typically used

10


3. Theory

because they have C2 continuity, and thus are smooth at every joint between the
polynomial curves. Partition of unity means that:

C(t) =
n∑

i=0
Ni,p(t) = 1

on the knot span [ti, ti+1). Consequently, every B-Spline is contained by the convex
hull created by its control polygon [2].

Rendering a B-Spline is done by first decomposing it into polylines (line segments).
Decomposing the B-Spline into polylines is typically done by sampling the paramet-
ric space with sufficiently high sampling rate to draw a good looking B-Spline [2].

3.2.2 Cox-De Boor Recurrence Basis Function Construction

Ni,0(t) =

1 if ti ≤ t < ti + 1
0 otherwise

Ni,p(t) = t − ti

ti+p − ti

Ni,p−1(t) + ti+p+1 − t

ti+p+1 − ti+1
Ni+1,p−1(t)

(3.1)

N0,4

N0,3

N0,2

N0,1

N0,0 N1,0

N1,1

N1,0 N2,0

N1,2

N1,1

N1,0 N2,0

N2,1

N2,0 N3,0

N1,3

N1,2

N1,1

N1,0 N2,0

N2,1

N2,0 N3,0

N2,2

N2,1

N2,0 N3,0

N3,1

N3,0 N4,0

Figure 3.4: Expanded recursion tree for the basis function N0,4.

Basis functions can be constructed for arbitrary degrees using Cox-De Boor recursion
formula 3.1 in an iterative top-down fashion by considering the recursion tree [2],
[17] shown in Figure 3.4. For example, to evaluate N0,3 we first have to evaluate
N0,2 and N1,2 recursively until we reach the the leaf nodes.

As shown in Figure 3.3, the base case, Ni,0(t), is a step function defined over the
knot span. In the general case, Ni,p(t), is a recursive linear interpolation of interpo-
lations [2].

11


3. Theory

3.2.3 Varying B-Spline Behavior via Knot Vector

Figure 3.5: Left: Open uniform B-Spline curve. Middle: Clamped uniform B-
Spline curve. Right: Clamped closed uniform B-Spline curve.

There are three variants of B-Splines that can be constructed using different knot
vectors: open, clamped and closed form, shown in fig 3.5. An open uniform B-
Spline curve can be constructed by distributing knot values with uniform spacing
over the parametric domain, and the clamped uniform B-Spline curve is constructed
by repeating knot values in the knot vector such that:

t0 = t1 = · · · = tk−1, tn−k+1 = tn−k+2 = · · · = tn.

holds. Clamped closed uniform B-Spline curves are loops, meaning the start and
end points are joined. This is typically achieved by repeating start control points at
the end of the B-Spline [2].

3.2.4 Cubic B-Spline Curve

Figure 3.6: Cubic B-Spline curve with 4 control points.

Cubic B-Spline curves are typically sufficient for most computer aided design related
modeling [1]. By normalizing the parameter t ∈ [0, 1] and analytically computing
the basis matrix, we then arrive at the resulting equation shown in Equation 3.2.

12


3. Theory

C(t) = 1
6

[ t3 t2 t 1 ]


−1 3 −3 1
3 −6 3 0

−3 0 3 0
1 4 1 0



P0
P1
P2
P3

 (3.2)

Furthermore, to model a B-Spline curve consisting of more than one B-Spline seg-
ment, we exploit that B-Splines can be modeled as a linear combination of open
uniform B-Splines because of its linear independency property [1]. Thus, we model
every segment as eq. 3.3 where every Bj corresponds to the j:th row in the basis
matrix as shown in Figure 3.2.

Si(t) =
3∑

j=0
Bj(t) Pi−2+j , t ∈ [0, 1] (3.3)

3.3 Hardware Tessellation
Tessellation in GPUs is the process of taking simple geometry and dynamically
deciding level of detail using fixed function hardware on the GPU. Furthermore, we
can for example, input a B-Spline curve description to the tessellation stages, sample
the parametric equation 3.2 and thereby avoid storing the sampled points in memory.
Instead, these are passed on through the pipeline as they are processed [15].

Hardware tessellation effectively reduces memory bus transactions between CPU
and GPU in the case where e.g., a curve should have more detail when more of it
is visible, while it should have less detail when less of it is visible. This is because
for every frame, the CPU would have to compute level of detail given for example
distance, and transfer it to the GPU. Using hardware tessellation, or tessellation
shaders as described in detail in Section 3.3.1, the programmer can compute the
distance and dynamically change level of detail [16].

3.3.1 Tessellation Shaders

Figure 3.7: Tessellation stages in the rendering pipeline.

Existing graphics APIs expose access to the tessellation stages of GPUs in the same
way, but with different names [15]. The names for every stage depicted in Figure 3.7
are the names used in OpenGL’s API [14].

The tessellation stages operate on abstract patches, configured by a specified number
of vertices. Every vertex in every patch is processed in the vertex shader, and
inputted into the Tessellation Control Shader (TCS) where per patch operations

13


3. Theory

are done. It is in this stage that the programmer can define logic to dynamically
determine level of detail by modifying built in variables. The outputted variables
are then used by the fixed-function hardware tessellator to generate primitives that
the Tessellation Evaluation Shader (TES) operates on [16].

3.4 Acceleration Structures
Performing intersection testing can be computationally expensive when dealing with
detailed or complex geometry. Naively performing intersection testing means com-
paring every object in a scene against every other object. This translates to a time
complexity of O(n2), and for large n, this becomes infeasible for even the cheapest
intersection testing such as an AABB vs AABB test [15]. However, there are ways to
reduce the number of needed intersection tests. Spatially structuring data reduces
the number of necessary tests because we only have to compare the objects that
are spatially grouped together. For example, in a tree data structure this would be
equivalent to only checking a node against the sub-tree that it is in. This effectively
reduces the necessary number of tests to a time complexity of O(log n) [18].

There exist many algorithms and data structures to spatially organize data, one
being trees and grid. If we divide the whole scene into a 2D grid and then map
every object in the scene onto a grid cell, then we only have to check every object
against all others objects in the same cell [18].

14


4
Method

In this chapter, we present two methods used to address the research questions
outlined in Section 1.2.

4.1 Overview
We first describe an implementation of rendering B-spline curves, using tessellation
shaders in OpenGL. We identify two primary issues with this implementation: first,
the GPU’s hardware-imposed tessellation factor prevents us from simplifying to ar-
bitrary numbers of generated isolines, something that is necessary to achieve high
performance while keeping sufficient visual quality. More critically, we notice that
the throughput is the main bottleneck. For example, 250k cubic B-spline curves con-
sist of at least one million control points, which translates to 250k abstract patches.
To overcome these limitations, we propose a second method which we call per-pixel
intersection algorithm, in which each pixel’s final color is determined by testing pixel
intersections against polylines sampled from B-spline curves. More importantly, the
spline segments are only sampled once in a pre-render loop phase, which reduces
the problem into figuring out which pixels are on the resulting polylines. To further
reduce the number of intersection tests, we introduced a uniform grid acceleration
structure, which significantly reduces the number of intersection tests.

15


4. Method

4.2 Tessellation Shader Implementation
We render B-spline curves as one pixel wide polylines on the GPU by using a modern
OpenGL technique called indirect rendering together with OpenGLs tessellation
stages. Indirect rendering is a concept in OpenGL and other graphics APIs where
instead of creating draw calls for every mesh or object, we instead batch draw
commands. As a benefit, the driver is not throttled by draw commands and can
instead determine how to schedule them internally.

The tessellation shader implementation approach has two key benefits:

1. Reduced memory usage and memory bandwidth, and

2. dynamic control of subdivision on the GPU, effectively reducing CPU-GPU
memory transactions.

We group control points into the abstract path primitives expected by the Tessel-
lation control Shader (TCS). Since it is not known how many control points there
are per B-spline curve, we define every patch as 4 vertices (4 control points). This
means that we effectively process one B-spline segment per Tessellation Evaluation
Shader (TES) invocation.

4.2.1 Dynamic Adaptive Sampling
When rendering the paths, we want to render at high quality while keeping main-
taining high performance. If we take too few samples with respect to the path’s
coverage in screen-space, then the rendered path will have visible edges. On the
other hand, if we take too many samples, we lose performance. Therefore, we adap-
tively determine a tessellation factor, analogous to determining how many samples
to take per B-spline segment, depending on its coverage in screen-space.

To calculate the coverage, we considered three alternatives: use a rough estimate
by either considering the area of the convex hull created by the control points,
calculating the total distance between every consecutive control point, or by taking
a sufficient number of samples and computing the distance similar as to the second
alternative.

We tested, and realized that sampling a B-spline is fast, and also provides a more
accurate representation of the polyline considering the pixel coverage than the other
two alternatives. We therefore use the third alternative: take 5 uniformly-spaced
samples per segment (t ∈ 0.0, 0.25, 0.5, 0.75, 1.0), project them to clip space, and
accumulate the distance between consecutive samples. Finally, we clamp that ac-
cumulated value to the range [1, 64]. All this is done once in the TCS per patch
primitive.

16


4. Method

4.2.2 Sampling in Tessellation Evaluation Shader

The TES is straightforward in the sense that we only need to implement a simple
algorithm that samples Equation 3.2. The parameter t in the Equation. 3.2 is
passed to the TES from the TCS signifying where along the B-spline segment we
should sample.

Figure 4.1: Tessellation Evaluation Shader B-spline segment sampling implementa-
tion.

1 coefficient = 1.0 / 6.0
2 cubic_basis_polynomials = 1.0 / 6.0 * mat4(
3 -1, 3, -3, 1,
4 3, -6, 3, 0,
5 -3, 0, 3, 0,
6 1, 4, 1, 0
7 );
8

9 sample_bspline(control_points, t)
10 // Compute the coefficients
11 ts = [t*t*t, t*t, t, 1.0]
12 basis = basis_polynomials * ts
13 // Multiply basis function with control points vector
14 x = dot(basis, points)
15 y = dot(basis, points)
16 return x, y

4.2.3 View Frustum Culling

We only want to render visible curves. To do so, we employed a view frustum culling
pre-pass compute shader where we build a buffer of indirect draw commands based
on visible paths. View frustum culling is a culling technique, meaning non-visible
objects are not processed during the rendering step for every frame. Since we use
orthographic projection, the view frustum will actually be a box instead of a cone
shape.

To cull paths we also needed a container to do intersection tests against. For this,
we exploited that B-splines are contained by the convex hull created by its control
polygon. We create these in a first step when all control points are loaded into the
program. We also saved metadata for every B-spline curve. This metadata contains
the number of control points, an index into the control point buffer, and lastly a
bounding box containing the B-spline curve.

17


4. Method

4.2.4 Pre-Pass Compute Shader Implementation

Figure 4.2: View frustum culling compute shader pseudocode.

1

2 // Assume SSBO definitions of atomic_index, spline_metadata_buffer,
and indirect_command_buffer.↪→

3

4 // Fetch index pointing into spline curve metadata buffer
5 curve_id = gl_GlobalInvocationID.x
6

7 // Ensure curve id is within the buffer range
8 if curve_id >= len(spline_metadata_buffer)
9 // if outside, exit.

10 return
11

12 // Load B-spline curve metadata
13 curve = spline_metadata_buffer[curve_id]
14

15 // Check if the B-spline curve associated AABB is within the
viewport↪→

16 if visible(curve.abbb)
17 // If so, use an atomic counter to find
18 // where in the indirect command buffer
19 // to write the indirect command.
20 index = atomicInc(atomic_index)
21 // Build an indirect command from curve metadata
22 cmd = create_indirect_command(curve)
23 // .. and write it to the indirect command buffer
24 indirect_command_buffer[index] = cmd

We implemented a pre-pass frustum culling in a compute shader shown in Figure 4.2
that is executed every frame before rendering. We assign a GPU thread per B-spline
curve to render, and check if the curve is visible by testing if its AABB is contained
within the viewport by AABB vs AABB intersection testing [18]. Since all GPU
threads write to the same buffer, and possibly same indices if a counter is shared,
we use an atomic counter to ensure that there are no race conditions. A different
approach could be to use shared memory, and allocate a range of indices assigned
for the thread block sharing memory.

18


4. Method

4.2.5 B-Spline Bounding Box
We save memory by computing a single bounding box per B-spline curve, and not
per B-spline segment. A B-spline curve can consist of a variable amount of B-spline
segments. For our data set, it averages to 8 B-spline segments per B-spline curve. We
could compute the bounding boxes on the fly and thereby reduce memory usage, but
chose instead to compute and store only a single bounding box. Unfortunately, this
leads to some extra B-spline segments being rendered even if not visible. However,
this is not the main bottleneck with this technique, because it is a common scenario
that all curves are visible at once.

We populated both the control point buffer and the index buffer by reading control
points from a file exported from an application where B-spline curves are used to
model paths. The indirect command buffer was populated in a pre-pass frustum
culling compute shader where only visible B-spline curves were written to the indirect
command buffer. We could not figure out a different way to implement this compute
shader without having an additional buffer, a metadata buffer, containing an index
to where the first control point is, the number of control points, and lastly a bounding
box. We construct the bounding box by exploiting that every B-spline curve is
contained within the convex hull of its control points, as described in Section 3.2.
This does not create a perfectly tight fit, but we realized empirically that it is
sufficient for our purpose.

19


4. Method

4.3 Per-Pixel Intersection Implementation

Figure 4.3: Overview of compute shader dispatch flow. A four phase single-dispatch
is executed in the order depicted in the figure. Then, using the result from the
binning step, pixel’s resulting color are set.

The tessellation shader method, as previously presented, is a rasterization prob-
lem. Instead of rasterizing polylines, we propose a method that we call per-pixel
intersection testing that computes which color every pixel should have.

In this section we describe the steps in implementing our per-pixel intersection
testing method. The algorithm uses a pre-compute step with four dispatch phases,
with the purpose of assigning line segments, sampled from B-spline curves, in a
grid. After the initialization phase, the grid is used to intersection test every pixel’s
corresponding world space coordinate against all line segments that fall inside of the
same cell as the pixel.

20


4. Method

4.3.1 Uniform Grid
One key performance gain in this method is from using a uniform grid. As described
in Section 3.4, we use this to spatially assign our primitives, line segments, and
efficiently access them. Therefore, the goal with the initial four phase, as depicted
in 4.3, is to build a grid that we can use to accelerate our per-pixel intersection tests.

For our test cases, we used a 128x128 grid size where every cell has a uniform size
determined by a boundary where all B-spline curves are in world space.

4.3.2 Sampling
Sampling is done using a similar method to how we did it in Section. 4.2.2. In-
stead of using tessellation shaders, we dispatch a compute shader. Additionally,
the samples are uniformly spaced depending on the sample count. For example, if
we set sample count per B-spline segment to 10 would mean sampling such that
t ∈ [ 0.0, 0.11, 0.22, 0.33, 0.44, 0.55, 0.66, 0.77, 0.88, 1.0 ]. Lastly, we aggregate the
samples into line segments.

One issue with this method of sampling is that we are writing to a buffer per GPU
thread (thread block of size 32), meaning there are data races if we do not use any
synchronization. To ensure there are no data races, we use an atomic counter to
get a unique index into the output line segment buffer. There are other ways to
efficiently implement this by allocating a range of indices that is shared within a
thread block group using shared memory, but we reasoned that this is executed only
once, and decided therefore to not use this method.

4.3.3 Counting Segments in Grid Cells
For our per-pixel intersection testing to work, we need an index to know where in
a global buffer that every line segment associated with a cell can be accessed. In
this step of the four-phase initialization, we keep a counter for every cell in a global
buffer corresponding to the number of line segments that intersect that cell.

We find every intersecting line segment with the corresponding grid cells by first
computing the bounding box of the line segments on the fly, by finding the minimum
and maximum points of the line segment. This bounding box is then mapped onto
the corresponding grid cells by computing the displacement to our grid origin and
then divide by cell size as shown in Figure 4.4. The grid origin is defined as the
bottom left point of the global grid bounding box.

Figure 4.4: Pseudocode to convert world positions to grid cell indices.

1 min_cell_box = floor((min_bbox - grid_origin) / cell_size)
2 max_cell_box = floor((max_bbox - grid_origin) / cell_size)
3

4 min_cell_box = clamp(min_cell_box, 0, 128 - 1)
5 max_cell_box = clamp(max_cell_box, 0, 128 - 1)

21


4. Method

Once the bounding box is mapped onto corresponding grid cells, we can iterate over
the grid cell indices to increment the counter for every cell that intersects with the
line segment as shown in Figure 4.5.

Figure 4.5: Pseudocode to increment the number of segments per grid cell.

1 For every min_cell_bbox.y <= max_cell_bbox.y
2 For every min_cell_bbox.x <= max_cell_bbox.x
3 cell_id = y * grid_size + x
4 grid_cells[cell_id] += 1

4.3.4 Computing Prefix Sum

Figure 4.6: Naive parallel prefix sum pseudocode.

1 For every i in parallel
2 For j = 0 to cell_index
3 prefix_sum_buffer[cell_index] += grid_cell_segment_count[j]

Prefix sum is an algorithm that given a set of input values, calculates the i-th value
given all preceding values by adding them and itself up in e.g an array of integers [19].
More formally, we define the prefix sum for the i-th element as: yi = yi−1 + xi.

We implemented a O(n2) naive parallel prefix sum in a compute shader to calculate
the starting position in the line segment buffer for every cell. The shader uses the
buffer from the previous phase, Count Segments, as input. Then, we assign one
GPU thread per element in a new output buffer where every thread is responsible to
compute its prefix sum using the input buffer, by executing the algorithm specified
in Figure 4.6.

4.3.5 Binning
The final phase of the initialization is the binning step where we assign line segments
to a buffer. We use the same algorithm as previously mentioned in Section. 4.3.3,
but in addition use two additional buffers: the prefix sum buffer and an atomic buffer
with same size as the grid dimensions. The atomic buffer ensures that writes to the
same grid cell are done atomically.

4.3.6 Per-Pixel Intersection Testing
The last phase of this algorithm is to determine which color every pixel should have.
To do so, we use the uniform grid from the previous initialization phase to access
line segments depending on which cell a pixel in world space resides in.

22


4. Method

Figure 4.7: Screen-space to world space computation.

1 screen_to_world(pixel_pos)
2 pixel_center = (pixel_pos + 0.5) / screen_size
3 to_ndc(pixel_center)
4 return inverse_projection_matrix * pixel_center

To get the world space coordinates for every pixel, we compute the pixels’ center
point, transform it to normalized device coordinates (NDC), and then apply the
inverse projection matrix as shown in Figure 4.7. The inverse projection matrix is
only computed once and passed as a uniform. Other uniforms passed to the shader
are the size of the grid, the size of every cell, a grid origin, pixel size, and line radius.
The pixel size is computed by taking the differential with respect to x and y. Our
line radius uniform is used to determine the width of the final rendered line, and
is empirically determined to 1.5 ∗ pixelsize and clamped to [2.0, 150.0] to adjust its
visual appearance at different zoom levels.

After computing a pixel’s world space coordinate, we have an early exit test to
avoid unnecessary point vs line tests. This early exit exploits the bounding box of
the uniform grid, and simply tests if the pixel is inside and otherwise early exits. If
a pixel is inside of the boundaries of the grid, then we find the corresponding cell
that it resides in, similarly as described in Section. 4.3.3 with the difference that we
are mapping a point and not a bounding box into grid cell indices.

When the affected grid cell is known, we use the two buffers constructed from Prefix
Sum and Count Segments steps to retrieve the start index of the line segments for
the cell and the number of line segments. We then loop and access every segment
and perform point vs line intersection test, and calculate a max coverage that we
later use to render smooth lines.

projv(w) = v · w
v · v v (4.1)

The point vs line intersection test uses the projection formula described in Equation.
4.1. We use this because it consists of few inexpensive operations and we can
implement it using a threshold factor to compute a width. We consider the width
as a empirically decided softness factor, which determines how fast the surrounding
pixels fade out. Lastly, we use a glsl function called smoothstep, which implements
cubic Hermite interpolation to smoothly fade the edges of the final lines. The
smoothstep function is defined on an interval [0, 1] where the tangents are zero
at x ≤ 0 and 1 ≤ x [15].

23


4. Method

24


5
Results

In this chapter, we present the results from our proposed method to render paths
on iGPUs. We present the performance using the methods presented in Chapter 4
and present visual figures of smooth and non-smooth lines.

To evaluate the performance of the two methods presented in Chapter 4, we used two
different datasets with varying densities; the first dataset is taken from a practical
real-world example where paths are uncommonly overlapped but still dense, while
the second dataset is generated to stress-test our methods by being distributed such
that paths are overlapping (see Figure 5.1 and Figure 5.2). Every time unit is an
average that is computed by either sampling the FPS or compute time (ms) over a
minute, and then computing the average.

5.1 Environment and Hardware Setup
All experiments were performed on a modern laptop with 32 GB RAM, an Intel Core
Ultra 7 155H processor, and integrated Intel Arc Graphics.

We used OpenGL 4.6 to enable modern programmable graphics features, includ-
ing tessellation and compute shaders, which are essential for the implementation
and comparison of both rendering methods. All benchmarks were performed at a
resolution of 1920x1080 unless otherwise stated.

All figures in this chapter are from an OpenGL application which implements the
methods as described in chapter 4. Every B-spline curve is rendered in 2D, and can
be visualized by scrolling and panning. Since the zoom-level affects the number of
visual curves, we only compare the test cases where all B-spline curves are visible and
at the same zoom-level. This is particularly important for our per-pixel intersection
method results, as it performs worse when all pixels on screen are in the grid.

25


5. Results

5.2 Datasets

Figure 5.1: Dataset D1, rendered using our proposed per-pixel intersection testing
method.

Figure 5.2: Dataset D2, rendered using our proposed per-pixel intersection testing
method.

We used two different datasets to evaluate the rendering performance under practical
and stress-test conditions. One dataset, D1, is used from a real-world application in-
volving B-spline curves used to define paths for Automated Guided Vehicles (AGVs).
The second dataset, D2, is synthetically constructed by combining and duplicating
B-spline curves to create high-density overlapping regions, designed to stress-test
the GPU utilization on an iGPU. These two datasets are visualized in Figure 5.1
and Figure 5.2.

26


5. Results

5.3 Performance

Table 5.1: Maximum number of B-spline curves rendered at 60 FPS across two
datasets, D1 and D2, using the two proposed methods. Per-pixel intersection enables
rendering more paths at the expense of higher memory usage.

Dataset Method Max Paths @60 FPS Memory Usage (MB)

D1
Tessellation Shader 225,000 128
Per-Pixel Intersection 1,020,350 1090

D2
Tessellation Shader 222,000 126
Per-Pixel Intersection 756,000 893

As shown in Table 5.1 we noticed an expected memory usage increase when testing
our method with rendering more paths. This follows our intuition, since the advan-
tage with using tessellation shaders is that the geometry is tessellated on hardware
without us explicitly using a buffer in memory to store the resulting polylines for
future rendering. Furthermore, we note that the per-pixel intersection method strug-
gles in scenarios when all pixels are covered by multiple paths, which is the case in
dataset D2. However, we stress that this is not a typical example of a real-world
example, but exists to stress-test our methods.

A straight comparison as shown in Table 5.1 is not fair. The tessellation shader
method and per-pixel method excel at different things: when the tessellation shader
is fully zoomed in, we effectively cull the majority of non-visible paths making it a
suitable method. In the opposite case, when all paths are visible the tessellation
shader performs worse than the per-pixel intersection method. We still present this
data in this way since they are two methods that we used to render paths, to put our
per-pixel intersection testing method into perspective with a naive implementation
to understand how much better or worse it is.

27


5. Results

Figure 5.3: Plot with four different zoom-levels showcasing how performance is
increased when non-visible paths are culled. The four samples showcased in the
figure are from the dataset, D1.

As shown in Figure 5.3, there is a clear correlation between compute time and
magnification. As zoom-levels change, or if the viewport is panned, non-visible
paths are culled. When the magnification is 10%, most of the dataset is visible,
leading to higher compute time. This pattern tapers off at 100% magnification, and
there is very little performance to gain which can be seen by comparing the data
point at 200% magnification with 100%.

28


5. Results

5.3.1 Tessellation Shader Implementation Performance
The tessellation shader implementation performs similarly in both datasets, as ex-
pected. Geometry throughput is the limiting factor — as the number of control
points grows, so does the number of patches that need to be processed, and we lose
performance.

Figure 5.4: View frustum culling plotted alongside tessellation and rendering times
as a function of input size.

View frustum culling has a small impact on the total compute time when the num-
ber of paths grow, as shown in Figure 5.4. In the worst-case, the view frustum
culling compute time is around 1 ms. The Tessellation & Render plot shown in Fig-
ure 5.4 depicts the total time for the pre-pass view frustum culling compute shader,
tessellation and render time.

We also measured the rasterization times for the same number of paths, but found
that rasterization had a marginal effect on the total compute time. We think the
reason for this is that since we are rendering isolines (OpenGL terminology for
polylines), the rasterizer will not create fragments that have large coverage. We
further verified that the rasterizer was not a bottleneck by using a profiling tool called
RenderDoc. RenderDoc allowed us to capture a frame, analyze pipeline statistics,
and more importantly, rasterization invocations and times.

5.3.2 Per-Pixel Intersection Testing Implementation
Performance

To test the performance of our proposed per-pixel intersection testing method, we
used the two previously presented datasets, D1 and D2. For both datasets, we
duplicated the number of paths such that they contained the same numbers, 90k,
180k, 360k, 540k and 720k paths. Then, we re-ran the initialization phase 10 times
per dataset and averaged the result shown in Figure 5.6

29


5. Results

Figure 5.5: Four phase compute initialization comparison for datasets D1 and D2.

We used OpenGL timestamp queries to get as close to accurate measurements for
every phase, meaning for every compute shader dispatch. Sampling scales linearly
with the number of paths, and is similar in both datasets as shown in Figure 5.6.
The most interesting part is that phase 2 and phase 4 takes much longer, which is
especially noticeable in D2. In our implementation we use atomics for writing to
cells. This seems to be the main reason to why phase 2 and phase 4 are slower in
D2, since if there are more paths intersecting a cell in the grid, then every thread
will compete against writing to the same index.

Scenario % Inside Grid Total Time (ms)

Best-case 0% 1.42
Average ∼50% 4.90
Worst-case ∼100% 7.50

Table 5.2: Per-pixel intersection test performance under varying grid occupancy
using dataset D1.

Scenario % Inside Grid Total Time (ms)

Best-case 0% 1.44
Average ∼50% 10.82
Worst-case ∼100% 16.92

Table 5.3: Per-pixel intersection test performance under varying grid occupancy
using dataset D2.

Our per-pixel intersection testing method performance depends on the percentage
of pixels on the grid. Therefore, we present rough estimates in percentage of pixel
coverage on the grid, and how it varies depending on the coverage as shown in
Table 5.2 and Table 5.3. We conclude that our per-pixel intersection testing method

30


5. Results

performs worse in high grid occupancy cases, since every pixel has to perform more
intersection tests.

5.4 Path Appearance

Figure 5.6: Paths rendered using per-pixel method. Left: smooth lines by calculat-
ing coverage and fading depending on distance to pixel. Right: non-smooth lines.

Figure 5.6

Figure 5.7: Clear visual aliasing and jaggedness rendered using per-pixel intersection
testing method.Left: Jaggedness and aliasing in non-smoothing/fading solution.
Right: Less visual aliasing by smoothing and fading edges.

The right image in Figure 5.6 illustrates the original, and unsmoothed path. When
viewing the same image at higher magnification the jaggedness is more evident along
the edges, as shown to the left in Figure 5.7. The other images in Figure 5.6 and
Figure 5.7 we can see that the edges are blurry in the latter figure to the right, while
not visible in the path in Figure 5.6.

31


5. Results

32


6
Conclusion

6.1 Discussion
In this section, we discuss possible alternatives, suggested improvements, and issues
with the proposed per-pixel intersection method.

6.1.1 Dynamic Layouts
Visualizing paths is not the only use case that users typically have. Paths are
typically edited, transformed, or deleted at large scale. This is not a feature that our
per-pixel intersection method supports, since the initialization phase where the grid
is built is too slow to run in real-time; see Figure 5.6 for exact numbers. We cannot
avoid sampling the paths with our per-pixel intersection testing method. However,
sampling is only required once at large scale, and afterwards it is considered cheap.
For example, if a user transforms a large number of paths at once, we only have to
resample the transformed paths to render them. Unfortunately, this method still
suffers from sampling and rebuilding compute times as the number of paths moved
grows making this method infeasible for dynamic layouts. There exist other grid
building methods that we could extend or adapt that for example focuses on memory
efficiency [20].

A uniform grid does not lend itself well to dynamic layouts. Instead, we propose ex-
ploring alternative acceleration structurese.g., an AABB tree or a linear quadtreeto
mitigate overlap issues and adapt cell size at runtime. In particular, a linear quadtree
can be built efficiently on the GPU and automatically adjusts both grid resolution
and extent.

Ultimately, we believe that a uniform grid is not suitable for dynamic layouts. In-
stead, we propose exploring alternative acceleration structures such as an AABB
(Axis-Aligned Bounding Box) tree [21] or a linear quadtree [22] to mitigate overlap
issues and adapt cell size at runtime. In particular, a linear quadtree can be built
efficiently on the GPU and automatically adjusts grid size. AABB trees are also
suitable because they inherently solve the issues as previously mentioned as with lin-
ear quadtrees, but can instead suffer from becoming unbalanced because of varying
path density distributions.

33


6. Conclusion

6.1.2 Anti-Aliasing
Instead of using smoothstep in our per-pixel intersection testing compute shader to
fade out the edges and calculate maximum coverage, we could rely on calculating
sub-pixel coverage by storing a sufficient number of line segments that intersect each
pixel. Then, we can test the known intersecting line segments against sub-pixels to
calculate a sub-pixel coverage in a similar manner to super-sampling anti-aliasing
(SSAA) [23].

6.2 Ethics and Risk Analysis
This thesis is conducted in close cooperation with a company and therefore an ethics
and risk analysis is necessary.

6.2.1 Intellectual Property
All code will be implemented independently of proprietary data or applications,
allowing results to be shared freely.

6.2.2 Conflict of Interest
There might be a conflict of interest in terms of the academic objective and the
company’s interest. To address the risks, the thesis will have a strong foundation in
objectivity while developing results that align with the company’s interests.

6.2.3 Social and Environmental Responsibility
This thesis addresses maximizing the utilization of the iGPU in path rendering,
resulting in increased power consumption. However, an iGPU is a System on Chip
(SoC) solution that is designed for low-power consumption [13], meaning, if we
can find a solution that allows application engineers to fully utilize their iGPU for
rendering paths, this might decrease the need for discrete GPUs which have a much
higher power consumption.

6.3 Future Work
As previously discussed in the discussion section, we think the next step is to replace
the uniform grid with a more fitting acceleration structure, e.g a linear quadtree to
support dynamic layouts. Linear quadtrees, as mentioned in Section 6.1.1, are fast
to build but slower to query [22]. We would also like to investigate how to best
implement an anti-aliasing technique, as mentioned in Section 6.1.2 to replace or
extend the technique of using the smoothstep function to fade out the edges.

The prefix sum phase, as shown in Figure 5.6, is the fastest out of all four phases.
There is performance to be gained; for example, by using a parallel prefix-sum
algorithm such as the one proposed by Harris et al. [24].

34


6. Conclusion

6.4 Conclusion
To summarize, we have answered our research questions by introducing a method
called per-pixel intersection testing. This technique launches one GPU thread per
pixel and computes the final color by performing intersection tests against line seg-
ments, effectively enabling the rendering of massive numbers of paths on iGPUs.
While currently limited to visualizing static scenes, we propose several improvements
to reduce or eliminate these limitations.

35


6. Conclusion

36


Bibliography

[1] A. Chaudhuri, B-splines, 2021. arXiv: 2108.06617 [cs.CV]. [Online]. Avail-
able: https://arxiv.org/abs/2108.06617.

[2] L. Piegl and W. Tiller, The NURBS book. Springer Science & Business Media,
2012.

[3] S. Y. Gatilov, “Vectorizing nurbs surface evaluation with basis functions in
power basis,” Computer-Aided Design, vol. 73, pp. 26–35, 2016, issn: 0010-
4485. doi: https://doi.org/10.1016/j.cad.2015.10.006. [Online].
Available: https : / / www . sciencedirect . com / science / article / pii /
S0010448515001591.

[4] A. Krishnamurthy, R. Khardekar, and S. McMains, “Optimized gpu evaluation
of arbitrary degree nurbs curves and surfaces,” Comput. Aided Des., vol. 41,
no. 12, pp. 971–980, Dec. 2009, issn: 0010-4485. doi: 10.1016/j.cad.2009.
06.015. [Online]. Available: https://doi.org/10.1016/j.cad.2009.06.
015.

[5] C. Loop and J. Blinn, “Resolution independent curve rendering using pro-
grammable graphics hardware,” in ACM Transactions on Graphics (TOG),
Proceedings of ACM SIGGRAPH 2005, [Accessed: Jan. 30, 2025], 2005. [On-
line]. Available: https : / / www . microsoft . com / en - us / research / wp -
content/uploads/2005/01/p1000-loop.pdf.

[6] C. Loop and J. Blinn, “Rendering vector art on the gpu,” in GPU Gems 3, H.
Nguyen, Ed., Addison-Wesley Professional, 2007, ch. 25. [Online]. Available:
https://developer.nvidia.com/gpugems/gpugems3/part- iv- image-
effects/chapter-25-rendering-vector-art-gpu.

[7] R. Santina, “Resolution independent nurbs curves rendering using programmable
graphics pipeline,” Sep. 2011.

[8] M. J. Kilgard and J. Bolz, “Gpu-accelerated path rendering,” ACM Trans.
Graph., vol. 31, no. 6, Nov. 2012, issn: 0730-0301. doi: 10.1145/2366145.
2366191. [Online]. Available: https://doi.org/10.1145/2366145.2366191.

[9] V. Batra, M. J. Kilgard, H. Kumar, and T. Lorach, “Accelerating vector graph-
ics rendering using the graphics hardware pipeline,” ACM Trans. Graph.,
vol. 34, no. 4, Jul. 2015, issn: 0730-0301. doi: 10.1145/2766968. [Online].
Available: https://doi.org/10.1145/2766968.

[10] R. Li, Q. Hou, and K. Zhou, “Efficient gpu path rendering using scanline
rasterization,” ACM Trans. Graph., vol. 35, no. 6, Dec. 2016, issn: 0730-0301.
doi: 10.1145/2980179.2982434. [Online]. Available: https://doi.org/10.
1145/2980179.2982434.

37

https://arxiv.org/abs/2108.06617
https://arxiv.org/abs/2108.06617
https://doi.org/https://doi.org/10.1016/j.cad.2015.10.006
https://www.sciencedirect.com/science/article/pii/S0010448515001591
https://www.sciencedirect.com/science/article/pii/S0010448515001591
https://doi.org/10.1016/j.cad.2009.06.015
https://doi.org/10.1016/j.cad.2009.06.015
https://doi.org/10.1016/j.cad.2009.06.015
https://doi.org/10.1016/j.cad.2009.06.015
https://www.microsoft.com/en-us/research/wp-content/uploads/2005/01/p1000-loop.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2005/01/p1000-loop.pdf
https://developer.nvidia.com/gpugems/gpugems3/part-iv-image-effects/chapter-25-rendering-vector-art-gpu
https://developer.nvidia.com/gpugems/gpugems3/part-iv-image-effects/chapter-25-rendering-vector-art-gpu
https://doi.org/10.1145/2366145.2366191
https://doi.org/10.1145/2366145.2366191
https://doi.org/10.1145/2366145.2366191
https://doi.org/10.1145/2766968
https://doi.org/10.1145/2766968
https://doi.org/10.1145/2980179.2982434
https://doi.org/10.1145/2980179.2982434
https://doi.org/10.1145/2980179.2982434


Bibliography

[11] H. Kumar and A. Sud, “Fast, memory efficient and resolution independent
rendering of cubic bezier curves using tessellation shaders,” in SIGGRAPH
Asia 2019 Posters, ser. SA ’19, Brisbane, QLD, Australia: Association for
Computing Machinery, 2019, isbn: 9781450369435. doi: 10.1145/3355056.
3364548. [Online]. Available: https://doi.org/10.1145/3355056.3364548.

[12] D. Nehab and H. Hoppe, “Random-access rendering of general vector graph-
ics,” ACM Trans. Graph., vol. 27, no. 5, Dec. 2008, issn: 0730-0301. doi:
10.1145/1409060.1409088. [Online]. Available: https://doi.org/10.1145/
1409060.1409088.

[13] P. Gera, H. Kim, H. Kim, S. Hong, V. George, and C.-K. Luk, “Performance
characterisation and simulation of intel’s integrated gpu architecture,” in 2018
IEEE International Symposium on Performance Analysis of Systems and Soft-
ware (ISPASS), 2018, pp. 139–148. doi: 10.1109/ISPASS.2018.00027.

[14] J. Kessenich, G. Sellers, and D. Shreiner, OpenGL Programming Guide: The
official guide to learning OpenGL, version 4.5 with SPIR-V. Addison-Wesley
Professional, 2016.

[15] T. Akenine-Moller, E. Haines, and N. Hoffman, Real-time rendering. AK Pe-
ters/crc Press, 2019.

[16] M. NieSSner, B. Keinert, M. Fisher, M. Stamminger, C. Loop, and H. Schäfer,
“Real-time rendering techniques with hardware tessellation,” Computer Graph-
ics Forum, vol. 35, no. 1, pp. 113–137, 2016. doi: https://doi.org/10.1111/
cgf.12714. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/
cgf.12714. [Online]. Available: https://onlinelibrary.wiley.com/doi/
abs/10.1111/cgf.12714.

[17] C. De Boor, “On calculating with b-splines,” Journal of Approximation theory,
vol. 6, no. 1, pp. 50–62, 1972.

[18] C. Ericson, Real-Time Collision Detection. CRC Press, 2004, pp. 79–80.
[19] G. E. Blelloch, “Prefix sums and their applications,” 1990.
[20] E. M. T. II and S. N. Pattanaik, “A memory efficient uniform grid build

process for gpus,” Journal of Computer Graphics Techniques (JCGT), vol. 5,
no. 3, pp. 50–67, Sep. 2016, issn: 2331-7418. [Online]. Available: http://jcgt.
org/published/0005/03/04/.

[21] G. van den Bergen and, “Efficient collision detection of complex deformable
models using aabb trees,” Journal of Graphics Tools, vol. 2, no. 4, pp. 1–13,
1997. doi: 10.1080/10867651.1997.10487480. eprint: https://doi.org/
10.1080/10867651.1997.10487480. [Online]. Available: https://doi.org/
10.1080/10867651.1997.10487480.

[22] J. Dupuy, J.-C. Iehl, and P. Poulin, “Quadtrees on the gpu,” in GPU PRO
360 Guide to GPGPU, AK Peters/CRC Press, 2018, pp. 211–222.

[23] M. Holländer, T. Boubekeur, and E. Eisemann, “Adaptive supersampling for
deferred anti-aliasing,” Journal of Computer Graphics Techniques, vol. 2, no. 1,
pp. 1–14, 2013.

[24] M. Harris, S. Sengupta, and J. D. Owens, “Parallel prefix sum (scan) with
cuda,” GPU gems, vol. 3, no. 39, pp. 851–876, 2007.

38

https://doi.org/10.1145/3355056.3364548
https://doi.org/10.1145/3355056.3364548
https://doi.org/10.1145/3355056.3364548
https://doi.org/10.1145/1409060.1409088
https://doi.org/10.1145/1409060.1409088
https://doi.org/10.1145/1409060.1409088
https://doi.org/10.1109/ISPASS.2018.00027
https://doi.org/https://doi.org/10.1111/cgf.12714
https://doi.org/https://doi.org/10.1111/cgf.12714
https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12714
https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12714
https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12714
https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12714
http://jcgt.org/published/0005/03/04/
http://jcgt.org/published/0005/03/04/
https://doi.org/10.1080/10867651.1997.10487480
https://doi.org/10.1080/10867651.1997.10487480
https://doi.org/10.1080/10867651.1997.10487480
https://doi.org/10.1080/10867651.1997.10487480
https://doi.org/10.1080/10867651.1997.10487480

	List of Figures
	List of Tables
	Introduction
	Motivation
	Problem Statement
	Limitations
	Paths
	Dynamic Layouts


	Previous Work
	CPU and GPU Path Rendering
	Path Rendering in Vector Graphics
	Rendering Paths Using Hardware Tessellation
	Contribution

	Theory
	Graphics Pipeline
	B-Spline
	B-Spline Basis Function
	Cox-De Boor Recurrence Basis Function Construction
	Varying B-Spline Behavior via Knot Vector
	Cubic B-Spline Curve

	Hardware Tessellation
	Tessellation Shaders

	Acceleration Structures

	Method
	Overview
	Tessellation Shader Implementation
	Dynamic Adaptive Sampling
	Sampling in Tessellation Evaluation Shader
	View Frustum Culling
	Pre-Pass Compute Shader Implementation
	B-Spline Bounding Box

	Per-Pixel Intersection Implementation
	Uniform Grid
	Sampling
	Counting Segments in Grid Cells
	Computing Prefix Sum
	Binning
	Per-Pixel Intersection Testing


	Results
	Environment and Hardware Setup
	Datasets
	Performance
	Tessellation Shader Implementation Performance
	Per-Pixel Intersection Testing Implementation Performance

	Path Appearance

	Conclusion
	Discussion
	Dynamic Layouts
	Anti-Aliasing

	Ethics and Risk Analysis
	Intellectual Property
	Conflict of Interest
	Social and Environmental Responsibility

	Future Work
	Conclusion

	Bibliography