Stereoscopic Depth Imaging for Improved
Point Cloud Generation of Skeletal Sur-
faces in IR-guided Orthopaedic Surgery
Master’s thesis in computer science and engineering

MATS CEDERVALL
OSCAR DAHLQVIST

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Stereoscopic Depth Imaging for Improved Point Cloud
Generation of Skeletal Surfaces in IR-guided Orthopaedic

Surgery

MATS CEDERVALL
OSCAR DAHLQVIST

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Stereoscopic Depth Imaging for Improved Point Cloud Generation of Skeletal Surfaces in IR-guided
Orthopaedic Surgery

MATS CEDERVALL
OSCAR DAHLQVIST

© MATS CEDERVALL, 2024.
© OSCAR DAHLQVIST, 2024.

Advisor: Hassan Nemati, Ortoma AB
Supervisor: Ulf Assarsson, Department of Computer Science and Engineering, Chalmers
Examiner: Erik Sintorn, Department of Computer Science and Engineering, Chalmers

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: The image depicts three components: a surgical navigation system, a depth camera, and a 3D
reconstruction of intraoperative knee geometry.

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Stereoscopic Depth Imaging for Improved Point Cloud Generation of Skeletal Surfaces in IR-guided
Orthopaedic Surgery
MATS CEDERVALL
OSCAR DAHLQVIST
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
This thesis proposes a flexible approach of combining traditional optical navigation systems with a
visible light-based depth camera for aligning preoperative CT models with the corresponding bone
exposed during active surgery, providing real-time feedback of implant positioning to the surgeon
based on a preoperative plan.

By leveraging the Intel® RealSenseTM D405 depth camera, the study investigates the benefits and
challenges of using stereoscopic depth imaging for 3D reconstruction of musculoskeletal surfaces.
The primary goal being to evaluate how this technology can be used to reduce surgery duration in
orthopaedic total knee arthroplasty (TKA) surgery.

The proposed method transforms the depth images captured using the depth camera into a navigation
system’s frame of reference, accomplished through a series of preoperative calibration steps. The
reconstructed intraoperative 3D model is then aligned with the CT model using iterative closest
point (ICP) algorithms. This thesis also includes an investigation into using light polarisation filters,
analysis of hyperparameter tuning, and accuracy evaluation of the Intel® RealSenseTM D405 camera’s
depth estimation capabilities.

Experimental validation includes a mock surgery on pig cadaver parts to simulate intraoperative
conditions. Results demonstrate that the proposed approach achieves good alignment accuracy at
around 0.5 mm, though it exhibits higher variability compared to competing methods.

In conclusion, the integration of this class of depth cameras with optical navigation systems is a
viable solution for improving the speed of knee registration in orthopaedic surgery, but future work is
encouraged to address depth estimation inaccuracies.

Keywords: surgical navigation tools, image-guided orthopaedic surgery, total knee arthroplasty,
stereoscopic RGBD camera, Intel® RealSenseTM D405

v


Acknowledgements
We would like to express our sincere thanks to our mentor, Hassan Nemati, for his unwavering support
and insightful guidance throughout the project. Our appreciation also extends to the exceptional
team at Ortoma AB, who generously shared their expertise and warmly welcomed us during our time
at their office.

We are also grateful to our supervisor Ulf Assarsson at Chalmers, for his support and contributions
to the project. Lastly, we acknowledge everyone else who contributed to the project in various ways,
helping us finalise this thesis.

Mats Cedervall & Oscar Dahlqvist, Gothenburg, 2024-06-24

vii


Contents

Glossary xi

1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Ethical Considerations and Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7
2.1 Traditional IGOS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Total Knee Arthroplasty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Anatomy of the Knee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Computed Tomography (CT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 3D Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Point Clouds and Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Homogeneous Coordinates and Transformations . . . . . . . . . . . . . . . . . 9

2.5 Medical Navigation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Patient Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Navigation System Tool Calibration . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.3 Using Navigation Systems To Find Bone . . . . . . . . . . . . . . . . . . . . . 11

2.6 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Extrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.3 Camera Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.4 Perspective-n-Point (PnP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.5 Stereoscopic Depth Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Registration Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 Least Squares Point Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.2 Iterative Closest Point (ICP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7.2.1 Local Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.2.2 Global Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Polarised Light and Reflective Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Related work 17
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Methodology 21
4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ix


Contents

4.1.1 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Camera Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Camera Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Camera Space to Navigation Space . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2.1 Shared Geometry Calibration . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2.2 Multi Frame Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Camera-Navigation Time Synchronisation . . . . . . . . . . . . . . . . . . . . 27
4.3 CT Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Frame Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 ICP Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Polarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Data Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.6.1 Baseline Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.2 Alignment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Results 33
5.1 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Camera Marker Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Frame Synchronisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Manual Pointer Tip Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.2 Calibration Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5.3 Comparison to Traditional Method . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5.4 Depth Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Analysis & Discussion 43
6.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Depth Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 Comparison to Traditional Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Conclusion 49

Bibliography 51

x


Glossary

Affine A geometric transformation that maintains lines and parallelism. 9, 14, 24, 26

Cartesian Product The set of all ordered pairs that can be obtained by combining each element of
one set with each element of another set. 27

CT Computed tomography, uses X-rays to create detailed images and 3D models of the internal
structure of the human body. Detailed in Section 2.3. 1–3, 7, 8, 14, 17, 18, 21–23, 27–29, 32, 36,
40, 45, 46, 49

Distal femur The lower end of the femur, part of the knee. Illustrated in Figure 2.2.1. 8, 18

Fiducials Easily recognisable shapes, such as dots, crosses, or patterned targets, designed to be
clearly identifiable for computers. 3, 11, 23–26

Fill rate The percentage of a scene’s depth information successfully captured, affected by occlusions
and surface properties. Detailed in Section 2.6.5. 3, 14, 18, 33, 34, 43–45

Frobenius norm The square root of the sum of the squared elements of a matrix. 26

ICP Iterative closest point. An algorithm to compute the affine transformation between two similar
structures. Described further in Section 2.8. 4, 15, 17–19, 28–30, 32, 44, 46, 47

IGOS image-guided orthopaedic surgery. 1, 2, 7, 14, 17, 19, 21

IR Infrared. 3, 10, 11, 18, 22, 23, 47

Least Squares Point Alignment An algorithm to compute the affine transformation between two
differently oriented structures. Described further in Section 2.7.1. 14, 15, 25

Marker An uniquely shaped object equipped with IR reflective surfaces, tracked by IR cameras.
Illustrated in Figure 2.5.1a. 3, 7, 11, 22–26, 31, 43, 46

Patellar surface A region at the lower end of the femur, part of the knee. Illustrated in Figure
2.2.1. 8, 28, 33

Patient space A frame of reference relative to the patient during surgery.
Detailed in Section 2.5.1. 3, 24

PnP Perspective-n-Point. An algorithm that computes the pose of a camera in some world space.
Described further in Section 2.6.4. 13, 25

xi


Glossary

RANSAC Random sample consensus. An algorithm for handling data sets with outliers. Described
further in Section 2.8. 15, 30, 44

RGBD An image that also contains depth information (red, green, blue, depth). 4, 25, 47

RGBD camera A camera that in addition to capturing ordinary colour data also estimates depth
information for each pixel. Produces RGBD images instead of typical RGB images. (red, green,
blue, depth). 3, 4, 21, 22, 33, 44, 45, 47

ROI Region of interest. 1, 2

Sobel filter A type of commonly used edge-detection filter. 27

Temporal noise A metric quantifying fluctuations in-depth data over time, defined as the standard
deviation of repeated depth captures, and detailed in Section 2.6.5. 3, 13, 18, 31, 33, 34, 43, 44,
46

TKA total knee arthroplasty, also known as joint replacement surgery. 1, 2, 4, 7, 8, 18, 19, 43, 46

Voxel A voxel is a 3D unit of space, representing a value in a grid, used in 3D modelling
and imaging. 8, 29, 44

xii


1
Introduction

Modern surgical navigational tools have been developed over the past 30 years and have become
increasingly important since their introduction [1]. In addition to enabling significantly safer, more
accurate, and less invasive procedures, the technology has allowed for new sophisticated techniques to
be integrated into the surgical routine [1]–[3]. While surgical navigational tools consist of a broad
span of technologies, this paper focuses on a limited part of image-guided orthopaedic surgery (IGOS)
in the context of total knee arthroplasty (TKA), also known as total knee replacement [4]. Specifically,
it addresses the task of determining the pose of the target bone during ongoing surgery.

The central idea of orthopaedic surgery is to restore the physiological function of the musculoskeletal
system and includes tasks such as reconstruction of torn ligaments, treatment for arthritis, dislocations,
or joint replacement [3], [5]. TKA is but one example of joint replacement surgery; in which the
damaged cartilage and bone in the joint is removed and replaced with artificial parts [4], [6], see
Figure 1.0.1.

Figure 1.0.1: Illustrates the results of common orthopaedic joint replacement surgeries. From [6]
CC-BY 4.0 / Cropped from original.

Whereas the typical surgical routine can be split into three stages: pre-, intra-, and post-operative
stage, IGOS is primarily only concerned with the first two. The preoperative stage involves preparation
and planning for the surgery, and in IGOS this typically includes the collection of 3D anatomic
skeletal data of the patient using, for example, computed tomography (CT) [3]. The 3D model of the
skeletal region of interest (ROI) obtained from CT scans will hereinafter be referred to as the CT
model.

1


1. Introduction

An orthopaedic surgical plan may typically involve a number of procedures performed on the skeletal
ROI, that are mapped out in varying detail using the CT model. In IGOS systems, the intraoperative
stage heavily relies on this CT model in order to provide accurate and real-time feedback to the
surgeon [3]. The specific type of feedback provided by IGOS systems can vary, but may include
visualisation of tool positioning in relation to pre-planned incision points, or instantaneous assessment
of implant placement compared to the preoperative plan [7].

The intent of TKA surgery is to alleviate patient pain and to improve their range of motion, such
that they may resume daily activities without being hindered by significant pain [4], [8]. However,
with increasing numbers of younger and more active individuals opting for TKA surgery, stricter
requirements are being placed on implant longevity and natural movement [7]. Thus, in pursuit of
improved surgical outcomes, personalised alignment strategies are becoming more widely adopted,
elevating the importance of precise customised implant positioning and the techniques for achieving
it [9], [10].

In this paper, we propose a practical approach of combining traditional optical navigation systems with
a visible light-based depth camera, presenting a fast and versatile solution for aligning a preoperative
CT model of a patient’s knee with the patient’s physical knee during surgery. Furthermore, we
investigate the effects of leveraging polarised light filters to improve the depth estimation quality of
the depth camera for reflective musculoskeletal surfaces, combined with a detailed analysis of the
impact of several key parameters in the aforementioned alignment procedure.

1.1 Problem Description
A key element of IGOS systems is the image registration process, which involves the aligning of
preoperative 3D models with the patient’s corresponding surgical area during the operation [3]; see
Figure 1.1.1 for an illustrative example. The preciseness of the registration is of paramount importance
for the preoperative plan to be of proper use, since any misalignment introduces erroneous offsets
between the procedures planed out using the CT model and the actual physical positioning of the
patient during surgery [11].

(a) None (b) Good (c) Bad

Figure 1.1.1: Illustrative example alignment between a pig knee and its CT-scanned counterpart.

While modern TKA encompasses multiple schools of thought regarding the optimal approach for
performing the procedure, the base requirement of achieving a good implant positioning remains a
common goal [9], [12]. The preciseness of the angle for the implant is regularly expressed in the range
of ±3 to 5◦, motivating the need for high precision feedback systems [10], [12].

2


1. Introduction

The first part of leveraging the CT-scanned anatomical 3D model is collecting intraoperative data to
match it with. In this work, we employ the Intel® RealSenseTM D405 stereoscopic depth camera (also
referred to as an RGBD camera) to 3D capture the musculoskeletal surfaces of the surgical incision
site. Using a stereoscopic depth camera for 3D object reconstruction is, however, not without its own
host of problems. Among the issues is the quality of the depth estimation, temporal noise, fill rate,
frame aggregation, and the computation time for generating the final 3D reconstructed model. Please
refer to Section 2.6.5 for detailed explanations of these terms.

Following the 3D surface reconstruction procedure, the subsequent step of aligning the captured 3D
anatomical surfaces with the preoperative CT model also presents several non-trivial challenges. The
source of which most notably originate from the fact that the CT model and the 3D captured surfaces
during surgery are always dissimilar, and these differences can occasionally be substantial. The
asymmetry stems from the differences in 3D reconstruction methodology; whereas the CT model is
derived from the application of computed tomography (CT), the 3D model during surgery is obtained
using a stereoscopic depth camera.

The key differences between the CT model and the intraoperative 3D model are the types of tissues
captured and the context in which they are. While CT uses X-rays to capture anatomical data of
the insides of a patient, see Section 2.3, it struggles to accurately capture cartilage. Which results
in a fundamental difference between the preoperative CT model and what the RGBD camera sees
during surgery. Additionally, the view of the RGBD camera is strictly limited to the musculoskeletal
surfaces exposed through the surgical incision, meaning only part of the bone will be represented in
the produced point cloud from the RGBD camera.

Importantly, the RGBD camera is also limited in terms of depth accuracy and the estimated depth
will fluctuate due to noise even with a stationary camera and scene. This variation is called temporal
noise and introduces errors between frames that need to be accounted for when computing a 3D
model from the produced point clouds [13], [14]. Furthermore, the RGBD camera is highly sensitive
to surface textures, for reasons that will be expanded upon in Section 2.6.5, and will have a hard time
capturing larger regions with homogeneous surfaces, such as smooth sections of bone and cartilage;
making it harder to accurately capture the anatomical regions we are looking for. Together, these
discrepancies between the CT model and the 3D model reconstruction from the RGBD camera make
for a non-trivial 3D model-to-model alignment procedure.

Further, the alignment between the CT model and the camera-captured surfaces is to be performed
in a reference space that is relative to the patient, such that if the patient is moved during surgery,
the overlayed CT model should automatically be moved the corresponding amount. This is generally
achieved using a surgical navigation system; in this paper, we employ a medical navigation system,
which uses infrared (IR) cameras and IR reflective objects, generally referred to as fiducials, attached
in a unique shape to form a marker that can be tracked in the 3D space in view of the IR cameras.
This system and its markers allow tools and objects in the scene to have their position and orientation
to be tracked in real-time, and by attaching such a marker to the patient we can ensure objects are
tracked in relation to the patient. We called this reference frame the patient space.

To leverage this navigational system for tracking the surgical incision site relative to the patient, we
have attached one of the aforementioned markers to the RGBD camera. Through a series of calibration
steps, the captured 3D point clouds from the RGBD camera are transformed to be expressed in
patient space, allowing the CT model to be aligned relative to the patient and be robust against limb
movements during surgery.

This study focuses on evaluating the applicability of using the Intel® RealSenseTM D405 depth camera
as a point cloud source for the CT model alignment procedure, in particular in combination with a
medicial navigational system for external tracking of the patient and the handheld RGBD camera.

3


1. Introduction

1.2 Purpose of the Study
This work aims to evaluate the efficacy of employing an affordable stereoscopic RGBD camera for 3D
reconstruction of musculoskeletal surfaces exposed intraoperatively in orthopaedic surgery. Specifically,
the study uses the Intel® RealSenseTM D405 depth camera in the context of total knee arthroplasty
(TKA) surgery. Additionally, the effects of leveraging polarised light filters will be evaluated, applied
in order to remove surface specular highlights and thereby improving the 3D reconstruction quality
of the target surfaces. This was done due to a concern of the reflective nature of musculoskeletal
surfaces.

The primary intent is to propose a faster methodology for intrasurgical CT model alignment compared
to a handheld manual pointer sampling based approach. A faster alignment procedure makes the
routine more appealing for the surgical personnel, and reduces the inherent risks and inconvenience
associated with longer surgery durations. Moreover, due to reduced user interaction during data
acquisition compared to handheld pointer sampling, this approach is more appropriate for a sterile
environment. That is, the proposed methodology carries benefits for both the surgeon and the patient.

The gap in literature identified in the study is the lack of 3D reconstruction quality evaluation
for depth cameras applied to capture reflective and homogeneous musculoskeletal surfaces during
orthopaedic surgery. This study seeks to address this gap by providing comprehensive data on the
performance and potential advantages of using an RGBD camera in this context.

1.2.1 Research Questions
Below is a list of the primary research questions for this study, condensed from the discussion in the
previous Section 1.2.

• RQ1: Is it viable to replace a few high quality sampled points with a large number
of low quality points using the Intel® RealSenseTM D405 depth camera ...

– RQ1.1: ... in terms of point cloud to CT model alignment accuracy?

– RQ1.2: ... while keeping total compute time from last RGBD frame
acquisition to completed alignment under 60 seconds?

• RQ2: How does the application of polarised light filtering effect the Intel® RealSenseTM D405
depth camera’s point cloud acquisition for musculoskeletal surfaces, and in turn
impact the point cloud to CT model alignment accuracy?

• RQ3: What is the depth estimation accuracy of the Intel® RealSenseTM D405 depth camera
for musculoskeletal surfaces?

1.3 Delimitations
This work will not be improving on existing registration methods, but rather utilise off-the-shelf
iterative closest point (ICP) algorithms provided by the open-source library Open3D [15], combined
in specific configurations to fit the problem at hand. In addition, the proposed methodology will
mainly be compared to the traditional method of manual handheld pointer sampling technique. The
collected points used for traditional method are manually aligned visually by us while assisted by an
experienced professional at Ortoma AB. It should be noted that this introduces certain limitations
regarding the generalisability and reproducibility of some of our findings. Furthermore, the study is
confined to a single set of pig cadaver parts with corresponding CT models, mainly due to time and
cost constraints.

4


1. Introduction

1.4 Ethical Considerations and Risks
This research focuses on designing a solution meant for use during complex human surgery, hence it is
of great importance that the solution considers reasonable diversity and differences between patients.
The solution must be robust to diverse body compositions in terms of muscle mass, fat, and most
importantly, differences in bone and cartilage in the target hip and knee regions. Without proper
care, a solution could avoidably fail to properly handle the diverse conditions expected in real surgery
and risk poor prosthetic alignment for the patient. The proposed methodology has therefore been
designed under the requirement of being generalisable from the start.

Moreover, the project involves a limited amount of analysis of images and videos captured during
real-patient orthopaedic surgeries, which naturally demands great responsibility for the handling of
the material, to preserve and respect patient privacy and integrity. Precautions include offline data
processing, a strict no-share policy, and thorough removal of the related data from utilised devices
upon project completion.

Additionally, the work includes tests performed on pig parts sourced from a local butcher, that would
otherwise be used for food. The ethical considerations are thus greatly limited, as the animal has
already been slaughtered for another purpose.

Lastly, any work targeting the domain of human surgery typically involves similar risks to those
discussed above, making them an inherent aspect of these types of works. We believe modern surgery
should continue to develop and employ digital technologies to further improve surgery outcomes and
decrease patient suffering, hence justifying the need for this paper.

5


1. Introduction

6


2
Background

This chapter delves into the technicalities of what image-guided orthopaedic surgery (IGOS) entails,
with a specific focus on its application in total knee arthroplasty (TKA). It aims to provide an
appropriate level of understanding of the domain and equip the reader with the necessary background
to comprehend the subsequent chapters of the thesis.

2.1 Traditional IGOS Method
The commonly used method for IGOS has remained largely unchanged since it was pioneered by
Merloz et al. in 1995 [2]. This technique is used in a similar manner regardless of the specific bones
being treated or the complexity of the skeletal intervention, and its general steps are outlined below:

1. Some time before the operation, the patient undergoes a CT scan. This scan is utilised to plan
the upcoming procedure, identifying what is damaged, determining the necessary implants, and
deciding their placement.

2. The operation room is fitted with a navigation system that tracks the position and orientation of
purpose-calibrated surgical tools equipped with markers. Navigation systems and their markers
will be further detailed in Section 2.5.

3. A reference navigation system marker is fitted to the bone that is to be tracked, creating a
frame of reference relative to the patient. This is useful for when the patient is moved during
surgery.

4. A navigation system pointer, shown in Figure 2.5.1b, is repeatedly positioned on the bone to
create a collection of points that then correspond to the bone’s surface.

5. A computer derives a transformation matrix that transforms the CT scan into the navigation
system’s frame of reference.

6. Some kind of mounting tool is used to position implants at the pre-planned locations and
orientations. The correct position and orientation are generally indicated in real-time by a
screen attached to the navigation system.

2.2 Total Knee Arthroplasty
TKA is a type of joint replacement surgery, in which damaged cartilage and bone is removed and
replaced with artificial parts with the intent of relieving pain and restoring joint functionality [4],
[6]. In Sweden, approximately 17,000 TKA surgeries were performed in 2022, with arthritis being
the predominant reason for the majority of cases [16]. TKA is, in fact, one of the most widely
performed orthopaedic surgeries in the world, hence unsurprisingly representing a notable portion of
all orthopaedic procedures done in Sweden 2022 [5], [12].

7


2. Background

The procedure is performed through a series of steps beginning with a detailed preoperative plan,
often involving imaging techniques such as CT to create a 3D model of the patient’s knee [1], [12].
This model is used for precise planning of bone cuts and implant placement to achieve optimal
performance of the implant and restoration of joint functionality [10]. Ensuring the accuracy of
implant positioning is critical for reducing the risk of postoperative complications and enhancing the
longevity of the implant [9], [11]. To achieve this, intraoperative navigation systems are employed,
providing real-time feedback to the surgeon on how closely the implant aligns with its pre-planned
location and rotation [1], [3].

2.2.1 Anatomy of the Knee
For the purposes of this study, only a very limited understanding of the anatomy of the knee is
necessary. Specifically, the distal femur region which is exposed during TKA surgery, and illustrated
in Figure 2.2.1. The two highlighted regions in the figure corresponds to the patellar surface and the
tibial surfaces respectively.

(a) Patellar surface of the femur. (b) Tibial surfaces of the femur.

Figure 2.2.1: Illustrates the key regions of the distal femur for this study [17].
From [6] CC-BY 4.0 / Edited from original.

2.3 Computed Tomography (CT)
In the context of medical imaging, computed tomography (CT) uses X-rays to create detailed images
of the internal structure of the human body. It involves the capturing of a series of 2D slices (called
radiographs) from multiple angles that are computed upon using a reconstruction algorithm to form
a single 3D grid of data, often referred to as a tomogram [18]. Each voxel (i.e. sub-volume) in the
3D spatial data grid (the tomogram) is assigned a, so-called, CT number that relates to the level
of attenuation of the X-ray passing through that location [18], [19]. Attenuation is the intensity
reduction of an X-ray passing through an object due to the absorption or scattering of photons, it
roughly corresponds to the energy of the photons and the properties of the material being passed
through, especially its density and atomic numbers [18], [19]. The CT number of each voxel is then
compared to threshold values to categorise the corresponding tissue type, enabling the segmentation
of for instance bone and soft tissues [18].

8


2. Background

2.4 3D Representations
This section provides an overview of 3D data presentation and the use of homogeneous coordinates to
transform 3D data between different frames of reference.

2.4.1 Point Clouds and Meshes
Point clouds are collections of 3D points, each defined by (x, y, z) coordinates relative to an origin,
and often also include some associated colour data. Figure 2.4.1a illustrates an example of a point
cloud. Meshes, on the other hand, consist of 3D points that are interconnected to form triangles,
which define the surfaces of a 3D object. Figure 2.4.1b illustrates an example of a 3D mesh.

(a) Example of a 3D point cloud.
Points are visualised as individual
squares.

(b) Example of a 3D mesh.

Figure 2.4.1: Examples of the two types of 3D geometries relevant for this project.

2.4.2 Homogeneous Coordinates and Transformations
Representing transformations in 3D space is straightforward when using matrix multiplication on
homogeneous coordinates. Homogeneous 3D coordinates are represented as a 4D vector, with the
constant 1 appended to the [x, y, z] positions, making it an element of R4×1.

P =
[ x

y
z
1

]

Using homogeneous coordinates simplifies rotation and translation operations, as they can be performed
easily using ordinary matrix multiplication. The 4 × 4 affine transformation matrices T used for this
purpose are matrices that when multiplied by a point in homogeneous coordinate form transform the
point into T’s frame of reference. T is composed of three key parts: a 3 × 3 rotation matrix that
handles rotation, a 3 × 1 translation vector that manages translation, and a padding row to enable
the use of matrix multiplication. By multiplying a point with a transformation matrix it rotates and
translates the point at the same time.

T =
[
R3x3 T3x1
01x3 1

]

9


2. Background

All coordinates are expressed in a particular frame of reference, denoted as AP , where A is the frame
of reference and P is the point. To convert a point to a different frame of reference, you multiply it
by the transformation matrix of the target frame.

AP = TA · P = TA ·
[ x

y
z
1

]

The previous equations apply not only to points but also to transformations. Transforming a
transformation can be conceptualised as transforming a point with an attached direction vector.

Frames of reference can be converted between each other. The transformation matrix that converts
positions from frame A to frame B is denoted as TA→B. For all frames of reference and all points,
the following equation holds1:

BP = TA→B · AP

Finally, matrix algebra can be performed as normal and produce meaningful results. Here is an
example that showcases that by knowing the transformation matrix of A and B to some shared space
X it is possible to transform points between A to B without ever knowing the point’s positions in X.

XP = TA→X · AP
XP = TB→X · BP

TB→X · BP = TA→X · AP
BP = T−1

B→X · TA→X · AP

2.5 Medical Navigation Systems

Navigation systems used in intraoperative orthopaedic imaging consist of hardware and software
that allow tracking of surgical equipment and assisting in delicate procedures during surgery. At its
most basic, a navigation system uses some kind of imaging technique to see where surgical equipment
is located with high precision. The choice of imaging technique varies, it can be infrared, visible
light-based, electromagnetic, or X-rays [3]. The reason for tracking surgical equipment is to ensure
that incisions and procedures are performed to a level of accuracy and precision that surpasses
traditional handheld methods in categories such as cutting, drilling, and implant installation [3].

For this study, we focus on a particular class of navigation systems: a system comprising an IR
camera that observes the surgical scene by tracking small IR-reflecting surfaces attached to specific
shapes. Figure 2.5.1 illustrates the components of this type of navigation system.

1This equation is a reformulation of the change-of-basis equation which is true for all matrices, not just 3D
homogeneous coordinate transformations.

10


2. Background

Figure 2.5.1: Components of a typical navigation system
used for orthopaedic surgery. Images courtesy of Ortoma
AB, used with explicit permission. Not to scale.

a Marker
A shape equipped with fiducials that can be
tracked by the navigation system (d). These fidu-
cials come in various forms; including optical pat-
terns, IR reflective surfaces, light sources, or other
features detectable in some arbitrary spectra.

b Pointer
A lightweight and nimble tool used to sample
points on the surface of the bone.

c Mounting Tool
Used to prepare for and insert the prosthetic,
varies greatly depending on the type of surgery
and method. A navigation system can include
multiple other tools, such as drills and cutting
devices.

d Navigation System Camera
A specialised stereoscopic camera that detects the
location of markers and provides real-time infor-
mation on the position of the tools.

e Navigation System Screen
Displays real-time information regarding the po-
sition of tools and surgical steps, often connected
to a standard PC running associated software.

2.5.1 Patient Space
Typically, during surgery, patient movement necessitates attaching a second marker to the patient [9].
This approach ensures consistent navigation by incorporating the frame of reference of the reference
marker, which moves with the patient, rather than relying solely on the assumption that the bone
remains fixed in the navigation system’s frame of reference.

2.5.2 Navigation System Tool Calibration
A crucial feature of navigation systems is their ability to track the geometry of surgical instruments
equipped with markers, rather than just the markers themselves. This requires identifying which
specific tool is associated with each marker and understanding the tool’s geometry. By knowing
the marker’s position and the tool’s geometry, the system can accurately determine the location of
specific parts, such as the tip of a scalpel or the end of a probe, relative to the navigation system [20].

2.5.3 Using Navigation Systems To Find Bone
The first step in finding the location of the intraoperative position of the bone is by calibrating a
pointer, as depicted in Figure 2.5.1b. By putting the pointer’s tip directly onto the bone during
surgery it is possible to calculate where exactly that point of the bone is in the navigation systems
frame of reference [21]. By taking multiple points it becomes possible to get an accurate understanding
of both the position and rotation of the intraoperative bone.

11


2. Background

2.6 Cameras
This subsection introduces the camera concepts and algorithms required to convert images into a
meaningful 3D space.

2.6.1 Intrinsics
Camera intrinsics refer to the internal parameters of a camera that define how it projects a 3D scene
onto a 2D image [22]. These parameters are typically combined into a 3 × 3 matrix and include:

Focal Length: fx, fy

The horizontal and vertical distances between the camera’s lens and its image sensor, determining
how much of the scene is captured and the size of objects in the image.

Principal Point cx, cy

The coordinates where the camera’s optical axis intersects the image plane, representing the
centre of the image.

Skew s
Skew defines the aspect ratio of pixels in the image. If the pixels are square, the skew is 1.

2.6.2 Extrinsics
Camera extrinsics define a camera’s pose within a coordinate space, often referred to as world space.
These parameters describe the camera’s location and its orientation relative to the scene it captures
[23]. Typically, extrinsics are represented as a 4 × 4 affine transformation matrix TW →C , which
transforms coordinates from world space to camera space.

To transform a point P = [x, y, z] from world space to camera space, you first convert the point into
homogeneous coordinates by appending a 1 to the vector. Then, you multiply this homogeneous
vector by the extrinsics matrix (2A). Alternatively, to transform points from camera space back to
world space, you multiply by the inverse of the extrinsics matrix (2B), (2C) [23].

CP = TW →C · W P (2A)
W P = T−1

W →C · CP (2B)
W P = TC→W · CP (2C)

CP ∈ R4×1 The position of some point in camera space.
W P ∈ R4×1 The position of some point in world space.

It is important to note that, unlike camera intrinsics, camera extrinsics are not fixed properties of
the camera. Instead, camera extrinsics represent the relative pose of the camera at the moment the
image is captured within a chosen world space. The world space serves as a frame of reference, which
can be defined in relation to a point on a table, another camera, a room, or even the entire world, as
long as the coordinate system is Euclidean.

2.6.3 Camera Space
Camera space refers to the three-dimensional frame of reference of the camera, where x, y, and z
represent the horizontal, vertical, and depth offsets from the camera’s origin. Consequently, moving

12


2. Background

the camera also shifts the camera space relative to the surrounding environment.

If depth can be estimated using, for instance, a depth camera, a flat 2D image can be transformed
into a 3D point cloud. This conversion is achieved by combining the depth image with the camera’s
intrinsic parameters, enabling the accurate mapping of the point cloud within the camera’s frame of
reference. Assuming there is no distortion or skew in the camera, Equation (2D) can be applied for
this conversion [15].

Px,y =


px

py

pz

1

 =


(x − cx) · Dx,y/fx

(y − cy) · Dx,y/fy

Dx,y

1

 (2D)

Focal length (part of camera instrinsics): fx, fy ∈ R
Principal point (part of camera instrinsics): cx, cy ∈ R

Depth Image: D ∈ Rwidth×height

Position in Camera Space: Px,y ∈ R4×1

2.6.4 Perspective-n-Point (PnP)
Perspective-n-Point (PnP) is an algorithm that computes the extrinsics matrix TW →C given 2D
points on a flat camera image and corresponding 3D points in world space [24]. PnP determines
the extrinsics matrix that minimises the error between the positions of features on the 2D camera
image and their corresponding 3D ground truth as projected onto the camera plane [24]. As with
any projection method, knowledge of the camera intrinsics is required in this process. Equation (2E)
shows the arguments of the PnP algorithm.

TW →C = solvePnP (I, {(2Dpi,
W P1), (2Dp2,

W P2), ...}) (2E)

2Dpi ∈ R2×1 The 2D coordinate of a point on the camera image.
W P1 ∈ R3×1 The 3D coordinate of a point in world space.

TW →C ∈ R4×4 The camera extrinsics matrix.
I ∈ R3×3 The camera intrinsics matrix.

2.6.5 Stereoscopic Depth Imaging
Stereoscopic depth imaging, often referred to as stereo vision, is a technique used to infer depth
information from two slightly different perspectives, mimicking human vision [25]. This method
involves capturing two images from two slightly offset cameras (left and right) and comparing them
to determine depth. Achieving high-quality stereoscopic imaging is challenging because even minor
and subtle differences are critical. Small, sometimes imperceptible, changes in the environment can
lead to significant variations in the calculated depth perception.

No stereoscopic depth camera or algorithm is perfect, and common metrics used to evaluate their
performance include:

Temporal Noise
A metric that quantifies fluctuation in depth data over time. It is defined as the standard
deviation of the depth data captured across a series of captures [13]. High temporal noise

13


2. Background

indicates that the depth measurements are inconsistent, which can affect the reliability of the
depth information.

Fill rate
This metric measures the proportion of the scene that the camera successfully captures depth
information for [13]. A low fill rate can occur if parts of the scene are not visible to both cameras
or if the reconstruction algorithm fails to generate accurate depth data for certain regions. This
can be due to occlusion, low-texture areas, or reflective surfaces.

By analysing these metrics, one can assess the performance and reliability of a stereoscopic depth
camera under different conditions.

The effectiveness of stereoscopic imaging is greatly dependent on the ability to find correspondences
between images [25]. Richly textured surfaces facilitate this process by providing distinct features
that can be matched between the left and right images, enabling accurate depth calculations. In
contrast, plain, monochromatic surfaces present challenges because they lack distinct features for
establishing correspondences. In such cases, depth reconstruction must rely on interpolating data
from surrounding better-textured areas, which can result in significant inaccuracies.

To address this weakness, a common solution is to introduce more texture in poorly textured areas
[25]. For instance, by placing distinctly textured markers, coating the surfaces of the object with
some material with more texture, or by projecting light patterns onto the surfaces. Naturally, the
sterile nature of surgeries imposes constraints on what can be utilised in orthopaedic surgery, making
for example coating-based methods impracticable.

Apart from stereoscopic depth imaging, there are numerous other methods for 3D reconstruction.
For instance, structured-light–based techniques, that involves projecting a known light pattern onto
the surface of an object, where the pattern deforms according to the object’s geometry. These
deformations are captured by a scanner’s sensors and used to reconstruct a detailed 3D model of
the object [25]. Another common method for depth estimation is time-of-flight, including techniques
such as SONAR and RADAR. Which, in short, uses a transmitter and receiver pair that determines
distance to an object by measuring the time from sending a signal and receiving it again as the result
of it reflecting against the object [25].

2.7 Registration Algorithms
A registration algorithm is used to align points from two different frames of reference by determining
an affine transformation between them [11]. This process ensures that features in one frame align
with the corresponding regions in the other. Enabling, for instance, the merging of different data
sources or seamless combination of multiple 3D scans of a common scene into a single cohesive scene.
In image-guided orthopaedic surgery (IGOS), registration algorithms are essential for accurately
mapping preoperative CT scans into the navigation system space, enabling precise localisation and
guidance during surgery.

2.7.1 Least Squares Point Alignment
Least squares point alignment is a simple form of alignment algorithm that relies on having pairs of
corresponding points where each item in the pair describes the same point from two different frames
of reference. Using this collection of pairs, it finds a transformation matrix TA→B that transforms
the points AP to BP with as little error as possible. Specifically, the objective is to find the TA→B

that minimises the sum of the squared distances between coordinates in pairs [26].

14


2. Background

TA→B = LeastSquares({(AP1,
BP1), (AP2,

BP2), ...}) (2F)

2.7.2 Iterative Closest Point (ICP)
ICP is a very popular and widely used point cloud registration method, and the original paper by Besl
and McKay [27] has been cited over 12,000 times. ICP is an iterative, robust and stable algorithm [28]
that unlike least squares point alignment, does not require points to have predefined correspondences
at the start, instead finding these on its own.

The general ICP algorithm starts with two point clouds and an initial guess for the relative trans-
formation matrix for positioning the source point cloud onto the target point cloud [29]. Note, the
initial guess can be the identity matrix, but a better guess is most often preferred since the algorithm
converges to the closest local minima [27]. Next, a correspondence set of point pairs between the two
point clouds is created, by finding the closest point in the target point cloud for each point in the
source point cloud. An improved guess of the relative transformation matrix is then computed using
the set of correspondences and least squares, minimising the summed distance of all correspondence
pairs [27]. The algorithm reiterates these two steps of populating a correspondence set and minimising
the distances between corresponding points until the summed distance falls below a target threshold,
or a maximum number of iterations is reached.

2.7.2.1 Local Registration

General ICP is not without problems; it only works if the initial alignment is reasonably correct,
otherwise, ICP ends up finding a local minima that is likely to be considerably different than the
global minima [28]. However, if the initial alignment is rather good then a better registration can
commonly be achieved by fine-tuning it using general ICP. This type of fine-tuning using ICP is
sometimes called local ICP.

2.7.2.2 Global Registration

An important class of ICP variants is those meant for global registration. These are used to provide an
initial rough alignment guess for the local registration to start out with. A popular version is the one
developed by Rasu et al. [30], which instead of 3D coordinates uses a higher dimensional coordinate
space. These coordinates are called feature vectors and describe the local geometric characteristics of
a 3D point. Open3D [15] implements a feature-matching–registration adaptation of ICP. It is largely
the same as a normal feature vector but with some extra outlier checks using RANSAC.

2.8 RANSAC
RANdom SAmple Consensus (RANSAC) is an algorithm that estimates characteristics of datasets
containing many outliers [31]. Although it was originally devised by Fischler and Bolles [32] for
the purposes of image detection, RANSAC has proven very useful for large variety of tasks. The
algorithm works by iteratively selecting random subsets of the data points and fitting a model to
these subsets, evaluating how many data points fit the model within a predefined tolerance to identify
inliers. By repeating this process and selecting the model with the most inliers, RANSAC effectively
isolates the best-fit model even in the presence of significant noise and outliers. The model can be
described by any number of objective functions and it is widely applicable in fields such as computer
vision, 3D reconstruction, and various other forms of data fitting.

15


2. Background

2.9 Polarised Light and Reflective Surfaces
When light reflects off many types of surfaces, it becomes partially polarised in the direction parallel
to the surface, effectively turning the surface into a polariser [33]. Some non-reflective surfaces do the
opposite, when polarised light interacts with them, it is partly absorbed and then re-emitted with a
random polarisation [33].

Reflections are important because reflective materials cause specular highlights in images, which is
generally undesirable since the regions corresponding with the specular highlights become overexposed,
making their details hard to discern. By placing a linear polarisation filter in front of the camera
lens, or lenses in the case of stereoscopic cameras, it is possible to remove some of these highlights
by ensuring the direction of the polarisation filter is orthogonal to the polarisation direction of the
highlights [33]. Notably, because the polarisation of the reflected light depends on the angle of the
surface, complex scenes will contain surface reflections with varying polarisation; making complete
exclusion of specular highlights infeasible for practical applications.

Specular highlights are of particular interest in this study, because feature-based 3D scanning
techniques, such as those utilised in stereoscopic depth cameras, have been shown to be susceptible to
holes and artefacts in the generated 3D point-clouds, due to the reflective and homogeneous nature of
the skeletal surfaces [34].

16


3
Related work

This chapter provides an overview of relevant previous work conducted in the domain of image-
guided orthopaedic surgery (IGOS), by summarising a number of earlier papers that are particularly
significant for framing the context of this study.

He et al. [21] addressed the time-consuming process of intraoperative patient bone to CT registration
in orthopaedic surgery by developing a novel approach utilising a compact 3D structured light surface
scanner, a technique briefly described in Section 2.6.5. They employed the EinScanTM SP desktop
scanner from Shining 3D Tech Co., Ltd., a device priced around $2,000 at the time of reading. The
scanner produced accurate high-density point clouds of the target bone geometry in about 30 seconds
without using ionising radiation. However, this duration only includes the capturing of the bone’s 3D
surface, not the registration process itself. The study evaluated this method on 24 CT-scanned femur
sawbones, which are synthetic anatomical models commonly used in medical training [35]. Accuracy
was measured by comparing known points at pre-drilled holes on the sawbones to their corresponding
positions on the registered CT model, achieving an accuracy of 0.44 ± 0.22 mm, surpassing the
traditional pointer sampling technique’s 0.68 ± 0.14 mm. This new method proved faster and more
precise than conventional techniques. However, the authors caution that this technology should be
tested in scenarios more realistic than sawbone models before being used clinically.

Other recent studies have, in addition to structured light scanners, employed other 3D reconstruction
devices. Wein et al. [36] used a handheld ultrasound scanner combined with a real-time bone
detection solution based on a novel bone-specific feature descriptor using filtering and confidence
maps computed using a random walk framework [37]. The detected bone surfaces were then globally
aligned using a controlled random search (CRS) algorithm with local mutation, followed by a fine
adjustment using soft tissue-aware intensity-based registration. Evaluation was performed on the
femur, tibia, and fibula of a human cadaver leg with implanted electromagnetic bone markers and
calibrated tracking sensors for use as ground truth information. Median system error was reported
at 3.7 mm across 11 datasets. The authors attributed most of this error to the two electromagnetic
sensors used for ground truth, while noting that the visual alignment appeared excellent.

Similarly, Ciganovic et al. [38] used a handheld ultrasound scanner for 3D capturing of bone surfaces
in orthopaedic procedures targeting the forearm. Employing phase symmetry to locate bone surfaces
and a factor graph-based outlier removal strategy. The study did not involve intraoperative scanning
and instead used the ultrasound scanner to capture the bones inside the forearm without involving
surgery. Registration of the non-exposed patient bone was carried out against a 3D model sourced
from magnetic resonance imaging (MRI). In total, three registration methods were tested: coherent
point drift (CPD), point-to-plane ICP, and a Gaussian mixture models-based algorithm. Of these,
only point-to-plane ICP achieved good registration accuracy within a reasonable amount of time.
Although not directly comparable to knee anatomy, the accuracy in forearm registration was 0.57 ±
0.08 mm in about 15 seconds of compute time.

Hu et al. [39] recently published a proof of concept study that combined the use of an Intel®

17


3. Related work

RealSenseTM D415 stereoscopic depth camera, priced around $270 at the time of writing, with an
IR navigation system for TKA surgery. Using the depth camera to capture the 3D surface of the
distal femur exposed during surgery. This was combined with machine learning (ML) based bone
segmentation and an improved ICP algorithm, called bounded iterative closest point (BICP), for
the intraoperative patient bone to CT registration. Evaluated on a synthetic model of a human
knee the system achieved an average drilling error of 3.64 ± 1.49 mm in position and 2.13 ± 0.81°
in orientation. The authors conclude that future work should incorporate a higher accuracy depth
camera and retraining of the bone segmentation models. Furthermore, it is important to note that
the automatic bone segmentation ML model was tested solely on a synthetic knee. Consequently, real
surgical conditions are anticipated to introduce new challenges, both in collecting training data and
in achieving the robustness required to handle the more varied conditions expected in actual TKA
surgeries.

In a comparison study by Burger et al. [13], the low-cost Intel® RealSenseTM D415 stereoscopic depth
camera, also used in the aforementioned [39] study, was evaluated against two other similarly priced
stereoscopic depth cameras: Intel® RealSenseTM D405, and Stereolabs® ZED-Mini. These tests were
designed to evaluate their performance in surgical simulation environments. The Intel® RealSenseTM

D405 depth camera outperformed the other two devices in four out of five metrics, including depth
accuracy and fill rate. However, it did not excel in temporal noise, where the RealSenseTM D415
performed better. The study also noted poor performance for thin structures and reflective surfaces,
such as surgical tools. The authors suggest further tests using light field cameras, which may be more
robust to reflections and occlusions than the stereoscopic depth cameras used in the study.

In addition to utilising different imaging techniques and devices, significant efforts have been made to
improve the registration procedure by enhancing the ICP algorithm. An example is the study by Hu
et al. [39], as mentioned earlier, which used bounded iterative closest point (BICP) instead of the
general ICP algorithm, outlined in Section 2.7.2. BICP enhances traditional ICP in femur registration
tasks by using the rotational centre of the femur-pelvis joint to constrain its convergence [11]. The
method requires four inputs: (1) 3D point samples on the distal femur during surgery, (2) reasonably
corresponding 3D points in the CT model, (3) the 3D point of the hip centre of rotation during
surgery, and (4) the corresponding estimated hip centre in the CT model [39]. The primary addition
compared to traditional ICP is the hip centre of rotation, which can be estimated by attaching a
tracker to the femur, rotating the leg, and fitting a sphere to the collected 3D points. The sphere’s
centre then provides a reasonable estimate of the rotational centre for the patient’s leg during surgery
[39]. BICP first aligns the corresponding hip centres of rotation and then incorporates additional
steps in the iterative alignment process. A key step involves rotation around the hip centre to align
the axis between the rotational centre and points on the distal femur before essentially performing
the typical alignment of the corresponding distal femur points. Hu et al. [39] reported significant
improvements in rotational accuracy with BICP compared to the traditional ICP algorithm, achieving
a 4-8x reduction in rotational error and slightly shorter computation times. However, while improved
robustness in both positional and rotational accuracy was also observed, only minor improvements
were found for positional accuracy. Rodriguez y Baena et al. [11] noted that BICP is also suited for
other long bone orthopaedic surgeries, as long as there is an estimable point on the axis to be aligned
that is sufficiently distant from the other sampled points.

In their paper, Zhang et al. [40] introduce a hybrid feature-based registration method named CR-
RAMSICP. This approach incorporates a two-step process: a coarse global registration using the
singular value decomposition (SVD) algorithm, followed by a fine local registration with an improved
ICP algorithm. The objective was to develop a straightforward yet effective intraoperative patient
bone to CT registration technique for robot-assisted orthopaedic surgeries. The method specifically
tackles the issue of local minima convergence, which arises from the asymmetry between dense
point clouds from CT scans and sparse point clouds collected from patients using traditional pointer

18


3. Related work

sampling techniques. This challenge involves aligning a densely populated point cloud with one that
has significantly fewer points. The experimental setup was nearly identical to the traditional IGOS
method described in Section 2.1, with the addition of an orthopaedic robot system. The tool of
the robot system was calibrated to enable its position to be expressed in the frame of reference of
the surgical navigation system. The ICP algorithm was improved by primarily replacing the fixed
distance threshold for valid correspondence pairs with an adaptable threshold that accounts for the
local distribution around each source point. This was achieved by calculating a separate distance
threshold for each source point based on the distance to its nearest neighbour in the target point cloud,
thus making the algorithm more robust against outliers and poor coarse registrations. Alignment
experiments conducted on a pig cadaver pelvis demonstrated an alignment accuracy of 0.56 ± 0.08
mm, compared to 0.89 ± 0.14 mm for standard ICP, with a computation time of approximately 10
seconds. The authors conclude that this method is suitable for other types of orthopaedic surgery,
with the next step being evaluation in human phantom experiments, i.e., using models mimicking the
human anatomy.

3.1 Summary
The previous work in image-guided orthopaedic surgery (IGOS) has explored various imaging tech-
niques, devices, and enhancements to registration algorithms to improve the applicability and accuracy
of these types of systems in real surgery. Studies have demonstrated the utility of structured light 3D
scanners [21] and handheld ultrasound devices [36], [38] for accurate bone detection and alignment,
though each comes with trade-offs such as capture speed and complexity of integration into surgical
workflows. Research comparing registration methods [11], [38]–[40], has highlighted the potential of
improved algorithms, such as bounded iterative closest point (BICP) [39], to enhance accuracy and
robustness over traditional ICP, especially in rotational alignment.

The cost of imaging devices naturally influences depth estimation accuracy, with higher-end devices
typically providing superior performance [21]. However, the strategic use of advanced registration
algorithms can offset these disparities, making high-precision outcomes achievable even with more
affordable equipment [39]. While standard ICP is widely used, customised improvements can
provide substantial benefits for specific surgical applications like total knee arthroplasty (TKA)
[39]. Evaluations of various depth cameras [13], [21], [39] have shown differing performance levels,
emphasising the importance of selecting appropriate imaging technology for optimal results. These
studies collectively highlight the necessity for advanced registration techniques and the integration
of high-performance imaging devices with surgical navigation systems to enhance the precision and
efficiency of orthopaedic procedures.

19


3. Related work

20


4
Methodology

This study intends to evaluate the performance of a specific depth camera, the Intel® RealSenseTM

D405, in CT alignment as part of an IGOS system, compared to the traditional manual point-taking
method. The traditional point-taking technique, outlined in Section 2.1, will be modified by replacing
steps 4 and 5 with a process using the aforementioned RGBD camera. A flowchart of our proposed
method is depicted in Figure 4.0.1.

Figure 4.0.1: A flowchart of how our method obtains an alignment of CT data in navigation space.

21


4. Methodology

The method can be divided into two parts: preoperative and intraoperative. The preoperative process
consists of the following CT data preparation and system calibration steps:

1. Filter the CT data to isolate regions important for alignment. Described in Section 4.3.1

2. Calibrate the camera-space position of the navigation system marker that is attached to the
camera CC. Described in Section 4.2.2.1 and 4.2.2.2.

3. Synchronise the system clocks between the navigation system and the RGBD camera. Described
in Section 4.2.3.

The intraoperative parts of our proposed method are more complex. The goal is to utilise the CT
scan, calibration data, and to combine the simultaneous real-time streams from both the navigation
system and RGBD camera to produce an accurate CT alignment. The high-level steps are as follows:

1. Synchronise the streams of the navigation system and depth camera system.

2. Transform camera images into point clouds in navigation-space. Described in section 4.2

3. Filter out noise and view-dependant effects. Described in section 4.3.2

4. Align CT model in navigation space 4.3.3.

4.1 Experiments
To develop and evaluate our method, we required recordings of scenarios that closely mimic in-
traoperative conditions for knee surgery. These recordings were generated through two distinct
approaches.

Firstly, we conducted continuous testing in an office environment equipped with a medical navigation
system and 3D-printed bone models. This setup facilitated ongoing experimentation and iterative
refinement of our method in a controlled setting.

Secondly, a thorough testing session was carried out at Sahlgrenska University Hospital, where a
surgeon performed a mock surgery on the knee region of a pig. The primary goal of this mock surgery
was to expose the knee bones to similar extent as in a real surgical procedure, without involving any
steps related to attaching or preparing for a prosthetic. Once the bones were appropriately positioned,
we conducted multiple passes with the camera to record depth camera frames and navigation system
communication. Additionally, we performed several repetitions of the traditional method to serve as
a benchmark for comparison.

4.1.1 Equipment
The following equipment was used during the thorough testing at Sahlgrenska University Hospital in
addition to the preliminary testing in an office setting.

Intel® RealSenseTM D405
The Intel® RealSenseTM D405 is a compact RGBD camera that uses two visible-light cameras
to estimate depth through stereoscopic imaging. We selected this depth camera because it is
affordable, referenced in previous work, and supports a close working range of 7 cm to 100
cm, making it an excellent fit for our needs [41]. As of the time of writing, it is priced at
approximately $270 [42].

Medical navigation system
We used an optical IR-based medical navigation system comprising of a pair of stereoscopic IR

22


4. Methodology

cameras. It operates at 60 Hz, providing continuous updates on the location of markers with IR
reflective fiducials.

Navigation system marker
A marker tracked by the navigation system, shown alongside the Intel® RealSenseTM D405
depth camera in Figure 4.1.1a.

Navigation pointer tool
A navigation system tool with a pointed tip, used for calibration and traditional method
comparison, similar to the one shown in Figure 2.5.1b.

A medium power laptop
A consumer-grade laptop equipped with an Intel® CoreTM i7-13700H CPU and 32GB RAM.

Camera Polarisation Filter
A camera attachable 3D-printed ring, fitted with a polarisation filter and depicted in Figure
4.1.1b.

Light Source Polarisation Filter
A box with a hole fitted with polarisation filters, depicted in Figure 4.1.1c.

(a) The Intel® RealSenseTM D405
fitted with a custom 3D printed
mount along with a navigation sys-
tem marker.

(b) Custom 3D printed polarisation
filter, attachable to the depth cam-
era.

(c) A box with a hole fitted with
polarisation filters.

Figure 4.1.1: Selection of equipment used in the project.

4.2 Camera Tracking
This section describes the process of converting a series of depth camera images into point clouds
within the navigation system’s frame of reference.

The objective of this project is to replace manual point-taking with automated methods. Therefore,
it is essential that the data captured by the camera can be transformed into the navigation system’s
frame of reference. This is crucial for aligning the CT-scanned bone with the patient, allowing the
mounting tool (as described in Figure 2.5.1c) to be accurately positioned according to the preoperative
plan.

23


4. Methodology

For our study, we define the world space using the navigation system, which we will refer to as
"Navigation Space" or "Navspace." Since the navigation system camera and the bone remain stationary
during surgery, this provides a stable frame of reference.

For the sake of conciseness and clarity, we will not discuss conversions to patient space. Our
experiments involve stationary, non-living materials, so converting to patient space has no practical
implications. However, adapting our method to use patient space would only require an additional
matrix multiplication.

4.2.1 Camera Space

The Intel® RealSenseTM D405 camera produces two 2D images: a standard 8-bit RGB image matrix
and a depth image matrix. The depth image matrix is half the resolution of the colour image
and contains 16-bit integer scalars, where each step represents a tenth of a millimetre. Since the
Intel® RealSenseTM D405 camera comes pre-calibrated with intrinsics and has negligible amounts of
distortions, we can use Equation (2D) to convert these images into a 3D point cloud in camera space.

4.2.2 Camera Space to Navigation Space
To compute the transformation between two different frames of reference, we can use one of the
following methods:

A) Using a Shared World Space: Establish the relationship of each coordinate space relative
to a common world space.

B) Identifying Common Points: Determine the positions of multiple (linearly independent)
points in both frames of reference.

C) Using an Affine Transformation: Identify the affine transformation of a shared object in
both frames of reference.

For our purposes, the third option is the most practical.

Using a shared world space (option A) would require either an additional navigation system or
the placement of fiducials within the operating scene that are visible to both the camera and the
navigation system. This approach is complicated by the different wavelengths used by the camera
and navigation system and the challenge of placing fiducials in the limited field of view of the Intel®
RealSenseTM D405 camera. The need to position the camera close to the incision for optimal resolution
and parallax, combined with the restricted visibility beyond the patient’s body, further complicates
this method.

Identifying common points (option B) faces similar visibility challenges. However, both this method
and using an affine transformation (option C) allow fiducials to be attached directly to the camera,
making their transformations constant in camera space. Given that the navigation system directly
provides the transformation of markers, the third option is more efficient as it simplifies the process
compared to identifying common points.

To implement the affine transformation method, we attach a navigation system marker to the
camera. The transformation of this marker in camera space (CC) remains constant regardless of the
camera’s position in navigation space. Retrieving the transformation in navigation space (NC) is
straightforward, as the navigation system provides this information directly.

By knowing the transformation matrix of the marker in both camera space and navigation space, we
can compute the transformation between these spaces using the equations in (4A).

24


4. Methodology

NC
−1 · NP =CC

−1 · CP

NP =NC · CC
−1 · CP

NP =TC→N · CP

(4A)

NC ∈ R4×4 Transformation of the marker attached to the camera in navigation space.
CC ∈ R4×4 Transformation of the marker attached to the camera in camera space.

TC→N ∈ R4×4 Transformation to convert camera space to navigation space.
CP ∈ R4×1 Point in camera space.
NP ∈ R4×1 Point in navigation space.

Thus, with NC and CC, we can easily convert the camera space point cloud to navigation space. We
obtain NC directly from the navigation system, while CC requires some calculation. The crucial point
is that CC is constant because the marker is fixed to the camera. Although we cannot measure CC
directly, we have developed two methods to compute it. The following subsections will describe these
methods in detail.

4.2.2.1 Shared Geometry Calibration

This idea is largely an implementation of method B as described in Section 4.2.2, by identifying the
location of points in both frames of reference. In essence, this method uses a machined calibration
block with distinct features that are easily visible in the camera’s RGBD image as well as measurable
in navigation space. By using the block’s manufacturing specifications as a ground truth, we can
determine transformation matrices where the block’s frame of reference is the world space.

This was accomplished using a metal block with machine-drilled divots that are clearly visible by the
camera and have well-defined geometry that can be sampled by a navigation system pointer. The
block’s manufacturing specifications define the block’s frame of reference, where one of the divots is
set as the origin at [0, 0, 0], with the remaining divots having coordinates as millimetre offsets from
the origin. For clarity, we refer to this frame of reference as block space.

The following steps describe the process of (1) finding the transformation of navigation space relative
to block space, (2) obtaining features in camera space and determining the relationship between
camera space and block space, and (3) calculating CC, the transformation of the camera marker in
camera space.

1) Navigation Space to Shared Geometry
The navigation markers were too large to fit inside the calibration divots. To address this,
we placed the tip of a navigation system pointer in the divot positions and extracted the tip
positions in navigation space. By comparing the navigation-space and block-space coordinates
of the divots, we computed the transformation matrix TN→Block between the two spaces using
least squares point alignment, described in Section 2.7.1.

2) Camera Space to Shared Geometry
Finding the transformation between camera space and block space TC→Block is more challenging
because we do not know any inherent extrinsics. Instead, we relied on analysing the camera
image and computing the extrinsics that could have produced the image of the block. This was
solved using PnP, described in Section 2.6.4. To run PnP, we needed to locate the machined
block’s divots on the 2D RGB image. Unlike traditional fiducials, such as grids or QR codes,
there is no pre-made solution for finding divots on this calibration block. Therefore, we performed
a series of manually adjusted steps, including edge detection, box filters, convolutions, circle

25


4. Methodology

finders, and image alignment, to locate the holes.

3) Camera Marker in Camera Space
With the transformation matrices from camera-space and navigation-space to block-space
determined, we used Equation (4B) to solve for the CC matrix.

∀M ∈ RA.

BlockM = TN→Block · NM
BlockM = TC→Block · CM

=⇒ CC = T−1
C→Block · TN→Block · NC (4B)

RA ⊂ R4×4 The set of all affine transformations.
CC ∈ R4×4 The transformations of the marker attached to the camera in camera space.
CN ∈ R4×4 The transformations of the marker attached to the camera in navigation space.

TC→Block ∈ R4×4 Transformation to convert camera space to block space.
TN→Block ∈ R4×4 Transformation to convert navigation space to block space.

4.2.2.2 Multi Frame Calibration

This method provides a shortcut that bypasses the need to know the location of points in navigation
space.

We record the camera images and navigation positions while viewing well-defined fiducials, in this
case, a chess grid. These patterns are ideal for determining the camera’s extrinsics relative to the
grid as the world space. The advantage of this method is that it eliminates the need to measure a
world space using the navigation system. The only assumption made is that the chess grid remains
stationary and fixed in navigation space.

Using the navigation space transformation formula (4A), we can set up a system of equations where
CC is the only unknown variable. For each camera pose, the unknown navigation transform should
result in the grid points being transformed to the same navigation space positions. This constraint is
described in Equation (4C).

{
∀a, b ∈ [1 . . . (#Frames)]2. NCa · CC

−1 · Ga = NCb · CC
−1 · Gb

}
(4C)

Gi ∈ R4×n Camera space positions of chess grid corners for frame i in homogeneous coordinates.
NCi ∈ R4×4 Camera marker in navigation space for frame i.
CC ∈ R4×4 The unknown camera-marker-in-camera-space matrix.

With a set of three linearly independent frames, it is possible to find an exact solution for CC.
However, this solution is susceptible to noise and variance in the camera-to-grid extrinsics, leading to
slight differences depending on which three frames are selected.

To improve calibration consistency, we investigate solving an optimisation problem in an over-
determined system that includes significantly more than three frames and a greater variance in
angles. This approach aims to reduce the effects of noise, inaccuracies in extrinsics, and potential
view-dependent effects. We use the open-source convex optimisation solver CVXPY [43] [44], with
the goal of minimising the Frobenius norm of the difference between each pair. Instead of adding

26


4. Methodology

optimisation criteria for the full Cartesian product of pairs, we examine the results for combinations,
permutations, and consecutive pairs.

4.2.3 Camera-Navigation Time Synchronisation
A final aspect of transforming the camera’s view into navigation space is making sure that the two
systems’ data streams are synchronised. This involves ensuring that recorded frames from both
systems are paired up properly, so the transformation uses frame pairs taken at the same time.
Both the navigation system and the depth camera have internal clocks and add timestamps to their
respective frames. By determining the difference between these clocks, we can match the corresponding
camera and navigation system frames by caching the most recent frames. This clock difference is
referred to as ∆sync.

Calibrating the clock discrepancy is achieved by observing a shared event, which in our case is moving
the camera. Detecting this event in the navigation system is straightforward; we simply examine the
length of the translation vector of the tracked depth camera between frames. For the depth camera,
event detection is accomplished by comparing consecutive frames. This comparison involves applying
a Sobel filter and then evaluating the average absolute difference between the frames. The Sobel filter
is used to amplify small differences, as demonstrated in Figure 4.2.1.

Image 1 Absolute difference of images Image 2

Image 1 with Sobel filter Absolute difference of Sobel-
filtered images

Image 1 with Sobel filter

Figure 4.2.1: Two slightly different images of a synthetic knee model to showcase Sobel filtering and
how it amplifies differences.

4.3 CT Alignment
The overarching goal of our project is to find an accurate transformation matrix that maps the
preoperative CT data onto the patient during surgery, specifically the TCT →N matrix. The process to
achieve this involves the following steps:

27


4. Methodology

1: Filtering
Filter the CT scan to primarily only include areas that should be visible and accurate in the
camera view.

2: Frame Averaging
Combine multiple images to produce a more consistent and reliable representation than an
individual frame.

3: Registration
Utilise ICP for precise point-to-point alignment and registration.

The next few sections expand on these steps.

4.3.1 Filtering
To enhance alignment efficiency and reduce false solutions, it is necessary to exclude certain regions of
the CT scan that are not visible from the camera’s perspective during ICP computation. Furthermore,
even visible regions may still be different in the CT model due to the aforementioned differences in
the 3D reconstruction method. Figure 4.3.1 provides a preview of the regions considered important
for inclusion in the alignment procedure. Notably, the upper centre of the patellar surface of the
femur is excluded due to its thicker cartilage layer.

(a) Picture of an in-progress intraoperative 3D capture,
taken during the pig animal lab at Sahlgrenska Univer-
sitetssjukhuset in Mölndal.

(b) Corresponding CT model with important regions
marked in red.

Figure 4.3.1: Comparison of the intraoperative view during a 3D capture and its CT model counterpart,
where only the red regions are used in the alignment procedure.

4.3.2 Frame Averaging
To address noise and view-dependent effects, we combine multiple frames of the navigation space
point clouds of the bone. However, simply merging these point clouds results in poor ICP performance
because the algorithm is sensitive to noisy data [45]. Misaligned aggregated point clouds lead to
numerous incorrect point-to-point correspondences, making it difficult for the ICP algorithm to
converge to an accurate solution.

Instead, we use a density-based averaging method, which conceptually identifies a curve through the
point cloud where the point density is the highest. This approach, however, tends to smooth out

28


4. Methodology

sharp edges and fine features because the density is typically higher inside corners than outside, as
illustrated in Figure 4.3.2.

We chose Poisson surface reconstruction for its robustness against noisy data [46]. This technique
allows us to create a smooth, continuous surface from the noisy point cloud data. After reconstruction,
the Poisson surface is uniformly down-sampled to match the point cloud density of the CT scan.
With the frame averaging complete, the Poisson mesh is now ready for CT alignment.

Figure 4.3.2: Depiction of how density
estimation of a sharp corner does not
match the real shape, the inside of the
corner ends up with a higher density than
the outside. Resulting in a smoothing of
the corner.

4.3.3 ICP Registration
The point-cloud-to-point-cloud registration between the CT model and Poisson surface is performed
in two stages using ICP. The first stage involves a global coarse alignment to ensure the CT scan is
approximately positioned and oriented relative to the bone, achieving accuracy within the order of a
centimetre. The second stage utilises traditional ICP for optimal fit refinement. Both global and
local ICP algorithms are constrained to prevent scaling, as the camera and navigation system scales
are known to be identical. Any ICP solution introducing scaling in TCT →N would thus be invalid.

The ICP registration is implemented using Open3D, beginning with global registration as described in
Section 2.7.2.2, followed by local registration as outlined in Section 2.7.2.1. For global registration, we
do not use the full 33-dimensional vectors; instead, we exclude attributes related to point orientation,
as this information is not inherently available from the depth images.

4.4 Hyperparameters
Our approach depends on various hyperparameters. To achieve optimal results, we analyse the
impact of these hyperparameters and select the most effective ones. These include both algorithmic
parameters and those that vary based on the recording. Below is a comprehensive list of all relevant
hyperparameters, organised into categories.

Global Registration
Feature Vector Voxel Size

To ensure that the feature vectors capture large-scale geometric shapes, we apply substantial,
uniform voxel down-sampling before calculating them. Without this step, the feature
vectors would fail to capture the macro-scale changes we are interested in. This parameter
sets the millimetre spacing between each voxel.

29


4. Methodology

Feature Vector Feature Radius
Feature vectors are computed based on local geometric characteristics and the feature
radius determines the extent of what is "local".

RANSAC Confidence
The confidence threshold of RANSAC in the feature vector ICP, represents the desired
probability that the inlier subset is representative. Higher values increase computation
time, reduce the inlier-subset’s size, and increase accuracy.

RANSAC Max Iteration
This parameter sets the maximum number of iterations allowed for the ICP algorithm
before it stops, even if the desired RANSAC confidence level has not been reached.

Local Registration
Distance Threshold

The maximum distance allowed between a source and target pair of points for them to be
considered as corresponding in the ICP algorithm.

Other Computation
Aggregation Random Down-Sampling Percent

To avoid creating excessively large point clouds that slow down computation, we apply
random down-sampling to each frame before running Poisson averaging.

Poisson Depth
The depth parameter in the Poisson surface reconstruction algorithm controls the level of
detail captured. Lower depth settings result in greater smoothing, which can eliminate
fine details and sharp corners. Higher depth settings, on the other hand, can misinterpret
noise as detail while also significantly increasing computation time.

Recording Attributes
Number of frames

The number of frames included in Poisson averaging, uniformly distributed over a recording.

Polarisation
Indicates whether polarisation filters are used. Details on the methodology for applying
and testing polarisation can be found in Section 4.5.

4.5 Polarisation
We also explore the effects of polarising both the light in the scene and the light received by the
depth camera. This investigation is inspired by an Intel white paper by Sweetser and Grunnet-Jepsen
[47], which suggests that polarising filters on their RealSense® devices can enhance depth estimation
accuracy, especially for reflective surfaces. Given the reflective nature of organic tissue and bone, we
consider it valuable to examine the potential benefits of adding polarisation filters.

To utilise polarisation on the camera, we designed a jig that attaches a rotating filter to the camera
lenses. This setup, illustrated in Figures 4.5.1a and 4.5.1b, allows the filter to be adjusted on the fly,
effectively reducing reflective highlights in the scene.

Applying polarisation to the scene lighting is relatively straightforward in a surgical environment, as
they are typically illuminated by a primary light source on a movable arm. By polarising this main
light source, the majority of the light on the bone will be polarised. Figures 4.5.1c and 4.5.1d depict
our crude polarisation filter fitted onto a surgical light.

30


4. Methodology

For completeness, we also investigate the effect of polarising only the light source, without attaching
a filter to the camera, to identify any effects introduced by the decreased light intensity caused by
using linear polarisation filters.

(a) Custom 3D printed polar-
isation filter, attachable to
the depth camera.

(b) The Intel® RealSenseTM

D405 depth camera fitted
with a custom 3D printed
mount along with a naviga-
tion system marker.

(c) A box with a hole fitted
with polarisation filters, used
to filter the surgical spot-
light.

(d) Light source polarisation
filter mounted to the surgical
spotlight.

Figure 4.5.1: Equipment and experiment setup during polarisation light experiment.

4.6 Data Evaluation
Our main goal is to achieve accurate intraoperative pose alignment and compare this with the
traditional method detailed in Section 2.1. The following subsection will describe the metrics used for
evaluation and their significance.

4.6.1 Baseline Evaluation
Baseline metrics, while not directly used in alignment, provide valuable insights into the reliability of
our technology and calibration methods, demonstrating the consistency and accuracy of our equipment
and procedures.

Temporal Noise
Evaluates the consistency of the Intel® RealSenseTM D405 camera’s depth measurements over
time in an intraoperative scenario. It identifies regions with varying levels of uncertainty and
assesses the impact of polarisation.

Sync Calibration
Measures the variability of synchronisation calibration between the camera and the navigation
system.

Camera Marker Calibration
Assesses the variability of CC and evaluates the time required for different calibration methods
to execute and compute.

Traditional Method Tip Accuracy
Examines the variability of sampled points using our equipment to ensure the reliability of our
implementation of the traditional method.

31


4. Methodology

4.6.2 Alignment Evaluation
A robust metric for evaluating alignment is essential. Alignment measures how accurately our solution
calculates the transformation of a CT scan into navigation space, specifically evaluating the correctness
of the TCT →N matrix in transforming the CT scan.

To compare transformations, we use a scalar metric that calculates the average point-to-point distances
between pairs of feature-filtered CT point clouds transformed into navigation space. This metric,
referred to as TFscore, is formally defined in Equation (4D). It specifies the average distance that
each point in a point cloud ends up transformed in comparison to another transformation.

TFscore(CT P , TCT →N , UCT →N) = ||TCT →N · CT P − UCT →N · CT P ||F robenius (4D)

CT P ∈ R4×m The feature filtered point cloud as homogeneous coordinates in CT Space
where m is the number of points in the cloud.

TCT →N ∈ R4×4 First CT to navigation space transformation.
UCT →N ∈ R4×4 Second CT to navigation space transformation.

Instead of only comparing pairs of CT-Navigation transforms, TFscore is also used to evaluate the
consistency of an alignment. Ideally, the same alignment should be achieved given the same recording,
even if the recording is cropped or down-sampled. In practice, however, ICP-based registration is
significantly affected by randomness.

The consistency of alignment is evaluated by comparing the TFscore of each transformation with the
transformation that produces the median point cloud for all configurations and recordings. This
approach helps us understand the distribution and variability of the alignment results.

With this alignment metric established, we address the following questions:

How do hyperparameters effect alignment
Algorithm parameters will be evaluated through individual parameter sweeps. The ranges were
chosen as either linear, geometric, or hard-coded, based on initial small-scale tests. Polarisation
and the number of frames will be given special handling and tested against all parameter sweeps
instead of on their own.

How does calibration effect alignment
Without external measurements to verify the accuracy of ∆Sync and CC calibration, we investigate
the impact of intentionally incorrect calibration. If a small additional calibration error worsens
alignment, we could conclude that our calibration method is insufficient.

How does polarisation effect alignment
As described in Section 4.5, we will study how alignment is influenced by the polarisation of the
camera and the light source.

32


5
Results

The following chapter presents the findings of the study, organised in a series of subsections starting
with the evaluation of the RGBD camera’s depth estimation capabilities, followed by an investigation
of the sensitiveness of the camera calibration procedure, the results of hyperparameter tuning, and
finally concludes with the presentation of achieved alignment quality between the CT model and the
intraoperative 3D model captured using the Intel® RealSenseTM D405 camera.

5.1 Depth Estimation

The estimated depth data computed by the Intel® RealSenseTM D405 camera is prone to a considerable
amount of temporal noise and coverage gaps in difficult-to-capture regions. The results of the
experiments conducted are presented in Figure 5.1.1. The results show tangible improvements in fill
rate when properly using polarised filters as well as higher levels of temporal noise in regions with
reflective surfaces and homogeneous texture, especially apparent on the patellar surface of the femur.

33


5. Results

Figure 5.1.1: Analysis of the depth uncertainty of
the knee region, based on three distinct angles.

Polarisation
The type of polarisation setup that was
used. ’Both’ indicates that polarisation
filters were applied to both the surgical
light and the camera; ’Lamp’ signifies that
only the surgical light was equipped with a
polarisation filter; ’None’ denotes that no
polarisation filters were used.

Depictions of the equipment used to
test polarisation can be found in Figure
4.5.1.

Fill rate
Illustrates how often the camera manages
to capture depth information for each pixel.
Higher is better. Yellow is bad.

Temporal Noise
Illustrates the per-pixel camera depth confi-
dence. For any given pixel it is defined as the
standard deviation of that position’s depth
value. Pixels with less than 50% fill rate are
excluded. Lower is better. Yellow is bad.

Summary
Using polarisation filters on both the camera and
the light source reduces data loss but increases tem-
poral noise. This conclusion is supported by the
observed decrease of yellow regions in the fill rate
column and a corresponding increase of yellow re-
gions in the temporal noise column when polarised
filters are used. In contrast, polarising only the
light source has a negligible effect.

34


5. Results

5.2 Camera Marker Calibration
This section presents our calibration results from determining the transformation for the navigation
space marker attached to the camera in camera space, referred to as CC.

Metric

Calibration Method

Shared
Geometry

Multi Frame

Set of 3 Optimisation Solver
Combinations Permutations Pairs

Standard
Deviation of
Translation

[mm]

1.8 4.3 1.7 1.7 2.3

Time to
Compute [s] ~10 0 3.2 8.7 0.18

Time to
Perform [s] ~100 ~10 ~10 ~10 ~10

Table 5.2.1: How different camera-marker-in-camera-space (CC) calibration methods affect the
consistency of the transformation matrix and how long they take to compute and perform.
Shared Geometry

Implementation of the method described in Section 4.2.2.1. The results were derived from three
repetitions of the shared geometry calibration using five different angles performed during the
animal test at Sahlgrenska.

Multiframe
Implementation of the method outlined in section 4.2.2.2. Results are compiled from 20
repetitions of calibrations performed in the office setting. This calibration was not used for any
results other than this table.

Standard Deviation of Translation
The standard deviation of the length of the translation vector making up CC

Time to Compute
The time spent by our mid-powered laptop to solve the method

Time to Perform
The time a person spent performing the calibration steps, excluding the computation time.

35


5. Results

5.3 Frame Synchronisation
This section presents our calibration results from determining the time difference between the system
clocks of the navigation system and the Intel® RealSenseTM D405, referred to as ∆sync.

Standard Deviation of ∆sync [s]
Repetition 1 0.010
Repetition 2 0.011

Average 0.011

Table 5.3.1: Showcases how consistent our sync calibration is. ∆sync is the difference between the
internal clocks of the navigation system and the depth camera. Results were derived from two
repetitions of sync calibration performed during the animal test at Sahlgrenska. Each repetition
consisted of 10 instances of ∆sync calibration. Each repetition took ~5 seconds.
Standard Deviation of ∆sync

The standard deviation of the measured ∆sync value.

5.4 Manual Pointer Tip Positions
This section presents the consistency achieved during the point sampling using the traditional method.
Points were collected by placing the tip of a pointer on bone. The handheld pointer was kept stationary
for around three seconds for each of the ~30 sampled points.

Standard Deviation
of Translation [mm] Time to perform alignment

Repetition 1 0.049 ~5 minutes
Repetition 2 0.038 ~5 minutes
Repetition 3 0.057 ~5 minutes

Average 0.048 ~5 minutes

Table 5.4.1: Showcases how consistent our point sampling was for our equipment using the traditional
method as described in 2.1. Data is collected from three repetitions of alignment using the traditional
method during the animal test at Sahlgrenska. The tip position for each sample was recorded over ~5
seconds.
Standard Deviation of Translation

The average standard deviation of the length of the tip-position-vectors.

5.5 Alignment
This is the most important result section, as it describes how well our solution calculated the
transformation of a CT scan into navigation space. More specifically we evaluate how correct the
TCT →N matrix is. We evaluate this by comparing the consistency of our TFscore to the median
transform of all alignments achieved by default hyperparameters.

5.5.1 Hyperparameters
This section presents the impact of hyperparameters on alignment and computation time. The results
highlight the alignment distribution achieved with various hyperparameter values.

36


5. Results

Figure 5.5.1: Parameter sweep of
Feature Vector Voxel Size

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

Figure 5.5.2: Parameter sweep of
Feature Vector Feature Radius

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

Figure 5.5.3: Parameter sweep of
RANSAC Confidence

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

Figure 5.5.4: Parameter sweep of
Distance Threshold

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

37


5. Results

Figure 5.5.5: Parameter sweep of
Aggregation Random
Down-Sampling Percent

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

Figure 5.5.6: Parameter sweep of
Poisson Depth

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance to the median alignment.
Lower values & lower variance is
better.

Figure 5.5.7: Parameter sweep of
Number of Frames

Each curve consists of 54 repetitions.
V alue The value of the parameter being

swept.
Time The total execution time to finish

the alignment.
TFscore The average point-to-point dis-

tance