Deep Learning-based Segmentation of
Kidneys from MR Images
A Multi-Channel Approach for Automated Segmentation and
Quantification in Multi-parametric MRI

Master’s Thesis in Biomedical Engineering

CECILIA NORDBERG
VIKTOR LINDFORS

DEPARTMENT OF ELECTRICAL ENGINEERING (E2)

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Deep Learning-based Segmentation of Kidneys
from MR Images

A Multi-Channel Approach for Automated Segmentation and
Quantification in Multi-parametric MRI

CECILIA NORDBERG
VIKTOR LINDFORS

Department of Electrical Engineering (E2)
Chalmers University of Technology

Gothenburg, Sweden 2025


Deep Learning-based Segmentation of Kidneys from MR Images
A Multi-Channel Approach for Automated Segmentation and Quantification in
Multi-parametric MRI
CECILIA NORDBERG
VIKTOR LINDFORS

© CECILIA NORDBERG, 2025.
© VIKTOR LINDFORS, 2025.

Supervisor: Bettina Selig, Antaros Medical AB
Supervisor: Kanishka Sharma, Antaros Medical AB
Examiner: Ida Häggström, Electrical Engineering (E2)

Master’s Thesis 2025
Department of Electrical Engineering (E2)
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Segmentation of the renal cortex (green) and medulla (magenta) using a 2D
ResUNet. The model was trained on multi-channel input from T1-MOLLI images
acquired at varying inversion times.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Deep Learning-based Segmentation of Kidneys from MR Images
A Multi-Channel Approach for Automated Segmentation and Quantification in
Multi-parametric MRI
CECILIA NORDBERG
VIKTOR LINDFORS
Department of Electrical Engineering (E2)
Chalmers University of Technology

Abstract
Chronic kidney disease (CKD) is a progressive condition affecting millions world-
wide, and accurate assessment of kidney structure is essential for early diagnosis and
monitoring disease progression. Magnetic Resonance Imaging (MRI) has emerged
as a powerful, non-invasive technique for visualizing subtle structural and functional
changes of the kidneys, providing insights into disease progression and severity. How-
ever, manual segmentation of MRI data is both time-consuming and prone to inter-
and intra-observer variability, highlighting the need for automated methods.

This thesis presents a deep learning-based approach for automated segmentation of
the renal parenchyma, cortex, and medulla using multi-channel and multi-modal
MRI data. A 2D ResUNet architecture was implemented with the Medical Open
Network for AI (MONAI) framework and trained on a dataset of 37 MRI scans from
CKD patients. Two approaches were evaluated: a multi-channel model utilizing T1-
weighted Modified Look-Locker Inversion Recovery (T1-MOLLI) images at multiple
inversion times, and a multi-modal model incorporating diffusion-weighted imaging
(DWI) and T2*-weighted image data. While the multi-channel T1-MOLLI model
demonstrated strong agreement with manual annotations, achieving Dice scores of
0.9089 for parenchyma and 0.8552 for cortex, the multi-modal approach underper-
formed due to spatial misalignment between input images and reference labels.

The proposed segmentation pipeline also enabled reliable quantification of renal
parenchyma and cortex volumes, and showed potential for quantifying tissue-specific
parametric values relevant to CKD monitoring. However, the reliability of these
measurements were highly dependent of the models segmentation performance. Over-
all, the findings highlight the potential of using deep learning models’ with multi-
channel MRI input for improving kidney segmentation, serving as a tool to support
clinical image analysis workflows and reduce manual effort.

Keywords: deep learning, image segmentation, MRI, kidneys, convolutional neural
networks, ResUNet, multi-channel images, chronic kidney disease.

v


Acknowledgements
We would like to thank Antaros Medical AB for giving us the opportunity to carry
out our thesis project within their organization. A special thanks to our super-
visors, Bettina Selig and Kanishka Sharma, for your ongoing support throughout
the project. You have continuously encouraged us to explore different perspectives
and shared your knowledge with us during this spring, which we truly appreciated.
Additionally, we would like to extend our thanks to Carl Sjöberg for making this
thesis possible, and for the warm welcome and support from the start. To all the
colleagues at Antaros, thank you for welcoming us and making us feel like part of
the team.

We would also like to thank our examiner, Ida Häggström, for guiding us during
our thesis work. Finally, as this thesis marks the conclusion of our five years at
Chalmers University of Technology, we would like to give a big thank you to our
friends and family, and to everyone who has contributed to our experience during
this time.

Cecilia Nordberg & Viktor Lindfors, Gothenburg, June 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

ADC Apparent Diffusion Coefficient
CKD Chronic Kidney Disease
CMD Corticomedullary Differentiation
CNN Convolutional Neural Network
CT Computer Tomography
DNN Deep Neural Network
DWI Diffusion-Weighted Imaging
DX Dextral
ESRD End-stage Renal Disease
FN False Negative
FOV Field of View
FP False Positive
GPU Graphical Processing Unit
IoU Intersection over Union
MOLLI Modified Look-Locker Inversion Recovery
MONAI Medical Open Network for Artificial Intelligence
MR Magnetic Resonance
MRI Magnetic Resonance Imaging
ReLU Rectified Linear Unit
RF Radio Frequency
ROI Region of Interest
SD Standard Deviation
SIN Sinistral
T1w T1-weighted
T2*w T2*-weighted
TE Echo Time
TI Inversion Time
TP True Positive
TR Repetition Time

ix


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xvii

1 Introduction 1
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 5
2.1 Kidney Anatomy and Physiology . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Chronic Kidney Disease . . . . . . . . . . . . . . . . . . . . . 6
2.2 Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Physical Principles of MRI . . . . . . . . . . . . . . . . . . . . 7
2.2.2 MRI Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 MRI Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Deep Learning for Image Segmentation . . . . . . . . . . . . . . . . . 10
2.3.1 Basic Concepts of Deep Learning . . . . . . . . . . . . . . . . 10
2.3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 11
2.3.3 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 ResUNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Loss Functions for Medical Image Segmentation . . . . . . . . 15
2.3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Multi-channel Image Segmentation . . . . . . . . . . . . . . . . . . . 17

3 Methods 19
3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Creation of Ground Truth Masks . . . . . . . . . . . . . . . . 22
3.4.2 Brightness and Intensity Scaling . . . . . . . . . . . . . . . . . 22
3.4.3 Splitting into 2D Slices . . . . . . . . . . . . . . . . . . . . . . 23
3.4.4 Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 Network Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xi


Contents

3.6 Model Implementation and Experimental Setup . . . . . . . . . . . . . 25
3.6.1 Single-Channel T1-MOLLI-based Segmentation . . . . . . . . 25
3.6.2 Multi-Channel T1-MOLLI-based Segmentation . . . . . . . . . 25
3.6.3 Multi-Modal Segmentation Integrating DWI and T2* . . . . . 26

3.7 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 Volume Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.10 Quantification of Parametric Values . . . . . . . . . . . . . . . . . . . 28

4 Results 29
4.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Single-channel T1 MOLLI-based Segmentation . . . . . . . . . . . . . 30
4.3 Multi-channel T1 MOLLI-based Segmentation . . . . . . . . . . . . . 32

4.3.1 Performance of Fine-tuned T1-MOLLI Model . . . . . . . . . 34
4.3.2 Quantification of Kidney Volume . . . . . . . . . . . . . . . . 38
4.3.3 Quantification of Median T1 Value . . . . . . . . . . . . . . . 42

4.4 Multi-Modal Kidney Segmentation . . . . . . . . . . . . . . . . . . . 43
4.4.1 Quantification of ADC and T2* value . . . . . . . . . . . . . . 46

5 Discussion 49
5.1 Comparison of 2D and 3D Segmentation . . . . . . . . . . . . . . . . 49
5.2 The Effect of Multi-Channel Inputs . . . . . . . . . . . . . . . . . . . 50
5.3 The Effect of Multi-modal Integration . . . . . . . . . . . . . . . . . . 51
5.4 Model Strengths and Limitations . . . . . . . . . . . . . . . . . . . . 51
5.5 Quantifying Volume and Tissue Parameters . . . . . . . . . . . . . . 53
5.6 Comparison to Existing Works . . . . . . . . . . . . . . . . . . . . . . 53
5.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusion 57

Bibliography 59

A Appendix A I

B Appendix B III

xii


List of Figures

2.1 Anatomical illustration of the normal renal structure from a coronal
orientation of the left kidney. The renal parenchyma tissue includes
cortex and pyramids of medullas. Created with BioRender.com. . . . 6

2.2 Examples of common MRI artifacts that include motion artifact, ze-
bra stripe artifact, and partial volume effect. Images (a) and (b) are
adapted from Stadler et al. [28], while (c) is courtesy of Antaros
Medical AB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 An example of convolutional layer with kernel size 3×3, a stride of 1,
and no padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Max pooling layer with kernel size 2×2 and a stride of 2. . . . . . . . 12
2.5 Comparison of activation functions: (a) ReLU, and (b) PReLU. . . . 14
2.6 Illustration of activation functions commonly used for classification:

(a) Sigmoid function, and (b) Softmax function. . . . . . . . . . . . . 14
2.7 Illustration of the relationship between ground truth and predicted

classifications. The overlapping region represents true positives (TP),
where the prediction correctly matches the ground truth. The left
(blue) region denotes false negatives (FN), where actual positives were
missed. The right (red) region indicates false positives (FP), where
the background is incorrectly segmented as foreground. . . . . . . . . 16

2.8 Example of a multi-channel input image with five different image
channels from renal MRI data. The channels provide different views
of the same anatomy with complementary tissue contrast information. 18

3.1 Overview of proposed pipeline for kidney segmentation. . . . . . . . . 19
3.2 Example of ground truth segmentation and corresponding binary

masks after applied transformations. From left to right: parenchyma
ground truth, parenchyma binary mask, cortex ground truth, and
cortex binary mask. The manually segmented ground image shows
the left kidney (cyan) and right kidney (green), annotated by image
analysts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Overview of proposed 2D ResUNet architecture. The left side shows
the downsampling path with residual units, where convolutional chan-
nels increase from 16 to 256 to extract features. The right side corre-
spond to the upsampling path, where channels decrease from 256 to
the number of output classes, combined with skip connections. . . . . 24

xiii


List of Figures

3.4 Visualization of the RemoveSmallObjects transform in MONAI, here
removing clusters with 100 pixels or less. Figure adapted from MONAI
[45]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Effect of erosion using a 3×3 structural element. . . . . . . . . . . . . 28

4.1 Examples of MRI artifacts and image distortions observed in the
dataset. Yellow arrows highlight regions impacted by zebra striping
and poorly defined kidney boundaries. . . . . . . . . . . . . . . . . . 29

4.2 Comparison of segmentation results for (a) renal parenchyma and (b)
renal cortex using 2D ResUNet and 3D ResUNet models. The figures
illustrate the original image, ground truth segmentation label, and
overlay of image, label and model predictions, where true positives
(TP) are represented in green, false positives (FP) in red, and false
negatives (FN) in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Training and validation performance of the proposed multi-label 2D
ResUNet, showing Dice loss and evaluation metrics progress during
training (Adam optimizer, learning rate of 10−4). Dice score and IoU
are reported separately for the parenchyma and cortex, as well as
averaged across both regions. . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Representative examples of parenchyma segmentation results from
two test cases: a) scan acquired at site A, and b) scan from site B.
Each figure shows selected image slices with the predicted segmenta-
tion mask, as well as an overlay on the TI 1400 ms image. The color
coding in the overlays is as follows: green - true positives, blue - false
negatives, and red - false positives. . . . . . . . . . . . . . . . . . . . 36

4.5 Examples of renal parenchyma segmentation results, illustrating clas-
sification outcomes where the model oversegments in the renal pelvis
region. The color coding is as follows: green - true positives, blue -
false negatives, and red - false positives. . . . . . . . . . . . . . . . . 37

4.6 Examples of renal parenchyma segmentation results, illustrating clas-
sification outcomes where the model undersegments in the most ante-
rior and posterior parts of the kidney. The color coding is as follows:
green - true positives, blue - false negatives, and red - false positives. 37

4.7 Examples of renal parenchyma segmentation results, illustrating clas-
sification outcomes where the model falsely predicts other structures
in background-only slices. Red indicates false positives. . . . . . . . . 37

4.8 Representative examples of cortex segmentation results from two test
cases: a) a scan acquired at site A, and b) a scan from site B. Each
figure shows selected image slices with the predicted segmentation
mask, as well as an overlay on the TI 1400 ms image. The color
coding in the overlays is as follows: green - true positives, blue - false
negatives, and red - false positives. . . . . . . . . . . . . . . . . . . . 38

4.9 Correlation of ground truth and predicted volume for parenchyma
and cortex. The identity line represents perfect correlation between
the CNN-predicted and ground-truth segmentation. Grey area corre-
sponds to a 5% volume difference. . . . . . . . . . . . . . . . . . . . . 39

xiv


List of Figures

4.10 Bland-Altman plots of agreement for volume prediction and ground
truth volume for parenchyma and cortex. Mean and standard de-
viation (SD) are calculated globally across all data sets. The solid
line represents the mean, while the dashed lines indicate the limits of
agreement, calculated as the mean ±1.96 times the SD. . . . . . . . . 40

4.11 Example of post processing / erosion of segmentation mask for (a)
cortex and (b) medulla. From left to right: original segmentation
mask and resulting segmentation mask after erosion with a square
structuring element of increasing size. . . . . . . . . . . . . . . . . . . 42

4.12 Correlation of predicted and delivered median T1 value for cortex and
medulla. The identity line represents perfect correlation between the
CNN-predicted and delivered values. Grey area corresponds to a 5%
difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 43

4.13 Example of renal parenchyma segmentation results from multi-modal
input, illustrating test cases with good image-label alignment. The
ground truth labels closely match the kidney boundaries visible in
the underlying T2*w image, resulting in fewer false detections. The
color coding in the overlays is as follows: green - true positives, blue
- false negatives, and red - false positives. . . . . . . . . . . . . . . . 45

4.14 Example of renal parenchyma segmentation results from multi-modal
input, illustrating test cases with poor image-label alignment. The
segmentation overlays show how the ground truth label either extends
beyond the visible kidney region or covers only a portion of the kid-
ney as seen in the underlying T2*w image. The color coding in the
overlays is as follows: green - true positives, blue - false negatives,
and red - false positives. . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.15 Examples of cortex segmentation results using the multi-modal seg-
mentation model, demonstrating classification performance on test
images. Images show an overlay of ground truth label and prediction
output for several slices, superimposed to T2*w images acquired at
TE = 3 ms. The color coding is as follows: green - true positives,
blue - false negatives, and red - false positives. . . . . . . . . . . . . . 46

4.16 Example of manually delineated regions and generated regions: (a)
Manually delineated regions for cortex and medulla on T2* map, (b)
generated mask using erosion for cortex, (c) generated mask using
erosion for medulla. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.17 Correlation of predicted and delivered median ADC value for cor-
tex and medulla. Model is a proposed multi-modal model (2D Re-
sUNet). The identity line represents perfect correlation between the
CNN-predicted and delivered values. Grey area corresponds to a 5%
difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 47

4.18 Correlation of predicted and delivered median T2star value for cor-
tex and medulla. Model is a proposed multi-modal model (2D Re-
sUNet). The identity line represents perfect correlation between the
CNN-predicted and delivered values. Grey area corresponds to a 5%
difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 48

xv


List of Figures

A.1 Example images for T1-MOLLI images acquired with increasing in-
version times, ranging from 174 ms to 2574 ms. . . . . . . . . . . . . I

A.2 Example images for T2*w images acquired with increasing echo times,
ranging from 3 ms to 62 ms. . . . . . . . . . . . . . . . . . . . . . . . I

A.3 Example images for DWI acquired with increasing b-values, ranging
from 0 to 500 s/mm2. Left to right: b=0, b=50, b=200, b=500. . . . I

A.4 Examples of parametric maps derived from different MRI sequences.
From left to right: T1 map from T1-MOLLI, ADC map from DWI
images, and T2* map from T2* mapping. . . . . . . . . . . . . . . . . II

xvi


List of Tables

3.1 Summary of imaging parameters by modality, reflecting the variabil-
ity in acquisition protocols depending on the modality and the scanner
setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 6-fold cross validation results for 2D and 3D ResUNet models. Per-
formance metrics are shown as mean ± standard deviations across all
folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Parenchyma segmentation performance using single-channel input im-
ages of different inversion time (TI). . . . . . . . . . . . . . . . . . . . 32

4.3 Cortex segmentation performance using single-channel input images
of different inversion time (TI). . . . . . . . . . . . . . . . . . . . . . 32

4.4 Comparison of the top five multi-channel input combinations for seg-
mentation of the renal parenchyma. . . . . . . . . . . . . . . . . . . . 33

4.5 Comparison of the top five multi-channel input combinations for seg-
mentation of the renal cortex. . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Comparison of the performance of single-label and multi-label seg-
mentation models trained on multi-channel T1-MOLLI input. . . . . 34

4.7 Mean ± SD of volume difference (in %) between predicted and deliv-
ered volumes, across kidney regions and dataset splits. . . . . . . . . 41

4.8 Overview of input channels used in the multi-modal segmentation
model. DWI inputs correspond to multiple b-values and T2* mapping
inputs correspond to multiple echo times. . . . . . . . . . . . . . . . . 44

4.9 Comparison of the performance of multi-label segmentation models
using multi-channel T1-MOLLI or DWI and T2* as input. Models
are based on a 2D ResUNet structure, segmenting both cortex and
parenchyma simultaneously. . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Summary of related work on kidney segmentation. . . . . . . . . . . . 54

B.1 R2 correlation values between predicted and actual T1 values for SIN
and DX cortex, using erosion with a square structural element of
various sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

B.2 R2 correlation values between predicted and actual T1 values for SIN
and DX medulla, using erosion with a square structural element of
various sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

xvii


List of Tables

xviii


1
Introduction

Accurate and precise segmentation of anatomical structures in medical images is a
critical challenge in the field of medical image analysis. With rapid advancements
in deep learning, new and powerful tools for processing and interpreting medical
images have been introduced, such as for segmenting vital organs like the kidneys.
Precise delineation of kidney regions is essential for advancing the understanding
of kidney function and pathology, as well as for supporting effective diagnosis and
treatment planning. For instance, segmenting internal kidney structures such as the
cortex and medulla allows for measurement of anatomical and functional changes
associated with pathology. Such information could support early detection of condi-
tions like chronic kidney disease, a progressive and widespread condition that often
goes undetected in its early stages [1]. Among medical imaging techniques, Mag-
netic Resonance Imaging (MRI) has emerged as a powerful, non-invasive technique
for visualizing subtle structural and functional changes of the kidneys, providing
insights into disease progression and severity [2].
However, manual analysis and segmentation of kidney regions from medical image
data is a time-consuming process that requires high expertise. It becomes even
more challenging when there is inter- and intra-subject anatomical variability, lim-
ited image contrast between internal tissue types due to limited image quality, and
inconsistencies across imaging protocols. These factors make manual annotation
both time-consuming and prone to variability. Deep learning offers a promising way
to automate the renal segmentation process, saving time and manual effort while
also reducing operator bias.
While recent advances in deep learning have led to significant improvements in
medical image analysis, existing approaches focus on whole-kidney segmentation
and do not often address the challenge of segmenting anatomical substructures of
the kidneys. Methods that do attempt to resolve internal regions often struggle with
low contrast, leading to insufficient segmentation performance and limited clinical
relevance. Thus, there is a clear need for deep learning solutions that can handle
the complexity of renal segmentation from imaging data.
This thesis focuses on developing a deep learning-based segmentation pipeline to
address the challenge of segmenting the parenchyma and internal renal structures in
kidney MRI. By providing predicted segmentations with high accuracy and precision,
the need for manual corrections, or any at all, can be minimized. Such advancements
have the potential to make the process of analyzing kidney images more efficient,
thereby reducing the time, effort, and operator dependency.

1


1. Introduction

1.1 Aim
The aim of this thesis is to develop a deep learning-based solution for automated
segmentation of kidney parenchyma, cortex, and medulla from MRI data. Specifi-
cally, the objective is to implement a multi-channel approach capable of generating
reliable segmentation suggestions for these anatomical regions by leveraging infor-
mation from multi-parametric MRI. Furthermore, this study aims to evaluate the
feasibility of using the proposed deep neural network to estimate renal volume and
tissue-specific parametric values, thereby enabling automatic assessment of renal
tissue health and disease progression.
This project is carried out in collaboration with Antaros Medical AB, a company
specializing in advanced medical imaging technologies for clinical trials and drug
development. The desired outcome is a reliable segmentation pipeline that assists
their image analysts in the manual segmentation workflow.

1.2 Scope and Limitations
The scope of this study is limited to the automated segmentation of the kidney
parenchyma, cortex, and medulla using MRI data. Only these specific renal regions
are considered, and no additional structures will be included in the segmentation
process. Similarly, the image data is restricted to MRI, and other modalities such
as computer tomography (CT) are not within the scope.
The MRI dataset used in the study is pre-acquired and provided by Antaros Medical
AB. As a result, this work does not involve any additional image data acquisition.
Areas such as motion artifact reduction and image registration techniques are further
excluded from the scope.
Given that image segmentation and deep learning are wide research areas, this thesis
focuses on investigating a limited set of techniques rather than attempting to explore
the entire field. The proposed solution will solely be evaluated against data from the
same clinical study, which limits the comparison of results to other medical image
data. Lastly, while algorithm performance in terms of efficiency and processing time
is discussed, direct comparisons with manual segmentation in terms of time savings
are not included in the scope of this thesis.

1.3 Related Work
Automated medical image segmentation has become an increasingly active area of
research, driven by advancements in deep learning and computer vision. Early efforts
in automatic kidney segmentation are primarily focused on the kidney as a whole,
with less attention paid to the internal structures, such as the cortex and medulla.
However, research highlights the importance of detailed segmentation of these inter-
nal structures due to their clinical importance in diagnosing and monitoring renal
diseases [3]. An early study from 2014 emphasized the significance of segmenting
the kidney cortex and medulla and proposed a method based on traditional thresh-

2


1. Introduction

olding techniques applied to MRI modalities [4]. With the advancements in deep
learning, particularly in the field of convolutional neural networks (CNNs), more
refined approaches have been developed to improve segmentation accuracy.
In the context of kidney segmentation, a number of efforts have explored using CNN-
based networks to automate and improve segmentation performance in MRI and CT
imaging. For example, a study from 2022 introduced a deep learning model based on
a modified U-Net architecture for the automated segmentation of the kidney cortex
and medulla in abdominal CT scans [5]. Although this approach demonstrated
improved accuracy over traditional methods, challenges remain due to the minimal
contrast differences between the cortex and medulla. Additional research has applied
CNNs for classification and segmentation tasks to support chronic kidney disease
(CKD) diagnosis and kidney volume estimation [6], [7], yet poor tissue contrast is
consistently highlighted as a challenge. Segmentation accuracy is often reduced near
anatomical boundaries where adjacent organs, such as the spleen and liver, blur the
distinction between tissues. In patients with CKD, additional challenges arise due
to irregular kidney shape and altered contrast, which further complicate delineation
of tissue boundaries. Despite these challenges, CNN-based methods demonstrate
high accuracy in segmenting kidneys, often outperforming manual segmentation
and reducing inter- and intraobserver variability [5], [8].
Deep learning-based segmentation models are typically limited by the quality of in-
put data. For renal segmentation, existing methods rely on either contrast-enhanced
or high-resolution imaging. Still, challenges remain due to limited contrast within
the kidneys or surrounding tissues. Additionally, models trained on a single imag-
ing modality often experience limited generalizability. Such models tend to become
biased toward specific imaging data, reducing their effectiveness when applied to
other modalities or diverse clinical datasets. This presents challenges for clinical ap-
plications, where robust performance across various imaging techniques is essential.
To overcome such challenges, multi-modal and multi-channel approaches have gained
attention. Recently, researchers have explored the integration of multiple imaging
modalities with deep learning techniques to improve segmentation performance and
robustness. For example, a study from 2024 proposed incorporating multiple imag-
ing modalities into the segmentation process by treating them as separate input
channels. This method demonstrated improved segmentation accuracy and better
generalization across varying datasets [9]. Such input-level fusion is widely adopted
in deep learning-based medical segmentation [10], offering a straightforward yet ef-
fective way to combine diverse tissue contrasts within a single model. Other methods
involve fusion at a later stage, where integration of different imaging modalities oc-
curs within the network’s internal layers or output stage, offering more flexibility
but with higher complexity [11].
Various approaches are represented in the literature, with a general agreement that
incorporating data from multiple imaging modalities improves segmentation accu-
racy and model generalization. Moreover, multi-dimensional U-Net variants have
demonstrated strong performance across diverse medical datasets [12]–[14]. Despite
the clear advantages, the application of multi-channel and multi-modal approaches
to renal image segmentation appears to be limited.

3


1. Introduction

This thesis builds on previous research by using a multi-channel approach that
combines data from multiple MRI sequences to segment not only the kidney as a
whole, but also its internal structures like the cortex and medulla. Unlike many
of the earlier methods that focus on CT or single-modality MRI to segment the
entire kidney, this approach combines data from different MRI acquisitions to help
separate internal kidney structures, which are often hard to distinguish due to low
tissue contrast.

4


2
Theory

This chapter provides the theoretical background relevant to this study on kidney
segmentation using deep learning. It begins with an introduction to kidney anatomy,
renal function, and clinical relevance of conditions like CKD. Following this, key
concepts related to magnetic resonance imaging will be introduced, highlighting its
role in renal imaging and functionality assessments, as well as common artifacts.
Furthermore, the fundamentals of deep learning and neural networks, specifically
for the task of semantic segmentation, will be explained. This includes an introduc-
tion to state-of-the-art approaches, focusing on convolutional neural networks and
architectures like the U-Net and ResUNet. Lastly, this chapter addresses training
strategies for deep neural networks, including optimization algorithms, loss functions
specific to medical segmentation, and commonly used evaluation metrics.

2.1 Kidney Anatomy and Physiology
The kidneys are paired, bean-shaped organs that are located on either side of the
spinal column in the retroperitoneal space, posterior to the abdominal cavity [15].
They play a crucial role in maintaining homeostasis by filtering blood, regulating
electrolyte balance, producing hormones, and excreting metabolic waste products
and excess water from the bloodstream.
Normally, the human body has two kidneys, identical in structure and function.
Each kidney consists of distinct anatomical regions, which are essential for its fil-
tration and regulatory functions. As illustrated in Figure 2.1, the outermost layer
consists of the renal cortex, which is where blood filtration begins. The cortex re-
gion contains the glomeruli, clusters of tiny blood vessels that initiate the filtration
process, and the proximal tubules, which reabsorb essential nutrients and regulate
fluid balance [15].
Beneath the cortex lies the renal medulla, organized into cone-shaped renal pyra-
mids. These structures contain loops of Henle and collecting ducts, both important
for concentrating urine and regulating water reabsorption. At the tip of each pyra-
mid, urine is drained into the minor calyces, which merge into the major calyces
and ultimately form the renal pelvis. The renal pelvis acts as a central collection
area, directing urine into the ureter for excretion. Together, the cortex and medulla
form the renal parenchyma, the kidney’s functional tissue.
Although the kidneys are bilaterally symmetrical in structure, they are positioned

5


2. Theory

asymmetrically. The left kidney, or sinistral (SIN) kidney, is located slightly higher
than the right, or dextral (DX) kidney, because the liver on the right side pushes the
DX kidney downward. This anatomical asymmetry, along with natural variations
in kidney size and shape due to individual differences or pathological conditions,
presents challenges for automated renal segmentation [3], [16].

Figure 2.1: Anatomical illustration of the normal renal structure from a coronal
orientation of the left kidney. The renal parenchyma tissue includes cortex and
pyramids of medullas. Created with BioRender.com.

2.1.1 Chronic Kidney Disease
CKD is a global health challenge, claiming more than 1 million lives annually [2].
This irreversible and progressive condition affects millions worldwide, severely im-
pairing kidney function and, consequently, quality of life. Despite its increasing
prevalence, CKD symptoms are subtle, and by the time they become apparent, the
disease has often progressed significantly. As a result, early-stage CKD frequently
goes undiagnosed, making timely intervention a critical challenge. The increasing
incidence and mortality rates underscore the need for better diagnostic tools and
more effective management strategies.
As CKD progresses, it can lead to end-stage renal disease (ESRD), which requires
expensive kidney replacement treatments, such as dialysis or transplantation. ESRD
presents a significant financial burden on healthcare systems worldwide, with dial-
ysis programs growing annually by 6% to 12% over the past two decades [6]. A key
barrier to early diagnosis of CKD is the lack of symptoms in the early stages of
the disease, highlighting the need for appropriate screening and diagnostic methods.
Early detection is crucial for preventing the disease’s progression and the develop-
ment of complications. MRI has emerged as a useful non-invasive tool in assessing
CKD progression. For instance, MRI helps detect tissue changes associated with
processes such as cyst formation and fibrosis, providing important information on
the extent of kidney damage and the effectiveness of treatment [17].

6


2. Theory

Renal volume is another important biomarker, as its reduction due to glomerular loss
correlates with decreased nephron function and impaired kidney filtration, leading to
complications like fluid overload and ESRD [18]. Monitoring renal volume provides
insights into disease progression, enabling clinicians to adjust treatment plans.

2.2 Magnetic Resonance Imaging
MRI is a widely used medical imaging technique that produces detailed 2D or 3D
views of internal organs and structures. It is non-invasive and relies on strong mag-
netic fields and radio waves to generate detailed images of the body’s internal struc-
tures, making it particularly useful for imaging soft tissues such as the brain, liver,
and kidneys. By providing high-resolution images, MRI plays an important role
in diagnosing and monitoring various medical conditions, as well as understanding
anatomical and physiological changes. In the context of renal imaging, MRI enables
non-invasive tissue characterization and early detection of renal disease progression,
helping to predict clinical outcomes and guide treatment decisions [17].

2.2.1 Physical Principles of MRI
The principles behind MRI rely on the interaction between hydrogen nuclei and a
magnetic field. During an MRI scan, the strong magnetic field generated by the
scanner causes protons in the body to align along the direction of the magnetic
field, creating a net magnetization known as equilibrium magnetization. When a
radiofrequency (RF) pulse is applied, it temporarily disturbs this equilibrium by
tipping the magnetization away from its aligned state, leading to a temporary loss
of longitudinal alignment and the development of magnetization in the transverse
plane. These changes in magnetization contribute to measurable alterations in the
net magnetization. Following the RF pulse, the longitudinal magnetization (Mz) is
reduced, while the transverse magnetization (Mxy) increases. It is this transverse
component that generates a detectable signal in the MRI receiver coils. Once the
RF pulse is turned off, the system undergoes a process known as relaxation, during
which the magnetization gradually returns to its equilibrium state due to interactions
between the hydrogen nuclei and their surrounding environment.
Relaxation occurs on two distinct time scales: Mxy decays according to the T2
time constant, while Mz recovers along the T1 time constant [19]. The relaxation
times vary across tissue types and are influenced by factors such as magnetic field
strength, tissue composition, and water content. Because these relaxation times are
tissue-dependent, they form the basis for the contrast seen in the resulting images.

2.2.2 MRI Sequences
MRI uses certain types of sequences to highlight different structural and functional
characteristics of tissues. Each sequence is characterized by a specific combination
of timing, RF pulses, and gradient fields that manipulate the magnetic properties
of protons in the body and enables acquisition of images with precise resolution

7


2. Theory

and contrast. The choice of imaging sequence determines the type of contrast and,
consequently, the kind of physiological or anatomical information captured, such as
tissue structure, composition, or water content.

Modified Look-Locker Inversion Recovery

This project utilizes data from several MRI sequences to capture distinct features
of kidney anatomy and function. One of them is Modified Look-Locker Inversion
Recovery (MOLLI), which is a widely used sequence for T1-weighted (T1w) imaging.
The images are acquired with respect to the subjects’ heartbeats, which helps to
reduce artifacts caused by cardiac pulses or respiratory motion [20]. As the technique
is widely available across imaging sites, MOLLI is now a well-established method in
renal T1 mapping [17].
In T1 mapping, the MOLLI sequence generates a series of images with different
inversion times (TI), which refers to the time between the inversion pulse and signal
acquisition. These acquired images enable voxel-wise calculations to derive a quan-
titative T1 map, where each voxel reflects the T1 relaxation time of the underlying
tissue [21]. Shorter T1 values appear brighter, while longer values appear darker,
enabling detailed tissue characterization. In the context of renal imaging, T1 map-
ping is useful for non-invasive assessment of renal microstructure. Changes in T1
relaxation times can indicate pathological changes such as fibrosis or inflammation,
making it a promising biomarker for early-stage CKD [2].

T2*-weighted Imaging

Another useful MRI sequence for renal imaging is T2*-weighted (T2*w) imaging,
which is a pulse sequence that measures and displays differences in T2* relaxation
times across various tissues [22]. The main difference between the T2* relaxation
and the conventional T2 relaxation parameter is that T2* relaxation also accounts
for magnetic field inhomogeneities from susceptibility differences in various tissues.
By acquiring a series of T2*w images with varying T2* sensitivities and estimating
the T2* relaxation times through pixel-wise modeling, a T2* map can be generated
[23]. This parametric map visualizes the varying T2*-values for each tissue in the
image, with fluids such as water appearing bright. The T2* relaxation time serves as
a marker of tissue oxygenation, making T2* mapping valuable for monitoring renal
oxygenation. This can in turn be useful for indicating progression of renal diseases
or evaluating the effects of drugs or treatments [24].

Diffusion-weighted Imaging

Diffusion-weighted imaging (DWI) is an imaging modality that can provide addi-
tional functional data in kidney analyses. This sequence measures diffusion proper-
ties and random Brownian motion of water molecules within tissues, offering insights
into their cell density and structural integrity [25]. The general principle behind
the DWI sequence is that it measures diffusion-related attenuation of the MR sig-
nal. It highlights differences in the diffusion of water molecules after application
of diffusion-sensitizing gradients, which affect the movement of water molecules in

8


2. Theory

different directions [26]. The strength and timing of these gradients are quantified
by the b-value, a parameter that determines the degree of diffusion weighting. A
higher b-value indicates a higher signal attenuation based on diffusion [25]. Darker
pixels in the DWI image are thus a result of more signal loss due to more motion of
the water molecules.
The resulting data are used to calculate a parametric map that highlights the appar-
ent diffusion coefficient (ADC) within the tissues. The ADC map allows quantifi-
cation of water movement as contrast in the images reflects differences in diffusion,
where a higher ADC value indicates areas with less restricted diffusion and more
motion [27] . This can provide functional information that complements anatom-
ical imaging from other imaging sequences. Renal DWI serves as a valuable tool
for assessing renal microstructure and circulation, with ADC being a widely used
diffusion biomarker and a prognostic indicator for various kidney diseases [2].

2.2.3 MRI Artifacts
In magnetic resonance imaging, artifacts frequently arise from equipment malfunc-
tions, choice of imaging technique, or the inherent physics of the modality [28].
Artifacts are defined as signals that do not correspond to true anatomy or as dis-
tortions and deletions of anatomical information.
There are many distinct types of MRI artifacts, each with their own characteristic
appearance. One of the most common is the motion artifact, which as its name
implies occurs when sudden movement in the ROI produces ghosting or blurring of
structures. Another frequent artifact is aliasing, which arises when anatomy extends
beyond the field of view (FOV) and its signal is wrapped back into the image,
potentially obscuring underlying tissues. When aliasing combines with magnetic
field inhomogeneities, it can give rise to the so-called zebra stripe artifact with
alternating bright and dark bands across the FOV that may mask anatomical detail.
Additionally, the partial volume artifact arises when a single voxel contains multiple
tissue types often due to limited resolution or minor motion so that the resulting
signal is an average of those tissues, reducing apparent resolution and potentially
obscuring small structures. Examples of these artifacts are shown in Figure 2.2.

(a) Motion artifact (b) Zebra stripes (c) Partial volume

Figure 2.2: Examples of common MRI artifacts that include motion artifact, zebra
stripe artifact, and partial volume effect. Images (a) and (b) are adapted from
Stadler et al. [28], while (c) is courtesy of Antaros Medical AB.

9


2. Theory

2.3 Deep Learning for Image Segmentation
Deep learning, a subset of machine learning, allows systems to learn from experience
rather than relying on pre-programmed knowledge. Deep neural networks (DNNs)
have proven successful in various computer vision tasks, such as image segmenta-
tion, where models can automatically delineate regions of interest (ROIs) [29]. This
is valuable in medical imaging, such as for organ segmentation. By training DNNs
on labeled image data, these models can efficiently segment areas in medical im-
ages, providing benefits and guidance for both diagnosis and treatment planning.
The process of training an algorithm using labeled datasets is known as supervised
learning. One of its advantages is that it generally achieves high accuracy and, with
sufficient training data, tends to outperform other approaches such as unsupervised
or semi-supervised learning in medical segmentation [30].

Among the various image segmentation techniques, semantic segmentation is one
of the most widely used in the medical field. Semantic segmentation refers to al-
gorithms that classify each pixel in an image by assigning it to a specific object
class, effectively grouping pixels that belong to the same category [31]. This pixel-
level classification enables clear identification and delineation of distinct structures
within an image. Semantic segmentation has numerous applications, particularly
in medical image analysis. It plays a crucial role in enhancing diagnostic accuracy
by identifying and segmenting ROIs, such as tumors, lesions, and anatomical struc-
tures [32]. By leveraging deep learning approaches, it enables automation of tasks
like annotation and boundary delineation, thereby improving workflow efficiency
and reducing the risk of human error.

2.3.1 Basic Concepts of Deep Learning
How the deep learning algorithm works is similar to the human brain. These algo-
rithms use a multi-layered architecture in which artificial neurons are inter-connected
between layers to learn hierarchical representations of data [33]. The artificial neu-
ron is an essential component in a DNN that transforms an input vector into a scalar
output through a weighted sum followed by a non-linear activation. Mathematically,
it can be expressed as

zk =
n∑

i=1
wkixi + bk, (2.1)

ak = f(zk), (2.2)

where xi is the i-th element of the input vector, wki is the weight connecting input
i to neuron k, and bk is the bias term for neuron k. The function f(zk) denotes
the activation function, and ak is the scalar output ofthe neuron. In a multi-layer
network, passing information between neurons is known as forward propagation, and
is repeated across layers l:

10


2. Theory

z(l) = W (l)a(l−1) + b(l), (2.3)
a(l) = f(z(l)). (2.4)

Once forward propagation is completed across the entire network, the model’s output
is computed and compared to the actual target value using a function called the loss
function, denoted as L. This error function measures the difference between the
predicted and actual values, providing information that guides the model during
training to find an optimal set of parameters.
In order to train the neural network and minimize the loss L, an operation called
backpropagation is typically performed, where the gradient of the loss with respect
to the weights is computed using the chain rule:

∂L
∂w

= ∂L
∂a

· ∂a

∂z
· ∂z

∂w
. (2.5)

These gradients, computed through backpropagation, are used to update the weights
in the model. By iteratively updating the weights to minimize the loss, the model
learns and improves. The optimal solution is reached when the gradients reach zero,
indicating convergence to a local or global minimum of the loss function.
For these gradients to be applied during training, they are processed by a component
called an optimizer, which adjusts the model’s weights to minimize the loss function.
Optimizers are typically based on variants of gradient descent, which updates the
parameters in the direction of decreasing error. Among the many available optimiz-
ers, one of the most commonly used with CNNs and image segmentation tasks is
the Adam optimizer. Adam is a type of stochastic gradient descent optimizer that
adaptively adjusts the learning rate for each parameter using estimates of the first
and second moments of the gradients [34]. The algorithm works by running averages
of both the gradients and their squared values, correct the biases introduced during
initialization, and use these adjusted values to scale update the parameter values.

2.3.2 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) is a type of artificial neural network that
has become important especially for imaging purposes [35]. The CNNs consists of
several building blocks that are called layers, each with a specific role in processing
data for the network.

Convolutional Layer

The convolutional layer is the primary component of a CNN and the origin of its
name [35]. The main purpose of this layer is feature extraction, which is achieved by
applying a convolution operation to the input data using a set of learnable filters,
also called kernels. As each kernel slides over the input and produces a feature map
by capturing local patterns, the network captures spatial hierarchies of features,
allowing it to recognize increasingly complex patterns.

11


2. Theory

The convolution operation, illustrated in Figure 2.3, is a linear operation that in-
volves taking the element-wise product of the kernel and a local region of the input.
These values are summed up into a single output value at the corresponding position
in the resulting feature map. By applying multiple kernels, the network can extract
different types of features from the input image in each layer, such as edges, tex-
tures, or shapes. The kernels operate on each input channel separately, producing
channel-wise feature maps. A convolutional layer is characterized by its kernel size,
stride, and padding, with stride determining how many pixels the kernel moves at
each step, and padding adding extra borders around the input to control the spatial
dimensions of the output.

Figure 2.3: An example of convolutional layer with kernel size 3×3, a stride of 1,
and no padding.

Pooling Layer

Pooling layers are another essential component of CNN architectures [35]. Their
primary function is to reduce the spatial dimensions of feature maps, while keeping
the most important information. Pooling layers help reduce the number of values the
network needs to process in the subsequent layers. This helps reduce computational
load and makes the network less likely to overfit.
One commonly used pooling layer is max pooling, illustrated in Figure 2.4. It
works by dividing the input feature map into smaller patches and keeping only the
maximum value from each patch in the output. The stride controls how far the patch
moves each step, larger strides result in greater downsampling and smaller output
dimensions. This downsampling can also be achieved by using strided convolutions.

Figure 2.4: Max pooling layer with kernel size 2×2 and a stride of 2.

12


2. Theory

Drop-out Layer

To combat overfitting in a model, which refers to when the model learns the training
data too closely but fails to generalize, the dropout layer is an important component
of the architecture. Dropout, introduced as a regularization technique by Hinton et
al. [36], works by randomly omitting hidden units in the network during training
with a fixed probability. This prevents any single hidden unit from relying too
heavily on the presence of others, which helps improve generalization and reduce
overfitting.

Activation Functions

Activation functions are applied after convolutional layers to introduce non-linearity
into the model, allowing the neural network to capture complex relationships within
the data [35]. Among the various activation functions, the Rectified Linear Unit
(ReLU) is widely used in CNNs, as it helps reduce the vanishing gradient problem,
accelerates training, and enables the model to learn more complex patterns. The
ReLU function outputs the input value for positive inputs and zero for negative
inputs, mathematically defined as

f(x) = max(0, x). (2.6)

To address the issue of zero gradients in the negative input range of ReLU, as shown
in Equation 2.6, the Parametric ReLU (PReLU) activation function can be used.
The PReLU function is defined as

f(x) =

x, if x > 0
αx, if x ≤ 0

, (2.7)

where α is a learnable parameter. Unlike the standard ReLU, which outputs zero
for all negative inputs, PReLU introduces a small, non-zero slope for negative input
values through α, enabling gradients to propagate and improving training stabil-
ity. This adaptive mechanism has been shown to enhance model performance with
minimal additional computational cost [37]. The difference between the ReLU and
PReLU activation functions is illustrated in Figure 2.5.
In the final classification layer, the activation function typically differ from those
used in earlier layers, as it determines the predicted class of each input [35]. In
binary classification tasks, the sigmoid function is typically used, producing a sin-
gle probability value that indicates the likelihood of the input belonging to the
foreground class. For multi-class problems, the softmax function is typically used
instead. This function normalizes the raw output of the fully connected layer, con-
verting it into multiple class probabilities, where each value ranges between 0 and 1
and the sum of all values equals 1. These normalized probabilities support effective
classification by identifying the class with the highest predicted probability. Visual
representations of both activation functions can be seen in Figure 2.6.

13


2. Theory

(a) (b)

Figure 2.5: Comparison of activation functions: (a) ReLU, and (b) PReLU.

(a)
(b)

Figure 2.6: Illustration of activation functions commonly used for classification:
(a) Sigmoid function, and (b) Softmax function.

2.3.3 U-Net
Among the most commonly used CNN architectures for semantic segmentation is
the U-Net, first introduced in 2015 [38]. Originally developed for biomedical image
analysis, U-Net was designed to perform well even with limited training data, making
it particularly useful in medical imaging applications where annotated datasets often
are scarce. U-Net has since become a popular choice for segmenting both 2D and
3D medical images from different modalities, including MRI and CT [39].
The U-Net has a unique, U-shaped architecture consisting of a contracting part
(encoder) and an expansive path (decoder). The encoder extract features to learn a
more compressed representation of the input. It is built up of stacked encoder blocks,
which helps the network learn increasingly abstract representations of the input
image at each level. Each encoder block consists of a series of convolutional layers,
with ReLU activation function following each convolution. Max-pooling operations
are applied at the end of each block, to downsample the feature maps.

14


2. Theory

The decoder, consisting of mirrored encoder blocks, then reconstructs the segmen-
tation mask by gradually increasing the resolution using transposed convolutions.
By using skip connections that link each encoder block to its corresponding decoder
block, the network combines both high- and low-level features at each level. This
allows for preservation and better learning of spatial information, improving seg-
mentation accuracy. Each decoder block mirrors its encoder counterpart, consisting
of a series of convolutional layers with ReLU activations. Finally, an additional
1×1 convolutional layer is used to produce pixel-wise predictions, resulting in the
predicted segmentation mask.

2.3.4 ResUNet
ResUNet is a deep learning model that builds on the U-Net architecture by incorpo-
rating residual units. This model was first introduced by Diakogiannis et al. in 2020
[40]. ResUNet retains the encoder–decoder structure of U-Net but replaces stan-
dard convolutional blocks with residual blocks. These residual units enable better
gradient flow during training and make it possible to construct deeper architectures
without suffering from vanishing gradient issues. The result is a more stable and
efficient model capable of capturing both fine and high-level semantic features.
As the depth of a neural network increases, the training process can become more
difficult due to issues such as vanishing gradients and model degradation. Residual
connections help address these challenges by allowing the input of a set of layers to
bypass those layers and be directly added to the output [41]. The residual connection
can be represented by the equation

H(x) = F (x) + x, (2.8)

where H(x) represents the desired mapping, and F (x) is the residual function
learned by the network.
There are many advantages to using residual units. They help reduce training error
in deeper architectures, as the identity mapping increases the likelihood of finding
suitable initial parameters. An additional benefit of residual connections is that
they provide a shortcut for gradients during the backpropagation process, allowing
deeper networks to learn more effectively.

2.3.5 Loss Functions for Medical Image Segmentation
The loss function is an essential part of training neural networks, as it quantifies the
error between predicted and actual values. Different problems require different loss
functions, and in medical image segmentation, several loss functions are commonly
used, including cross entropy, Dice loss, Tversky loss, and their variants [42]. Choos-
ing the right loss function is particularly important in medical image segmentation,
where datasets often have a class imbalance. This imbalance occurs when there is
far fewer foreground pixels compared to background pixels, and is common when
the ROI is small in comparison to the full image volume. A suitable loss function

15


2. Theory

helps the model focus on both the foreground and background, which helps improve
the accuracy of the segmentation.
The Dice loss, introduced by Milletari et al. [43], is commonly used in medical image
segmentation because it effectively handles class imbalances. The loss function is
based on the Dice coefficient (explained further in section 2.3.6), which is a statistical
measure of overlap between two sets. The Dice loss, LDice, is calculated as follows:

LDice = 1 − 2 ∑N
i=1 yiŷi + ϵ∑N

i=1 y2
i + ∑N

i=1 ŷ2
i + ϵ

, (2.9)

where N corresponds to the total number of voxels in the image, ŷi is the predicted
probability of voxel i belonging to the foreground, and yi is the ground truth value
for voxel i. The term ϵ is a small smoothing term added to avoid division by zero in
the case of background-only, when both the prediction and ground truth are empty.
The Dice loss is typically calculated for each class separately, and the average loss
across all classes is then used during training.

2.3.6 Evaluation Metrics
Evaluation metrics are used to quantitatively assess the performance of segmen-
tation algorithms. In the context of medical image segmentation, various metrics
exist, each suited to different types of segmentation tasks and applications. For eval-
uating segmentation overlap, metrics are typically based on pixel-wise comparisons
between the predicted segmentation mask and the ground truth, based on certain
classification outcomes. True positives (TP) refer to the number of pixels correctly
identified as belonging to the foreground object of interest, false positives (FP) are
background pixels that are incorrectly labeled as foreground, and false negatives
(FN) are foreground pixels missed by the model, misclassified as background. The
relationship between these classification outcomes can be visualized in Figure 2.7

Figure 2.7: Illustration of the relationship between ground truth and predicted
classifications. The overlapping region represents true positives (TP), where the
prediction correctly matches the ground truth. The left (blue) region denotes false
negatives (FN), where actual positives were missed. The right (red) region indicates
false positives (FP), where the background is incorrectly segmented as foreground.

16


2. Theory

Two of the most commonly used metrics that rely on these classification outcomes
are the Dice score and the Jaccard index, also known as Intersection over Union
(IoU). Both metrics quantify the similarity between predicted and ground truth
masks. The Dice score is defined as twice the area of overlap between the predicted
and ground truth masks, divided by the total pixel area of both masks:

Dice score = 2 × |Y ∩ Ŷ |
|Y | + |Ŷ |

= 2TP

2TP + FP + FN
, (2.10)

The IoU provides a slightly stricter measure of overlap and is defined as the area of
overlap divided by the area of the union of the predicted and ground truth masks:

IoU = Y ∩ Ŷ

Y ∪ Ŷ
= TP

TP + FP + FN
. (2.11)

While both metrics assess how closely the prediction matches the ground truth, the
IoU penalizes over-segmentation and under-segmentation more strongly than the
Dice score, making it particularly useful in applications requiring precise delineation.

2.4 Multi-channel Image Segmentation
In deep learning-based image segmentation, multi-channel and multi-modal input
strategies have proven effective in enhancing model performance [10]. A multi-
channel image consists of multiple correlated image channels, each capturing differ-
ent aspects of the same scene or object. For instance, this could be images acquired
with different cameras, time points, or acquisition parameters. These channels are
stacked as separate input layers, allowing the network to process them simultane-
ously. By combining multiple input channels, each pixel is represented by a multi-
dimensional vector instead of a single intensity-value, which helps CNNs to learn
complementary feature representations.
In medical image analysis, this becomes particularly useful. Many anatomical struc-
tures are complex, overlapping, or poorly defined in single-channel images. To over-
come this, scans are often acquired with varying acquisition parameters. These
variations help enhance tissue contrast and allow for the extraction of quantitative
measurements like parametric mapping. Multi-modal input further extends the con-
cept of multi-channel input by combining data from different imaging modalities,
or from MR sequences with different acquisition parameters. As each image pro-
vides unique and complementary information about different tissues, the network
learns from images with richer contrast information, which could be useful for dif-
ferentiating between complex structures. Research indicate that networks trained
on multi-channel or multi-modal data often generalize better, particularly across
patients and imaging conditions [10], [32]. They reduce the risk of overfitting to
channel- or modality-specific features and can even be advantageous for training on
smaller datasets, thanks to the diversity of input information to learn from.

17


2. Theory

Figure 2.8: Example of a multi-channel input image with five different image
channels from renal MRI data. The channels provide different views of the same
anatomy with complementary tissue contrast information.

18


3
Methods

To address the challenges of manual kidney segmentation, this thesis proposes a
deep learning-based method for automated segmentation of the kidney parenchyma,
cortex, and medulla. Due to the often limited contrast between these internal kid-
ney structures in certain MRI sequences, a multi-channel approach is employed to
leverage complementary information from multiple modalities or images acquired
with different parameter settings.
Method implementation was done using Medical Open Network for AI (MONAI),
an open-source platform optimized for deep learning in medical imaging [44]. The
experimental workflow included three main experiments: single-channel segmenta-
tion of parenchyma and cortex from T1-MOLLI images, extension to multi-channel
using multiple TIs, and finally, multi-modal integration of DWI and T2* mapping.
An overview of the proposed segmentation pipeline is visualized in Figure 3.1.

Figure 3.1: Overview of proposed pipeline for kidney segmentation.

19


3. Methods

3.1 Data Description
The dataset used in this study consists of kidney MR images provided by Antaros
Medical AB. The clinical data includes a total of 30 patients, each diagnosed with
CKD at varying stages. The patients were divided between two scanner sites, re-
ferred to as site A and site B. All patients underwent MRI scans at two separate
visits, with each patient being scanned at the same site for both visits. Hence, there
were 60 scans available for the study.
MRI acquisitions were performed using 1.5 Tesla scanner (Siemens for site A and
GE for site B). Each site acquired scans for three different sequences, T1-MOLLI,
DWI, and T2* mapping, with slightly varying acquisition parameters between the
two sites. All scans consisted of coronal 2D slices, acquired in sequence, with in-
plane pixel resolution ranging from 1.5×1.5 mm2 to 1.9531×1.9531 mm2, reflecting
differences in scanner settings and FOV requirements. These slice-wise acquisitions
were stacked into a 3D image volume, which was then stored and provided in VTK
format. Table 3.1 provides an full overview of the imaging parameters for each
modality used in this study, including the number of slices, slice spacing, image
resolution, and the corresponding parametric maps generated.

Table 3.1: Summary of imaging parameters by modality, reflecting the variability
in acquisition protocols depending on the modality and the scanner setup.

Modality Number of slices Slice spacing Image resolution Parametric map

T1-MOLLI 9-19 5 mm 288×288 (A)
256×256 (B) T1

DWI 5 10 mm 210×210 (A)
256×256 (B) ADC

T2* 5 10 mm 288×288 (A)
512×512 (B) T2*

The T1-MOLLI images were acquired at multiple TIs, ranging from 174 ms to 4452
ms. DWI images were acquired at multiple b-values, ranging from 0 to 500 s/mm2.
The T2* mapping image data were acquired at multiple TEs, ranging from 3 ms to
62 ms. For each patient scan, the dataset includes MRI data from all sequences, their
respective parametric maps, and manually segmented ground truth labels for the
cortex and parenchyma. These segmentations were manually delineated by trained
image analysts. Additionally, all images and parametric maps were registered prior
to this study to ensure spatial alignment across modalities. Specifically, images
were registered using the T1-MOLLI image at TI = 1300 ms as fixed reference
image. Example images from the T1-MOLLI, DWI, and T2* sequences acquired
with varying parameters can be found in Appendix A.

20


3. Methods

3.2 Data Exclusion
The initial step in the data processing involved handling the input data to sup-
port downstream training of the segmentation algorithm. Images from the dataset
were manually reviewed to assess image quality, and to do an inventory of which
imaging modalities and parametric maps are available for each patient visit. Some
visits lacked one or more MRI sequences or were missing corresponding segmenta-
tion labels, which are essential for supervised learning. Thus, as a first processing
step, data were inspected to exclude potential subjects with incomplete image data.
Additionally, images were excluded due to image quality issues, such as poor im-
age, presence of larger imaging artifacts, or abnormalities affecting overall clarity or
visualization of the kidney. These exclusion criteria can be summarized as follows:

1. incomplete MRI data (e.g. missing registered T1-MOLLI images or T1 map)
2. missing or incomplete ground truth segmentation maps (e.g. missing cortex

segmentation)
3. inconsistent slice count between images and corresponding segmentation map
4. poor image quality, such as pronounced artifacts (e.g. zebra stripes) or low

contrast
5. presence of large renal lesions (∅>30 mm)

After applying these exclusion criteria, the final dataset consisted of 37 scans (8
from site A and 29 from site B).

3.3 Dataset Split
Given the limited size of the dataset after exclusion, the split was set to include
as many training samples as possible, while still having data for validation and
test available. To prevent data leakage, the dataset was split scan-wise, meaning
that no data from the same patient scan appears in more than one subset. The
MRI dataset was randomly split into three subsets for training (27 scans, 75%),
validation, (6 scans, 15%) and testing (4 scans, 10%). Efforts were made to ensure
a balanced distribution of renal volumes across the data splits.

3.4 Data Pre-processing
Before being input into the segmentation network, the image volumes go through
several pre-processing steps. This pipeline included generating binary ground truth
masks from manual segmentations, scaling and normalizing brightness and intensity
values, and applying data augmentation techniques to increase variability in the
dataset using the MONAI library. To further address the limited size of the dataset,
a patch-based approach was used during training. This involved randomly cropping
the original images into smaller patches, helping the model learn from a greater
variety of spatial contexts.

21


3. Methods

3.4.1 Creation of Ground Truth Masks
The manually segmented ground truth masks were created by trained image analysts
at Antaros Medical, who annotated the original MR images using green and cyan
overlays to indicate the parenchyma and cortex in the left and right kidneys. These
color-coded annotations served as the reference for generating binary segmentation
labels. To convert the color segmentations into useful segmentation labels that serve
as ground truth during training of the CNN, a series of transforms were applied.
The first transform used was ForeGroundMaskd, with the input being the color-
annotated ground-truth segmentation. Within this transform, a custom HSV filter
was applied to isolate the green and cyan annotations from the grayscale background,
effectively removing non-relevant pixels. This process generated a binary mask for
each endpoint, which was then used as a label for the network.
An example of the of manually segmented ground truth mask and resulting binary
mask for both parenchyma and cortex is shown in Figure 3.2.

Figure 3.2: Example of ground truth segmentation and corresponding binary
masks after applied transformations. From left to right: parenchyma ground truth,
parenchyma binary mask, cortex ground truth, and cortex binary mask. The manu-
ally segmented ground image shows the left kidney (cyan) and right kidney (green),
annotated by image analysts.

3.4.2 Brightness and Intensity Scaling
The MR acquisition and processing technique assigns intensity values to each voxel,
reflecting the strength of the signal in the specific region. However, these values can
vary significantly between scans due to various external factors, including scanner
settings, patient movement, or background noise. Additionally, the tissue pathology
also influences image contrast. For example, patients with CKD often exhibit re-
duced tissue contrast compared to healthy individuals. This inconsistency can affect
overall brightness and intensity in acquired images, and generate differences even
across scans from the same patient.
When combining data from different MRI sequences or modalities, these variations
become even more pronounced. Each sequence can produce images with varying
intensity distributions and contrasts, even for the same anatomical structures. To
address this, and to be able to combine images with various intensities as multi-
channel input in this work, histogram normalization was applied to the MRI data

22


3. Methods

using HistogramNormalized. This technique redistributes intensity values to en-
hance contrast and rescale them between 0 and 1, which enhanced the visibility of
renal structures like cortex and medulla, while making intensity values more consis-
tent across the dataset.
For the parametric maps, which have very different intensity distributions compared
to standard MR images, a different strategy was used. The ScaleIntensityRanged
transform was applied individually to each parametric map. Based on inspection of
the corresponding intensity histograms, specific input ranges were chosen and scaled
to the [0, 1] interval. This approach enabled customized normalization, ensuring
consistent scaling across maps despite large variations in raw intensity values.

3.4.3 Splitting into 2D Slices
The proposed model operates on 2D input slices extracted from volumetric (3D)
images. To accommodate this input format, a pre-processing step was required to
convert each 3D volume into a series of 2D slices. This was achieved by applying
a custom function that systematically iterates through the depth dimension of each
volume, extracting one slice at a time. As a result, each 3D image was converted into
a series of 2D slices matching the original acquisitions, significantly increasing the
total number of training samples available for the model. Additionally, all slices were
resized to 256×256 to ensure consistent spatial resolution throughout the dataset.

3.4.4 Augmentations
Data augmentation is a widely used technique in machine learning, particularly use-
ful when working with limited datasets. It helps improve model generalization by
introducing variability into the training data. In this work, augmentations were
implemented using MONAI’s dictionary-based transforms, ensuring consistent ap-
plication to both images and their corresponding labels.
A key strategy to address the limited dataset was using a patch-based approach,
where the original training images were randomly cropped into smaller patches. This
not only increased the number of training samples but also ensured the patches were
small enough to fit into GPU memory. Specifically, the RandCropByPosNegLabeld
transform was used to extract patches based on a defined ratio of foreground and
background. For each training image, four patches of size 160×160 were generated.
Given the relatively small size of the kidneys in the full input image, patches were
sampled with a probability condition (p = 0.5) of being centered on foreground
regions to ensure sufficient coverage of the ROI.
To further increase the size and variability of the training data, additional augmen-
tations were applied to the extracted patches. These included random scaling, rota-
tion and zooming to simulate anatomical differences. Additionally, random contrast
adjustments were introduced to account for variations in acquisition conditions.

23


3. Methods

3.5 Network Configuration
A 3D ResUNet architecture proposed by Inoue et al. (2023) was identified as a strong
baseline for this thesis in kidney segmentation, due to its demonstrated performance
on a similar task and its use of residual connections to enhance feature learning [7].
The proposed architecture, seen in Figure 3.3, follows a U-Net structure with four
downsampling steps. Each step includes a max pooling layer that reduces the spa-
tial dimensions by a factor of two, resulting in a total downsampling of 24. The
number of filters in the convolutional layers increases progressively, using 16, 32,
128, and 256 filters, respectively. A key feature of the network is the inclusion of
two residual connections for each convolutional block, which as mentioned in the
introduction helps maintain gradient flow and improve training stability. While the
original ResUNet was designed for 3D segmentation, this project implements a 2D
adaptation of the architecture. Additionally, dropout layers were added to help
prevent overfitting during training, which was particularly important given the lim-
ited size of the dataset. The proposed network configuration is designed to handle
multi-dimensional inputs, making it suitable for processing multi-channel data.

Figure 3.3: Overview of proposed 2D ResUNet architecture. The left side shows
the downsampling path with residual units, where convolutional channels increase
from 16 to 256 to extract features. The right side correspond to the upsampling
path, where channels decrease from 256 to the number of output classes, combined
with skip connections.

24


3. Methods

3.6 Model Implementation and Experimental Setup

The following subsections describe the experimental setup and implementation de-
tails of proposed segmentation models based on the above-mentioned network archi-
tecture. This includes the design of three main experiments conducted to evaluate
different segmentation approaches, and the types of MRI input data used, including
single-modality, multi-channel, and multi-modal setups.

3.6.1 Single-Channel T1-MOLLI-based Segmentation
The primary experiment evaluated the performance of the proposed network configu-
ration on T1w kidney images, specifically its ability to accurately segment parenchyma
and cortex. The T1-MOLLI sequence was selected as a starting point due to its
higher corticomedullary differentiation (CMD), providing visible contrast between
the cortex and medulla.
Using the above-mentioned network architecture, two distinct segmentation strate-
gies were implemented and compared. The 2D slice-wise approach was compared
to a 3D implementation of the network, similar to the one proposed in the original
reference work. This setup allowed for a comparative study between a volumetric
segmentation method and a slice-wise approach. Experiments were performed to
investigate whether spatial relationships between adjacent slices could be leveraged
to improve segmentation performance.
To ensure robustness in performance evaluation, a 6-fold cross-validation was used
instead of the conventional train-validation-test split. This offered a more robust
assessment of the overall network performance in the early experimental phase by
minimizing bias in model evaluation. Specifically, the full dataset was divided into
six equally sized folds. In each iteration, one fold served as the validation set while
the remaining five were used for training. This iterative process ensures that each
subset serves as the validation set exactly once, providing insights to the network
performance on data with varying distribution.

3.6.2 Multi-Channel T1-MOLLI-based Segmentation
After evaluating the network’s overall performance on the dataset, an experiment
was conducted using combinations of multiple TIs as channel-wise input to the
model. This approach aimed to determine whether segmentation performance and
generalizability could be improved by leveraging the varying contrasts provided by
different TIs. To identify the optimal combination, each TI was individually input
into the model and evaluated based on segmentation performance. These were then
combined into a multi-channel input, with the goal of utilizing the unique features of
each TI. The proposed method utilizes early fusion of the inputs to a multi-channel
input image that can be fed into the segmentation network.
The best-performing configuration was then fine-tuned to maximize the model per-
formance, incorporating additional data augmentations and dropout layers to en-
hance generalization.

25


3. Methods

3.6.3 Multi-Modal Segmentation Integrating DWI and T2*
As a final experiment, the initial segmentation approach, using only T1-MOLLI
image data as input, was extended to incorporate multi-modal MRI data. Instead
of relying on a single imaging sequence, the approach was extended to combine both
DWI and T2* mapping images as inputs to evaluate whether integration of multiple
image modalities could improve segmentation performance and generalizability.
Following a similar strategy to the proposed multi-channel T1-MOLLI approach,
multi-channel inputs were constructed by concatenating DWI images acquired at
different b-values and T2* mapping images captured at varying echo times. Addi-
tionally, parametric maps derived from all three modalities were added as inputs. By
allowing a CNN to process these inputs simultaneously, it was investigated whether
it can simultaneously process complementary information across modalities to learn
a richer and more diverse representation of tissue characteristics.
Since ground truth volume segmentations were not available specifically for the DWI
and T2* modalities, segmentation masks were generated by resampling existing
MOLLI segmentations. Five MOLLI ground truth slices were selected to match
corresponding slices in the DWI and T2* datasets. These served as a new ground
truth set for training and evaluating the multi-modal segmentation model.

3.7 Training and Evaluation
All models were trained on a NVIDIA GeForce RTX 2080 Ti GPU until convergence,
for up to 4000 epochs. To accelerate the training process, MONAI’s CacheDataset
was used, which caches transformed data into memory during training. Adam op-
timizer was used, with a set learning rate of 10−4. Model performance was contin-
uously monitored using validation loss, where the best-performing model was saved
at the lowest validation loss during training.
For evaluation of segmentation performance, the segmentation output was compared
to manual ground truth annotations. The output probabilities were passed through
a sigmoid activation to obtain binary masks. The segmentation overlap was then
evaluated using both Dice score and IoU. These metrics were computed slice-wise
during training and validation, and then aggregated to provide an overall mean.
Both metrics were computed separately for the renal parenchyma and cortex, offering
insight into segmentation accuracy and precision across the two target regions.
During evaluation, the entire image slice was processed step by step using a sliding
window with the shape 160×160 and a 50% overlap along each spatial dimension.
In overlapping regions, outputs were averaged to produce the final prediction. Zero-
padding was applied to the edges of each slice as needed.

3.8 Post-processing
To refine the raw model predictions and prepare them for downstream analysis, a
series of post-processing steps were applied. First, small objects were removed from

26


3. Methods

the output segmentation masks using the RemoveSmallObjects transform, which
filters out isolated clusters of pixels below a given size threshold. This helped reduce
noise and false positives in the resulting segmentation output. An example of this
transformation is illustrated in Figure 3.4.

Figure 3.4: Visualization of the RemoveSmallObjects transform in MONAI, here
removing clusters with 100 pixels or less. Figure adapted from MONAI [45].

Since 2D slices were segmented independently, an essential post-processing step
was the aggregation of the predicted slices into a reconstructed 3D volume. This
volumetric assembly ensured the predicted segmentation aligned spatially with the
original input scan, and was necessary for a more accurate quantification of the
endpoints. Since the model segments both kidneys bilaterally, the final prediction
volume was divided at the midline to separate the left (SIN) and right (DX) kidneys.
This allowed for computation of kidney volume and tissue-specific parameters (T1,
ADC, T2*) for each kidney separately.

3.9 Volume Quantification
Once the predicted masks for each scan were generated and post-processed, kidney
volume measurements were obtained from the predicted masks by summing the total
number of voxels corresponding to parenchyma and cortex separately for each scan.
The sum of voxels was then scaled by the voxel dimensions of the respective scan,
resulting in the total kidney volume in milliliters for both the SIN and DX kidneys.
The quality of the segmentation and the model’s ability to automatically quantify
renal volumes were assessed by evaluating how well the predicted endpoint values
matched the reference values, computed from manually created segmentations. This
was quantified using the coefficient of determination (R2), which provides a measure-
ment of the correlation between predicted and reference volumes. The volumetric
analysis was performed across all data splits, highlighting differences in how well the
model had adapted to the training data, the distribution of training data volumes,
and the accuracy of volume predictions on the validation and test data.

27


3. Methods

3.10 Quantification of Parametric Values
To quantify the parametric values, median values in the renal cortex and medulla
respectively were extracted from parametric maps using the predicted segmentation
masks. The medulla region was derived by subtracting the predicted cortex mask
from the parenchyma mask, both of which were obtained from the model’s output.
This resulted in a binary medulla mask, excluding the cortical regions.
Before extracting parametric values of the two tissues, both the cortex and medulla
masks were refined using binary erosion. Erosion was applied with a square struc-
tural element to shrink the ROI, removing edge pixels that may contain noise or par-
tial volume effects. This step ensured that the extracted values were representative
of the central, more reliable tissue regions, and not influenced by misclassifications
at the outer MR slices. An example of erosion using a 3×3 structural element is
illustrated in Figure 3.5

Figure 3.5: Effect of erosion using a 3×3 structural element.

After applying erosion, the refined cortex and medulla masks were applied to the
original parametric maps (T1, ADC, and T2*) to calculate the median of the in-
tensity values within the mask. As the parametric maps are sensitive to variations
outside the ROI, where intensity values can vary significantly, extracting the me-
dian value ensures that the resulting values are more representative of the true tis-
sue characteristics and less sensitive to outliers. For each image volume, pixel-wise
parametric values were aggregated across a number of consecutive centered slices to
reduce slice-to-slice variability. Lastly, the resulting median values were compared
against reference values provided by image analysts using R2 correlation values.

28


4
Results

In this chapter, the results from the training and evaluation of the implemented seg-
mentation network are presented. It begins with a brief overview of the dataset and
key findings from data exploration. The subsequent sections follow the structure of
the experimental setup. First, results from single-channel segmentation of the renal
parenchyma and cortex using T1-MOLLI images are presented. This is followed by
results from multi-channel segmentation experiments using multiple inversion times
from T1-MOLLI, along with the quantification of kidney volume and median T1
values. Finally, the chapter concludes with results from multi-modal segmentation
that integrates DWI and T2* input data, including the quantification of ADC and
T2* values for both cortex and medulla.

4.1 Data Exploration
Exploration of the dataset revealed the presence of multiple artifacts that affected
image quality and the visibility of key anatomical structures of the kidney. Zebra
stripe artifacts were among the most prevalent, appearing with differing severity
across scans. While images with severe distortions that obscured renal boundaries
were excluded due to potential impairment of segmentation performance, small ar-
tifacts with minimal impact on the ROI were retained. Examples of such images
are presented in Figure 4.1, which illustrates the range of artifacts observed.

Figure 4.1: Examples of MRI artifacts and image distortions observed in the
dataset. Yellow arrows highlight regions impacted by zebra striping and poorly
defined kidney boundaries.

29


4. Results

The leftmost image shows visible zebra striping. In this case, the artifact was deter-
mined to have little impact on the ROI, and the image was retained in the dataset.
The second image shows blurring of the upper boundaries of the kidneys, likely
caused by banding artifacts in outer slices. The third image shows poorly defined
kidney edges, likely from partial volume effects. These distortions were commonly
observed throughout the dataset, posing challenges for the segmentation models.

4.2 Single-channel T1 MOLLI-based Segmentation
To evaluate the performance of the 2D and 3D segmentation approaches for renal
parenchyma and cortex, four models were trained using 6-fold cross-validation and
TI1400 as input. Each model was evaluated individually and the results were av-
eraged across folds. The results are presented in Table 4.1, showing the average
performance and standard deviation of across the cross validation experiments.

Table 4.1: 6-fold cross validation results for 2D and 3D ResUNet models. Perfor-
mance metrics are shown as mean ± standard deviations across all folds.

Network Endpoint Dice IoU

3D ResUNet Parenchyma 0.8290 ± 0.0254 0.7112 ± 0.0372
Cortex 0.6925 ± 0.0321 0.5335 ± 0.0372

2D ResUNet Parenchyma 0.8721 ± 0.0292 0.7930 ± 0.0365
Cortex 0.7942 ± 0.0248 0.6753 ± 0.0280

As shown in Table 4.1, the 2D ResUNet outperformed the 3D ResUNet in terms of
both Dice and IoU. For parenchyma segmentation, the 2D model achieved a mean
Dice of 0.8721 ± 0.0278 compared to 0.8290 ± 0.0242 for the 3D model. A similar
trend was observed for cortex segmentation, where the 2D model reached a mean
Dice of 0.7942 ± 0.0236, notably higher than the 3D model’s 0.6925 ± 0.0327. The
IoU scores followed the same pattern, further suggesting that the 2D ResUNet was
more effective at both segmentation tasks.
Figure 4.2 illustrates a comparison of example segmentation outputs for the center
slice of two patient scans. While the parenchyma segmentation are comparable for
both 2D and 3D ResUNet, the cortex segmentation output clearly differs between
the two architectures. In the parenchyma case, both models struggle to accurately
delineate the outer kidney boundary. These challenges become more apparent in
the cortex segmentation task, where the renal cortex’s complex structure introduces
more intricate boundary regions. The segmentation overlays demonstrate that the
2D ResUNet provides more accurate cortex delineation compared to the 3D Re-
sUNet. In the 3D model overlays, a higher occurrence of false positives is observed,
particularly in regions corresponding to the medullary pyramids. This leads to
outputs that more closely resemble parenchyma segmentation, indicating a reduced
performance in distinguishing cortical boundaries. These observations are consistent
with the quantitative performance metrics shown in Table 4.1, which confirm the
superior performance of the 2D model in segmenting the parenchyma and cortex.

30


4. Results

Figure 4.2: Comparison of segmentation results for (a) renal parenchyma and
(b) renal cortex using 2D ResUNet and 3D ResUNet models. The figures illustrate
the original image, ground truth segmentation label, and overlay of image, label and
model predictions, where true positives (TP) are represented in green, false positives
(FP) in red, and false negatives (FN) in blue.

31


4. Results

4.3 Multi-channel T1 MOLLI-based Segmentation
To explore the impact of multi-dimensional input data, each available TI image and
the T1 map were first evaluated individually as single-channel inputs. This experi-
ment aimed to determine the standalone contribution of each input to segmentation
performance. Segmentation performance was assessed for both renal parenchyma
and cortex using standard performance metrics on the validation and test sets. The
results for each input channel are summarized in Table 4.2 - 4.3.

Table 4.2: Parenchyma segmentation performance using single-channel input im-
ages of different inversion time (TI).

Input Val Dice Val IoU Test Dice Test IoU
TI200 0.8148 0.7109 0.8310 0.7125
TI800 0.8443 0.7676 0.7878 0.7050
TI1400 0.8584 0.7955 0.8488 0.7717
TI2000 0.8517 0.7773 0.7982 0.7037
TI2500 0.8291 0.7513 0.8019 0.7068
T1-map 0.8911 0.8204 0.8520 0.7740

As can be seen in the results, the T1 map consistently achieved the best overall
performance among the single-channel inputs. For parenchyma segmentation, the T1
map yielded the highest Dice scores on both the validation and test sets (0.8911 and
0.8520, respectively). The best-performing individual TI was 1400 ms, which also
demonstrated strong segmentation accuracy, with validation and test Dice scores
of 0.8584 and 0.8488. A similar pattern can be observed for cortex segmentation,
where the T1 map again led to the best performance on the validation set, achieving
a Dice score of 0.8155 and IoU of 0.7013. However, the best performance on the test
set was obtained with TI1400, which reached a Dice score of 0.8196.

Table 4.3: Cortex segmentation performance using single-channel input images of
different inversion time (TI).

Input Val Dice Val IoU Test Dice Test IoU
TI200 0.7024 0.5508 0.6970 0.5498
TI800 0.7491 0.6211 0.7432 0.6163
TI1400 0.7868 0.6749 0.8196 0.7090
TI2000 0.7710 0.6511 0.7588 0.6404
TI2500 0.7543 0.6292 0.7382 0.6134
T1-map 0.8155 0.7013 0.7898 0.6787

Various combinations of inversion times were evaluated as multi-channel inputs to
the 2D ResUNet. Starting with TI1400 and T1 map as a base input, additional
input channels were systematically added or removed to enhance segmentation per-
formance. The results from the top five performing models for parenchyma and
cortex segmentation are presented in Table 4.4 and Table 4.5, respectively.

32


4. Results

Table 4.4: Comparison of the top five multi-channel input combinations for seg-
mentation of the renal parenchyma.

Input Val Dice Val IoU Test Dice Test IoU
TI1400 + T1map 0.8799 0.8158 0.8691 0.8011
TI1400, 2000 + T1map 0.9018 0.8399 0.8914 0.8217
TI200, 800, 1400, 2000 + T1map 0.9055 0.8454 0.8959 0.8275
TI800, 1400, 2000, 2500 + T1map 0.9139 0.8504 0.8735 0.8027
All input channels 0.9085 0.8470 0.8410 0.7650

Increasing the number of input channels improved model performance, enhancing
segmentation accuracy on both validation and test sets. One of the best-performing
multi-channel input configurations was the combination of dropping the inversion
time of 2500 ms. The model trained on this input combination achieved a validation
Dice score of 0.9055 and a test Dice score of 0.8959 for parenchyma segmentation,
along with the second-highest IoU scores for both validation and test data.
The model trained with all available input channels showed high validation per-
formance but notably lower test performance. In contrast, models trained with
carefully selected input combinations achieved better test results, as indicated by im-
proved evaluation metrics. A similar trend was observed in another top-performing
model where the 200 ms inversion time was excluded. This model achieved the
highest validation Dice score of 0.9139 and a validation IoU of 0.8507. However, its
performance on test data was lower compared to other models.

Table 4.5: Comparison of the top five multi-channel input combinations for seg-
mentation of the renal cortex.

Input Val Dice Val IoU Test Dice Test IoU
TI1400 + T1map 0.7992 0.6219 0.8212 0.7110
TI1400, 2000 + T1map 0.8253 0.7150 0.8182 0.7123
TI800, 1400, 2000 + T1map 0.8280 0.7219 0.8223 0.7106
TI200, 800, 1400, 2000 + T1map 0.8242 0.7166 0.8392 0.7396
All input channels 0.8267 0.7196 0.8272 0.7184

For cortex segmentation, the multi-channel experiments showed that combining in-
version times that individually yielded the highest segmentation performance im-
proved accuracy. Similar to segmentation of the parenchyma, including all channels
led to one of the top-performing models, but with slightly reduced performance on
the test data. Notably, the model that included all inversion times except the 2500
ms image demonstrated consistently high performance across both validation and
test datasets for cortex segmentation as well. This input configuration achieved a
validation Dice score of 0.8267 and IoU of 0.7196, as well as a test Dice score of
0.8272 and IoU of 0.7184. These results indicate a good balance between model
accuracy and generalization to unseen data.

33


4. Results

4.3.1 Performance of Fine-tuned T1-MOLLI Model
Fine-tuning of the 2D ResUNet architecture was done with the goal of improving
accuracy and robustness of the model. A multi-channel input of five channels rep-
resenting different inversion times and the parametric T1 map, previously identified
as optimal for both anatomical regions, were used as input to the network. Given
that this input combination proved to be optimal for both parenchyma and cor-
tex, a multi-label approach was also explored, training a single model to segment
both ROIs simultaneously. Its performance was compared against single-label mod-
els for segmenting parenchyma and cortex separately, as summarized in Table 4.6.
To enhance generalization, additional data augmentation were applied, along with
dropout (p = 0.1) for regularization. Both model configurations were then trained
to convergence using the Adam optimizer with a learning rate of 10−4.

Table 4.6: Comparison of the performance of single-label and multi-label segmen-
tation models trained on multi-channel T1-MOLLI input.

Network Val Dice Val IoU Test Dice Test IoU
Single-label (2D ResUNet)

Parenchyma 0.9130 0.8534 0.8962 0.8254
Cortex 0.8352 0.7260 0.8466 0.7428

Multi-label (2D ResUNet)
Parenchyma 0.9298 0.8737 0.9089 0.8453
Cortex 0.8614 0.7598 0.8552 0.7557

As shown in Table 4.6, the multi-label network outperformed the performance of
the models segmenting both segmentation endpoints separately. For instance, the
multi-label model achieved a Dice score of 0.9298 and 0.8614 for parenchyma and
cortex respectively on the validation set, compared to 0.9130 and 0.8352 for the
separate segmentation models. Notably, the multi-label model also demonstrated
strong generalization to unseen test data, with Dice scores of 0.9089 (parenchyma)
and 0.8552 (cortex). The Dice loss and segmentation performance during training
the multi-label network is seen in Figure 4.3.

34


4. Results

Figure 4.3: Training and validation performance of the proposed multi-label 2D
ResUNet, showing Dice loss and evaluation metrics progress during training (Adam
optimizer, learning rate of 10−4). Dice score and IoU are reported separately for the
parenchyma and cortex, as well as averaged across both regions.

Figure 4.4 shows example segmentations of the renal parenchyma using the proposed
multi-channel model. Qualitatively, the model demonstrates a high degree of accu-
racy in delineating the whole parenchymal region. Despite false detections along the
k