Deep Learning-based Segmentation of Kidneys from MR Images A Multi-Channel Approach for Automated Segmentation and Quantification in Multi-parametric MRI Master’s Thesis in Biomedical Engineering CECILIA NORDBERG VIKTOR LINDFORS DEPARTMENT OF ELECTRICAL ENGINEERING (E2) CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 Deep Learning-based Segmentation of Kidneys from MR Images A Multi-Channel Approach for Automated Segmentation and Quantification in Multi-parametric MRI CECILIA NORDBERG VIKTOR LINDFORS Department of Electrical Engineering (E2) Chalmers University of Technology Gothenburg, Sweden 2025 Deep Learning-based Segmentation of Kidneys from MR Images A Multi-Channel Approach for Automated Segmentation and Quantification in Multi-parametric MRI CECILIA NORDBERG VIKTOR LINDFORS © CECILIA NORDBERG, 2025. © VIKTOR LINDFORS, 2025. Supervisor: Bettina Selig, Antaros Medical AB Supervisor: Kanishka Sharma, Antaros Medical AB Examiner: Ida Häggström, Electrical Engineering (E2) Master’s Thesis 2025 Department of Electrical Engineering (E2) Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Segmentation of the renal cortex (green) and medulla (magenta) using a 2D ResUNet. The model was trained on multi-channel input from T1-MOLLI images acquired at varying inversion times. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv Deep Learning-based Segmentation of Kidneys from MR Images A Multi-Channel Approach for Automated Segmentation and Quantification in Multi-parametric MRI CECILIA NORDBERG VIKTOR LINDFORS Department of Electrical Engineering (E2) Chalmers University of Technology Abstract Chronic kidney disease (CKD) is a progressive condition affecting millions world- wide, and accurate assessment of kidney structure is essential for early diagnosis and monitoring disease progression. Magnetic Resonance Imaging (MRI) has emerged as a powerful, non-invasive technique for visualizing subtle structural and functional changes of the kidneys, providing insights into disease progression and severity. How- ever, manual segmentation of MRI data is both time-consuming and prone to inter- and intra-observer variability, highlighting the need for automated methods. This thesis presents a deep learning-based approach for automated segmentation of the renal parenchyma, cortex, and medulla using multi-channel and multi-modal MRI data. A 2D ResUNet architecture was implemented with the Medical Open Network for AI (MONAI) framework and trained on a dataset of 37 MRI scans from CKD patients. Two approaches were evaluated: a multi-channel model utilizing T1- weighted Modified Look-Locker Inversion Recovery (T1-MOLLI) images at multiple inversion times, and a multi-modal model incorporating diffusion-weighted imaging (DWI) and T2*-weighted image data. While the multi-channel T1-MOLLI model demonstrated strong agreement with manual annotations, achieving Dice scores of 0.9089 for parenchyma and 0.8552 for cortex, the multi-modal approach underper- formed due to spatial misalignment between input images and reference labels. The proposed segmentation pipeline also enabled reliable quantification of renal parenchyma and cortex volumes, and showed potential for quantifying tissue-specific parametric values relevant to CKD monitoring. However, the reliability of these measurements were highly dependent of the models segmentation performance. Over- all, the findings highlight the potential of using deep learning models’ with multi- channel MRI input for improving kidney segmentation, serving as a tool to support clinical image analysis workflows and reduce manual effort. Keywords: deep learning, image segmentation, MRI, kidneys, convolutional neural networks, ResUNet, multi-channel images, chronic kidney disease. v Acknowledgements We would like to thank Antaros Medical AB for giving us the opportunity to carry out our thesis project within their organization. A special thanks to our super- visors, Bettina Selig and Kanishka Sharma, for your ongoing support throughout the project. You have continuously encouraged us to explore different perspectives and shared your knowledge with us during this spring, which we truly appreciated. Additionally, we would like to extend our thanks to Carl Sjöberg for making this thesis possible, and for the warm welcome and support from the start. To all the colleagues at Antaros, thank you for welcoming us and making us feel like part of the team. We would also like to thank our examiner, Ida Häggström, for guiding us during our thesis work. Finally, as this thesis marks the conclusion of our five years at Chalmers University of Technology, we would like to give a big thank you to our friends and family, and to everyone who has contributed to our experience during this time. Cecilia Nordberg & Viktor Lindfors, Gothenburg, June 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: ADC Apparent Diffusion Coefficient CKD Chronic Kidney Disease CMD Corticomedullary Differentiation CNN Convolutional Neural Network CT Computer Tomography DNN Deep Neural Network DWI Diffusion-Weighted Imaging DX Dextral ESRD End-stage Renal Disease FN False Negative FOV Field of View FP False Positive GPU Graphical Processing Unit IoU Intersection over Union MOLLI Modified Look-Locker Inversion Recovery MONAI Medical Open Network for Artificial Intelligence MR Magnetic Resonance MRI Magnetic Resonance Imaging ReLU Rectified Linear Unit RF Radio Frequency ROI Region of Interest SD Standard Deviation SIN Sinistral T1w T1-weighted T2*w T2*-weighted TE Echo Time TI Inversion Time TP True Positive TR Repetition Time ix Contents List of Acronyms ix List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 5 2.1 Kidney Anatomy and Physiology . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Chronic Kidney Disease . . . . . . . . . . . . . . . . . . . . . 6 2.2 Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Physical Principles of MRI . . . . . . . . . . . . . . . . . . . . 7 2.2.2 MRI Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 MRI Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Deep Learning for Image Segmentation . . . . . . . . . . . . . . . . . 10 2.3.1 Basic Concepts of Deep Learning . . . . . . . . . . . . . . . . 10 2.3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 11 2.3.3 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 ResUNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.5 Loss Functions for Medical Image Segmentation . . . . . . . . 15 2.3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Multi-channel Image Segmentation . . . . . . . . . . . . . . . . . . . 17 3 Methods 19 3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Data Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Creation of Ground Truth Masks . . . . . . . . . . . . . . . . 22 3.4.2 Brightness and Intensity Scaling . . . . . . . . . . . . . . . . . 22 3.4.3 Splitting into 2D Slices . . . . . . . . . . . . . . . . . . . . . . 23 3.4.4 Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Network Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xi Contents 3.6 Model Implementation and Experimental Setup . . . . . . . . . . . . . 25 3.6.1 Single-Channel T1-MOLLI-based Segmentation . . . . . . . . 25 3.6.2 Multi-Channel T1-MOLLI-based Segmentation . . . . . . . . . 25 3.6.3 Multi-Modal Segmentation Integrating DWI and T2* . . . . . 26 3.7 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.8 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.9 Volume Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.10 Quantification of Parametric Values . . . . . . . . . . . . . . . . . . . 28 4 Results 29 4.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Single-channel T1 MOLLI-based Segmentation . . . . . . . . . . . . . 30 4.3 Multi-channel T1 MOLLI-based Segmentation . . . . . . . . . . . . . 32 4.3.1 Performance of Fine-tuned T1-MOLLI Model . . . . . . . . . 34 4.3.2 Quantification of Kidney Volume . . . . . . . . . . . . . . . . 38 4.3.3 Quantification of Median T1 Value . . . . . . . . . . . . . . . 42 4.4 Multi-Modal Kidney Segmentation . . . . . . . . . . . . . . . . . . . 43 4.4.1 Quantification of ADC and T2* value . . . . . . . . . . . . . . 46 5 Discussion 49 5.1 Comparison of 2D and 3D Segmentation . . . . . . . . . . . . . . . . 49 5.2 The Effect of Multi-Channel Inputs . . . . . . . . . . . . . . . . . . . 50 5.3 The Effect of Multi-modal Integration . . . . . . . . . . . . . . . . . . 51 5.4 Model Strengths and Limitations . . . . . . . . . . . . . . . . . . . . 51 5.5 Quantifying Volume and Tissue Parameters . . . . . . . . . . . . . . 53 5.6 Comparison to Existing Works . . . . . . . . . . . . . . . . . . . . . . 53 5.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 Conclusion 57 Bibliography 59 A Appendix A I B Appendix B III xii List of Figures 2.1 Anatomical illustration of the normal renal structure from a coronal orientation of the left kidney. The renal parenchyma tissue includes cortex and pyramids of medullas. Created with BioRender.com. . . . 6 2.2 Examples of common MRI artifacts that include motion artifact, ze- bra stripe artifact, and partial volume effect. Images (a) and (b) are adapted from Stadler et al. [28], while (c) is courtesy of Antaros Medical AB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 An example of convolutional layer with kernel size 3×3, a stride of 1, and no padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Max pooling layer with kernel size 2×2 and a stride of 2. . . . . . . . 12 2.5 Comparison of activation functions: (a) ReLU, and (b) PReLU. . . . 14 2.6 Illustration of activation functions commonly used for classification: (a) Sigmoid function, and (b) Softmax function. . . . . . . . . . . . . 14 2.7 Illustration of the relationship between ground truth and predicted classifications. The overlapping region represents true positives (TP), where the prediction correctly matches the ground truth. The left (blue) region denotes false negatives (FN), where actual positives were missed. The right (red) region indicates false positives (FP), where the background is incorrectly segmented as foreground. . . . . . . . . 16 2.8 Example of a multi-channel input image with five different image channels from renal MRI data. The channels provide different views of the same anatomy with complementary tissue contrast information. 18 3.1 Overview of proposed pipeline for kidney segmentation. . . . . . . . . 19 3.2 Example of ground truth segmentation and corresponding binary masks after applied transformations. From left to right: parenchyma ground truth, parenchyma binary mask, cortex ground truth, and cortex binary mask. The manually segmented ground image shows the left kidney (cyan) and right kidney (green), annotated by image analysts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Overview of proposed 2D ResUNet architecture. The left side shows the downsampling path with residual units, where convolutional chan- nels increase from 16 to 256 to extract features. The right side corre- spond to the upsampling path, where channels decrease from 256 to the number of output classes, combined with skip connections. . . . . 24 xiii List of Figures 3.4 Visualization of the RemoveSmallObjects transform in MONAI, here removing clusters with 100 pixels or less. Figure adapted from MONAI [45]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Effect of erosion using a 3×3 structural element. . . . . . . . . . . . . 28 4.1 Examples of MRI artifacts and image distortions observed in the dataset. Yellow arrows highlight regions impacted by zebra striping and poorly defined kidney boundaries. . . . . . . . . . . . . . . . . . 29 4.2 Comparison of segmentation results for (a) renal parenchyma and (b) renal cortex using 2D ResUNet and 3D ResUNet models. The figures illustrate the original image, ground truth segmentation label, and overlay of image, label and model predictions, where true positives (TP) are represented in green, false positives (FP) in red, and false negatives (FN) in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Training and validation performance of the proposed multi-label 2D ResUNet, showing Dice loss and evaluation metrics progress during training (Adam optimizer, learning rate of 10−4). Dice score and IoU are reported separately for the parenchyma and cortex, as well as averaged across both regions. . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Representative examples of parenchyma segmentation results from two test cases: a) scan acquired at site A, and b) scan from site B. Each figure shows selected image slices with the predicted segmenta- tion mask, as well as an overlay on the TI 1400 ms image. The color coding in the overlays is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . . . . . . . 36 4.5 Examples of renal parenchyma segmentation results, illustrating clas- sification outcomes where the model oversegments in the renal pelvis region. The color coding is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . . . . 37 4.6 Examples of renal parenchyma segmentation results, illustrating clas- sification outcomes where the model undersegments in the most ante- rior and posterior parts of the kidney. The color coding is as follows: green - true positives, blue - false negatives, and red - false positives. 37 4.7 Examples of renal parenchyma segmentation results, illustrating clas- sification outcomes where the model falsely predicts other structures in background-only slices. Red indicates false positives. . . . . . . . . 37 4.8 Representative examples of cortex segmentation results from two test cases: a) a scan acquired at site A, and b) a scan from site B. Each figure shows selected image slices with the predicted segmentation mask, as well as an overlay on the TI 1400 ms image. The color coding in the overlays is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . . . . . . . 38 4.9 Correlation of ground truth and predicted volume for parenchyma and cortex. The identity line represents perfect correlation between the CNN-predicted and ground-truth segmentation. Grey area corre- sponds to a 5% volume difference. . . . . . . . . . . . . . . . . . . . . 39 xiv List of Figures 4.10 Bland-Altman plots of agreement for volume prediction and ground truth volume for parenchyma and cortex. Mean and standard de- viation (SD) are calculated globally across all data sets. The solid line represents the mean, while the dashed lines indicate the limits of agreement, calculated as the mean ±1.96 times the SD. . . . . . . . . 40 4.11 Example of post processing / erosion of segmentation mask for (a) cortex and (b) medulla. From left to right: original segmentation mask and resulting segmentation mask after erosion with a square structuring element of increasing size. . . . . . . . . . . . . . . . . . . 42 4.12 Correlation of predicted and delivered median T1 value for cortex and medulla. The identity line represents perfect correlation between the CNN-predicted and delivered values. Grey area corresponds to a 5% difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 43 4.13 Example of renal parenchyma segmentation results from multi-modal input, illustrating test cases with good image-label alignment. The ground truth labels closely match the kidney boundaries visible in the underlying T2*w image, resulting in fewer false detections. The color coding in the overlays is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . . . 45 4.14 Example of renal parenchyma segmentation results from multi-modal input, illustrating test cases with poor image-label alignment. The segmentation overlays show how the ground truth label either extends beyond the visible kidney region or covers only a portion of the kid- ney as seen in the underlying T2*w image. The color coding in the overlays is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.15 Examples of cortex segmentation results using the multi-modal seg- mentation model, demonstrating classification performance on test images. Images show an overlay of ground truth label and prediction output for several slices, superimposed to T2*w images acquired at TE = 3 ms. The color coding is as follows: green - true positives, blue - false negatives, and red - false positives. . . . . . . . . . . . . . 46 4.16 Example of manually delineated regions and generated regions: (a) Manually delineated regions for cortex and medulla on T2* map, (b) generated mask using erosion for cortex, (c) generated mask using erosion for medulla. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.17 Correlation of predicted and delivered median ADC value for cor- tex and medulla. Model is a proposed multi-modal model (2D Re- sUNet). The identity line represents perfect correlation between the CNN-predicted and delivered values. Grey area corresponds to a 5% difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 47 4.18 Correlation of predicted and delivered median T2star value for cor- tex and medulla. Model is a proposed multi-modal model (2D Re- sUNet). The identity line represents perfect correlation between the CNN-predicted and delivered values. Grey area corresponds to a 5% difference in predicted value. . . . . . . . . . . . . . . . . . . . . . . . 48 xv List of Figures A.1 Example images for T1-MOLLI images acquired with increasing in- version times, ranging from 174 ms to 2574 ms. . . . . . . . . . . . . I A.2 Example images for T2*w images acquired with increasing echo times, ranging from 3 ms to 62 ms. . . . . . . . . . . . . . . . . . . . . . . . I A.3 Example images for DWI acquired with increasing b-values, ranging from 0 to 500 s/mm2. Left to right: b=0, b=50, b=200, b=500. . . . I A.4 Examples of parametric maps derived from different MRI sequences. From left to right: T1 map from T1-MOLLI, ADC map from DWI images, and T2* map from T2* mapping. . . . . . . . . . . . . . . . . II xvi List of Tables 3.1 Summary of imaging parameters by modality, reflecting the variabil- ity in acquisition protocols depending on the modality and the scanner setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 6-fold cross validation results for 2D and 3D ResUNet models. Per- formance metrics are shown as mean ± standard deviations across all folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Parenchyma segmentation performance using single-channel input im- ages of different inversion time (TI). . . . . . . . . . . . . . . . . . . . 32 4.3 Cortex segmentation performance using single-channel input images of different inversion time (TI). . . . . . . . . . . . . . . . . . . . . . 32 4.4 Comparison of the top five multi-channel input combinations for seg- mentation of the renal parenchyma. . . . . . . . . . . . . . . . . . . . 33 4.5 Comparison of the top five multi-channel input combinations for seg- mentation of the renal cortex. . . . . . . . . . . . . . . . . . . . . . . 33 4.6 Comparison of the performance of single-label and multi-label seg- mentation models trained on multi-channel T1-MOLLI input. . . . . 34 4.7 Mean ± SD of volume difference (in %) between predicted and deliv- ered volumes, across kidney regions and dataset splits. . . . . . . . . 41 4.8 Overview of input channels used in the multi-modal segmentation model. DWI inputs correspond to multiple b-values and T2* mapping inputs correspond to multiple echo times. . . . . . . . . . . . . . . . . 44 4.9 Comparison of the performance of multi-label segmentation models using multi-channel T1-MOLLI or DWI and T2* as input. Models are based on a 2D ResUNet structure, segmenting both cortex and parenchyma simultaneously. . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Summary of related work on kidney segmentation. . . . . . . . . . . . 54 B.1 R2 correlation values between predicted and actual T1 values for SIN and DX cortex, using erosion with a square structural element of various sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.2 R2 correlation values between predicted and actual T1 values for SIN and DX medulla, using erosion with a square structural element of various sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III xvii List of Tables xviii 1 Introduction Accurate and precise segmentation of anatomical structures in medical images is a critical challenge in the field of medical image analysis. With rapid advancements in deep learning, new and powerful tools for processing and interpreting medical images have been introduced, such as for segmenting vital organs like the kidneys. Precise delineation of kidney regions is essential for advancing the understanding of kidney function and pathology, as well as for supporting effective diagnosis and treatment planning. For instance, segmenting internal kidney structures such as the cortex and medulla allows for measurement of anatomical and functional changes associated with pathology. Such information could support early detection of condi- tions like chronic kidney disease, a progressive and widespread condition that often goes undetected in its early stages [1]. Among medical imaging techniques, Mag- netic Resonance Imaging (MRI) has emerged as a powerful, non-invasive technique for visualizing subtle structural and functional changes of the kidneys, providing insights into disease progression and severity [2]. However, manual analysis and segmentation of kidney regions from medical image data is a time-consuming process that requires high expertise. It becomes even more challenging when there is inter- and intra-subject anatomical variability, lim- ited image contrast between internal tissue types due to limited image quality, and inconsistencies across imaging protocols. These factors make manual annotation both time-consuming and prone to variability. Deep learning offers a promising way to automate the renal segmentation process, saving time and manual effort while also reducing operator bias. While recent advances in deep learning have led to significant improvements in medical image analysis, existing approaches focus on whole-kidney segmentation and do not often address the challenge of segmenting anatomical substructures of the kidneys. Methods that do attempt to resolve internal regions often struggle with low contrast, leading to insufficient segmentation performance and limited clinical relevance. Thus, there is a clear need for deep learning solutions that can handle the complexity of renal segmentation from imaging data. This thesis focuses on developing a deep learning-based segmentation pipeline to address the challenge of segmenting the parenchyma and internal renal structures in kidney MRI. By providing predicted segmentations with high accuracy and precision, the need for manual corrections, or any at all, can be minimized. Such advancements have the potential to make the process of analyzing kidney images more efficient, thereby reducing the time, effort, and operator dependency. 1 1. Introduction 1.1 Aim The aim of this thesis is to develop a deep learning-based solution for automated segmentation of kidney parenchyma, cortex, and medulla from MRI data. Specifi- cally, the objective is to implement a multi-channel approach capable of generating reliable segmentation suggestions for these anatomical regions by leveraging infor- mation from multi-parametric MRI. Furthermore, this study aims to evaluate the feasibility of using the proposed deep neural network to estimate renal volume and tissue-specific parametric values, thereby enabling automatic assessment of renal tissue health and disease progression. This project is carried out in collaboration with Antaros Medical AB, a company specializing in advanced medical imaging technologies for clinical trials and drug development. The desired outcome is a reliable segmentation pipeline that assists their image analysts in the manual segmentation workflow. 1.2 Scope and Limitations The scope of this study is limited to the automated segmentation of the kidney parenchyma, cortex, and medulla using MRI data. Only these specific renal regions are considered, and no additional structures will be included in the segmentation process. Similarly, the image data is restricted to MRI, and other modalities such as computer tomography (CT) are not within the scope. The MRI dataset used in the study is pre-acquired and provided by Antaros Medical AB. As a result, this work does not involve any additional image data acquisition. Areas such as motion artifact reduction and image registration techniques are further excluded from the scope. Given that image segmentation and deep learning are wide research areas, this thesis focuses on investigating a limited set of techniques rather than attempting to explore the entire field. The proposed solution will solely be evaluated against data from the same clinical study, which limits the comparison of results to other medical image data. Lastly, while algorithm performance in terms of efficiency and processing time is discussed, direct comparisons with manual segmentation in terms of time savings are not included in the scope of this thesis. 1.3 Related Work Automated medical image segmentation has become an increasingly active area of research, driven by advancements in deep learning and computer vision. Early efforts in automatic kidney segmentation are primarily focused on the kidney as a whole, with less attention paid to the internal structures, such as the cortex and medulla. However, research highlights the importance of detailed segmentation of these inter- nal structures due to their clinical importance in diagnosing and monitoring renal diseases [3]. An early study from 2014 emphasized the significance of segmenting the kidney cortex and medulla and proposed a method based on traditional thresh- 2 1. Introduction olding techniques applied to MRI modalities [4]. With the advancements in deep learning, particularly in the field of convolutional neural networks (CNNs), more refined approaches have been developed to improve segmentation accuracy. In the context of kidney segmentation, a number of efforts have explored using CNN- based networks to automate and improve segmentation performance in MRI and CT imaging. For example, a study from 2022 introduced a deep learning model based on a modified U-Net architecture for the automated segmentation of the kidney cortex and medulla in abdominal CT scans [5]. Although this approach demonstrated improved accuracy over traditional methods, challenges remain due to the minimal contrast differences between the cortex and medulla. Additional research has applied CNNs for classification and segmentation tasks to support chronic kidney disease (CKD) diagnosis and kidney volume estimation [6], [7], yet poor tissue contrast is consistently highlighted as a challenge. Segmentation accuracy is often reduced near anatomical boundaries where adjacent organs, such as the spleen and liver, blur the distinction between tissues. In patients with CKD, additional challenges arise due to irregular kidney shape and altered contrast, which further complicate delineation of tissue boundaries. Despite these challenges, CNN-based methods demonstrate high accuracy in segmenting kidneys, often outperforming manual segmentation and reducing inter- and intraobserver variability [5], [8]. Deep learning-based segmentation models are typically limited by the quality of in- put data. For renal segmentation, existing methods rely on either contrast-enhanced or high-resolution imaging. Still, challenges remain due to limited contrast within the kidneys or surrounding tissues. Additionally, models trained on a single imag- ing modality often experience limited generalizability. Such models tend to become biased toward specific imaging data, reducing their effectiveness when applied to other modalities or diverse clinical datasets. This presents challenges for clinical ap- plications, where robust performance across various imaging techniques is essential. To overcome such challenges, multi-modal and multi-channel approaches have gained attention. Recently, researchers have explored the integration of multiple imaging modalities with deep learning techniques to improve segmentation performance and robustness. For example, a study from 2024 proposed incorporating multiple imag- ing modalities into the segmentation process by treating them as separate input channels. This method demonstrated improved segmentation accuracy and better generalization across varying datasets [9]. Such input-level fusion is widely adopted in deep learning-based medical segmentation [10], offering a straightforward yet ef- fective way to combine diverse tissue contrasts within a single model. Other methods involve fusion at a later stage, where integration of different imaging modalities oc- curs within the network’s internal layers or output stage, offering more flexibility but with higher complexity [11]. Various approaches are represented in the literature, with a general agreement that incorporating data from multiple imaging modalities improves segmentation accu- racy and model generalization. Moreover, multi-dimensional U-Net variants have demonstrated strong performance across diverse medical datasets [12]–[14]. Despite the clear advantages, the application of multi-channel and multi-modal approaches to renal image segmentation appears to be limited. 3 1. Introduction This thesis builds on previous research by using a multi-channel approach that combines data from multiple MRI sequences to segment not only the kidney as a whole, but also its internal structures like the cortex and medulla. Unlike many of the earlier methods that focus on CT or single-modality MRI to segment the entire kidney, this approach combines data from different MRI acquisitions to help separate internal kidney structures, which are often hard to distinguish due to low tissue contrast. 4 2 Theory This chapter provides the theoretical background relevant to this study on kidney segmentation using deep learning. It begins with an introduction to kidney anatomy, renal function, and clinical relevance of conditions like CKD. Following this, key concepts related to magnetic resonance imaging will be introduced, highlighting its role in renal imaging and functionality assessments, as well as common artifacts. Furthermore, the fundamentals of deep learning and neural networks, specifically for the task of semantic segmentation, will be explained. This includes an introduc- tion to state-of-the-art approaches, focusing on convolutional neural networks and architectures like the U-Net and ResUNet. Lastly, this chapter addresses training strategies for deep neural networks, including optimization algorithms, loss functions specific to medical segmentation, and commonly used evaluation metrics. 2.1 Kidney Anatomy and Physiology The kidneys are paired, bean-shaped organs that are located on either side of the spinal column in the retroperitoneal space, posterior to the abdominal cavity [15]. They play a crucial role in maintaining homeostasis by filtering blood, regulating electrolyte balance, producing hormones, and excreting metabolic waste products and excess water from the bloodstream. Normally, the human body has two kidneys, identical in structure and function. Each kidney consists of distinct anatomical regions, which are essential for its fil- tration and regulatory functions. As illustrated in Figure 2.1, the outermost layer consists of the renal cortex, which is where blood filtration begins. The cortex re- gion contains the glomeruli, clusters of tiny blood vessels that initiate the filtration process, and the proximal tubules, which reabsorb essential nutrients and regulate fluid balance [15]. Beneath the cortex lies the renal medulla, organized into cone-shaped renal pyra- mids. These structures contain loops of Henle and collecting ducts, both important for concentrating urine and regulating water reabsorption. At the tip of each pyra- mid, urine is drained into the minor calyces, which merge into the major calyces and ultimately form the renal pelvis. The renal pelvis acts as a central collection area, directing urine into the ureter for excretion. Together, the cortex and medulla form the renal parenchyma, the kidney’s functional tissue. Although the kidneys are bilaterally symmetrical in structure, they are positioned 5 2. Theory asymmetrically. The left kidney, or sinistral (SIN) kidney, is located slightly higher than the right, or dextral (DX) kidney, because the liver on the right side pushes the DX kidney downward. This anatomical asymmetry, along with natural variations in kidney size and shape due to individual differences or pathological conditions, presents challenges for automated renal segmentation [3], [16]. Figure 2.1: Anatomical illustration of the normal renal structure from a coronal orientation of the left kidney. The renal parenchyma tissue includes cortex and pyramids of medullas. Created with BioRender.com. 2.1.1 Chronic Kidney Disease CKD is a global health challenge, claiming more than 1 million lives annually [2]. This irreversible and progressive condition affects millions worldwide, severely im- pairing kidney function and, consequently, quality of life. Despite its increasing prevalence, CKD symptoms are subtle, and by the time they become apparent, the disease has often progressed significantly. As a result, early-stage CKD frequently goes undiagnosed, making timely intervention a critical challenge. The increasing incidence and mortality rates underscore the need for better diagnostic tools and more effective management strategies. As CKD progresses, it can lead to end-stage renal disease (ESRD), which requires expensive kidney replacement treatments, such as dialysis or transplantation. ESRD presents a significant financial burden on healthcare systems worldwide, with dial- ysis programs growing annually by 6% to 12% over the past two decades [6]. A key barrier to early diagnosis of CKD is the lack of symptoms in the early stages of the disease, highlighting the need for appropriate screening and diagnostic methods. Early detection is crucial for preventing the disease’s progression and the develop- ment of complications. MRI has emerged as a useful non-invasive tool in assessing CKD progression. For instance, MRI helps detect tissue changes associated with processes such as cyst formation and fibrosis, providing important information on the extent of kidney damage and the effectiveness of treatment [17]. 6 2. Theory Renal volume is another important biomarker, as its reduction due to glomerular loss correlates with decreased nephron function and impaired kidney filtration, leading to complications like fluid overload and ESRD [18]. Monitoring renal volume provides insights into disease progression, enabling clinicians to adjust treatment plans. 2.2 Magnetic Resonance Imaging MRI is a widely used medical imaging technique that produces detailed 2D or 3D views of internal organs and structures. It is non-invasive and relies on strong mag- netic fields and radio waves to generate detailed images of the body’s internal struc- tures, making it particularly useful for imaging soft tissues such as the brain, liver, and kidneys. By providing high-resolution images, MRI plays an important role in diagnosing and monitoring various medical conditions, as well as understanding anatomical and physiological changes. In the context of renal imaging, MRI enables non-invasive tissue characterization and early detection of renal disease progression, helping to predict clinical outcomes and guide treatment decisions [17]. 2.2.1 Physical Principles of MRI The principles behind MRI rely on the interaction between hydrogen nuclei and a magnetic field. During an MRI scan, the strong magnetic field generated by the scanner causes protons in the body to align along the direction of the magnetic field, creating a net magnetization known as equilibrium magnetization. When a radiofrequency (RF) pulse is applied, it temporarily disturbs this equilibrium by tipping the magnetization away from its aligned state, leading to a temporary loss of longitudinal alignment and the development of magnetization in the transverse plane. These changes in magnetization contribute to measurable alterations in the net magnetization. Following the RF pulse, the longitudinal magnetization (Mz) is reduced, while the transverse magnetization (Mxy) increases. It is this transverse component that generates a detectable signal in the MRI receiver coils. Once the RF pulse is turned off, the system undergoes a process known as relaxation, during which the magnetization gradually returns to its equilibrium state due to interactions between the hydrogen nuclei and their surrounding environment. Relaxation occurs on two distinct time scales: Mxy decays according to the T2 time constant, while Mz recovers along the T1 time constant [19]. The relaxation times vary across tissue types and are influenced by factors such as magnetic field strength, tissue composition, and water content. Because these relaxation times are tissue-dependent, they form the basis for the contrast seen in the resulting images. 2.2.2 MRI Sequences MRI uses certain types of sequences to highlight different structural and functional characteristics of tissues. Each sequence is characterized by a specific combination of timing, RF pulses, and gradient fields that manipulate the magnetic properties of protons in the body and enables acquisition of images with precise resolution 7 2. Theory and contrast. The choice of imaging sequence determines the type of contrast and, consequently, the kind of physiological or anatomical information captured, such as tissue structure, composition, or water content. Modified Look-Locker Inversion Recovery This project utilizes data from several MRI sequences to capture distinct features of kidney anatomy and function. One of them is Modified Look-Locker Inversion Recovery (MOLLI), which is a widely used sequence for T1-weighted (T1w) imaging. The images are acquired with respect to the subjects’ heartbeats, which helps to reduce artifacts caused by cardiac pulses or respiratory motion [20]. As the technique is widely available across imaging sites, MOLLI is now a well-established method in renal T1 mapping [17]. In T1 mapping, the MOLLI sequence generates a series of images with different inversion times (TI), which refers to the time between the inversion pulse and signal acquisition. These acquired images enable voxel-wise calculations to derive a quan- titative T1 map, where each voxel reflects the T1 relaxation time of the underlying tissue [21]. Shorter T1 values appear brighter, while longer values appear darker, enabling detailed tissue characterization. In the context of renal imaging, T1 map- ping is useful for non-invasive assessment of renal microstructure. Changes in T1 relaxation times can indicate pathological changes such as fibrosis or inflammation, making it a promising biomarker for early-stage CKD [2]. T2*-weighted Imaging Another useful MRI sequence for renal imaging is T2*-weighted (T2*w) imaging, which is a pulse sequence that measures and displays differences in T2* relaxation times across various tissues [22]. The main difference between the T2* relaxation and the conventional T2 relaxation parameter is that T2* relaxation also accounts for magnetic field inhomogeneities from susceptibility differences in various tissues. By acquiring a series of T2*w images with varying T2* sensitivities and estimating the T2* relaxation times through pixel-wise modeling, a T2* map can be generated [23]. This parametric map visualizes the varying T2*-values for each tissue in the image, with fluids such as water appearing bright. The T2* relaxation time serves as a marker of tissue oxygenation, making T2* mapping valuable for monitoring renal oxygenation. This can in turn be useful for indicating progression of renal diseases or evaluating the effects of drugs or treatments [24]. Diffusion-weighted Imaging Diffusion-weighted imaging (DWI) is an imaging modality that can provide addi- tional functional data in kidney analyses. This sequence measures diffusion proper- ties and random Brownian motion of water molecules within tissues, offering insights into their cell density and structural integrity [25]. The general principle behind the DWI sequence is that it measures diffusion-related attenuation of the MR sig- nal. It highlights differences in the diffusion of water molecules after application of diffusion-sensitizing gradients, which affect the movement of water molecules in 8 2. Theory different directions [26]. The strength and timing of these gradients are quantified by the b-value, a parameter that determines the degree of diffusion weighting. A higher b-value indicates a higher signal attenuation based on diffusion [25]. Darker pixels in the DWI image are thus a result of more signal loss due to more motion of the water molecules. The resulting data are used to calculate a parametric map that highlights the appar- ent diffusion coefficient (ADC) within the tissues. The ADC map allows quantifi- cation of water movement as contrast in the images reflects differences in diffusion, where a higher ADC value indicates areas with less restricted diffusion and more motion [27] . This can provide functional information that complements anatom- ical imaging from other imaging sequences. Renal DWI serves as a valuable tool for assessing renal microstructure and circulation, with ADC being a widely used diffusion biomarker and a prognostic indicator for various kidney diseases [2]. 2.2.3 MRI Artifacts In magnetic resonance imaging, artifacts frequently arise from equipment malfunc- tions, choice of imaging technique, or the inherent physics of the modality [28]. Artifacts are defined as signals that do not correspond to true anatomy or as dis- tortions and deletions of anatomical information. There are many distinct types of MRI artifacts, each with their own characteristic appearance. One of the most common is the motion artifact, which as its name implies occurs when sudden movement in the ROI produces ghosting or blurring of structures. Another frequent artifact is aliasing, which arises when anatomy extends beyond the field of view (FOV) and its signal is wrapped back into the image, potentially obscuring underlying tissues. When aliasing combines with magnetic field inhomogeneities, it can give rise to the so-called zebra stripe artifact with alternating bright and dark bands across the FOV that may mask anatomical detail. Additionally, the partial volume artifact arises when a single voxel contains multiple tissue types often due to limited resolution or minor motion so that the resulting signal is an average of those tissues, reducing apparent resolution and potentially obscuring small structures. Examples of these artifacts are shown in Figure 2.2. (a) Motion artifact (b) Zebra stripes (c) Partial volume Figure 2.2: Examples of common MRI artifacts that include motion artifact, zebra stripe artifact, and partial volume effect. Images (a) and (b) are adapted from Stadler et al. [28], while (c) is courtesy of Antaros Medical AB. 9 2. Theory 2.3 Deep Learning for Image Segmentation Deep learning, a subset of machine learning, allows systems to learn from experience rather than relying on pre-programmed knowledge. Deep neural networks (DNNs) have proven successful in various computer vision tasks, such as image segmenta- tion, where models can automatically delineate regions of interest (ROIs) [29]. This is valuable in medical imaging, such as for organ segmentation. By training DNNs on labeled image data, these models can efficiently segment areas in medical im- ages, providing benefits and guidance for both diagnosis and treatment planning. The process of training an algorithm using labeled datasets is known as supervised learning. One of its advantages is that it generally achieves high accuracy and, with sufficient training data, tends to outperform other approaches such as unsupervised or semi-supervised learning in medical segmentation [30]. Among the various image segmentation techniques, semantic segmentation is one of the most widely used in the medical field. Semantic segmentation refers to al- gorithms that classify each pixel in an image by assigning it to a specific object class, effectively grouping pixels that belong to the same category [31]. This pixel- level classification enables clear identification and delineation of distinct structures within an image. Semantic segmentation has numerous applications, particularly in medical image analysis. It plays a crucial role in enhancing diagnostic accuracy by identifying and segmenting ROIs, such as tumors, lesions, and anatomical struc- tures [32]. By leveraging deep learning approaches, it enables automation of tasks like annotation and boundary delineation, thereby improving workflow efficiency and reducing the risk of human error. 2.3.1 Basic Concepts of Deep Learning How the deep learning algorithm works is similar to the human brain. These algo- rithms use a multi-layered architecture in which artificial neurons are inter-connected between layers to learn hierarchical representations of data [33]. The artificial neu- ron is an essential component in a DNN that transforms an input vector into a scalar output through a weighted sum followed by a non-linear activation. Mathematically, it can be expressed as zk = n∑ i=1 wkixi + bk, (2.1) ak = f(zk), (2.2) where xi is the i-th element of the input vector, wki is the weight connecting input i to neuron k, and bk is the bias term for neuron k. The function f(zk) denotes the activation function, and ak is the scalar output ofthe neuron. In a multi-layer network, passing information between neurons is known as forward propagation, and is repeated across layers l: 10 2. Theory z(l) = W (l)a(l−1) + b(l), (2.3) a(l) = f(z(l)). (2.4) Once forward propagation is completed across the entire network, the model’s output is computed and compared to the actual target value using a function called the loss function, denoted as L. This error function measures the difference between the predicted and actual values, providing information that guides the model during training to find an optimal set of parameters. In order to train the neural network and minimize the loss L, an operation called backpropagation is typically performed, where the gradient of the loss with respect to the weights is computed using the chain rule: ∂L ∂w = ∂L ∂a · ∂a ∂z · ∂z ∂w . (2.5) These gradients, computed through backpropagation, are used to update the weights in the model. By iteratively updating the weights to minimize the loss, the model learns and improves. The optimal solution is reached when the gradients reach zero, indicating convergence to a local or global minimum of the loss function. For these gradients to be applied during training, they are processed by a component called an optimizer, which adjusts the model’s weights to minimize the loss function. Optimizers are typically based on variants of gradient descent, which updates the parameters in the direction of decreasing error. Among the many available optimiz- ers, one of the most commonly used with CNNs and image segmentation tasks is the Adam optimizer. Adam is a type of stochastic gradient descent optimizer that adaptively adjusts the learning rate for each parameter using estimates of the first and second moments of the gradients [34]. The algorithm works by running averages of both the gradients and their squared values, correct the biases introduced during initialization, and use these adjusted values to scale update the parameter values. 2.3.2 Convolutional Neural Networks Convolutional Neural Networks (CNNs) is a type of artificial neural network that has become important especially for imaging purposes [35]. The CNNs consists of several building blocks that are called layers, each with a specific role in processing data for the network. Convolutional Layer The convolutional layer is the primary component of a CNN and the origin of its name [35]. The main purpose of this layer is feature extraction, which is achieved by applying a convolution operation to the input data using a set of learnable filters, also called kernels. As each kernel slides over the input and produces a feature map by capturing local patterns, the network captures spatial hierarchies of features, allowing it to recognize increasingly complex patterns. 11 2. Theory The convolution operation, illustrated in Figure 2.3, is a linear operation that in- volves taking the element-wise product of the kernel and a local region of the input. These values are summed up into a single output value at the corresponding position in the resulting feature map. By applying multiple kernels, the network can extract different types of features from the input image in each layer, such as edges, tex- tures, or shapes. The kernels operate on each input channel separately, producing channel-wise feature maps. A convolutional layer is characterized by its kernel size, stride, and padding, with stride determining how many pixels the kernel moves at each step, and padding adding extra borders around the input to control the spatial dimensions of the output. Figure 2.3: An example of convolutional layer with kernel size 3×3, a stride of 1, and no padding. Pooling Layer Pooling layers are another essential component of CNN architectures [35]. Their primary function is to reduce the spatial dimensions of feature maps, while keeping the most important information. Pooling layers help reduce the number of values the network needs to process in the subsequent layers. This helps reduce computational load and makes the network less likely to overfit. One commonly used pooling layer is max pooling, illustrated in Figure 2.4. It works by dividing the input feature map into smaller patches and keeping only the maximum value from each patch in the output. The stride controls how far the patch moves each step, larger strides result in greater downsampling and smaller output dimensions. This downsampling can also be achieved by using strided convolutions. Figure 2.4: Max pooling layer with kernel size 2×2 and a stride of 2. 12 2. Theory Drop-out Layer To combat overfitting in a model, which refers to when the model learns the training data too closely but fails to generalize, the dropout layer is an important component of the architecture. Dropout, introduced as a regularization technique by Hinton et al. [36], works by randomly omitting hidden units in the network during training with a fixed probability. This prevents any single hidden unit from relying too heavily on the presence of others, which helps improve generalization and reduce overfitting. Activation Functions Activation functions are applied after convolutional layers to introduce non-linearity into the model, allowing the neural network to capture complex relationships within the data [35]. Among the various activation functions, the Rectified Linear Unit (ReLU) is widely used in CNNs, as it helps reduce the vanishing gradient problem, accelerates training, and enables the model to learn more complex patterns. The ReLU function outputs the input value for positive inputs and zero for negative inputs, mathematically defined as f(x) = max(0, x). (2.6) To address the issue of zero gradients in the negative input range of ReLU, as shown in Equation 2.6, the Parametric ReLU (PReLU) activation function can be used. The PReLU function is defined as f(x) = x, if x > 0 αx, if x ≤ 0 , (2.7) where α is a learnable parameter. Unlike the standard ReLU, which outputs zero for all negative inputs, PReLU introduces a small, non-zero slope for negative input values through α, enabling gradients to propagate and improving training stabil- ity. This adaptive mechanism has been shown to enhance model performance with minimal additional computational cost [37]. The difference between the ReLU and PReLU activation functions is illustrated in Figure 2.5. In the final classification layer, the activation function typically differ from those used in earlier layers, as it determines the predicted class of each input [35]. In binary classification tasks, the sigmoid function is typically used, producing a sin- gle probability value that indicates the likelihood of the input belonging to the foreground class. For multi-class problems, the softmax function is typically used instead. This function normalizes the raw output of the fully connected layer, con- verting it into multiple class probabilities, where each value ranges between 0 and 1 and the sum of all values equals 1. These normalized probabilities support effective classification by identifying the class with the highest predicted probability. Visual representations of both activation functions can be seen in Figure 2.6. 13 2. Theory (a) (b) Figure 2.5: Comparison of activation functions: (a) ReLU, and (b) PReLU. (a) (b) Figure 2.6: Illustration of activation functions commonly used for classification: (a) Sigmoid function, and (b) Softmax function. 2.3.3 U-Net Among the most commonly used CNN architectures for semantic segmentation is the U-Net, first introduced in 2015 [38]. Originally developed for biomedical image analysis, U-Net was designed to perform well even with limited training data, making it particularly useful in medical imaging applications where annotated datasets often are scarce. U-Net has since become a popular choice for segmenting both 2D and 3D medical images from different modalities, including MRI and CT [39]. The U-Net has a unique, U-shaped architecture consisting of a contracting part (encoder) and an expansive path (decoder). The encoder extract features to learn a more compressed representation of the input. It is built up of stacked encoder blocks, which helps the network learn increasingly abstract representations of the input image at each level. Each encoder block consists of a series of convolutional layers, with ReLU activation function following each convolution. Max-pooling operations are applied at the end of each block, to downsample the feature maps. 14 2. Theory The decoder, consisting of mirrored encoder blocks, then reconstructs the segmen- tation mask by gradually increasing the resolution using transposed convolutions. By using skip connections that link each encoder block to its corresponding decoder block, the network combines both high- and low-level features at each level. This allows for preservation and better learning of spatial information, improving seg- mentation accuracy. Each decoder block mirrors its encoder counterpart, consisting of a series of convolutional layers with ReLU activations. Finally, an additional 1×1 convolutional layer is used to produce pixel-wise predictions, resulting in the predicted segmentation mask. 2.3.4 ResUNet ResUNet is a deep learning model that builds on the U-Net architecture by incorpo- rating residual units. This model was first introduced by Diakogiannis et al. in 2020 [40]. ResUNet retains the encoder–decoder structure of U-Net but replaces stan- dard convolutional blocks with residual blocks. These residual units enable better gradient flow during training and make it possible to construct deeper architectures without suffering from vanishing gradient issues. The result is a more stable and efficient model capable of capturing both fine and high-level semantic features. As the depth of a neural network increases, the training process can become more difficult due to issues such as vanishing gradients and model degradation. Residual connections help address these challenges by allowing the input of a set of layers to bypass those layers and be directly added to the output [41]. The residual connection can be represented by the equation H(x) = F (x) + x, (2.8) where H(x) represents the desired mapping, and F (x) is the residual function learned by the network. There are many advantages to using residual units. They help reduce training error in deeper architectures, as the identity mapping increases the likelihood of finding suitable initial parameters. An additional benefit of residual connections is that they provide a shortcut for gradients during the backpropagation process, allowing deeper networks to learn more effectively. 2.3.5 Loss Functions for Medical Image Segmentation The loss function is an essential part of training neural networks, as it quantifies the error between predicted and actual values. Different problems require different loss functions, and in medical image segmentation, several loss functions are commonly used, including cross entropy, Dice loss, Tversky loss, and their variants [42]. Choos- ing the right loss function is particularly important in medical image segmentation, where datasets often have a class imbalance. This imbalance occurs when there is far fewer foreground pixels compared to background pixels, and is common when the ROI is small in comparison to the full image volume. A suitable loss function 15 2. Theory helps the model focus on both the foreground and background, which helps improve the accuracy of the segmentation. The Dice loss, introduced by Milletari et al. [43], is commonly used in medical image segmentation because it effectively handles class imbalances. The loss function is based on the Dice coefficient (explained further in section 2.3.6), which is a statistical measure of overlap between two sets. The Dice loss, LDice, is calculated as follows: LDice = 1 − 2 ∑N i=1 yiŷi + ϵ∑N i=1 y2 i + ∑N i=1 ŷ2 i + ϵ , (2.9) where N corresponds to the total number of voxels in the image, ŷi is the predicted probability of voxel i belonging to the foreground, and yi is the ground truth value for voxel i. The term ϵ is a small smoothing term added to avoid division by zero in the case of background-only, when both the prediction and ground truth are empty. The Dice loss is typically calculated for each class separately, and the average loss across all classes is then used during training. 2.3.6 Evaluation Metrics Evaluation metrics are used to quantitatively assess the performance of segmen- tation algorithms. In the context of medical image segmentation, various metrics exist, each suited to different types of segmentation tasks and applications. For eval- uating segmentation overlap, metrics are typically based on pixel-wise comparisons between the predicted segmentation mask and the ground truth, based on certain classification outcomes. True positives (TP) refer to the number of pixels correctly identified as belonging to the foreground object of interest, false positives (FP) are background pixels that are incorrectly labeled as foreground, and false negatives (FN) are foreground pixels missed by the model, misclassified as background. The relationship between these classification outcomes can be visualized in Figure 2.7 Figure 2.7: Illustration of the relationship between ground truth and predicted classifications. The overlapping region represents true positives (TP), where the prediction correctly matches the ground truth. The left (blue) region denotes false negatives (FN), where actual positives were missed. The right (red) region indicates false positives (FP), where the background is incorrectly segmented as foreground. 16 2. Theory Two of the most commonly used metrics that rely on these classification outcomes are the Dice score and the Jaccard index, also known as Intersection over Union (IoU). Both metrics quantify the similarity between predicted and ground truth masks. The Dice score is defined as twice the area of overlap between the predicted and ground truth masks, divided by the total pixel area of both masks: Dice score = 2 × |Y ∩ Ŷ | |Y | + |Ŷ | = 2TP 2TP + FP + FN , (2.10) The IoU provides a slightly stricter measure of overlap and is defined as the area of overlap divided by the area of the union of the predicted and ground truth masks: IoU = Y ∩ Ŷ Y ∪ Ŷ = TP TP + FP + FN . (2.11) While both metrics assess how closely the prediction matches the ground truth, the IoU penalizes over-segmentation and under-segmentation more strongly than the Dice score, making it particularly useful in applications requiring precise delineation. 2.4 Multi-channel Image Segmentation In deep learning-based image segmentation, multi-channel and multi-modal input strategies have proven effective in enhancing model performance [10]. A multi- channel image consists of multiple correlated image channels, each capturing differ- ent aspects of the same scene or object. For instance, this could be images acquired with different cameras, time points, or acquisition parameters. These channels are stacked as separate input layers, allowing the network to process them simultane- ously. By combining multiple input channels, each pixel is represented by a multi- dimensional vector instead of a single intensity-value, which helps CNNs to learn complementary feature representations. In medical image analysis, this becomes particularly useful. Many anatomical struc- tures are complex, overlapping, or poorly defined in single-channel images. To over- come this, scans are often acquired with varying acquisition parameters. These variations help enhance tissue contrast and allow for the extraction of quantitative measurements like parametric mapping. Multi-modal input further extends the con- cept of multi-channel input by combining data from different imaging modalities, or from MR sequences with different acquisition parameters. As each image pro- vides unique and complementary information about different tissues, the network learns from images with richer contrast information, which could be useful for dif- ferentiating between complex structures. Research indicate that networks trained on multi-channel or multi-modal data often generalize better, particularly across patients and imaging conditions [10], [32]. They reduce the risk of overfitting to channel- or modality-specific features and can even be advantageous for training on smaller datasets, thanks to the diversity of input information to learn from. 17 2. Theory Figure 2.8: Example of a multi-channel input image with five different image channels from renal MRI data. The channels provide different views of the same anatomy with complementary tissue contrast information. 18 3 Methods To address the challenges of manual kidney segmentation, this thesis proposes a deep learning-based method for automated segmentation of the kidney parenchyma, cortex, and medulla. Due to the often limited contrast between these internal kid- ney structures in certain MRI sequences, a multi-channel approach is employed to leverage complementary information from multiple modalities or images acquired with different parameter settings. Method implementation was done using Medical Open Network for AI (MONAI), an open-source platform optimized for deep learning in medical imaging [44]. The experimental workflow included three main experiments: single-channel segmenta- tion of parenchyma and cortex from T1-MOLLI images, extension to multi-channel using multiple TIs, and finally, multi-modal integration of DWI and T2* mapping. An overview of the proposed segmentation pipeline is visualized in Figure 3.1. Figure 3.1: Overview of proposed pipeline for kidney segmentation. 19 3. Methods 3.1 Data Description The dataset used in this study consists of kidney MR images provided by Antaros Medical AB. The clinical data includes a total of 30 patients, each diagnosed with CKD at varying stages. The patients were divided between two scanner sites, re- ferred to as site A and site B. All patients underwent MRI scans at two separate visits, with each patient being scanned at the same site for both visits. Hence, there were 60 scans available for the study. MRI acquisitions were performed using 1.5 Tesla scanner (Siemens for site A and GE for site B). Each site acquired scans for three different sequences, T1-MOLLI, DWI, and T2* mapping, with slightly varying acquisition parameters between the two sites. All scans consisted of coronal 2D slices, acquired in sequence, with in- plane pixel resolution ranging from 1.5×1.5 mm2 to 1.9531×1.9531 mm2, reflecting differences in scanner settings and FOV requirements. These slice-wise acquisitions were stacked into a 3D image volume, which was then stored and provided in VTK format. Table 3.1 provides an full overview of the imaging parameters for each modality used in this study, including the number of slices, slice spacing, image resolution, and the corresponding parametric maps generated. Table 3.1: Summary of imaging parameters by modality, reflecting the variability in acquisition protocols depending on the modality and the scanner setup. Modality Number of slices Slice spacing Image resolution Parametric map T1-MOLLI 9-19 5 mm 288×288 (A) 256×256 (B) T1 DWI 5 10 mm 210×210 (A) 256×256 (B) ADC T2* 5 10 mm 288×288 (A) 512×512 (B) T2* The T1-MOLLI images were acquired at multiple TIs, ranging from 174 ms to 4452 ms. DWI images were acquired at multiple b-values, ranging from 0 to 500 s/mm2. The T2* mapping image data were acquired at multiple TEs, ranging from 3 ms to 62 ms. For each patient scan, the dataset includes MRI data from all sequences, their respective parametric maps, and manually segmented ground truth labels for the cortex and parenchyma. These segmentations were manually delineated by trained image analysts. Additionally, all images and parametric maps were registered prior to this study to ensure spatial alignment across modalities. Specifically, images were registered using the T1-MOLLI image at TI = 1300 ms as fixed reference image. Example images from the T1-MOLLI, DWI, and T2* sequences acquired with varying parameters can be found in Appendix A. 20 3. Methods 3.2 Data Exclusion The initial step in the data processing involved handling the input data to sup- port downstream training of the segmentation algorithm. Images from the dataset were manually reviewed to assess image quality, and to do an inventory of which imaging modalities and parametric maps are available for each patient visit. Some visits lacked one or more MRI sequences or were missing corresponding segmenta- tion labels, which are essential for supervised learning. Thus, as a first processing step, data were inspected to exclude potential subjects with incomplete image data. Additionally, images were excluded due to image quality issues, such as poor im- age, presence of larger imaging artifacts, or abnormalities affecting overall clarity or visualization of the kidney. These exclusion criteria can be summarized as follows: 1. incomplete MRI data (e.g. missing registered T1-MOLLI images or T1 map) 2. missing or incomplete ground truth segmentation maps (e.g. missing cortex segmentation) 3. inconsistent slice count between images and corresponding segmentation map 4. poor image quality, such as pronounced artifacts (e.g. zebra stripes) or low contrast 5. presence of large renal lesions (∅>30 mm) After applying these exclusion criteria, the final dataset consisted of 37 scans (8 from site A and 29 from site B). 3.3 Dataset Split Given the limited size of the dataset after exclusion, the split was set to include as many training samples as possible, while still having data for validation and test available. To prevent data leakage, the dataset was split scan-wise, meaning that no data from the same patient scan appears in more than one subset. The MRI dataset was randomly split into three subsets for training (27 scans, 75%), validation, (6 scans, 15%) and testing (4 scans, 10%). Efforts were made to ensure a balanced distribution of renal volumes across the data splits. 3.4 Data Pre-processing Before being input into the segmentation network, the image volumes go through several pre-processing steps. This pipeline included generating binary ground truth masks from manual segmentations, scaling and normalizing brightness and intensity values, and applying data augmentation techniques to increase variability in the dataset using the MONAI library. To further address the limited size of the dataset, a patch-based approach was used during training. This involved randomly cropping the original images into smaller patches, helping the model learn from a greater variety of spatial contexts. 21 3. Methods 3.4.1 Creation of Ground Truth Masks The manually segmented ground truth masks were created by trained image analysts at Antaros Medical, who annotated the original MR images using green and cyan overlays to indicate the parenchyma and cortex in the left and right kidneys. These color-coded annotations served as the reference for generating binary segmentation labels. To convert the color segmentations into useful segmentation labels that serve as ground truth during training of the CNN, a series of transforms were applied. The first transform used was ForeGroundMaskd, with the input being the color- annotated ground-truth segmentation. Within this transform, a custom HSV filter was applied to isolate the green and cyan annotations from the grayscale background, effectively removing non-relevant pixels. This process generated a binary mask for each endpoint, which was then used as a label for the network. An example of the of manually segmented ground truth mask and resulting binary mask for both parenchyma and cortex is shown in Figure 3.2. Figure 3.2: Example of ground truth segmentation and corresponding binary masks after applied transformations. From left to right: parenchyma ground truth, parenchyma binary mask, cortex ground truth, and cortex binary mask. The manu- ally segmented ground image shows the left kidney (cyan) and right kidney (green), annotated by image analysts. 3.4.2 Brightness and Intensity Scaling The MR acquisition and processing technique assigns intensity values to each voxel, reflecting the strength of the signal in the specific region. However, these values can vary significantly between scans due to various external factors, including scanner settings, patient movement, or background noise. Additionally, the tissue pathology also influences image contrast. For example, patients with CKD often exhibit re- duced tissue contrast compared to healthy individuals. This inconsistency can affect overall brightness and intensity in acquired images, and generate differences even across scans from the same patient. When combining data from different MRI sequences or modalities, these variations become even more pronounced. Each sequence can produce images with varying intensity distributions and contrasts, even for the same anatomical structures. To address this, and to be able to combine images with various intensities as multi- channel input in this work, histogram normalization was applied to the MRI data 22 3. Methods using HistogramNormalized. This technique redistributes intensity values to en- hance contrast and rescale them between 0 and 1, which enhanced the visibility of renal structures like cortex and medulla, while making intensity values more consis- tent across the dataset. For the parametric maps, which have very different intensity distributions compared to standard MR images, a different strategy was used. The ScaleIntensityRanged transform was applied individually to each parametric map. Based on inspection of the corresponding intensity histograms, specific input ranges were chosen and scaled to the [0, 1] interval. This approach enabled customized normalization, ensuring consistent scaling across maps despite large variations in raw intensity values. 3.4.3 Splitting into 2D Slices The proposed model operates on 2D input slices extracted from volumetric (3D) images. To accommodate this input format, a pre-processing step was required to convert each 3D volume into a series of 2D slices. This was achieved by applying a custom function that systematically iterates through the depth dimension of each volume, extracting one slice at a time. As a result, each 3D image was converted into a series of 2D slices matching the original acquisitions, significantly increasing the total number of training samples available for the model. Additionally, all slices were resized to 256×256 to ensure consistent spatial resolution throughout the dataset. 3.4.4 Augmentations Data augmentation is a widely used technique in machine learning, particularly use- ful when working with limited datasets. It helps improve model generalization by introducing variability into the training data. In this work, augmentations were implemented using MONAI’s dictionary-based transforms, ensuring consistent ap- plication to both images and their corresponding labels. A key strategy to address the limited dataset was using a patch-based approach, where the original training images were randomly cropped into smaller patches. This not only increased the number of training samples but also ensured the patches were small enough to fit into GPU memory. Specifically, the RandCropByPosNegLabeld transform was used to extract patches based on a defined ratio of foreground and background. For each training image, four patches of size 160×160 were generated. Given the relatively small size of the kidneys in the full input image, patches were sampled with a probability condition (p = 0.5) of being centered on foreground regions to ensure sufficient coverage of the ROI. To further increase the size and variability of the training data, additional augmen- tations were applied to the extracted patches. These included random scaling, rota- tion and zooming to simulate anatomical differences. Additionally, random contrast adjustments were introduced to account for variations in acquisition conditions. 23 3. Methods 3.5 Network Configuration A 3D ResUNet architecture proposed by Inoue et al. (2023) was identified as a strong baseline for this thesis in kidney segmentation, due to its demonstrated performance on a similar task and its use of residual connections to enhance feature learning [7]. The proposed architecture, seen in Figure 3.3, follows a U-Net structure with four downsampling steps. Each step includes a max pooling layer that reduces the spa- tial dimensions by a factor of two, resulting in a total downsampling of 24. The number of filters in the convolutional layers increases progressively, using 16, 32, 128, and 256 filters, respectively. A key feature of the network is the inclusion of two residual connections for each convolutional block, which as mentioned in the introduction helps maintain gradient flow and improve training stability. While the original ResUNet was designed for 3D segmentation, this project implements a 2D adaptation of the architecture. Additionally, dropout layers were added to help prevent overfitting during training, which was particularly important given the lim- ited size of the dataset. The proposed network configuration is designed to handle multi-dimensional inputs, making it suitable for processing multi-channel data. Figure 3.3: Overview of proposed 2D ResUNet architecture. The left side shows the downsampling path with residual units, where convolutional channels increase from 16 to 256 to extract features. The right side correspond to the upsampling path, where channels decrease from 256 to the number of output classes, combined with skip connections. 24 3. Methods 3.6 Model Implementation and Experimental Setup The following subsections describe the experimental setup and implementation de- tails of proposed segmentation models based on the above-mentioned network archi- tecture. This includes the design of three main experiments conducted to evaluate different segmentation approaches, and the types of MRI input data used, including single-modality, multi-channel, and multi-modal setups. 3.6.1 Single-Channel T1-MOLLI-based Segmentation The primary experiment evaluated the performance of the proposed network configu- ration on T1w kidney images, specifically its ability to accurately segment parenchyma and cortex. The T1-MOLLI sequence was selected as a starting point due to its higher corticomedullary differentiation (CMD), providing visible contrast between the cortex and medulla. Using the above-mentioned network architecture, two distinct segmentation strate- gies were implemented and compared. The 2D slice-wise approach was compared to a 3D implementation of the network, similar to the one proposed in the original reference work. This setup allowed for a comparative study between a volumetric segmentation method and a slice-wise approach. Experiments were performed to investigate whether spatial relationships between adjacent slices could be leveraged to improve segmentation performance. To ensure robustness in performance evaluation, a 6-fold cross-validation was used instead of the conventional train-validation-test split. This offered a more robust assessment of the overall network performance in the early experimental phase by minimizing bias in model evaluation. Specifically, the full dataset was divided into six equally sized folds. In each iteration, one fold served as the validation set while the remaining five were used for training. This iterative process ensures that each subset serves as the validation set exactly once, providing insights to the network performance on data with varying distribution. 3.6.2 Multi-Channel T1-MOLLI-based Segmentation After evaluating the network’s overall performance on the dataset, an experiment was conducted using combinations of multiple TIs as channel-wise input to the model. This approach aimed to determine whether segmentation performance and generalizability could be improved by leveraging the varying contrasts provided by different TIs. To identify the optimal combination, each TI was individually input into the model and evaluated based on segmentation performance. These were then combined into a multi-channel input, with the goal of utilizing the unique features of each TI. The proposed method utilizes early fusion of the inputs to a multi-channel input image that can be fed into the segmentation network. The best-performing configuration was then fine-tuned to maximize the model per- formance, incorporating additional data augmentations and dropout layers to en- hance generalization. 25 3. Methods 3.6.3 Multi-Modal Segmentation Integrating DWI and T2* As a final experiment, the initial segmentation approach, using only T1-MOLLI image data as input, was extended to incorporate multi-modal MRI data. Instead of relying on a single imaging sequence, the approach was extended to combine both DWI and T2* mapping images as inputs to evaluate whether integration of multiple image modalities could improve segmentation performance and generalizability. Following a similar strategy to the proposed multi-channel T1-MOLLI approach, multi-channel inputs were constructed by concatenating DWI images acquired at different b-values and T2* mapping images captured at varying echo times. Addi- tionally, parametric maps derived from all three modalities were added as inputs. By allowing a CNN to process these inputs simultaneously, it was investigated whether it can simultaneously process complementary information across modalities to learn a richer and more diverse representation of tissue characteristics. Since ground truth volume segmentations were not available specifically for the DWI and T2* modalities, segmentation masks were generated by resampling existing MOLLI segmentations. Five MOLLI ground truth slices were selected to match corresponding slices in the DWI and T2* datasets. These served as a new ground truth set for training and evaluating the multi-modal segmentation model. 3.7 Training and Evaluation All models were trained on a NVIDIA GeForce RTX 2080 Ti GPU until convergence, for up to 4000 epochs. To accelerate the training process, MONAI’s CacheDataset was used, which caches transformed data into memory during training. Adam op- timizer was used, with a set learning rate of 10−4. Model performance was contin- uously monitored using validation loss, where the best-performing model was saved at the lowest validation loss during training. For evaluation of segmentation performance, the segmentation output was compared to manual ground truth annotations. The output probabilities were passed through a sigmoid activation to obtain binary masks. The segmentation overlap was then evaluated using both Dice score and IoU. These metrics were computed slice-wise during training and validation, and then aggregated to provide an overall mean. Both metrics were computed separately for the renal parenchyma and cortex, offering insight into segmentation accuracy and precision across the two target regions. During evaluation, the entire image slice was processed step by step using a sliding window with the shape 160×160 and a 50% overlap along each spatial dimension. In overlapping regions, outputs were averaged to produce the final prediction. Zero- padding was applied to the edges of each slice as needed. 3.8 Post-processing To refine the raw model predictions and prepare them for downstream analysis, a series of post-processing steps were applied. First, small objects were removed from 26 3. Methods the output segmentation masks using the RemoveSmallObjects transform, which filters out isolated clusters of pixels below a given size threshold. This helped reduce noise and false positives in the resulting segmentation output. An example of this transformation is illustrated in Figure 3.4. Figure 3.4: Visualization of the RemoveSmallObjects transform in MONAI, here removing clusters with 100 pixels or less. Figure adapted from MONAI [45]. Since 2D slices were segmented independently, an essential post-processing step was the aggregation of the predicted slices into a reconstructed 3D volume. This volumetric assembly ensured the predicted segmentation aligned spatially with the original input scan, and was necessary for a more accurate quantification of the endpoints. Since the model segments both kidneys bilaterally, the final prediction volume was divided at the midline to separate the left (SIN) and right (DX) kidneys. This allowed for computation of kidney volume and tissue-specific parameters (T1, ADC, T2*) for each kidney separately. 3.9 Volume Quantification Once the predicted masks for each scan were generated and post-processed, kidney volume measurements were obtained from the predicted masks by summing the total number of voxels corresponding to parenchyma and cortex separately for each scan. The sum of voxels was then scaled by the voxel dimensions of the respective scan, resulting in the total kidney volume in milliliters for both the SIN and DX kidneys. The quality of the segmentation and the model’s ability to automatically quantify renal volumes were assessed by evaluating how well the predicted endpoint values matched the reference values, computed from manually created segmentations. This was quantified using the coefficient of determination (R2), which provides a measure- ment of the correlation between predicted and reference volumes. The volumetric analysis was performed across all data splits, highlighting differences in how well the model had adapted to the training data, the distribution of training data volumes, and the accuracy of volume predictions on the validation and test data. 27 3. Methods 3.10 Quantification of Parametric Values To quantify the parametric values, median values in the renal cortex and medulla respectively were extracted from parametric maps using the predicted segmentation masks. The medulla region was derived by subtracting the predicted cortex mask from the parenchyma mask, both of which were obtained from the model’s output. This resulted in a binary medulla mask, excluding the cortical regions. Before extracting parametric values of the two tissues, both the cortex and medulla masks were refined using binary erosion. Erosion was applied with a square struc- tural element to shrink the ROI, removing edge pixels that may contain noise or par- tial volume effects. This step ensured that the extracted values were representative of the central, more reliable tissue regions, and not influenced by misclassifications at the outer MR slices. An example of erosion using a 3×3 structural element is illustrated in Figure 3.5 Figure 3.5: Effect of erosion using a 3×3 structural element. After applying erosion, the refined cortex and medulla masks were applied to the original parametric maps (T1, ADC, and T2*) to calculate the median of the in- tensity values within the mask. As the parametric maps are sensitive to variations outside the ROI, where intensity values can vary significantly, extracting the me- dian value ensures that the resulting values are more representative of the true tis- sue characteristics and less sensitive to outliers. For each image volume, pixel-wise parametric values were aggregated across a number of consecutive centered slices to reduce slice-to-slice variability. Lastly, the resulting median values were compared against reference values provided by image analysts using R2 correlation values. 28 4 Results In this chapter, the results from the training and evaluation of the implemented seg- mentation network are presented. It begins with a brief overview of the dataset and key findings from data exploration. The subsequent sections follow the structure of the experimental setup. First, results from single-channel segmentation of the renal parenchyma and cortex using T1-MOLLI images are presented. This is followed by results from multi-channel segmentation experiments using multiple inversion times from T1-MOLLI, along with the quantification of kidney volume and median T1 values. Finally, the chapter concludes with results from multi-modal segmentation that integrates DWI and T2* input data, including the quantification of ADC and T2* values for both cortex and medulla. 4.1 Data Exploration Exploration of the dataset revealed the presence of multiple artifacts that affected image quality and the visibility of key anatomical structures of the kidney. Zebra stripe artifacts were among the most prevalent, appearing with differing severity across scans. While images with severe distortions that obscured renal boundaries were excluded due to potential impairment of segmentation performance, small ar- tifacts with minimal impact on the ROI were retained. Examples of such images are presented in Figure 4.1, which illustrates the range of artifacts observed. Figure 4.1: Examples of MRI artifacts and image distortions observed in the dataset. Yellow arrows highlight regions impacted by zebra striping and poorly defined kidney boundaries. 29 4. Results The leftmost image shows visible zebra striping. In this case, the artifact was deter- mined to have little impact on the ROI, and the image was retained in the dataset. The second image shows blurring of the upper boundaries of the kidneys, likely caused by banding artifacts in outer slices. The third image shows poorly defined kidney edges, likely from partial volume effects. These distortions were commonly observed throughout the dataset, posing challenges for the segmentation models. 4.2 Single-channel T1 MOLLI-based Segmentation To evaluate the performance of the 2D and 3D segmentation approaches for renal parenchyma and cortex, four models were trained using 6-fold cross-validation and TI1400 as input. Each model was evaluated individually and the results were av- eraged across folds. The results are presented in Table 4.1, showing the average performance and standard deviation of across the cross validation experiments. Table 4.1: 6-fold cross validation results for 2D and 3D ResUNet models. Perfor- mance metrics are shown as mean ± standard deviations across all folds. Network Endpoint Dice IoU 3D ResUNet Parenchyma 0.8290 ± 0.0254 0.7112 ± 0.0372 Cortex 0.6925 ± 0.0321 0.5335 ± 0.0372 2D ResUNet Parenchyma 0.8721 ± 0.0292 0.7930 ± 0.0365 Cortex 0.7942 ± 0.0248 0.6753 ± 0.0280 As shown in Table 4.1, the 2D ResUNet outperformed the 3D ResUNet in terms of both Dice and IoU. For parenchyma segmentation, the 2D model achieved a mean Dice of 0.8721 ± 0.0278 compared to 0.8290 ± 0.0242 for the 3D model. A similar trend was observed for cortex segmentation, where the 2D model reached a mean Dice of 0.7942 ± 0.0236, notably higher than the 3D model’s 0.6925 ± 0.0327. The IoU scores followed the same pattern, further suggesting that the 2D ResUNet was more effective at both segmentation tasks. Figure 4.2 illustrates a comparison of example segmentation outputs for the center slice of two patient scans. While the parenchyma segmentation are comparable for both 2D and 3D ResUNet, the cortex segmentation output clearly differs between the two architectures. In the parenchyma case, both models struggle to accurately delineate the outer kidney boundary. These challenges become more apparent in the cortex segmentation task, where the renal cortex’s complex structure introduces more intricate boundary regions. The segmentation overlays demonstrate that the 2D ResUNet provides more accurate cortex delineation compared to the 3D Re- sUNet. In the 3D model overlays, a higher occurrence of false positives is observed, particularly in regions corresponding to the medullary pyramids. This leads to outputs that more closely resemble parenchyma segmentation, indicating a reduced performance in distinguishing cortical boundaries. These observations are consistent with the quantitative performance metrics shown in Table 4.1, which confirm the superior performance of the 2D model in segmenting the parenchyma and cortex. 30 4. Results Figure 4.2: Comparison of segmentation results for (a) renal parenchyma and (b) renal cortex using 2D ResUNet and 3D ResUNet models. The figures illustrate the original image, ground truth segmentation label, and overlay of image, label and model predictions, where true positives (TP) are represented in green, false positives (FP) in red, and false negatives (FN) in blue. 31 4. Results 4.3 Multi-channel T1 MOLLI-based Segmentation To explore the impact of multi-dimensional input data, each available TI image and the T1 map were first evaluated individually as single-channel inputs. This experi- ment aimed to determine the standalone contribution of each input to segmentation performance. Segmentation performance was assessed for both renal parenchyma and cortex using standard performance metrics on the validation and test sets. The results for each input channel are summarized in Table 4.2 - 4.3. Table 4.2: Parenchyma segmentation performance using single-channel input im- ages of different inversion time (TI). Input Val Dice Val IoU Test Dice Test IoU TI200 0.8148 0.7109 0.8310 0.7125 TI800 0.8443 0.7676 0.7878 0.7050 TI1400 0.8584 0.7955 0.8488 0.7717 TI2000 0.8517 0.7773 0.7982 0.7037 TI2500 0.8291 0.7513 0.8019 0.7068 T1-map 0.8911 0.8204 0.8520 0.7740 As can be seen in the results, the T1 map consistently achieved the best overall performance among the single-channel inputs. For parenchyma segmentation, the T1 map yielded the highest Dice scores on both the validation and test sets (0.8911 and 0.8520, respectively). The best-performing individual TI was 1400 ms, which also demonstrated strong segmentation accuracy, with validation and test Dice scores of 0.8584 and 0.8488. A similar pattern can be observed for cortex segmentation, where the T1 map again led to the best performance on the validation set, achieving a Dice score of 0.8155 and IoU of 0.7013. However, the best performance on the test set was obtained with TI1400, which reached a Dice score of 0.8196. Table 4.3: Cortex segmentation performance using single-channel input images of different inversion time (TI). Input Val Dice Val IoU Test Dice Test IoU TI200 0.7024 0.5508 0.6970 0.5498 TI800 0.7491 0.6211 0.7432 0.6163 TI1400 0.7868 0.6749 0.8196 0.7090 TI2000 0.7710 0.6511 0.7588 0.6404 TI2500 0.7543 0.6292 0.7382 0.6134 T1-map 0.8155 0.7013 0.7898 0.6787 Various combinations of inversion times were evaluated as multi-channel inputs to the 2D ResUNet. Starting with TI1400 and T1 map as a base input, additional input channels were systematically added or removed to enhance segmentation per- formance. The results from the top five performing models for parenchyma and cortex segmentation are presented in Table 4.4 and Table 4.5, respectively. 32 4. Results Table 4.4: Comparison of the top five multi-channel input combinations for seg- mentation of the renal parenchyma. Input Val Dice Val IoU Test Dice Test IoU TI1400 + T1map 0.8799 0.8158 0.8691 0.8011 TI1400, 2000 + T1map 0.9018 0.8399 0.8914 0.8217 TI200, 800, 1400, 2000 + T1map 0.9055 0.8454 0.8959 0.8275 TI800, 1400, 2000, 2500 + T1map 0.9139 0.8504 0.8735 0.8027 All input channels 0.9085 0.8470 0.8410 0.7650 Increasing the number of input channels improved model performance, enhancing segmentation accuracy on both validation and test sets. One of the best-performing multi-channel input configurations was the combination of dropping the inversion time of 2500 ms. The model trained on this input combination achieved a validation Dice score of 0.9055 and a test Dice score of 0.8959 for parenchyma segmentation, along with the second-highest IoU scores for both validation and test data. The model trained with all available input channels showed high validation per- formance but notably lower test performance. In contrast, models trained with carefully selected input combinations achieved better test results, as indicated by im- proved evaluation metrics. A similar trend was observed in another top-performing model where the 200 ms inversion time was excluded. This model achieved the highest validation Dice score of 0.9139 and a validation IoU of 0.8507. However, its performance on test data was lower compared to other models. Table 4.5: Comparison of the top five multi-channel input combinations for seg- mentation of the renal cortex. Input Val Dice Val IoU Test Dice Test IoU TI1400 + T1map 0.7992 0.6219 0.8212 0.7110 TI1400, 2000 + T1map 0.8253 0.7150 0.8182 0.7123 TI800, 1400, 2000 + T1map 0.8280 0.7219 0.8223 0.7106 TI200, 800, 1400, 2000 + T1map 0.8242 0.7166 0.8392 0.7396 All input channels 0.8267 0.7196 0.8272 0.7184 For cortex segmentation, the multi-channel experiments showed that combining in- version times that individually yielded the highest segmentation performance im- proved accuracy. Similar to segmentation of the parenchyma, including all channels led to one of the top-performing models, but with slightly reduced performance on the test data. Notably, the model that included all inversion times except the 2500 ms image demonstrated consistently high performance across both validation and test datasets for cortex segmentation as well. This input configuration achieved a validation Dice score of 0.8267 and IoU of 0.7196, as well as a test Dice score of 0.8272 and IoU of 0.7184. These results indicate a good balance between model accuracy and generalization to unseen data. 33 4. Results 4.3.1 Performance of Fine-tuned T1-MOLLI Model Fine-tuning of the 2D ResUNet architecture was done with the goal of improving accuracy and robustness of the model. A multi-channel input of five channels rep- resenting different inversion times and the parametric T1 map, previously identified as optimal for both anatomical regions, were used as input to the network. Given that this input combination proved to be optimal for both parenchyma and cor- tex, a multi-label approach was also explored, training a single model to segment both ROIs simultaneously. Its performance was compared against single-label mod- els for segmenting parenchyma and cortex separately, as summarized in Table 4.6. To enhance generalization, additional data augmentation were applied, along with dropout (p = 0.1) for regularization. Both model configurations were then trained to convergence using the Adam optimizer with a learning rate of 10−4. Table 4.6: Comparison of the performance of single-label and multi-label segmen- tation models trained on multi-channel T1-MOLLI input. Network Val Dice Val IoU Test Dice Test IoU Single-label (2D ResUNet) Parenchyma 0.9130 0.8534 0.8962 0.8254 Cortex 0.8352 0.7260 0.8466 0.7428 Multi-label (2D ResUNet) Parenchyma 0.9298 0.8737 0.9089 0.8453 Cortex 0.8614 0.7598 0.8552 0.7557 As shown in Table 4.6, the multi-label network outperformed the performance of the models segmenting both segmentation endpoints separately. For instance, the multi-label model achieved a Dice score of 0.9298 and 0.8614 for parenchyma and cortex respectively on the validation set, compared to 0.9130 and 0.8352 for the separate segmentation models. Notably, the multi-label model also demonstrated strong generalization to unseen test data, with Dice scores of 0.9089 (parenchyma) and 0.8552 (cortex). The Dice loss and segmentation performance during training the multi-label network is seen in Figure 4.3. 34 4. Results Figure 4.3: Training and validation performance of the proposed multi-label 2D ResUNet, showing Dice loss and evaluation metrics progress during training (Adam optimizer, learning rate of 10−4). Dice score and IoU are reported separately for the parenchyma and cortex, as well as averaged across both regions. Figure 4.4 shows example segmentations of the renal parenchyma using the proposed multi-channel model. Qualitatively, the model demonstrates a high degree of accu- racy in delineating the whole parenchymal region. Despite false detections along the k