Generate Realistic Image Segmentation Pair Using Diffusion Probabilistic Model A Data Augmentation Approach for Biomedical Segmentation Master’s thesis in Biomedical Engineering WENYIN ZHOU DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Generate Realistic Image and Segmentation Pair using Diffusion Probabilistic Model A Data Augmentation Approach for Biomedical Segmentation WENYIN ZHOU A thesis for the degree of Master of Science in Biomedical Engineering Department of Electrical Engineering Division of Signal Processing and Biomedical Engineering Computer Vision and Medical Image Analysis Group Chalmers University of Technology Gothenburg, Sweden 2023 Generate Realistic Image Segmentation Pair using Diffusion Probabilistic Model A Data Augmentation Approach for Biomedical Segmentation Wenyin Zhou © Wenyin Zhou, 2023. Supervisor: Roman Naeem, Department of Electrical Engineering Examiner: Fredrik Kahl, Department of Electrical Engineering Master’s Thesis 2023 Department of Electrical Engineering Division of Signal Processing and Biomedical Engineering Computer Vision and Medical Image Analysis Group Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Generate Realistic Image Segmentation Pair using Diffusion Probabilistic Model. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Declaration I affirm that I am the sole author of this thesis and that it has not been previously submitted, either in its entirety or in part, for any other degree application. v Abstract Diffusion Models (DMs) [22, 54, 55] are newly emerged deep generative models which produce data via progressive denoising and have achieved state-of-art results for im- age synthesis. Medical image segmentation, a dense prediction task that assigns a pixel-wise classification to the medical images, severely lacks data to achieve great performance in the context of deep learning. In this thesis, we seek to improve the segmentation model using generative modeling, particularly the DMs. We system- atically review the DMs from the perspective of variational inference and several key neural network architectures for both the DMs and biomedical segmentation. Based upon the classical DMs, we propose a training scheme of joint modeling for paired data generation and conditional sampling. To that end, the Joint Diffusion Model (JDM) learns mutual dependence among multimodality data and is orthogonal to various accelerating sampling methods and inverse problem solvers using the DMs. The qualitative results present the great visual quality and diversity of the generated samples, showcasing the capability of the JDM in modeling the joint distribution even with a limited number of data samples. The quantitative results suggest that the appropriate use of the synthetic data can improve the discriminative model in recognizing the fine-grained image features, leading to a performance increase for biomedical segmentation. More specifically, we first pretrain the model on synthetic data using supervised learning and then finetune the model on the real data. The number of synthetic data is ablated. The experiments are conducted on five pub- licly available polyp segmentation and one nuclei segmentation datasets, and show on average +2.5% increase in terms of mean dice score on polyps segmentation and set the new state-of-art results on MoNuSeg [31] using nnU-Net [25]. Keywords: generative model, diffusion model, medical image segmentation, data augmentation. vi Acknowledgements First and foremost, I want to deliver thanks to my parents for their eternal love and permanent support for my cause. I want to thank my supervisor Roman Naeem for his great guidance and introduction to this interesting topic. This work is a long journey starting from the discussion in October 2022 and is impossible without his preconditioning, patience and devotion. I want to thank David Hagerman Olzon, Professor Lennart Svensson, and Professor Fredrik Kahl for the useful discussions. Wenyin Zhou, Gothenburg, July 2023 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: CDM Conditional Diffusion Model CNN Convolutional Neural Network DMs Diffusion Models DDPM Denoising Diffusion Probabilistic Model FID Fréchet Inception Distance GANs Generative Adversarial Networks JDM Joint Diffusion Model mDice Mean Dice Coefficient mIoU Mean Intersection over Union KL Kullbeck Leibler VAEs Variational Autoencoders viii Contents List of Acronyms viii List of Figures xi List of Tables xv 1 Introduction 1 1.1 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 6 2.1 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Generative Modeling and Diffusion Process . . . . . . . . . . . 7 2.1.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Simplified Training Objective . . . . . . . . . . . . . . . . . . 10 2.1.4 Learned Variance . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.5 Reverse Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.6 Diffusion Network . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Posterior sampling using Diffusion Models . . . . . . . . . . . . . . . 15 2.2.1 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Conditional Model . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Diffusion Model for Zero-shot Data Editing . . . . . . . . . . . . . . . 16 2.4 Medical Segmentation Network . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 nnU-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Joint Diffusion Model for Paired Data Synthesis 23 3.1 Joint Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Experiments 27 ix Contents 4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Joint Diffusion Model (JDM) . . . . . . . . . . . . . . . . . . 29 4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Results and Discussion 33 5.1 Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1 Pair Data Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.2 Mask Conditioned Synthesis . . . . . . . . . . . . . . . . . . . 35 5.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Medical Segmentation on Synthetic Data . . . . . . . . . . . . . . . . 39 5.2.1 Medical Segmentation on Polyps . . . . . . . . . . . . . . . . 39 5.2.2 Medical Segmentation on Nuclei . . . . . . . . . . . . . . . . . 42 5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Transfer Learning on Synthetic Data . . . . . . . . . . . . . . . . . . 45 5.3.1 Transfer Learning on Polyps . . . . . . . . . . . . . . . . . . . 45 5.3.2 Transfer Learning on Nuclei . . . . . . . . . . . . . . . . . . . 46 5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6 Conclusion and Limitations 50 6.0.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . 51 Bibliography 53 A Appendix 1 Properties of the Diffusion Model 61 A.1 Proofs of Equation (2.4) and (2.5) . . . . . . . . . . . . . . . . . . . . 61 A.2 A Proof of Equation (2.6) . . . . . . . . . . . . . . . . . . . . . . . . 62 A.3 A proof of Equation (2.9) . . . . . . . . . . . . . . . . . . . . . . . . 63 A.4 A proof of Equation (2.10) . . . . . . . . . . . . . . . . . . . . . . . . 64 A.5 Variational bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B Appendix 2 More Generated Samples 67 x List of Figures 1.1 Medical Image Segmentation with Swin UNETR [19] on BRATS2021 [2]. The segmentation network produces segmentation masks for the Enhancing Tumor, Whole Tumor and Tumor Core. From left to right: T1, T1c, T2 and FLAIR MRI axial slice. Image from [19]. . . . . . . 2 2.1 The directed graphical model of DDPM. The latent variables along the sampling path share the same dimension and are hypothesized to be Markovian. The noise is gradually added to the data in the forward diffusion process, while the reverse process recovers the lost semantic information and pushes one step closer to the clean data once the network evaluation. Demo image from [22]. . . . . . . . . . . 8 2.2 The Residual block used in the diffusion models. . . . . . . . . . . . . 14 2.3 The overall network design of the diffusion network. RB: residual block, Attn: efficient self-attention block. The residual Block is the basic unit for image downsampling and upsampling with the skip connection introduced. See text for more description. Efficient self- attention block is applied at the deepest level and improve feature extraction via QK product with linear complexity. We refer the reader to [48] for more details about efficient attention. . . . . . . . . . . . . 15 2.4 SegDiff [1]: a diffusion-based segmentation method. The two separate encoders (G & F) are employed for denoising noisy segmentation mask the extracting features of the image. The encodings are integrated by the direct sum in the latent space. . . . . . . . . . . . . . . . . . . 17 2.5 Qualitative results on ImageNet 512 × 512 for thin mask from Repaint [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 xi List of Figures 2.6 The U-Net Architecture: A powerful deep learning framework for semantic segmentation tasks in medical imaging. It comprises a U- shaped network structure with both contracting and expanding paths. The contracting path consists of convolutional and pooling layers to capture context, while the expanding path uses transposed convolu- tions for precise localization. Skip connections facilitate the integra- tion of multi-scale features, enabling different size of receptive fields. . 19 2.7 nnU-Net’s interdependence among the hyperparameter setting, data pre-post processing, and model selection. . . . . . . . . . . . . . . . 21 3.1 The gradual denoising procedure of the Joint Diffusion Model (JDM). A paired sample of medical image and segmentation mask is produced with a single reverse run. . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Random generated samples of nuclei and polyp images and their as- sociated segmentation masks. All the images and masks are resized to 256 for display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Random generated samples and closest train data regarding mutual information concerning segmentation mask. From left to right: syn- thetic image, associated synthetic segmentation mask, closest real image, closest real segmentation mask. . . . . . . . . . . . . . . . . . 37 5.3 Synthetic medical images guided by the training mask using the joint diffusion model. The significant difference in contrast to the real image demonstrates the sample diversity and model distribution. . . . 38 5.4 mDice vs. the number of synthetic data. The dashed line denotes the result of training on real data. From left to right and top to down: Kvasir, Clinic-DB, Colon-DB, CVC-300, ETIS. . . . . . . . . . . . . . 41 5.5 mDice vs. the number of synthetic data on MoNuSeg. The dashed line denotes the result of training on real data. . . . . . . . . . . . . . 44 5.6 Quliative results. The red box highlights the improvement of the pre- trained segmentation model on synthetic data. The results are taken from the best case on ETIS [49] and MoNuSeg [31]. . . . . . . . . . . 47 A.1 Variational Bound vs. diffusion steps. Image from [41] . . . . . . . . 66 B.1 Kvasir generated samples. . . . . . . . . . . . . . . . . . . . . . . . . 68 B.2 Glas generated samples. . . . . . . . . . . . . . . . . . . . . . . . . . 69 B.3 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 70 B.4 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 71 B.5 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 72 B.6 Kvasir mask conditioned generated samples. From top to bottom: real image, real segmentation mask, four generated samples. . . . . . 73 xii List of Figures B.7 MoNuSeg mask conditioned generated samples. From top to bottom: real image, real segmentation mask, four generated samples . . . . . . 74 xiii List of Figures xiv List of Tables 4.1 The hyperparameter setting of the JDM. (*) The number of residual blocks is set to 2 and 3 for Kvasir and MoNuSeg, respectively. . . . . 30 5.1 Results on validation sets of polyp segmentation. (# Synthetic - %) represents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respec- tively. The number in blue quantifies the performance difference in contrast to the result on real data. . . . . . . . . . . . . . . . . . . . 40 5.2 Results on generalization sets of polyp segmentation. (# Synthetic - %) represents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance differ- ence compared to the result on real data. . . . . . . . . . . . . . . . . 42 5.3 Results on MoNuSeg of nuclei segmentation. (# Synthetic - %) rep- resents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Results on the validation set pretrain on a different number of syn- thetic data. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. . . . 46 5.5 Results on the generalization set pretrained on a different number of synthetic data. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. . . . 47 xv List of Tables 5.6 Results on MoNuSeg [31]. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 xvi 1 Introduction Medical Image Segmentation is a discipline that partitions a medical image into multiple, non-overlapping segments, each of which represents a specific anatomi- cal structure or tissue type (See Fig. 1.1). This process plays an important role in modern healthcare and is embedded into many clinical decision workflows, such as diagnosis, treatment, image-guided surgery, etc. Automated segmentation aims to provide a segmentation solution to medical images by designing computer or machine learning algorithms1 without significant manual efforts. Compared with manual segmentation, the automated counterpart has the advantage of prediction speed, which might potentially alleviate the heaven burden of medical experts and improve diagnostic efficiency. With the availability of a large amount of data, advanced network architectures, the increasing computing power of the graphical processing unit, and many other impor- tant reasons, deep learning [17] has become the de facto choice of machine learning methods in the last decade. In deep learning, neural networks are designed with multiple layers of interconnected nodes, known as artificial neurons or units. Each layer processes and transforms the input data and passes it on to the next layer, gradually extracting higher-level features and representations. These interconnected nodes usually referred to as parameters are updated through stochastic gradient de- scent. It has been demonstrated that neural networks such as convolutional neural 1In this thesis, we refer to machine learning algorithms as a mathematical model or computer procedure to learn the pattern of the input data and make predictions without explicit programming [36]. 1 1. Introduction Figure 1.1: Medical Image Segmentation with Swin UNETR [19] on BRATS2021 [2]. The segmentation network produces segmentation masks for the Enhancing Tumor, Whole Tumor and Tumor Core. From left to right: T1, T1c, T2 and FLAIR MRI axial slice. Image from [19]. networks2 (CNN) can model complex representations for image classification [20], object detection [6], semantic segmentation [7], etc. As for automated medical segmentation, most developed methods rely on supervised learning based on deep learning, in which both the data point and label are accessi- ble during model training. Empirically, the deep learning based approaches heavily demand a huge amount of heterogeneous data to obtain generalizable results. How- ever, in contrast to the natural image, sizable medical imaging datasets are more difficult to collect, owing to the specialized medical equipment, the requirement of medical experts’ devotion, strict regulation of patient data privacy, and many other reasons preventing active data sharing across multiple institutes. In this thesis, we consider approaching this issue from the perspective of data aug- mentation using generative modeling. Data augmentation is a common tactic to improve the generalization skill of the discriminative model. In practice, it can be accomplished by simple transformations (e.g. rotations or random flips) or synthetic images from the generative model. While simple transformations can be effective in most scenarios, it is limited in generating diverse and realistic samples without modeling data distribution. On the other hand, based on the observations from the 2In this work, CNN is referred to as a type of deep learning neural network designed for vi- sual data analysis, usually utilizing convolutional layers to extract features and pooling layers to downsample the data. 2 1. Introduction training set, a generative model can be constructed to directly approximate either the underlying data distribution pdata or the data generation process, from which the novel data point can be produced. Two well-known representatives are likelihood based approaches [16, 59, 13, 39] and generative adversarial networks (GANs) [18]. GANs support efficient sampling and high sample quality with carefully designed architectures, but training GANs can be unstable and sometimes result in training failure. It can be largely attributed to the imbalance between the generator and the discriminator. For example, in the case of the discriminator being much better than the generator, the generator might not make progress by fooling the discriminator. Besides, the mode covering issues and demand for a multitude of data are also ma- jor drawbacks of GANs, preventing them from broader applications. As suggested by [51], GANs’ effectiveness as a source of medical imaging data was found to not always be reliable, even if the generated images are nearly indistinguishable from real data. Models that have a tractable likelihood, such as autoregressive models [16, 59] or flow-based models [13, 39], give competitive likelihoods, potentially leading to great diversity. However, these likelihood based approaches have limitations based on certain assumptions of the model distribution. For instance, autoregressive mod- els assume the factorization of multiple conditional distributions, and flow-based models interpret the generation process as an invertible transformation from a base distribution to data density. Other cases such as Variational Autoencoders (VAEs) [28] having incomplete likelihoods assumes the data is projected to a tractable dis- tribution in the latent space, and the model is trained for projection and recovery. These imposed restrictions about the model distribution can be unnatural and limit the capability of neural networks, which might partially explain why these methods have not established the same sample quality as GANs. Recently, a newly emerged class of generative model called the diffusion models (DMs) or score-based generative models [52, 22, 54, 55] has attracted much interest in image synthesis and inverse problem solving. The produced images of the DMs achieve great sample quality and diversity on several benchmarks, such as CIFAR-10 [29], and Celeb-A [33]. Owing to its substantial generation capability, the DMs may potentially serve as an efficacious tool for medical images, a domain severely suffer- ing from data sparsity problems. It may also promote active data sharing across the institutes and improve the model performance on downstream tasks. Furthermore, since the DMs also support the exact likelihood computation, it may help to enhance the explainability in the context of clinical practice. 3 1. Introduction In this thesis, we study the most prevailing instance of the DMs, denoising diffusion probabilistic model (DDPM) [22] as an out-of-box tool for 2d biomedical segmenta- tion. Applying DDPM to the medical images, one can generate novel data samples that might fill in the gap of the training data manifold. 1.1 Main Contribution The main contribution of this thesis is to give a thorough review of DMs for prac- titioners to apply this technique to medical problems. The technical novelty is summarized as follows: 1. We propose a new training scheme of joint modeling for the DMs with paired data other than condition modeling. We prove that joint modeling implicitly supports conditional sampling and is orthogonal to various inverse problem solvers. 2. We show that applying the Joint Diffusion Model (JDM) to 2d biomedical segmen- tation is capable of generating meaningful and unlimited number of paired samples. The generated samples deviate from the training data significantly. 3. Empirical results on supervised learning and transfer learning suggest that the generated samples are useful for improving the segmentation model even if the train- ing set is severely sparse. 1.2 Outline In Chapter 2, we review the background that is concerned with this thesis, encom- passing the DMs and medical segmentation network. In Chapter 3, we elaborate on how to construct a joint diffusion probabilistic model to generate the paired data. This chapter also includes conditional sampling through joint modeling. In Chapter 4, we present the experimental setups of this work, including the working environment, hyperparameter setting, performance metrics, and compared baselines. In Chapter 5, we visually check the generated samples and investigate the effect of data augmentation using synthetic data. This encompasses evaluating the seg- mentation model trained on purely synthetic data, and pretrained on synthetic data 4 1. Introduction then finetuned on the real data. 5 2 Background 2.1 Diffusion Models Diffusion Models (DMs) are newly emerged deep generative models [22, 52, 54, 55], which produce sample by reversing the data corruption process via iterative refine- ment. DMs have seen wide applications in image [54, 55, 22, 45, 41, 14], video [32, 8], music [37], molecular synthesis [65], and various image-to-image translation tasks such as super-resolution [46], inpainting [55, 35, 10], etc. In the forward diffusion process, the noise is gradually injected into the data, transforming the complex data distribution into a tractable prior distribution. As an adequate amount of noise is added, the perturbed data can be equivalently considered as a sample from a prior distribution (e.g. Gaussian distribution) that is easy to sample from. For reverse sampling, one can train a deep neural network1 [17] that progressively eliminates noise from the Gaussian sample to produce data. Since the DMs can be further generalized to be continuous [55], there are extensive design options available for the DMs. This work primarily concentrates on the DDPM [22], a particular discretization of DMs, building upon the advancements introduced in recent work - improved DDPM [41]. The objective is to enhance the mode coverage and stationary training of the DDPM framework. The following 1In this work, we consider deep neural network as a type of computer model inspired by the human brain that learns to recognize patterns in data by building multiple layers of interconnected artificial neurons, allowing it to understand complex information and make accurate predictions. 6 2. Background subsections introduce the DMs in detail. The derivation and proof of the section are presented in a manner similar to that in [22]. 2.1.1 Generative Modeling and Diffusion Process Generative modeling is a machine learning problem with many applications and is usually accomplished by density estimation, a statistical technique used to estimate the underlying probability distribution of a set of data points. It enables us to gain insights into the data’s underlying structure, identify patterns, and make informed decisions in various fields, such as machine learning, signal processing, and data analysis. However, it is commonly assumed that the data is governed by a complex data distribution, which might be difficult to approximate directly. To circumvent this, one can leverage the tractable prior p(z) in nature (e.g. Gaussian Distribution) and move the sample from the prior distribution to the target data density. In this sense, the data density is approximated by a posterior distribution given the prior: pdata(x) ≈ pmodel(x)2 = p(x|z). Here, we denote the q(z|x) and p(x|z) as forward and backward mapping. Fortunately, the backward mapping can be modeled using a deep neural network by minimizing the Kullback–Leibler (KL) divergence between the posterior of forward mapping and backward mapping DKL(q(x|z) ∥ p(x|z)). This has led to a series of works posing different assumptions on the tractable prior and model distribution. For example, VAEs [28] assume the tractable prior as a low dimensional Gaussian and the forward and backward mapping is achieved by an autoencoder. In this context, DMs can be viewed as an instance where the prior is a standard Gaussian distribution sharing the same dimension as the data distribution and the forward mapping (diffusion process) is a gradual noise injection process (See Fig. 2.1). Specifically, DDPM consists of a chain of latent variables of Markov property and can be decomposed into the multiplication of a series of conditional distributions of the same dimension (See Fig. 2.1). The forward joint distribution or forward mapping can be described as follows: q(x1:T |x0) := T∏ t=1 q(xt|xt−1). (2.1) where {xt}t=1...T are the intermediate latent variables along the forward diffusion 2In this work, we refer model distribution pmodel(x) as the approximation to the data density pdata(x). 7 2. Background Figure 2.1: The directed graphical model of DDPM. The latent variables along the sampling path share the same dimension and are hypothesized to be Markovian. The noise is gradually added to the data in the forward diffusion process, while the reverse process recovers the lost semantic information and pushes one step closer to the clean data once the network evaluation. Demo image from [22]. process. The latent variables xt depend only on xt−1 and are perturbed by a pre-defined noise injection process: xt = √ 1 − βtxt−1 + √ βtz, z ∼ N (0, I). (2.2) q(xt|xt−1) = N (xt; √ 1 − βtxt−1, βtI). (2.3) where {βt}t=1...T is the variance schedule controlling the noise level. Theoretically, {βt}t=1...T can be designed in many ways, tailoring to the speed of for- ward data corruption. In this work, we only examine the linear schedule proposed in [22], in accordance with the most existing work, where {βt}t=1...T = T −t T −1β1 + t−1 T −1βT , with T set to 1000, so that the {βt}t=1...T is linearly interpolated between β1 and βT . It should be noted that the mean of the forward diffusion q(xt|xt−1) is scaled by√ 1 − βt to bound the variance so that the xT approximates to standard Gaussian as T goes to infinity. According to Eq. (2.2), it’s non-trivial to show that any noisy intermediate variables can be obtained via direct sampling (See Section A.1): xt = √ ᾱtx0 + √ 1 − ᾱtz, z ∼ N (0, I). (2.4) q(xt|x0) = N (xt; √ ᾱtx0, (1 − ᾱt)I). (2.5) where αt := 1 − βt and ᾱt := ∏T t=1 αt. 8 2. Background Again, as shown by the Eq. (2.5), as ᾱt approaches to 0 with large T, the Gaussian perturbation kernel progressively transforms the data into a standard Gaussian dis- tribution denoted by xT . Interestingly, the posterior of forward diffusion q(xt−1|xt, x0) under this circumstance is tractable and can be computed using the Bayes rule (See Section A.2): q(xt−1|xt, x0) = N (xt−1; µ̃t(xt, x0), β̃tI). β̃t := 1 − ¯αt−1 1 − ᾱt βt. µ̃t(xt, x0) := √ ᾱt−1βt 1 − ᾱt x0 + √ αt(1 − ᾱt−1) 1 − ᾱt xt. (2.6) where µ̃t, β̃t are the posterior mean and variance of the forward diffusion process. From Eq. (2.6), it is noted that the posterior of forward diffusion q(xt−1|xt, x0) shares the same analytical form of Gaussian distribution. Therefore, it is natural to assume the intermediate variables of the reverse process to be also Gaussian, with a tiny amount of noise being removed at each timestep. Practically, the reverse process is approximated by a deep neural network pθ(xt|xt−1). pθ(xt−1|xt) ∼ N (xt−1; µθ(xt, t), Σθ(xt, t)). (2.7) where µθ(xt, t), Σθ(xt, t) denote the mean and variance of the denoised distribution of the reverse process, and θ represents the model parameters. Heuristically, a neural network is trained to denoise the image blurred with Gaussian noise with multiple noise levels, and thus the network captures both coarse and fine- grained image features. The reverse mean µθ(xt, t) is parameterized as the model output, while the reverse variance Σθ(xt, t) can either be learned or approximated by the posterior variance of the forward process β̃t or the noise schedule βt. In the same spirit of the forward diffusion, the joint distribution of the reverse process can also be depicted as a series of transitions starting from xT ∼ N (xT ; 0, I): pθ(x0:T ) := p(xT ) T∏ t=1 pθ(xt−1|xt). (2.8) 9 2. Background 2.1.2 Training Objectives DDPM trains to optimize the variational bound of the negative log-likelihood of data distribution (See Section A.3): E[− log pθ(x0)] ≤ Eq [ − log pθ(x0:T ) q(x1:T |x0) ] = Eq − log p(xT ) − ∑ t≥1 log pθ(xt−1|xt) q(xt|xt−1)  := Lvlb. (2.9) The upper bound Lvlb in the Eq. (2.9) can be further reduced to the tractable KL divergence [30] (See Section A.4 for the complete derivation): Eq [ DKL(q(xT |x0) ∥ p(xT )) + ∑ t>1 DKL(q(xt−1|xt, x0) ∥ pθ(xt−1|xt)) − log pθ(x0|x1) ] . (2.10) where DKL(q(xT |x0) ∥ p(xT )) is denoted as LT , DKL(q(xt−1|xt, x0) ∥ pθ(xt−1|xt) as Lt−1, and − log pθ(x0|x1) as L0. LT is free from the model parameters and can be neglected, as the distribution q(xT |x0) converges to the Gaussian p(xT ) when T is large enough. Optimizing the term Lt−1 pulls the denoised distribution to the ideal posterior of forward diffusion. In this sense, L0 can also be seen as a special case of Lt−1 as shown in Eq. (2.11), where the q(x0) is known and does not depend on the model. DKL(q(x0) ∥ pθ(x0|x1)) = ∫ q(x0) log q(x0) pθ(x0|x1) dx0. (2.11) 2.1.3 Simplified Training Objective This section mainly discusses the reparameterization of reverse diffusion. Noted that Eq. (2.4) provides an efficient way of directly sampling noisy data at arbitrary steps. Consequently, the training can be performed by uniformly sampling t from [0, 1, ..., T] and corrupting the data to xt to compute the Lvlb. To model the denoised distribution of pθ(xt−1|xt), there are many ways to elucidate the model output. The most apparent choice is to estimate the µθ(xt, t) directly, or one can predict the noise ϵ and leverage the reparameterization to obtain xt−1. In [22], Ho., et.al compare the ϵ-prediction and µt−1-prediction and argues that both 10 2. Background achieve relatively equal sample quality, while predicting noise ϵ allows for the sim- plified objective and resemble the denoising score matching [54], a theoretical link is bridged to the score-based models [54, 55]. Additionally, one can also predict the clean image x0, but is found to lead to degraded sample quality in [22]. Therefore, to simplify the overall design choice, we follow [22, 41] to train a noise estimation net- work ϵθ(xt, t). Nonetheless, recent work [3] proposes to train a dual-output network to predict both the ϵ and x0, and its linear combination improves the generation quality, especially with a limited number of sampling steps is considered. Below, we show how the training objective term Lt−1 in Eq. (2.10) boils down to the final simplified objective. We first derive the objective for the mean of the re- verse process, supposing the fixed variance schedule, and then deal with the learned variance in Section 2.1.4. Noted that the gradient is detached from each other when both the mean and variance of the reverse process are estimated by the network. Based on the result of KL-divergence between two Gaussian, the Lt−1 can be written as: Lt−1 = Eq [ 1 2σ2 t ∥ µ̃t(xt, x0) − µθ(xt, t) ∥2 ] + C. (2.12) where C is a constant with respect to θ, σ2 t is the variance of the reverse process, µ̃t is the posterior mean of the forward process, and µθ is the mean of the reverse process. From Eq. (2.12), we can again visualize that the most obvious option is to predict the mean of reverse process µθ. Yet, the parameterization (x0 = 1√ ᾱt (xt(x0, ϵ) −√ 1 − ᾱtϵ) from Eq. (2.2) and Eq. (2.6) can be leveraged to further simplify as: Lt−1 − C = Eq [ 1 2σ2 t ∥ 1√ ᾱt ( xt − βt√ 1 − ᾱt ϵ ) − µθ(xt, t) ∥2 ] . (2.13) Eq. (2.13) reveals that µ̃t is accessible given xt. With xt set to be the network input, a similar reparameterization of the reverse mean can be considered as: µθ(xt, t) = 1√ ᾱt ( xt − βt√ 1 − ᾱt ϵθ(xt, t) ) . (2.14) where the network output ϵθ(xt, t) is the approximator of the sampled noise. Then we plug in the Eq. (2.14) into Eq. (2.13) and achieve the true variational bound: 11 2. Background Lt−1 − C = Ex0,ϵ [ β2 t 2σ2 t αt(1 − ᾱt) ∥ ϵ − ϵθ(xt, t) ∥2 ] . (2.15) Practically, Ho., et.al [22] find out that simply dropping the coefficient in the Eq. (2.15) is beneficial to the sample quality and stabilizes the training when predicting the noise term ϵθ(xt, t), while predicting µθ(xt, t) only works when training with the true variation bound (2.15). The simplified objective Lsimples is as follows: Lsimple := Et,xt,ϵ [ ∥ ϵ − ϵθ(xt, t) ∥2 ] . (2.16) The simplified objective in Eq. (2.16) can be perceived as a reweighted variational bound that emphasizes the training of different noise levels. With the small t being down-weighted, the denoising quality of intermediate and large t can be improved significantly. Besides, as the small t mainly concentrates on recovering the impercep- tible details, the overall sample quality is not severely deteriorated with the training period being shortened. Nevertheless, as the beginning step (small t) also mas- sively contributes to the true variational bound, the log likelihood can be negatively impacted, potentially causing mode covering issues [41] (See Section A.5). 2.1.4 Learned Variance Reverse variance is crucial for density estimation to achieve a competitive log likeli- hood, since both the reverse mean and variance determine the denoised distribution pθ(xt−1|xt). It is generally convinced that the likelihood performance relates to how well the model captures the modes of data distribution [43]. In other words, sample diversity is greatly tied to the performance of the likelihood. Ho., etc [22] adopts the fixed variance schedule, where the same variance being applied to inject the noise in the forward diffusion serves as the reverse variance. As mentioned earlier, bad mode coverage might be caused by fixing variance and simplified objective, and thereby we may improve the log-likelihood by learning the Σθ(xt, t). Empirically, the possible range of Σθ(xt, t) is very small since only a tiny amount of noise is removed at each iteration, and thus predicting Σθ(xt, t) directly by the network might be suboptimal even in the log domain. To solve this, in [41], a linear interpolation is done between βt and β̃, with the network only deciding the weighting term v. The computation is performed in the log domain so that there is no constraint on network output for v: Σθ(xt, t) = exp(v log βt + (1 − v) log β̃t). (2.17) 12 2. Background Here we kindly remind that the βt is the linear variance schedule (See Sec. 2.1.1) and β̃t is the variance of the posterior of the forward mapping in Eq. (2.6). Then we can define a new training objective combined the Lsimple and Lvlb. The former only glues to the reverse mean, while the latter contributes to the true vari- ational bound for likelihood improvement: Lhybrid = Lsimple + λLvlb. (2.18) where the λ weights the multiple losses. Following [41], λ is set to 0.001 and the gradient of Lvlb is detached from the output of µθ(xt, t) to prevent Lvlb from over- whelming Lsimple, and thus the sample diversity is traded off with the sample quality. 2.1.5 Reverse Sampling After training a noise estimation network, samples can be generated by iterative denoising through pθ(xt−1|xt), from T to 0, where the xT ∼ N (xT ; 0, I). Specifically, the sampling formula via parameterization can be written as follows: xt−1 = 1√ ᾱt ( xt − βt√ 1 − ᾱt ϵθ(xt, t) ) + σt ∗ z. (2.19) where the σt corresponds to the standard deviation of the reverse variance Σθ(xt, t), and z is the sampled noise. 2.1.6 Diffusion Network The primary obstacle in the diffusion model lies in developing a robust parameter- sharing network capable of effectively denoising images with varying levels of noise. One commonly employed network in the diffusion model is the U-Net [44] archi- tecture, originally designed for image segmentation, due to the task similarity of image understanding and the same spatial resolution of the network’s input and output. It has been emphasized that without utilizing sophisticated architectures (e.g. U-Net), the performance of the diffusion model may hugely deteriorate [41, 55]. Considering the importance of the diffusion network, in this section, we review the network architecture Res-UNet proposed in [41]. Since the U-Net [44] is the most commonly used framework for medical image segmentation, we refer the reader to Section 2.4.1 for the U-Net [44]. Res-UNet modifies the U-Net architecture by adding a skip connection for each con- volutional block (See Fig. 2.2a), which is commonly helpful for feature aggregation 13 2. Background (a) A residual block; block input x is fused with the block output F (x) via di- rect sum. Image from [20]. (b) The residual block with time encoding t (Skip connection omitted). Figure 2.2: The Residual block used in the diffusion models. and gradient flow. First, the fusion of feature maps at multiple layers enhances the learned representation and prevents the gradient exploding issue. Second, skip con- nections also allows multiple paths for gradient propagation, which might accelerate convergence and enhance training stability. The exact residual block design follows PixelCNN++ [47], consisting of Group Nor- malization [63], 3 × 3 convolutional block and GeLU activation function [21]. The overall network is visible in Fig. 2.3, taking both xt and t as input. Each time step is projected into a unique sinusoidal encoding, passing as a Fourier features for aggregation: sin(t) :=[sin(2w1t), sin(2w2t), ..., sin(2wdt), cos(2w1t), cos(2w2t), ..., cos(2wdt)]. (2.20) where w1, w2, ..., and wd are the pre-defined frequencies free from the parameters, and d is the number of dimension of the projected vector. Fig. 2.2b shows that the sinusoidal time encoding t goes through a linear time module to adjust the channels for the feature maps of varying depths. Eventually, the time encoding is fused with the feature maps between the first and second conv blocks at each residual block via direct sum. 14 2. Background Figure 2.3: The overall network design of the diffusion network. RB: residual block, Attn: efficient self-attention block. The residual Block is the basic unit for image downsampling and upsampling with the skip connection introduced. See text for more description. Efficient self-attention block is applied at the deepest level and improve feature extraction via QK product with linear complexity. We refer the reader to [48] for more details about efficient attention. 2.2 Posterior sampling using Diffusion Models As the impressiveness of DDPM is revealed for image generation, a natural next step would be investigating its performance of controllable generation. In this section, we roughly review two lines of work for posterior sampling p(y | x) using DMs. 2.2.1 Conditional Sampling Using Bayes rule, one can easily show that posterior sampling is equivalent with the multiplication of prior and forward model: p(yt−1|yt, x) ∼ p(yt−1|yt)p(x|yt). (2.21) In the context of the DMs as Eq. (2.21), in order to sample from the posterior p(yt−1|yt, x), conditional sampling concentrates on leveraging a pre-trained diffusion model to capture the generative prior p(yt−1|yt), and incorporate the conditional information during sampling. The key insight is to "hijack" the intermediate latent variables yt or yt−1 and enforce consistency with the guidance signal x. In practice, the conditional information is only integrated during sampling through the forward model p(x|yt), a reverse mapping contrary to our target. If the forward model is linear (e.g. y = Ax), it is not hard to estimate without training. This has led to a series of works leveraging pre-trained diffusion models for solving linear inverse problems in an unsupervised manner [9, 55, 56]. However, if the forward 15 2. Background model is nonlinear, a certain approximation is inevitable. For example, Dhariwal, P. and Alex N. [14] trains a separate classifier to perform class-conditional sampling referred to as classifier guidance, in which the gradient of the forward model is incorporated for consistency. The result balances sample diversity and fidelity. 2.2.2 Conditional Model Another line of work focuses on training a conditional diffusion model (CDM), once the paired data {(xi, yi)}N i=1 is collected. In this scenario, conditional distribution p(y|x) needs to be modeled given a dataset of paired data D = {(xi, yi)}N i=1, where N is the number of data points. Specifically, the target distribution p(y|x) is a one- to-many mapping that multiple target data y can align with the source information x. To achieve this, the denoising network takes both the noisy input yt and conditional input x to predict the sampled noise. This leaves much room for the design of a con- ditional strategy to build an effective CDM. Existing work [46] for super resolution employs simple concatenation along the channel dimension between noisy input and low resolution image (net(cat[yt, x], t)), obtaining great results on the recovery of fine-grained details. Since CDM is based upon a more general relation between the condition and target, it is much easier to apply directly to the case of multimodal generation. For instance, the latent diffusion model [45] can be applied for class, segmentation map and text-guided synthesis via a separate encoder. In this case, a conditional encoder is leveraged to encode the features of x and then aggregate the representation with the denoising network [1] (See Fig. 2.4). 2.3 Diffusion Model for Zero-shot Data Editing A remarkable property of DMs is that zero-shot inverse problem solving can be attained without problem-specific training. As the forward diffusion process is non- parametric, a mild initialization can be optimized by adding noise and applying the diffusion denoiser. For instance, for image colorization [55, 11, 9], the grayscale image can be tripled along the channel dimension and then being injected noise to a certain intermediate timestep for the denoiser. In this section, we review how the pretrained diffusion model performs zero-shot image inpainting and will see later how the same logic generalizes to image level that training a joint diffusion model with pair data resembles conditional sampling in Section 3.1. The initial idea of inpainting is proposed by Yang, S., etc [54] to illustrate the representation learned by the DMs, and trailed by a dozen of work [55, 9, 35, 27] 16 2. Background Figure 2.4: SegDiff [1]: a diffusion-based segmentation method. The two separate encoders (G & F) are employed for denoising noisy segmentation mask the extracting features of the image. The encodings are integrated by the direct sum in the latent space. to explore a robust sampling algorithm, which gives the most harmonious exam- ple visually and highest sample quality quantitatively. As for the vanilla sampling method, the image is subdivided into known regions for conditioning and unknown regions for generation. The intermediate noisy image sources from two directions: the forward diffusion of the known regions, and the reverse diffusion of the unknown regions: xknown t−1 ∼ N ( √ ᾱtx0, (1 − ᾱt)I). xunknown t−1 ∼ N (µθ(xt, t), Σθ(xt, t))s. xt−1 = m ⊙ xknown t−1 + (1 − m) ⊙ xunknown t−1 . (2.22) where the m is the inpainting mask, xknown t−1 samples from the forward process and xunknown t−1 samples from the reverse process. The combined output via masking enforces harmony between the known and un- known regions for the next step’s evaluation. The pseudocode3 and examples can be seen in Alg. 1 and Fig. 2.5. We refer the reader to more illustrations and examples to Repaint [35]. 3In the Repatint [35], an additional resampling strategy is introduced to enhance the quality of the output. This strategy brings about certain differences in the algorithms in contrast to Alg. 1, while the core ideas behind the algorithms remain unchanged. 17 2. Background Algorithm 1 Diffusion Model for Image Inpainting 1: Initialization: 2: xT ∼ N (0, I) 3: for t = T, ..., 1 do 4: ϵ ∼ N (0, I) if t > 1 else ϵ = 0 5: xknown t−1 = √ ᾱtx0 + √ 1 − ᾱtϵ 6: z ∼ N (0, I) if t > 1 else z = 0 7: xunknown t−1 = 1√ ᾱt ( xt − βt√ 1−ᾱt ϵθ(xt, t) ) + σt ∗ z 8: xt−1 = m ⊙ xknown t−1 + (1 − m) ⊙ xunknown t−1 9: end for 10: Output: x0 Figure 2.5: Qualitative results on ImageNet 512 × 512 for thin mask from Repaint [35]. 2.4 Medical Segmentation Network The medical segmentation network is a special type of neural network fθ(x) for partitioning the image pixels into desired categories. The network projects the image into a probability map of targeted classes, and then the class with the maximum probability is taken as the final output: y = argmax(fθ(X), dim = 1). (2.23) where X ∈ (H, W, 3), fθ(X) ∈ (C, H, W ), y ∈ (H, W ) and C is the number of class. The Eq. (2.23) can be applied to both 2D and 3D biomedical segmentation tasks, while the key distinction lies in the additional spatial dimension present in 3D data. It is also apparent that directly training a 3D network is more computationally ex- pensive than that of 2d. As the neural network is key to the medical segmentation 18 2. Background Figure 2.6: The U-Net Architecture: A powerful deep learning framework for semantic segmentation tasks in medical imaging. It comprises a U-shaped network structure with both contracting and expanding paths. The contracting path consists of convolutional and pooling layers to capture context, while the expanding path uses transposed convolutions for precise localization. Skip connections facilitate the integration of multi-scale features, enabling different size of receptive fields. performance, in this section, we review the biomedical segmentation network archi- tecture. It is also noted that the same network is also utilized to train the DDPM with coincidence due to task similarity. 2.4.1 U-Net U-Net (See Fig. 2.6) proposed by Olaf R., etc [44] is a CNN architecture originally designed by biomedical segmentation, and soon being extended to many applica- tions where the model’s input and output share the same spatial resolution. The name "U-Net" originates from the entire U-shaped design of the architecture, which resembles an encoder-decoder framework. Currently, the encoder and decoder are not only limited to the CNN structure but general feature extractor and interpreter. 19 2. Background The encoder-decoder structure is highly effective for tasks where pixel-wise under- standing is required for the output, such as semantic segmentation, denoising, etc. Overally, the entire architecture consists of two main parts: the contracting path for downsampling and the expanding path for reconstruction. The contracting path resembles a typical CNN encoder and is responsible for cap- turing context and extracting features from the input image. It composes of a series of convolutional and pooling layers, which progressively reduce the spatial dimen- sions of the input while increasing the number of feature channels. The pooling operations are used for downsampling, contracts the image into a lower resolution feature map. These operations, including maxpooling, average pooling or stride convolutional kernel, capture the context and extract higher-level features. The expanding path serves as the decoder and reconstructs the segmented image from the lower-resolution feature map. It consists of a series of up-convolutional lay- ers, also known as transposed convolutions or deconvolution [42]. The up-convolutional layers perform upsampling implemented by either parameter-free extrapolation or convolutional kernel, increasing the spatial dimensions of the feature map while re- ducing the number of feature channels. To compensate for the information loss during dowansampling and fuse the multi-scale features maps for different spatial levels of features, the skip connection is adopted to aggregate the upsampled features with the corresponding feature maps from the contracting path. The skip connec- tion is critical for recognizing objects of various sizes as multi-scale features being fused have different receptive fields, and also supports the active gradient flow with more paths for backpropagation. In the original design [44], the skip connection is implemented by simple concatenation. Then, the concatenated features are further processed through convolutional layers to refine the segmentation. The contracting and expanding paths are symmetric, with skip connections between them. Overall, the U-Net architecture has proven to be very successful for various image segmentation tasks, particularly in the medical domain. Its ability to learn from limited training data and produce accurate pixel-level predictions has made it a popular choice in the field of image analysis. 2.4.2 nnU-Net nnU-Net [25] (See Fig. 2.7) is a deep learning framework specifically designed for volumetric medical segmentation tasks. It is purely based on the U-Net [44] archi- tecture, but over-engineered with a multitude of training recipes built upon the data format, exact memory constraint, etc. 20 2. Background Figure 2.7: nnU-Net’s interdependence among the hyperparameter setting, data pre-post processing, and model selection. This results in a highly generalized and self-configured framework for practical train- ing decisions via heuristic rules in terms of hyperparameters and data processing. nnU-Net achieves state-of-art on a majority of tasks without detailed network ar- chitecture design, but systematizing the complex process of manual method config- uration, which was previously addressed either by cumbersome manual tuning or purely empirical approaches with practical limitations [25]. nnU-Net is a comprehensive toolbox with high flexibility, robustness, making it a valuable tool and baseline for medical imaging researchers and practitioners. Be- cause of this, it can also be an appropriate training instance for to assess the ef- fectiveness of new plug-and-play methods (e.g. data augmentation) for volumetric segmentation. The whole design of distilling the domain knowledge emcompasses three parts of parameter groups: fixed, rule-based and empirical parameters. In the rest of this section, we roughly review the core design. The full details are available in [25]. The fixed parameters, involving architecture, optimizer, learning rate, etc (See Fig. 2.7), are optimized for joint configuration across multiple datasets. This part is determined manually by the designer and stay fixed regardless of the datasets being applied to. 21 2. Background The rule-based parameters manage the data pre-processing and the setting of some parts of hyperparameters. This decision is a set of heuristic if-else rules merely inspired by the experience of the designer. For instance, the distribution of spac- ing decides the image resampling strategy (See Fig. 2.7): if the collected data is anisotrpic4, a cubic spline interpolation is employed within the image plane and nearest neighbor interpolation for the outer plane, while a cubic spline interpolation is applied to all planes if the opposite. The empirical parameters mainly decide the data post-processing and the model inference. For example, the best configured model among 2D, 3D, and 3D cascade model will be chosen for inference according to their validation performance. Besides, nnU-Net utilizes an ensembling approach during the inference phase to further boost performance. It combines the predictions from multiple models trained on different training sets or with different settings. This ensemble of models helps to capture diverse information and improve the segmentation results. 4According to [5], anisotropic refers to a situation where the pixel or voxel dimensions are not equal in all directions. In the context of medical imaging, it means that the spacing between pixels or voxels differs along different axes. 22 3 Joint Diffusion Model for Paired Data Synthesis In this chapter, we illustrate the Joint Diffusion Model (JDM), a generative frame- work for pair data synthesis in parallel with the classical DMs, and CDM. We show how a joint model is trained and sampled to produce data, and how to leverage a joint model for conditional sampling, inspired by the zero-shot data edition using pretrained DMs (See Section 2.3). Generally speaking, the JDM still follows the training diagram of the DMs, however, instead of injecting noise and predicting the sampled noise for the single modality input, the same operation is performed on the multiple modalities simultaneously. To that end, joint modeling incorporates the marginal distribution of single modality data and conditional sampling, resembling the image inpainting using pretrained DMs [35]. Section 3.1 discusses the frame- work of the JDM in detail. Section 3.2 further illustrates the theory of the condition sampling. 3.1 Joint Diffusion Model Supposing paired data is collected as {(xi1, xi2, ..., xim)}N i=1, where N is the number of data points and m is the number of modality (e.g. text image pair corresponds to m equals to 2 ). Rather than model the conditional distribution with parts of data adopted for conditioning, a JDM can be trained to model the joint distribu- tion. Under this circumstance, extending from the classical DMs, the target data 23 3. Joint Diffusion Model for Paired Data Synthesis distribution switches from uni-modal x to the joint variable (x1, x2, ..., xm): p(x) ⇒ p(x1, x2, ..., xm). (3.1) In spite of the huge shift from unimodal to multi-modal data, the mathematical framework remains the same as the DDPM [22], because we can simply define a joint variable and this results in the same mathematical derivation of the DMs: x =: (x1, x2, ..., xm). (3.2) 3.2 Conditional Sampling There are several advantages of considering pair data entirely as the target in the context of DMs. First, joint modeling incorporates the marginal distribution. If the joint distribution p(x1, x2, ..., xm) is correctly modeled, we can easily sample the arbitrary unimodal data {xi}m i=1 by simply dropping the remaining modalities {xτ }τ ̸=i. Second, it also permits condition sampling given an arbitrary number of modalities. For instance, a joint distribution is modeled for CT, MRI images for the same patient, one can use either CT or MRI to induce the generation of the other. Theoretically, assuming the number of available modalities is k, we reorder the multimodal data as (x1, x2, ..., xm−k) and (xm−k+1, xm−k+2, ..., xm), with the former for sampling and the latter for conditioning. The conditional distribution can be decomposed by the multiplication rule: p(x1, x2, ..., xm−k|xm−k+1, xm−k+2, ..., xm) = p(x1, x2, ..., xm) p(xm−k+1, xm−k+2, ..., xm) . (3.3) where m is the number of modalities. In the diffusion setting, the distribution is characterized as the progressive denoised distribution. Therefore, the practical distribution for sampling is: p(x1, x2, ..., xm) p(xm−k+1, xm−k+2, ..., xm) ⇒ p(xt−1,1, xt−1,2, ..., xt−1,m|xt,1, xt,2, ..., xt,m) p(xt−1,m−k+1, xt−1,m−k+2, ..., xt−1,m) . (3.4) where t is the timestep across the gradual denoising process. Note that this results in two easy-to-sample distributions in the practical context. For the denominator in Eq. (3.4), we can use the forward diffusion to corrupt the data as the (xm−k+1, xm−k+2, ..., xm) being known. For the nominator in Eq. (3.4), 24 3. Joint Diffusion Model for Paired Data Synthesis the denoised joint distribution can be employed for reverse sampling with a trained joint model on hand: p(xt−1,1, xt−1,2, ..., xt−1,m|xt,1, xt,2, ..., xt,m) ∼ N (µθ(xt, t)), Σθ(xt, t)) p(xt−1,m−k+1, xt−1,m−k+2, ..., xt−1,m) ∼ N ( √ ¯αt−1x0, (1 − ¯αt−1)I) (3.5) where θ is the model parameters and ¯αt−1 is the pre-defined noise schedule (See Section 2.1.1). Notably, this resembles the zero-shot image inpainting using a pretrained diffusion model in Section 2.3, in which the image is subgrouped into the known and unknown regions. Literally, training a JDM and performing conditional sampling spread the target from pixel level to image level. Hence, JDM is also orthogonal to various zero-shot inverse problem solvers using the DMs [55, 10, 9]. Empirically, the sampling can be achieved by replacing the generated intermediate noisy data of known modality with the known data through the forward data cor- ruption. However, extending from unimodal data to higher dimensional paired data inevitably poses a huge challenge to the neural network. This may possibly lead to failure of mode capturing, unstable training, etc., as the network shoulders the extra burden of learning the mutual dependence among the multimodal data rather than the uni- lateral dependence in the conditional model. Hence, in this thesis, we study the simple case where m = 2 and train the JDM of medical image and segmentation mask, a task that usually suffers from data sparsity problems. A trained JDM can ideally produce countless numbers of data. Given the great sample quality and di- versity, we expect the JDM can be leveraged as a data augmentation tool to improve the performance of medical image segmentation. As for mask-conditioned sampling, we can add an extra step of substituting the generated mask with the true mask injected by noise, which only results in a few lines of code difference (See Alg. 2). Theoretically, a trained model can also be used for medical segmentation, but it is beyond the interest of this work. 25 3. Joint Diffusion Model for Paired Data Synthesis Figure 3.1: The gradual denoising procedure of the Joint Diffusion Model (JDM). A paired sample of medical image and segmentation mask is produced with a single reverse run. Algorithm 2 Conditional Sampling 1: Initialization: 2: xT ∼ N (0, I), m0 := mask_true 3: for t = T, ..., 1 do 4: z ∼ N (0, I) if t > 1 else z = 0 5: xt−1 ∼ N (xt−1; µθ(xt, t)), Σθ(xt, t)) 6: imgt−1, maskt−1 = chunk(xt−1, dim = 1) 7: maskt−1 ∼ N (mt−1; √ ¯αt−1m0, (1 − ¯αt−1)I) 8: xt−1 = cat([imgt−1, maskt−1], dim = 1) 9: end for 10: Output: x0 26 4 Experiments In this chapter, we discuss the experimental setups of this work. This involves in the dataset descriptions, experimental environments, setups of the training and evaluation, and performance metrics. 4.1 Experimental Setups Software Setup: We use Pytorch [40], an open source deep learning framework that provides a flexible and efficient platform for building and training various types of deep learning mod- els. The experiments of the JDM is based on the code1 from [64]. For biomedical segmentation, we use the training instance nnU-Net 2. The code and scripts to reim- plement this work is accessible through: https://github.com/CaviarLover/diffAug. Hardware Setup: All experiments are executed on two NVIDIA GEFORCE RTX 3090 GPUs and the AMD Ryzen 9 3950X 16-Core Processor. 1https://github.com/JuliaWolleb/Diffusion-based-Segmentation 2https://github.com/MIC-DKFZ/nnUNet 27 https://github.com/CaviarLover/diffAug 4. Experiments Transfer Learning on Synthetic Data: We adopt a two-step process to perform supervised transfer learning on synthetic data: 1. Pretraining with Synthetic Data: We begin by pretraining the segmentation model using supervised learning on the synthetic data samples. This initial training step allows the model to gain insights from the generated data and learn relevant features. 2. Fine-tuning with Real Data: After the pretraining phase, we proceed to finetune the segmentation model using real data. By fine-tuning on real data, the model can adapt and refine its learned features to better align with the characteristics of real-world data. It is worth noting that an alternative strategy could involve mixing both real and synthetic data during training. However, our preliminary experiments show that the success of this scheme largely depends on the proportion of the data combination and varies significantly across different datasets. Due to the complexities involved and the potential challenges in finding the optimal data mixing ratio, we have de- cided not to consider this strategy in this work. By adopting the two-step approach of pretraining with synthetic data and then fine- tuning on real data, we aim to improve the segmentation performance and leverage the benefits of synthetic data in a more controlled and effective manner. 4.2 Datasets Kvasir dataset is a polyp segmentation dataset composed of consistent video frames of endoscopy. The train test split is adopted from [15], where randomly selected 900 images from Kvasir and 450 images from Clinic-DB formulating the training set. GlaS dataset comprises microscopic images of slides stained with Hematoxylin and Eosin (H&E). These images and corresponding ground truth annotations by expert pathologists are used for training and testing purposes. The dataset contains 85 images for training and 80 images for testing. MoNuSeg dataset is created using images captured at 40x magnification of H&E stained tissue. It includes images from multiple organs and patients. The MoNuSeg dataset encompasses 30 training images with approximately 22,000 nuclear bound- 28 4. Experiments ary annotations and 14 testing images with over 7,000 nuclear boundary annotations. Testsets: Kvasir [26], Clinic-DB [4], CVC-ColonDB [57], CVC-300 [60] and ETIS [49] are five polyp segmentation datasets, commonly used to demonstrate the effectiveness of the segmentation model of polyps. Since the generative model is trained on a mixed dataset of Kvasir and Clinic-DB as mentioned earlier, we claim the Kvasir and Clinic-DB as the validation set, and CVC-ColonDB, CVC-300, and ETIS as the generalization set. The testset of MoNuSeg with 14 images is leveraged to evaluate the nuclei segmentation. The segmentation model is trained on all images resized to 256 (polyps) and 512 (nuclei) and inference is performed in the same resolution and upsample to the original resolution via nearest upsampling. 4.3 Hyperparameters 4.3.1 Joint Diffusion Model (JDM) Key hyperparameters of the JDM are summarized in Table 4.1. Aligned with [41], we train the generative model with the number of residual blocks at each level set to 2. The number of base channels is 128, and the attention block of 4 heads is applied on the U-Net level with the downsampling factor of 8 and 16. We do not use the dropout during training. For more detailed information, we rec- ommend referring to the publication [41] and visiting the GitHub repository at https://github.com/openai/improved-diffusion. These sources provide additional details and resources related to the setting. Train Period. Since there is no suggested number of steps for training a diffusion model, we train the model for 60k steps on Kvasir, 40k on Monuseg, and 40k on GlaS. After training, the checkpoints after half steps are loaded to visualize the gen- erated samples. We pinpoint the valid setup of the number of steps giving the most harmonious sample between the generated segmentation mask and medical image, while the early checkpoints lack fine-grained image features and are less harmonious. We do not qualitatively check the metric of the sample quality such as "Fréchet In- ception Distance" (FID) [23] for picking the checkpoint. 29 https://github.com/openai/improved-diffusion 4. Experiments setting number of diffusion steps 1000 learned variance True dropout 0.0 batch size 10 learning rate 1e-4 number of base channels 128 channel multiplier 1, 1, 2, 2, 4, 4 number of residual blocks 2(3)* number of input channels 4 number of heads per attention block 4 depth level applied with attention block 16, 8 Table 4.1: The hyperparameter setting of the JDM. (*) The number of residual blocks is set to 2 and 3 for Kvasir and MoNuSeg, respectively. 4.4 Metrics To evaluate the segmentation results, we adopt two most commonly used metrics mean Dice score (mDice) and mean Intersection over Union (mIoU). mDice also known as the Dice coefficient or Dice similarity coefficient, measures the similarity between the predicted segmentation and the ground truth. It quantifies the overlap between the two by computing the ratio of twice the intersection of the predicted and ground truth regions to the sum of their areas. mIoU is another widely used metric that evaluates the accuracy of segmentation algorithms. It calculates the ratio of the intersection area to the union area between the predicted and ground truth regions. The mIoU is the average IoU calculated over all the classes in the dataset. The computation of two metrics is as follows: mDice = 2 ·∑c i=1 TPi∑c i=1(TPi + FPi) +∑c i=1(TPi + FNi) (4.1) mIoU = 1 C C∑ i=1 TPi TPi + FPi + FNi (4.2) where TPi is the true positives for class i, FPi is the false positives for class i, FNi false negatives for class i, and c is the total number of classes. 30 4. Experiments 4.5 Baselines nnU-Net. The training of nnU-Net [25] strictly adheres to the guidelines presented in the official implementation. For detailed instructions, please refer to the nnU-Net repository on GitHub: https://github.com/MIC-DKFZ/nnUNet. Polyp Segmentation. Except for the nnU-Net [25], we include some other meth- ods for comparison. PraNet [15] and CaraNet [24] are two methods designed specif- ically for polyp segmentation. The PraNet set the best performance back in 2020 and CaraNet is known for its ability to effectively detect small medical objects by combining various modules to achieve this capability. The result can be found on the leaderboard of paperswithcode.com3. This leaderboard showcases the latest state-of- the-art of the benchmarks in the field of computer vision. Nuclei Segmentation. We have incorporated several state-of-the-art methods for comparison on the MoNuSeg [31] dataset. Each of these methods introduces unique advancements to the U-Net [44] architecture to improve its performance in medical image segmentation. The results are extracted from the leaderboard of paperswith- code.com4. 1. U-Net++ [66]: This method enhances the standard U-Net by deeply supervising the intermediate feature maps. This strategy compels the low-level features to be- come more discriminative, leading to improved segmentation results. 2. Attention U-Net [38]: By incorporating an attention gate mechanism on the skip connections, this method selectively highlights the salient features of the encoder. This pruning process helps the model focus on relevant information. 3. MedT [58]: This approach combines both transformer and CNN structures, cre- ating two separate branches to process medical images. The transformer enables capturing long-range dependencies, while the CNN branch handles local features, improving the model’s ability to capture both global and local context. 4. UCTransUnet [61]: Building upon the U-Net architecture, UCTransUnet in- troduces a global feature aggregation module. This module effectively fuses rep- resentations from different levels of the encoders and decoders, leading to a more comprehensive understanding of the input. 3https://paperswithcode.com/sota/medical-image-segmentation-on-kvasir-seg 4https://paperswithcode.com/sota/medical-image-segmentation-on-monuseg 31 https://github.com/MIC-DKFZ/nnUNet https://paperswithcode.com/sota/medical-image-segmentation-on-kvasir-seg https://paperswithcode.com/sota/medical-image-segmentation-on-monuseg 4. Experiments 5. SMESwin Unet [62]: This method leverages a superpixel segmentation branch to identify and exclude irrelevant features from the original image. By doing so, the Swin Transformer [34] based U-Net is enhanced to focus solely on important regions. 32 5 Results and Discussion In this chapter, we present the main results of this thesis. We explore the capability of the Joint Diffusion Model (JDM) to produce medical image segmentation pair data. We further train the segmentation model based upon nnU-Net [25] to inves- tigate the usefulness of the synthetic data. In Section 5.1, we visualize the generated samples on several datasets, including pair data synthesis and mask-conditioned synthesis. In Section 5.2, we train the segmentation model in supervised manner using purely synthetic data for polyp and nuclei segmentation. In Section 5.3, we further leverage the pretrained checkpoint and finetune the segmentation model on the real data. In both cases, the number of used synthetic data is compared. 5.1 Visual Comparison In this section, we train the JDM on three medical image segmentation datasets: Kvasir [26], MoNuSeg [31] and GlaS [50]. The generative model is trained from scratch. The full descriptions of the datasets can be found in [26] [50] [31]. 5.1.1 Pair Data Synthesis We highlight some of our generated samples and their associated segmentation masks (See Fig. 5.1). We further qualitatively investigate the similarity between the syn- 33 5. Results and Discussion thetic sample and the closest sample in the training set in terms of the cross entropy mutual information. As shown in Fig. 5.1, the synthetic samples demonstrate the effectiveness of the JDM in generating high-quality images with intricate details. It showcases the model’s capability to capture the nuances of light contrast in endoscopy and accurately vi- sualize small text descriptions of patients, particularly for polyp images (See last two rows of Fig. 5.1). Furthermore, the generated segmentation masks, together with their corresponding images, exhibit a remarkable visual consistency and are appropriately positioned. Notably, the results display a significant diversity in the generated samples, despite being trained on a limited number of large-resolution images (512 × 512 for the MoNuSeg with only 30 images). The first two rows of Fig. 5.1 for the MoNuSeg are particularly distinct from each other in various aspects. These differences include the background, color composition, size of nuclei, and compactness of nuclei struc- tures. This proves the model’s ability to produce diverse and varied outputs even when trained on a relatively small set of high-resolution images. To inspect the true effectiveness of the model distribution in the context of deep learning, it is necessary to verify that the model does not just simply memorize the training set, especially for the dataset with a limited number of data. This verification can be accomplished by first computing the mutual distance between the images used for training and the generated samples. Then, the real image with the closest distance is visually compared with the synthetic one to further check the similarity, so that the validity of the model distribution can be demonstrated if huge visual differences are witnessed. Following [12], a similar work for retina image and vessel network synthesis, we adopt the Mutual Information Measure to analyze the distance between the training and synthetic segmentation masks. We denote the generated segmentation mask as s̃, and a true mask as s. The metric can be computed as follows: MI(s, s̃) = ∑ si∈s ∑ s̃i∈s̃ p(si, s̃i) log p(si, s̃i) p(si) · p(s̃i) (5.1) where si and s̃i are the pixel values of the s and s̃, the p(si, s̃i) is the joint histogram of the true and synthetic segmentation masks, and p(si), p(s̃i) are the marginals. The closest sample is extracted from the training set for visual comparison. The randomly selected examples are shown in Fig 5.2. 34 5. Results and Discussion The findings indicate that there are noticeable distinctions between synthetic im- ages and real images, in spite of high structure similarities between the segmentation masks. This is particularly evident for polyp images and histopathological images from the MoNuSeg dataset (e.g. nuclei of different sizes, background color, etc). We will see later that the mask-conditioned synthesis (See Section 5.1.2) gives us a great diversity of images. This further reinforces the observation that synthetic samples differ from real ones. In the case of the GlaS dataset, the synthetic images primarily differ from real images in terms of style and fine-grained details, while the main content of the images remains relatively consistent. This overall similarity might be partly attributed to the size of the training sets and the low resolution of the training images being used. The limited size of the training sets may overlook the overall interdependence among different regions of the images. Moreover, the low resolution of the training images along with the small size might miss some intricate details. 5.1.2 Mask Conditioned Synthesis We investigate the outcomes of mask-conditioned synthesis using the JDM (See Sec- tion 3.1) for polyp and nuclei images. To demonstrate that our method goes beyond simple memorization of the training set, we apply the training mask for conditional sampling. The qualitative results can be found in Fig. 5.3. Fig. 5.3a and 5.3b provide visual evidence that the generated images induced by the training mask exhibit significant differences compared to the original training images. The main distinctions lie in context, textures, and contrast. The substantial deviation between the generated images and the training images again indicates the validity of the model distribution, image representation, and mutual consistency. 5.1.3 Discussion In this section, we visually evaluate the generated samples on both polyp and nuclei segmentation datasets. The analysis reveals that the synthetic paired data generated by the JDM exhibits visual harmony and demonstrates a high level of consistency between the generated images and their associated segmentation masks. The model is capable of producing paired samples that possess both quality and diversity. 35 5. Results and Discussion Figure 5.1: Random generated samples of nuclei and polyp images and their asso- ciated segmentation masks. All the images and masks are resized to 256 for display. 36 5. Results and Discussion Figure 5.2: Random generated samples and closest train data regarding mutual information concerning segmentation mask. From left to right: synthetic image, associated synthetic segmentation mask, closest real image, closest real segmentation mask. 37 5. Results and Discussion (a) Synthetic images induced by the training mask of MoNuseg [31]. (b) Synthetic images induced by the training mask of Kvasir [26]. Figure 5.3: Synthetic medical images guided by the training mask using the joint diffusion model. The significant difference in contrast to the real image demonstrates the sample diversity and model distribution. 38 5. Results and Discussion For polyp segmentation, the generated samples appropriately position the polyps within the images, however, the boundaries of the polyps may not be entirely smooth. The model successfully avoids simple memorization of the training im- ages, as evidenced by studying the closest sample in the training set compared to the synthetic ones. As for the Glas dataset, the results exhibit a high degree of structural similarity to real images. This outcome might be partly attributed to the size of the training set and the relatively low resolution of the images used, which might lower the learning difficulty of the generative model, and thus fail to capture the interdependence among different image regions. Hence, we suggest training the DDPM in a large resolution under the situation of data sparsity. For future work, we plan to explore how the training time will impact the sample quality and diversity quantitatively. In the next section, we will investigate the generated images for training the segmentation model. 5.2 Medical Segmentation on Synthetic Data In this section, we investigate the segmentation performance of training on purely synthetic data. Given that countless numbers of data can be produced, we compare the number of synthetic images used for training. The results can be an indication of how many numbers of images should be used for transfer learning in Section 5.3. We start from the same number of synthetic data as the original dataset, and then gradually increase the data percentage. The number of synthetic data is quantified as the proportion relative to the training set. 5.2.1 Medical Segmentation on Polyps We set the number of training epochs to 400, as the validation curve saturates nearly after 300 epochs. The test is split into two subgroups (validation set and general- ization set) for Results 1 & 2. Result 1: We compare the number of synthetic samples training the segmentation model. The results are presented in Table 5.1 and the first row of Fig. 5.4, in contrast to training on real data. Note that the results are quite close to that of training on real data, achieving com- petitive performance on both datasets. It can be observed that both performance metrics vary with different numbers of synthetic images. Two metrics peak when 39 5. Results and Discussion Training images Kvasir Clinic-DB mDice(%) mIoU(%) mDice(%) mIoU(%) Real 90.81 85.16 88.54 83.91 Synthetic - 100% 85.87 78.71 77.40 70.66 Synthetic - 150% 86.50 79.76 77.46 70.28 Synthetic - 200% 86.36 79.60 77.69 71.06 Synthetic - 250% 87.08(-3.73) 80.41(-4.75) 78.01 71.49 Synthetic - 300% 86.00 78.96 78.21(-10.33) 71.70(-12.21) Synthetic - 400% 84.52 77.28 77.19 70.39 Synthetic - 500% 83.84 76.69 77.93 71.35 Table 5.1: Results on validation sets of polyp segmentation. (# Synthetic - %) represents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. 250% or 300% of images are leveraged for training and decline afterwards (See the first row of Fig. 5.4). The variation of the number of synthetic data is more sensitive to the Kvasir rather than the Clinic-DB with larger performance drops is witness. Besides, the performance gap between the synthetic and real data is much smaller for Kvasir compared with that of Clinic-DB, with the best case (250% and 300%, respectively) of 3.73% and 10.33% differences in terms of mDice. This suggests that the model distribution is more inclined to approximate the distribution of Kvasir instead of Clinic-DB. Result 2: The results of the generalization sets can be found in Table 5.2 and the last two rows of Fig. 5.4. It is unexpected that the best outcome (Synthetic 100%) is achieved with the same number of real images (100%). Generally, Fig. 5.4 indicates a trend of decline as more synthetic data is used to train the model. All three testsets experience a small bump when the number of data is 250%. Similar to the validation set, both mDice and mIoU decrease as the number of images exceeds 250%. Additionally, the performance drop is huge for ETIS of 8.2% as opposed to 4.4% of Colon-DB and 3.2% of CVC-300. 40 5. Results and Discussion Figure 5.4: mDice vs. the number of synthetic data. The dashed line denotes the result of training on real data. From left to right and top to down: Kvasir, Clinic-DB, Colon-DB, CVC-300, ETIS. 41 5. Results and Discussion Training Images Colon-DB CVC-300 ETIS mDice(%) mIoU(%) mDice(%) mIoU(%) mDice(%) mIoU(%) Real 76.62 69.39 85.35 77.84 69.21 62.14 Synthetic - 100% 72.26 65.53 82.11 74.56 60.96 53.28 (-4.36) (-3.86) (-3.24) (-3.28) (-8.25) (-8.86) Synthetic - 150% 71.50 64.77 81.93 74.24 58.19 51.22 Synthetic - 200% 70.77 64.10 81.38 73.63 57.66 50.60 Synthetic - 250% 72.07 65.44 81.94 74.28 60.40 53.18 Synthetic - 300% 71.15 64.25 81.56 73.67 59.05 51.75 Synthetic - 400% 69.94 63.15 81.47 73.16 58.08 50.43 Synthetic - 500% 68.50 62.14 81.27 73.41 56.93 49.14 Table 5.2: Results on generalization sets of polyp segmentation. (# Synthetic - %) represents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference compared to the result on real data. 5.2.2 Medical Segmentation on Nuclei We set the number of epochs by default to be 1000 as the computational cost is bear- able and the general trend of the validation curve stabilizes only when the number of training epochs is from 900 to 1000. Other hyperparameter setting is determined by the nnU-Net [25] without checking. Result: We perform an ablation study on the number of synthetic data same as the Section 5.2.1. The results of these experiments are presented in Table 5.3 and visualized in Fig. 5.5. Table 5.3 demonstrates that the segmentation performance is only marginally lower than the real data counterpart (with a difference of 2.44% and 3.32% for mDice and mIoU, respectively) when training with 150% of synthetic data. This finding suggests that the JDM can effectively model complex data distributions even when the available data is limited. It is further illustrated in Fig. 5.5 about the segmentation model’s robustness, showing a slight tendency of performance drop as the number of synthetic data increases, but overall it remains resilient. Notably, it is observed that the mDice for the case of using 100% synthetic data is significantly lower compared to the results obtained on the real data, indicating potential challenges in modeling the 42 5. Results and Discussion Training images MoNuSeg mDice(%) mIoU(%) Real 81.68 69.10 Synthetic - 100% 75.47 61.15 Synthetic - 150% 79.24(-2.44) 65.78(-3.32) Synthetic - 200% 77.71 63.91 Synthetic - 250% 78.26 64.57 Synthetic - 300% 78.30 64.74 Synthetic - 400% 78.32 64.91 Synthetic - 500% 78.24 64.68 Synthetic - 1000% 77.56 63.77 Table 5.3: Results on MoNuSeg of nuclei segmentation. (# Synthetic - %) repre- sents the number of synthetic data used in training the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. small dataset. 5.2.3 Discussion In this section, we evaluate the segmentation model’s performance using various quantities of synthetic images. The experiments are conducted by training the model exclusively on synthetic data for two tasks: polyp segmentation and nuclei segmentation. When a model is trained using synthetic data, its overall performance is encourag- ing, surpassing that of a randomly initialized model, with only a minor single-digit percentage difference compared to real data. This again demonstrates the effec- tiveness of the model distribution, approximating the data distribution. However, as the quantity of synthetic data increases, the model’s performance begins to de- cline, albeit at varying rates depending on the datasets used. This decline can be attributed to the growing deviation between the imperfect model distribution and the true data distribution caused by the increased use of synthetic data. In practice, the optimal "sweet spot" for the number of data points lies within the range of 100% to 300% of the original dataset, and its exact value should be determined empirically. Regarding the segmentation gap observed between real and synthetic data for polyp segmentation, it is noteworthy that the performance drop is more pronounced on the validation set especially on the Clinic-DB dataset compared to the generalization 43 5. Results and Discussion Figure 5.5: mDice vs. the number of synthetic data on MoNuSeg. The dashed line denotes the result of training on real data. set, even though the generative model is partially constructed using Clinic-DB data. We explain this phenomenon to the fact that the model distribution is actually a mixture of data distributions of the Kvasir and Clinic-DB datasets. However, it leans much closer to the distribution of Kvasir. Despite incorporating Clinic-DB data in the generative model, the synthetic data still exhibits a more significant per- formance drop due to the dominant influence of the Kvasir data distribution in the mixture model. Furthermore, this mixture model distribution acts as a mitigating factor to some extent, reducing the gap between the validation set and the general- ization set, leading to better performance on the generalization set rather than on the Clinic-DB. This finding sheds light on the intricacies of data distributions and their impact on the performance of generative models. For future work, we propose investigating the potential benefits of training a new generative model based on a different mixture of training data. By leveraging diverse datasets and possibly different combinations, this new generative model could offer valuable insights into improving the usability of synthetic data to generalize better on various datasets. In the next section, we focus on evaluating the effectiveness of synthetic data through the lens of transfer learning. 44 5. Results and Discussion 5.3 Transfer Learning on Synthetic Data In this section, we delve deeper into finding an effective strategy to enhance segmen- tation performance using synthetic data. The approach involves first pretraining on the synthetic data and finetuning on the real data (See Section 4.1). 5.3.1 Transfer Learning on Polyps We compare the number of data for pretraining at a linear rate, using proportions of 100%, 250%, and 500% of synthetic data. The pretrained checkpoints utilized in these experiments are derived from the segmentation model in Section 5.2.1. The fine-tuning process involves training the model on real data for 150 epochs. To determine the optimal learning rate for fine-tuning, we perform a grid search and find that a learning rate of 1e-2 yields the best results. We carry out the training and fine-tuning for all five folds, employing the same pretraining checkpoint. The evaluation of these experiments is presented in Table 5.4 for the validation set and in Table 5.5 for the generalization set. The experiments are conducted five times and compute the average. Again, the results are illustrated in Results 1 and Results 2, respectively. Result 1: The noticeable improvements in model performance are evident from Table 5.4, showcasing the benefits of pretraining with synthetic data. In general, the segmen- tation performance surpasses that of PraNet and narrows the performance gap with CaraNet, highlighting the potential of properly utilizing synthetic data to bridge the differences with task-specific networks. While the enhancement on the Kvasir [26] is relatively marginal, there is a substan- tial increase in performance on Clinic-DB [4], with a remarkable improvement of +3.31% and +3.27% observed when leveraging 500% of synthetic data. As antic- ipated, models pretrained with larger amounts of synthetic data generally tend to perform better. It is worth noting that models pretrained with 250% of synthetic data perform worse than those pretrained with only 100%. This result suggests that there may be an optimal amount of synthetic data for pretraining, and exceeding that threshold could have an unexpected impact on model performance. Hence, careful consideration of the quantity of synthetic data used for pretraining is essential for achieving optimal results. 45 5. Results and Discussion Methods # Syn. Pretrain Kvasir Clinic-DB mDice(%) mIoU(%) mDice(%) mIoU(%) U-Net [44] - 81.80 74.60 82.30 75.50 ResUNet++ - 81.30 79.30 79.60 79.60 PraNet [15] - 89.80 84.00 89.90 84.90 CaraNet [24] - 91.80 86.50 93.60 88.70 nnU-Net - 90.81 85.16 88.54 83.91 nnU-Net 100% 91.02 85.60 90.21 85.61 nnU-Net 250% 90.81 85.51 89.31 84.87 nnU-Net 500% 91.22(+0.41) 85.89(+0.73) 91.85(+3.31) 87.18(+3.27) nnU-Net 1000% 91.09 85.77 91.55 86.96 Table 5.4: Results on the validation set pretrain on a different number of synthetic data. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. Result 2: Table 5.5 presents the results of the generalization set, demonstrating that pre- training enhances the segmentation model’s ability to generalize. The pretrained nnU-Net outperforms PraNet on all three datasets and performs comparably to CaraNet on the challenging dataset ETIS [49]. Specifically, on Colon-DB [4], the pretrained nnU-Net outperforms the task-specific network, and a notable increase of +5.4% is observed on ETIS when utilizing 500% of pretraining data. The im- provement is particularly significant for ETIS, which contains numerous images with small medical objects, showing how pretraining with diverse synthetic datasets en- hances the recognition of such small objects. Figure 5.6 visually demonstrates a more accurate segmentation of small polyps after pretraining. Notably, the most substantial increase in performance for CVC-300 [60] is achieved when using a 250% pretraining data increment. These findings further disclose that the improvement achieved through pretraining is influenced by the specific dataset and the quantity of synthetic data employed. 5.3.2 Transfer Learning on Nuclei Following the section 5.3.1, we conduct the experiments using 100%, 250%, 500% and 1000% percent of synthetic data and checkpoints from section 5.2.2. We set the training epoch to 100. The grid search of the learning rate yields 5e-6 as the 46 5. Results and Discussion Methods # Syn. Pretrain Colon-DB CVC-300 ETIS mDice(%) mIoU(%) mDice(%) mIoU(%) mDice(%) mIoU(%) U-Net [44] - 51.20 44.40 71.00 62.70 39.80 33.50 PraNet [15] - 70.90 64.00 87.10 79.71 62.80 56.70 CaraNet [24] - 77.30 68.90 90.30 83.80 74.70 67.20 nnU-Net - 76.62 69.39 85.45 77.84 69.21 62.14 nnU-Net 100% 77.58 70.80 87.49 80.81 71.71 65.06 nnU-Net 250% 77.90 71.16 87.61 81.09 72.10 65.21 nnU-Net 500% 78.19 70.89 86.73 79.93 74.61 67.67 nnU-Net 1000% 77.65 70.84 87.05 80.46 73.63 66.68 (+1.57) (+1.50) (+2.16) (+3.25) (+5.40) (+5.53) Table 5.5: Results on the generalization set pretrained on a different number of synthetic data. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. Figure 5.6: Quliative results. The red box highlights the improvement of the pre- trained segmentation model on synthetic data. The results are taken from the best case on ETIS [49] and MoNuSeg [31]. 47 5. Results and Discussion Methods # Syn. Pretrain MoNuSeg mDice(%) mIoU(%) U-Net++ - 81.17 69.29 Attention U-Net - 79.74 67.19 MedT - 79.55 66.17 UCTransUnet - 80.73 68.42 SMESwin Unet - 81.13 68.35 nnU-Net - 81.68 69.09 nnU-Net 100% 81.13 68.36 nnU-Net 250% 81.57 68.98 nnU-Net 500% 81.80 69.30 nnU-Net 1000% 82.31(+0.51) 70.03(+0.73) Table 5.6: Results on MoNuSeg [31]. (# Syn. Pretrain) represents the number of synthetic data used pretraining the model. The first and second best results are in bold and underlined, respectively. The number in blue quantifies the performance difference in contrast to the result on real data. optimal value. The experiments are done five times and compute the average. The result can be found in Table 5.6. Result: As shown in Table 5.6, nnU-Net has already appeared as one of the top-performing methods compared with other methods on this dataset. Further leveraging the power of pretraining on a much larger dataset comprising ten times more synthetic data, a new state-of-the-art result is achieved with 0.51% and 0.73% increase on mDice and mIoU. Visualization of the first row of Fig. 5.6 shows that the pretrained model excels at capturing fine-grained details present in the medical images, outperforming models trained directly on real data. However, note that the benefits of pretraining are not universally consistent. When the amount of pretrain synthetic data is kept relatively small (100% and 150%), the performance of the segmentation model actually degrades compared to train- ing on real data alone. We explain this degradation to the scale of the synthetic dataset (only 30 images for 100% as opposed to the 100% of Kvasir consisting of 1450 images). With insufficient data for pretraining, the model fails to learn the repre- sentation for recognizing the image features, leading to overfitting on the synthetic data and imperfect initialization. 48 5. Results and Discussion 5.3.3 Discussion In this section, we further explore the effectiveness of the synthetic data through supervised transfer learning on polyps and nuclei segmentation testsets. We also compare the number of synthetic data used for pretraining. We found a substantial overall increase in performance following pretraining with synthetic data. Moreover, there is a consistent trend of performance improvement with the utilization of more synthetic data, up to a certain point where further in- creases show diminishing returns. In essence, pretraining with synthetic data proves to be an effective strategy for enhancing model performance. As the quantity of synthetic data used for pretraining increases, the performance of the model tends to improve steadily. However, this improvement reaches a saturation point, beyond which additional synthetic data does not yield significant gains in performance. Thus, there is an optimal balance to be struck in terms of the amount of synthetic data employed to achieve the best results. On ETIS, the model is significantly increased by recognizing small medical objects. We explain this by better feature perception and more small objects being seen af- ter pretraining. On MoNuSeg, the model outperforms all state-of-arts in Table 5.6. We believe that the generated samples from the JDM alleviate the problem of im- balanced examples in the training set and enhance the diversity of the training data. For future work, we aim to delve into more advanced strategies to fully harness the potential of synthetic data, such as a student teacher model to enhance the segmentation model. 49 6 Conclusion and Limitations Diffusion models (DMs) have incited the latest trend in image synthesis for their great sample quality and diversity, accomplished by moving a Gaussian sample to the data distribution through progressive denoising. This work explores the possi- bility of DMs as a data augmentation tool for biomedical segmentation. A theoretical framework Joint Diffusion Model (JDM) is proposed to synthesize paired samples. It is proved that the joint modeling resembles the classical DMs for conditional sampling and is orthogonal to various methods of solving inverse prob- lems using the DMs, including image inpainting, super-resolution, colorization, and etc. The qualitative results on the medical image segmentation dataset present the realness and diversity of the synthetic samples, demonstrating the capability of the JDM even with only a small amount of data samples. Using the segmentation mask for training to perform conditional sampling and identifying the closest matching sample from the training set, we have confirmed that the JDM effectively learns the underlying data distribution rather than merely memorizing the specific training data. We further investigate using synthetic data for biomedical segmentation. It is dis- covered that training with synthetic data can lead to relatively comparable segmen- tation accuracy to real data. However, increasing the amount of synthetic data may cause performance to decline. We believe this is caused by the growing deviation between the model distribution and the data distribution. This deviation affects segmentation performance differently across various datasets. 50 6. Conclusion and Limitations For transfer learning, the pipeline involves pretraining on synthetic data and then finetuning on real data. This process improves segmentation performance by empha- sizing sample diversity, which helps the model recognize fine-grained image features effectively. Implicit likelihood training of the DMs encourages the model to generate diverse samples, and thus balance the training data, leading to better generalization and robustness of the segmentation model. 6.0.1 Limitations and Future Work Joint modeling poses a severe challenge to the network architecture as the number of modalities increases. Hence, it still remains unclear whether the method is solid in the context of multi-class