Generate Realistic Image Segmentation
Pair Using Diffusion Probabilistic Model

A Data Augmentation Approach for Biomedical Segmentation

Master’s thesis in Biomedical Engineering

WENYIN ZHOU

DEPARTMENT OF ELECTRICAL ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023
www.chalmers.se

www.chalmers.se


Master’s thesis 2023

Generate Realistic Image and Segmentation Pair
using Diffusion Probabilistic Model

A Data Augmentation Approach for Biomedical Segmentation

WENYIN ZHOU

A thesis
for the degree of

Master of Science in Biomedical Engineering
Department of Electrical Engineering

Division of Signal Processing and Biomedical Engineering
Computer Vision and Medical Image Analysis Group

Chalmers University of Technology
Gothenburg, Sweden 2023


Generate Realistic Image Segmentation Pair using Diffusion Probabilistic Model
A Data Augmentation Approach for Biomedical Segmentation
Wenyin Zhou

© Wenyin Zhou, 2023.

Supervisor: Roman Naeem, Department of Electrical Engineering
Examiner: Fredrik Kahl, Department of Electrical Engineering

Master’s Thesis 2023
Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering
Computer Vision and Medical Image Analysis Group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Generate Realistic Image Segmentation Pair using Diffusion Probabilistic
Model.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2023

iv


Declaration
I affirm that I am the sole author of this thesis and that it has not been previously
submitted, either in its entirety or in part, for any other degree application.

v


Abstract
Diffusion Models (DMs) [22, 54, 55] are newly emerged deep generative models which
produce data via progressive denoising and have achieved state-of-art results for im-
age synthesis. Medical image segmentation, a dense prediction task that assigns a
pixel-wise classification to the medical images, severely lacks data to achieve great
performance in the context of deep learning. In this thesis, we seek to improve the
segmentation model using generative modeling, particularly the DMs. We system-
atically review the DMs from the perspective of variational inference and several key
neural network architectures for both the DMs and biomedical segmentation. Based
upon the classical DMs, we propose a training scheme of joint modeling for paired
data generation and conditional sampling. To that end, the Joint Diffusion Model
(JDM) learns mutual dependence among multimodality data and is orthogonal to
various accelerating sampling methods and inverse problem solvers using the DMs.
The qualitative results present the great visual quality and diversity of the generated
samples, showcasing the capability of the JDM in modeling the joint distribution
even with a limited number of data samples. The quantitative results suggest that
the appropriate use of the synthetic data can improve the discriminative model in
recognizing the fine-grained image features, leading to a performance increase for
biomedical segmentation. More specifically, we first pretrain the model on synthetic
data using supervised learning and then finetune the model on the real data. The
number of synthetic data is ablated. The experiments are conducted on five pub-
licly available polyp segmentation and one nuclei segmentation datasets, and show
on average +2.5% increase in terms of mean dice score on polyps segmentation and
set the new state-of-art results on MoNuSeg [31] using nnU-Net [25].

Keywords: generative model, diffusion model, medical image segmentation, data
augmentation.

vi


Acknowledgements
First and foremost, I want to deliver thanks to my parents for their eternal love and
permanent support for my cause.

I want to thank my supervisor Roman Naeem for his great guidance and introduction
to this interesting topic. This work is a long journey starting from the discussion in
October 2022 and is impossible without his preconditioning, patience and devotion.

I want to thank David Hagerman Olzon, Professor Lennart Svensson, and Professor
Fredrik Kahl for the useful discussions.

Wenyin Zhou, Gothenburg, July 2023

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

CDM Conditional Diffusion Model
CNN Convolutional Neural Network
DMs Diffusion Models
DDPM Denoising Diffusion Probabilistic Model
FID Fréchet Inception Distance
GANs Generative Adversarial Networks
JDM Joint Diffusion Model
mDice Mean Dice Coefficient
mIoU Mean Intersection over Union
KL Kullbeck Leibler
VAEs Variational Autoencoders

viii


Contents

List of Acronyms viii

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 6
2.1 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Generative Modeling and Diffusion Process . . . . . . . . . . . 7
2.1.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Simplified Training Objective . . . . . . . . . . . . . . . . . . 10
2.1.4 Learned Variance . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.5 Reverse Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6 Diffusion Network . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Posterior sampling using Diffusion Models . . . . . . . . . . . . . . . 15
2.2.1 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Conditional Model . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Diffusion Model for Zero-shot Data Editing . . . . . . . . . . . . . . . 16
2.4 Medical Segmentation Network . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 nnU-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Joint Diffusion Model for Paired Data Synthesis 23
3.1 Joint Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Experiments 27

ix


Contents

4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Joint Diffusion Model (JDM) . . . . . . . . . . . . . . . . . . 29
4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Results and Discussion 33
5.1 Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Pair Data Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Mask Conditioned Synthesis . . . . . . . . . . . . . . . . . . . 35
5.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Medical Segmentation on Synthetic Data . . . . . . . . . . . . . . . . 39
5.2.1 Medical Segmentation on Polyps . . . . . . . . . . . . . . . . 39
5.2.2 Medical Segmentation on Nuclei . . . . . . . . . . . . . . . . . 42
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Transfer Learning on Synthetic Data . . . . . . . . . . . . . . . . . . 45
5.3.1 Transfer Learning on Polyps . . . . . . . . . . . . . . . . . . . 45
5.3.2 Transfer Learning on Nuclei . . . . . . . . . . . . . . . . . . . 46
5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusion and Limitations 50
6.0.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . 51

Bibliography 53

A Appendix 1
Properties of the Diffusion Model 61
A.1 Proofs of Equation (2.4) and (2.5) . . . . . . . . . . . . . . . . . . . . 61
A.2 A Proof of Equation (2.6) . . . . . . . . . . . . . . . . . . . . . . . . 62
A.3 A proof of Equation (2.9) . . . . . . . . . . . . . . . . . . . . . . . . 63
A.4 A proof of Equation (2.10) . . . . . . . . . . . . . . . . . . . . . . . . 64
A.5 Variational bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

B Appendix 2
More Generated Samples 67

x


List of Figures

1.1 Medical Image Segmentation with Swin UNETR [19] on BRATS2021
[2]. The segmentation network produces segmentation masks for the
Enhancing Tumor, Whole Tumor and Tumor Core. From left to right:
T1, T1c, T2 and FLAIR MRI axial slice. Image from [19]. . . . . . . 2

2.1 The directed graphical model of DDPM. The latent variables along
the sampling path share the same dimension and are hypothesized
to be Markovian. The noise is gradually added to the data in the
forward diffusion process, while the reverse process recovers the lost
semantic information and pushes one step closer to the clean data
once the network evaluation. Demo image from [22]. . . . . . . . . . . 8

2.2 The Residual block used in the diffusion models. . . . . . . . . . . . . 14
2.3 The overall network design of the diffusion network. RB: residual

block, Attn: efficient self-attention block. The residual Block is the
basic unit for image downsampling and upsampling with the skip
connection introduced. See text for more description. Efficient self-
attention block is applied at the deepest level and improve feature
extraction via QK product with linear complexity. We refer the reader
to [48] for more details about efficient attention. . . . . . . . . . . . . 15

2.4 SegDiff [1]: a diffusion-based segmentation method. The two separate
encoders (G & F) are employed for denoising noisy segmentation mask
the extracting features of the image. The encodings are integrated
by the direct sum in the latent space. . . . . . . . . . . . . . . . . . . 17

2.5 Qualitative results on ImageNet 512 × 512 for thin mask from Repaint
[35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

xi


List of Figures

2.6 The U-Net Architecture: A powerful deep learning framework for
semantic segmentation tasks in medical imaging. It comprises a U-
shaped network structure with both contracting and expanding paths.
The contracting path consists of convolutional and pooling layers to
capture context, while the expanding path uses transposed convolu-
tions for precise localization. Skip connections facilitate the integra-
tion of multi-scale features, enabling different size of receptive fields. . 19

2.7 nnU-Net’s interdependence among the hyperparameter setting, data
pre-post processing, and model selection. . . . . . . . . . . . . . . . 21

3.1 The gradual denoising procedure of the Joint Diffusion Model (JDM).
A paired sample of medical image and segmentation mask is produced
with a single reverse run. . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Random generated samples of nuclei and polyp images and their as-
sociated segmentation masks. All the images and masks are resized
to 256 for display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Random generated samples and closest train data regarding mutual
information concerning segmentation mask. From left to right: syn-
thetic image, associated synthetic segmentation mask, closest real
image, closest real segmentation mask. . . . . . . . . . . . . . . . . . 37

5.3 Synthetic medical images guided by the training mask using the joint
diffusion model. The significant difference in contrast to the real
image demonstrates the sample diversity and model distribution. . . . 38

5.4 mDice vs. the number of synthetic data. The dashed line denotes the
result of training on real data. From left to right and top to down:
Kvasir, Clinic-DB, Colon-DB, CVC-300, ETIS. . . . . . . . . . . . . . 41

5.5 mDice vs. the number of synthetic data on MoNuSeg. The dashed
line denotes the result of training on real data. . . . . . . . . . . . . . 44

5.6 Quliative results. The red box highlights the improvement of the pre-
trained segmentation model on synthetic data. The results are taken
from the best case on ETIS [49] and MoNuSeg [31]. . . . . . . . . . . 47

A.1 Variational Bound vs. diffusion steps. Image from [41] . . . . . . . . 66

B.1 Kvasir generated samples. . . . . . . . . . . . . . . . . . . . . . . . . 68
B.2 Glas generated samples. . . . . . . . . . . . . . . . . . . . . . . . . . 69
B.3 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 70
B.4 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 71
B.5 MoNuSeg generated samples. . . . . . . . . . . . . . . . . . . . . . . 72
B.6 Kvasir mask conditioned generated samples. From top to bottom:

real image, real segmentation mask, four generated samples. . . . . . 73

xii


List of Figures

B.7 MoNuSeg mask conditioned generated samples. From top to bottom:
real image, real segmentation mask, four generated samples . . . . . . 74

xiii


List of Figures

xiv


List of Tables

4.1 The hyperparameter setting of the JDM. (*) The number of residual
blocks is set to 2 and 3 for Kvasir and MoNuSeg, respectively. . . . . 30

5.1 Results on validation sets of polyp segmentation. (# Synthetic - %)
represents the number of synthetic data used in training the model.
The first and second best results are in bold and underlined, respec-
tively. The number in blue quantifies the performance difference in
contrast to the result on real data. . . . . . . . . . . . . . . . . . . . 40

5.2 Results on generalization sets of polyp segmentation. (# Synthetic
- %) represents the number of synthetic data used in training the
model. The first and second best results are in bold and underlined,
respectively. The number in blue quantifies the performance differ-
ence compared to the result on real data. . . . . . . . . . . . . . . . . 42

5.3 Results on MoNuSeg of nuclei segmentation. (# Synthetic - %) rep-
resents the number of synthetic data used in training the model. The
first and second best results are in bold and underlined, respectively.
The number in blue quantifies the performance difference in contrast
to the result on real data. . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Results on the validation set pretrain on a different number of syn-
thetic data. (# Syn. Pretrain) represents the number of synthetic
data used pretraining the model. The first and second best results are
in bold and underlined, respectively. The number in blue quantifies
the performance difference in contrast to the result on real data. . . . 46

5.5 Results on the generalization set pretrained on a different number of
synthetic data. (# Syn. Pretrain) represents the number of synthetic
data used pretraining the model. The first and second best results are
in bold and underlined, respectively. The number in blue quantifies
the performance difference in contrast to the result on real data. . . . 47

xv


List of Tables

5.6 Results on MoNuSeg [31]. (# Syn. Pretrain) represents the number
of synthetic data used pretraining the model. The first and second
best results are in bold and underlined, respectively. The number in
blue quantifies the performance difference in contrast to the result on
real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xvi


1
Introduction

Medical Image Segmentation is a discipline that partitions a medical image into
multiple, non-overlapping segments, each of which represents a specific anatomi-
cal structure or tissue type (See Fig. 1.1). This process plays an important role
in modern healthcare and is embedded into many clinical decision workflows, such
as diagnosis, treatment, image-guided surgery, etc. Automated segmentation aims
to provide a segmentation solution to medical images by designing computer or
machine learning algorithms1 without significant manual efforts. Compared with
manual segmentation, the automated counterpart has the advantage of prediction
speed, which might potentially alleviate the heaven burden of medical experts and
improve diagnostic efficiency.

With the availability of a large amount of data, advanced network architectures, the
increasing computing power of the graphical processing unit, and many other impor-
tant reasons, deep learning [17] has become the de facto choice of machine learning
methods in the last decade. In deep learning, neural networks are designed with
multiple layers of interconnected nodes, known as artificial neurons or units. Each
layer processes and transforms the input data and passes it on to the next layer,
gradually extracting higher-level features and representations. These interconnected
nodes usually referred to as parameters are updated through stochastic gradient de-
scent. It has been demonstrated that neural networks such as convolutional neural

1In this thesis, we refer to machine learning algorithms as a mathematical model or computer
procedure to learn the pattern of the input data and make predictions without explicit programming
[36].

1


1. Introduction

Figure 1.1: Medical Image Segmentation with Swin UNETR [19] on BRATS2021
[2]. The segmentation network produces segmentation masks for the Enhancing
Tumor, Whole Tumor and Tumor Core. From left to right: T1, T1c, T2 and FLAIR
MRI axial slice. Image from [19].

networks2 (CNN) can model complex representations for image classification [20],
object detection [6], semantic segmentation [7], etc.

As for automated medical segmentation, most developed methods rely on supervised
learning based on deep learning, in which both the data point and label are accessi-
ble during model training. Empirically, the deep learning based approaches heavily
demand a huge amount of heterogeneous data to obtain generalizable results. How-
ever, in contrast to the natural image, sizable medical imaging datasets are more
difficult to collect, owing to the specialized medical equipment, the requirement of
medical experts’ devotion, strict regulation of patient data privacy, and many other
reasons preventing active data sharing across multiple institutes.

In this thesis, we consider approaching this issue from the perspective of data aug-
mentation using generative modeling. Data augmentation is a common tactic to
improve the generalization skill of the discriminative model. In practice, it can be
accomplished by simple transformations (e.g. rotations or random flips) or synthetic
images from the generative model. While simple transformations can be effective
in most scenarios, it is limited in generating diverse and realistic samples without
modeling data distribution. On the other hand, based on the observations from the

2In this work, CNN is referred to as a type of deep learning neural network designed for vi-
sual data analysis, usually utilizing convolutional layers to extract features and pooling layers to
downsample the data.

2


1. Introduction

training set, a generative model can be constructed to directly approximate either
the underlying data distribution pdata or the data generation process, from which the
novel data point can be produced. Two well-known representatives are likelihood
based approaches [16, 59, 13, 39] and generative adversarial networks (GANs) [18].

GANs support efficient sampling and high sample quality with carefully designed
architectures, but training GANs can be unstable and sometimes result in training
failure. It can be largely attributed to the imbalance between the generator and the
discriminator. For example, in the case of the discriminator being much better than
the generator, the generator might not make progress by fooling the discriminator.
Besides, the mode covering issues and demand for a multitude of data are also ma-
jor drawbacks of GANs, preventing them from broader applications. As suggested
by [51], GANs’ effectiveness as a source of medical imaging data was found to not
always be reliable, even if the generated images are nearly indistinguishable from
real data.

Models that have a tractable likelihood, such as autoregressive models [16, 59] or
flow-based models [13, 39], give competitive likelihoods, potentially leading to great
diversity. However, these likelihood based approaches have limitations based on
certain assumptions of the model distribution. For instance, autoregressive mod-
els assume the factorization of multiple conditional distributions, and flow-based
models interpret the generation process as an invertible transformation from a base
distribution to data density. Other cases such as Variational Autoencoders (VAEs)
[28] having incomplete likelihoods assumes the data is projected to a tractable dis-
tribution in the latent space, and the model is trained for projection and recovery.
These imposed restrictions about the model distribution can be unnatural and limit
the capability of neural networks, which might partially explain why these methods
have not established the same sample quality as GANs.

Recently, a newly emerged class of generative model called the diffusion models
(DMs) or score-based generative models [52, 22, 54, 55] has attracted much interest
in image synthesis and inverse problem solving. The produced images of the DMs
achieve great sample quality and diversity on several benchmarks, such as CIFAR-10
[29], and Celeb-A [33]. Owing to its substantial generation capability, the DMs may
potentially serve as an efficacious tool for medical images, a domain severely suffer-
ing from data sparsity problems. It may also promote active data sharing across the
institutes and improve the model performance on downstream tasks. Furthermore,
since the DMs also support the exact likelihood computation, it may help to enhance
the explainability in the context of clinical practice.

3


1. Introduction

In this thesis, we study the most prevailing instance of the DMs, denoising diffusion
probabilistic model (DDPM) [22] as an out-of-box tool for 2d biomedical segmenta-
tion. Applying DDPM to the medical images, one can generate novel data samples
that might fill in the gap of the training data manifold.

1.1 Main Contribution
The main contribution of this thesis is to give a thorough review of DMs for prac-
titioners to apply this technique to medical problems. The technical novelty is
summarized as follows:

1. We propose a new training scheme of joint modeling for the DMs with paired data
other than condition modeling. We prove that joint modeling implicitly supports
conditional sampling and is orthogonal to various inverse problem solvers.

2. We show that applying the Joint Diffusion Model (JDM) to 2d biomedical segmen-
tation is capable of generating meaningful and unlimited number of paired samples.
The generated samples deviate from the training data significantly.

3. Empirical results on supervised learning and transfer learning suggest that the
generated samples are useful for improving the segmentation model even if the train-
ing set is severely sparse.

1.2 Outline
In Chapter 2, we review the background that is concerned with this thesis, encom-
passing the DMs and medical segmentation network.

In Chapter 3, we elaborate on how to construct a joint diffusion probabilistic model
to generate the paired data. This chapter also includes conditional sampling through
joint modeling.

In Chapter 4, we present the experimental setups of this work, including the working
environment, hyperparameter setting, performance metrics, and compared baselines.

In Chapter 5, we visually check the generated samples and investigate the effect
of data augmentation using synthetic data. This encompasses evaluating the seg-
mentation model trained on purely synthetic data, and pretrained on synthetic data

4


1. Introduction

then finetuned on the real data.

5


2
Background

2.1 Diffusion Models
Diffusion Models (DMs) are newly emerged deep generative models [22, 52, 54, 55],
which produce sample by reversing the data corruption process via iterative refine-
ment. DMs have seen wide applications in image [54, 55, 22, 45, 41, 14], video [32, 8],
music [37], molecular synthesis [65], and various image-to-image translation tasks
such as super-resolution [46], inpainting [55, 35, 10], etc. In the forward diffusion
process, the noise is gradually injected into the data, transforming the complex data
distribution into a tractable prior distribution. As an adequate amount of noise is
added, the perturbed data can be equivalently considered as a sample from a prior
distribution (e.g. Gaussian distribution) that is easy to sample from. For reverse
sampling, one can train a deep neural network1 [17] that progressively eliminates
noise from the Gaussian sample to produce data.

Since the DMs can be further generalized to be continuous [55], there are extensive
design options available for the DMs. This work primarily concentrates on the
DDPM [22], a particular discretization of DMs, building upon the advancements
introduced in recent work - improved DDPM [41]. The objective is to enhance the
mode coverage and stationary training of the DDPM framework. The following

1In this work, we consider deep neural network as a type of computer model inspired by the
human brain that learns to recognize patterns in data by building multiple layers of interconnected
artificial neurons, allowing it to understand complex information and make accurate predictions.

6


2. Background

subsections introduce the DMs in detail. The derivation and proof of the section
are presented in a manner similar to that in [22].

2.1.1 Generative Modeling and Diffusion Process
Generative modeling is a machine learning problem with many applications and is
usually accomplished by density estimation, a statistical technique used to estimate
the underlying probability distribution of a set of data points. It enables us to gain
insights into the data’s underlying structure, identify patterns, and make informed
decisions in various fields, such as machine learning, signal processing, and data
analysis. However, it is commonly assumed that the data is governed by a complex
data distribution, which might be difficult to approximate directly. To circumvent
this, one can leverage the tractable prior p(z) in nature (e.g. Gaussian Distribution)
and move the sample from the prior distribution to the target data density. In this
sense, the data density is approximated by a posterior distribution given the prior:
pdata(x) ≈ pmodel(x)2 = p(x|z). Here, we denote the q(z|x) and p(x|z) as forward
and backward mapping.

Fortunately, the backward mapping can be modeled using a deep neural network by
minimizing the Kullback–Leibler (KL) divergence between the posterior of forward
mapping and backward mapping DKL(q(x|z) ∥ p(x|z)). This has led to a series of
works posing different assumptions on the tractable prior and model distribution.
For example, VAEs [28] assume the tractable prior as a low dimensional Gaussian
and the forward and backward mapping is achieved by an autoencoder.

In this context, DMs can be viewed as an instance where the prior is a standard
Gaussian distribution sharing the same dimension as the data distribution and the
forward mapping (diffusion process) is a gradual noise injection process (See Fig.
2.1).

Specifically, DDPM consists of a chain of latent variables of Markov property and
can be decomposed into the multiplication of a series of conditional distributions
of the same dimension (See Fig. 2.1). The forward joint distribution or forward
mapping can be described as follows:

q(x1:T |x0) :=
T∏

t=1
q(xt|xt−1). (2.1)

where {xt}t=1...T are the intermediate latent variables along the forward diffusion
2In this work, we refer model distribution pmodel(x) as the approximation to the data density

pdata(x).

7


2. Background

Figure 2.1: The directed graphical model of DDPM. The latent variables along
the sampling path share the same dimension and are hypothesized to be Markovian.
The noise is gradually added to the data in the forward diffusion process, while the
reverse process recovers the lost semantic information and pushes one step closer to
the clean data once the network evaluation. Demo image from [22].

process.

The latent variables xt depend only on xt−1 and are perturbed by a pre-defined noise
injection process:

xt =
√

1 − βtxt−1 +
√

βtz, z ∼ N (0, I). (2.2)

q(xt|xt−1) = N (xt;
√

1 − βtxt−1, βtI). (2.3)

where {βt}t=1...T is the variance schedule controlling the noise level.

Theoretically, {βt}t=1...T can be designed in many ways, tailoring to the speed of for-
ward data corruption. In this work, we only examine the linear schedule proposed in
[22], in accordance with the most existing work, where {βt}t=1...T = T −t

T −1β1 + t−1
T −1βT ,

with T set to 1000, so that the {βt}t=1...T is linearly interpolated between β1 and
βT .

It should be noted that the mean of the forward diffusion q(xt|xt−1) is scaled by√
1 − βt to bound the variance so that the xT approximates to standard Gaussian

as T goes to infinity.

According to Eq. (2.2), it’s non-trivial to show that any noisy intermediate variables
can be obtained via direct sampling (See Section A.1):

xt =
√

ᾱtx0 +
√

1 − ᾱtz, z ∼ N (0, I). (2.4)

q(xt|x0) = N (xt;
√

ᾱtx0, (1 − ᾱt)I). (2.5)

where αt := 1 − βt and ᾱt := ∏T
t=1 αt.

8


2. Background

Again, as shown by the Eq. (2.5), as ᾱt approaches to 0 with large T, the Gaussian
perturbation kernel progressively transforms the data into a standard Gaussian dis-
tribution denoted by xT .

Interestingly, the posterior of forward diffusion q(xt−1|xt, x0) under this circumstance
is tractable and can be computed using the Bayes rule (See Section A.2):

q(xt−1|xt, x0) = N (xt−1; µ̃t(xt, x0), β̃tI).

β̃t := 1 − ¯αt−1

1 − ᾱt

βt.

µ̃t(xt, x0) :=
√

ᾱt−1βt

1 − ᾱt

x0 +
√

αt(1 − ᾱt−1)
1 − ᾱt

xt.

(2.6)

where µ̃t, β̃t are the posterior mean and variance of the forward diffusion process.

From Eq. (2.6), it is noted that the posterior of forward diffusion q(xt−1|xt, x0)
shares the same analytical form of Gaussian distribution. Therefore, it is natural to
assume the intermediate variables of the reverse process to be also Gaussian, with
a tiny amount of noise being removed at each timestep. Practically, the reverse
process is approximated by a deep neural network pθ(xt|xt−1).

pθ(xt−1|xt) ∼ N (xt−1; µθ(xt, t), Σθ(xt, t)). (2.7)

where µθ(xt, t), Σθ(xt, t) denote the mean and variance of the denoised distribution
of the reverse process, and θ represents the model parameters.

Heuristically, a neural network is trained to denoise the image blurred with Gaussian
noise with multiple noise levels, and thus the network captures both coarse and fine-
grained image features. The reverse mean µθ(xt, t) is parameterized as the model
output, while the reverse variance Σθ(xt, t) can either be learned or approximated
by the posterior variance of the forward process β̃t or the noise schedule βt.

In the same spirit of the forward diffusion, the joint distribution of the reverse process
can also be depicted as a series of transitions starting from xT ∼ N (xT ; 0, I):

pθ(x0:T ) := p(xT )
T∏

t=1
pθ(xt−1|xt). (2.8)

9


2. Background

2.1.2 Training Objectives
DDPM trains to optimize the variational bound of the negative log-likelihood of
data distribution (See Section A.3):

E[− log pθ(x0)] ≤ Eq

[
− log pθ(x0:T )

q(x1:T |x0)

]

= Eq

− log p(xT ) −
∑
t≥1

log pθ(xt−1|xt)
q(xt|xt−1)

 := Lvlb.

(2.9)

The upper bound Lvlb in the Eq. (2.9) can be further reduced to the tractable KL
divergence [30] (See Section A.4 for the complete derivation):

Eq

[
DKL(q(xT |x0) ∥ p(xT )) +

∑
t>1

DKL(q(xt−1|xt, x0) ∥ pθ(xt−1|xt))

− log pθ(x0|x1)
]

.

(2.10)

where DKL(q(xT |x0) ∥ p(xT )) is denoted as LT , DKL(q(xt−1|xt, x0) ∥ pθ(xt−1|xt) as
Lt−1, and − log pθ(x0|x1) as L0.

LT is free from the model parameters and can be neglected, as the distribution
q(xT |x0) converges to the Gaussian p(xT ) when T is large enough. Optimizing the
term Lt−1 pulls the denoised distribution to the ideal posterior of forward diffusion.
In this sense, L0 can also be seen as a special case of Lt−1 as shown in Eq. (2.11),
where the q(x0) is known and does not depend on the model.

DKL(q(x0) ∥ pθ(x0|x1)) =
∫

q(x0) log q(x0)
pθ(x0|x1)

dx0. (2.11)

2.1.3 Simplified Training Objective
This section mainly discusses the reparameterization of reverse diffusion. Noted
that Eq. (2.4) provides an efficient way of directly sampling noisy data at arbitrary
steps. Consequently, the training can be performed by uniformly sampling t from
[0, 1, ..., T] and corrupting the data to xt to compute the Lvlb.

To model the denoised distribution of pθ(xt−1|xt), there are many ways to elucidate
the model output. The most apparent choice is to estimate the µθ(xt, t) directly, or
one can predict the noise ϵ and leverage the reparameterization to obtain xt−1. In
[22], Ho., et.al compare the ϵ-prediction and µt−1-prediction and argues that both

10


2. Background

achieve relatively equal sample quality, while predicting noise ϵ allows for the sim-
plified objective and resemble the denoising score matching [54], a theoretical link
is bridged to the score-based models [54, 55]. Additionally, one can also predict the
clean image x0, but is found to lead to degraded sample quality in [22]. Therefore, to
simplify the overall design choice, we follow [22, 41] to train a noise estimation net-
work ϵθ(xt, t). Nonetheless, recent work [3] proposes to train a dual-output network
to predict both the ϵ and x0, and its linear combination improves the generation
quality, especially with a limited number of sampling steps is considered.

Below, we show how the training objective term Lt−1 in Eq. (2.10) boils down to
the final simplified objective. We first derive the objective for the mean of the re-
verse process, supposing the fixed variance schedule, and then deal with the learned
variance in Section 2.1.4. Noted that the gradient is detached from each other when
both the mean and variance of the reverse process are estimated by the network.

Based on the result of KL-divergence between two Gaussian, the Lt−1 can be written
as:

Lt−1 = Eq

[
1

2σ2
t

∥ µ̃t(xt, x0) − µθ(xt, t) ∥2
]

+ C. (2.12)

where C is a constant with respect to θ, σ2
t is the variance of the reverse process, µ̃t is

the posterior mean of the forward process, and µθ is the mean of the reverse process.

From Eq. (2.12), we can again visualize that the most obvious option is to predict
the mean of reverse process µθ. Yet, the parameterization (x0 = 1√

ᾱt
(xt(x0, ϵ) −√

1 − ᾱtϵ) from Eq. (2.2) and Eq. (2.6) can be leveraged to further simplify as:

Lt−1 − C = Eq

[
1

2σ2
t

∥ 1√
ᾱt

(
xt − βt√

1 − ᾱt

ϵ

)
− µθ(xt, t) ∥2

]
.

(2.13)

Eq. (2.13) reveals that µ̃t is accessible given xt. With xt set to be the network input,
a similar reparameterization of the reverse mean can be considered as:

µθ(xt, t) = 1√
ᾱt

(
xt − βt√

1 − ᾱt

ϵθ(xt, t)
)

.

(2.14)

where the network output ϵθ(xt, t) is the approximator of the sampled noise.

Then we plug in the Eq. (2.14) into Eq. (2.13) and achieve the true variational
bound:

11


2. Background

Lt−1 − C = Ex0,ϵ

[
β2

t

2σ2
t αt(1 − ᾱt)

∥ ϵ − ϵθ(xt, t) ∥2
]

.

(2.15)

Practically, Ho., et.al [22] find out that simply dropping the coefficient in the Eq.
(2.15) is beneficial to the sample quality and stabilizes the training when predicting
the noise term ϵθ(xt, t), while predicting µθ(xt, t) only works when training with the
true variation bound (2.15). The simplified objective Lsimples is as follows:

Lsimple := Et,xt,ϵ

[
∥ ϵ − ϵθ(xt, t) ∥2

]
.

(2.16)

The simplified objective in Eq. (2.16) can be perceived as a reweighted variational
bound that emphasizes the training of different noise levels. With the small t being
down-weighted, the denoising quality of intermediate and large t can be improved
significantly. Besides, as the small t mainly concentrates on recovering the impercep-
tible details, the overall sample quality is not severely deteriorated with the training
period being shortened. Nevertheless, as the beginning step (small t) also mas-
sively contributes to the true variational bound, the log likelihood can be negatively
impacted, potentially causing mode covering issues [41] (See Section A.5).

2.1.4 Learned Variance
Reverse variance is crucial for density estimation to achieve a competitive log likeli-
hood, since both the reverse mean and variance determine the denoised distribution
pθ(xt−1|xt). It is generally convinced that the likelihood performance relates to how
well the model captures the modes of data distribution [43]. In other words, sample
diversity is greatly tied to the performance of the likelihood. Ho., etc [22] adopts the
fixed variance schedule, where the same variance being applied to inject the noise in
the forward diffusion serves as the reverse variance. As mentioned earlier, bad mode
coverage might be caused by fixing variance and simplified objective, and thereby
we may improve the log-likelihood by learning the Σθ(xt, t).

Empirically, the possible range of Σθ(xt, t) is very small since only a tiny amount
of noise is removed at each iteration, and thus predicting Σθ(xt, t) directly by the
network might be suboptimal even in the log domain. To solve this, in [41], a
linear interpolation is done between βt and β̃, with the network only deciding the
weighting term v. The computation is performed in the log domain so that there is
no constraint on network output for v:

Σθ(xt, t) = exp(v log βt + (1 − v) log β̃t). (2.17)

12


2. Background

Here we kindly remind that the βt is the linear variance schedule (See Sec. 2.1.1)
and β̃t is the variance of the posterior of the forward mapping in Eq. (2.6).

Then we can define a new training objective combined the Lsimple and Lvlb. The
former only glues to the reverse mean, while the latter contributes to the true vari-
ational bound for likelihood improvement:

Lhybrid = Lsimple + λLvlb. (2.18)

where the λ weights the multiple losses. Following [41], λ is set to 0.001 and the
gradient of Lvlb is detached from the output of µθ(xt, t) to prevent Lvlb from over-
whelming Lsimple, and thus the sample diversity is traded off with the sample quality.

2.1.5 Reverse Sampling
After training a noise estimation network, samples can be generated by iterative
denoising through pθ(xt−1|xt), from T to 0, where the xT ∼ N (xT ; 0, I). Specifically,
the sampling formula via parameterization can be written as follows:

xt−1 = 1√
ᾱt

(
xt − βt√

1 − ᾱt

ϵθ(xt, t)
)

+ σt ∗ z. (2.19)

where the σt corresponds to the standard deviation of the reverse variance Σθ(xt, t),
and z is the sampled noise.

2.1.6 Diffusion Network
The primary obstacle in the diffusion model lies in developing a robust parameter-
sharing network capable of effectively denoising images with varying levels of noise.
One commonly employed network in the diffusion model is the U-Net [44] archi-
tecture, originally designed for image segmentation, due to the task similarity of
image understanding and the same spatial resolution of the network’s input and
output. It has been emphasized that without utilizing sophisticated architectures
(e.g. U-Net), the performance of the diffusion model may hugely deteriorate [41, 55].

Considering the importance of the diffusion network, in this section, we review the
network architecture Res-UNet proposed in [41]. Since the U-Net [44] is the most
commonly used framework for medical image segmentation, we refer the reader to
Section 2.4.1 for the U-Net [44].

Res-UNet modifies the U-Net architecture by adding a skip connection for each con-
volutional block (See Fig. 2.2a), which is commonly helpful for feature aggregation

13


2. Background

(a) A residual block; block input x is
fused with the block output F (x) via di-
rect sum. Image from [20].

(b) The residual block with time encoding
t (Skip connection omitted).

Figure 2.2: The Residual block used in the diffusion models.

and gradient flow. First, the fusion of feature maps at multiple layers enhances the
learned representation and prevents the gradient exploding issue. Second, skip con-
nections also allows multiple paths for gradient propagation, which might accelerate
convergence and enhance training stability.
The exact residual block design follows PixelCNN++ [47], consisting of Group Nor-
malization [63], 3 × 3 convolutional block and GeLU activation function [21]. The
overall network is visible in Fig. 2.3, taking both xt and t as input. Each time
step is projected into a unique sinusoidal encoding, passing as a Fourier features for
aggregation:

sin(t) :=[sin(2w1t), sin(2w2t), ..., sin(2wdt),
cos(2w1t), cos(2w2t), ..., cos(2wdt)].

(2.20)

where w1, w2, ..., and wd are the pre-defined frequencies free from the parameters,
and d is the number of dimension of the projected vector.

Fig. 2.2b shows that the sinusoidal time encoding t goes through a linear time
module to adjust the channels for the feature maps of varying depths. Eventually,
the time encoding is fused with the feature maps between the first and second conv
blocks at each residual block via direct sum.

14


2. Background

Figure 2.3: The overall network design of the diffusion network. RB: residual
block, Attn: efficient self-attention block. The residual Block is the basic unit for
image downsampling and upsampling with the skip connection introduced. See text
for more description. Efficient self-attention block is applied at the deepest level
and improve feature extraction via QK product with linear complexity. We refer
the reader to [48] for more details about efficient attention.

2.2 Posterior sampling using Diffusion Models
As the impressiveness of DDPM is revealed for image generation, a natural next step
would be investigating its performance of controllable generation. In this section,
we roughly review two lines of work for posterior sampling p(y | x) using DMs.

2.2.1 Conditional Sampling
Using Bayes rule, one can easily show that posterior sampling is equivalent with the
multiplication of prior and forward model:

p(yt−1|yt, x) ∼ p(yt−1|yt)p(x|yt). (2.21)

In the context of the DMs as Eq. (2.21), in order to sample from the posterior
p(yt−1|yt, x), conditional sampling concentrates on leveraging a pre-trained diffusion
model to capture the generative prior p(yt−1|yt), and incorporate the conditional
information during sampling. The key insight is to "hijack" the intermediate latent
variables yt or yt−1 and enforce consistency with the guidance signal x.

In practice, the conditional information is only integrated during sampling through
the forward model p(x|yt), a reverse mapping contrary to our target. If the forward
model is linear (e.g. y = Ax), it is not hard to estimate without training. This
has led to a series of works leveraging pre-trained diffusion models for solving linear
inverse problems in an unsupervised manner [9, 55, 56]. However, if the forward

15


2. Background

model is nonlinear, a certain approximation is inevitable. For example, Dhariwal,
P. and Alex N. [14] trains a separate classifier to perform class-conditional sampling
referred to as classifier guidance, in which the gradient of the forward model is
incorporated for consistency. The result balances sample diversity and fidelity.

2.2.2 Conditional Model
Another line of work focuses on training a conditional diffusion model (CDM), once
the paired data {(xi, yi)}N

i=1 is collected. In this scenario, conditional distribution
p(y|x) needs to be modeled given a dataset of paired data D = {(xi, yi)}N

i=1, where
N is the number of data points. Specifically, the target distribution p(y|x) is a one-
to-many mapping that multiple target data y can align with the source information
x.

To achieve this, the denoising network takes both the noisy input yt and conditional
input x to predict the sampled noise. This leaves much room for the design of a con-
ditional strategy to build an effective CDM. Existing work [46] for super resolution
employs simple concatenation along the channel dimension between noisy input and
low resolution image (net(cat[yt, x], t)), obtaining great results on the recovery of
fine-grained details. Since CDM is based upon a more general relation between the
condition and target, it is much easier to apply directly to the case of multimodal
generation. For instance, the latent diffusion model [45] can be applied for class,
segmentation map and text-guided synthesis via a separate encoder. In this case, a
conditional encoder is leveraged to encode the features of x and then aggregate the
representation with the denoising network [1] (See Fig. 2.4).

2.3 Diffusion Model for Zero-shot Data Editing
A remarkable property of DMs is that zero-shot inverse problem solving can be
attained without problem-specific training. As the forward diffusion process is non-
parametric, a mild initialization can be optimized by adding noise and applying
the diffusion denoiser. For instance, for image colorization [55, 11, 9], the grayscale
image can be tripled along the channel dimension and then being injected noise to
a certain intermediate timestep for the denoiser. In this section, we review how the
pretrained diffusion model performs zero-shot image inpainting and will see later
how the same logic generalizes to image level that training a joint diffusion model
with pair data resembles conditional sampling in Section 3.1.

The initial idea of inpainting is proposed by Yang, S., etc [54] to illustrate the
representation learned by the DMs, and trailed by a dozen of work [55, 9, 35, 27]

16


2. Background

Figure 2.4: SegDiff [1]: a diffusion-based segmentation method. The two separate
encoders (G & F) are employed for denoising noisy segmentation mask the extracting
features of the image. The encodings are integrated by the direct sum in the latent
space.

to explore a robust sampling algorithm, which gives the most harmonious exam-
ple visually and highest sample quality quantitatively. As for the vanilla sampling
method, the image is subdivided into known regions for conditioning and unknown
regions for generation. The intermediate noisy image sources from two directions:
the forward diffusion of the known regions, and the reverse diffusion of the unknown
regions:

xknown
t−1 ∼ N (

√
ᾱtx0, (1 − ᾱt)I).

xunknown
t−1 ∼ N (µθ(xt, t), Σθ(xt, t))s.

xt−1 = m ⊙ xknown
t−1 + (1 − m) ⊙ xunknown

t−1 .

(2.22)

where the m is the inpainting mask, xknown
t−1 samples from the forward process and

xunknown
t−1 samples from the reverse process.

The combined output via masking enforces harmony between the known and un-
known regions for the next step’s evaluation. The pseudocode3 and examples can be
seen in Alg. 1 and Fig. 2.5. We refer the reader to more illustrations and examples
to Repaint [35].

3In the Repatint [35], an additional resampling strategy is introduced to enhance the quality of
the output. This strategy brings about certain differences in the algorithms in contrast to Alg. 1,
while the core ideas behind the algorithms remain unchanged.

17


2. Background

Algorithm 1 Diffusion Model for Image Inpainting
1: Initialization:
2: xT ∼ N (0, I)
3: for t = T, ..., 1 do
4: ϵ ∼ N (0, I) if t > 1 else ϵ = 0
5: xknown

t−1 =
√

ᾱtx0 +
√

1 − ᾱtϵ
6: z ∼ N (0, I) if t > 1 else z = 0
7: xunknown

t−1 = 1√
ᾱt

(
xt − βt√

1−ᾱt
ϵθ(xt, t)

)
+ σt ∗ z

8: xt−1 = m ⊙ xknown
t−1 + (1 − m) ⊙ xunknown

t−1
9: end for

10: Output: x0

Figure 2.5: Qualitative results on ImageNet 512 × 512 for thin mask from Repaint
[35].

2.4 Medical Segmentation Network
The medical segmentation network is a special type of neural network fθ(x) for
partitioning the image pixels into desired categories. The network projects the image
into a probability map of targeted classes, and then the class with the maximum
probability is taken as the final output:

y = argmax(fθ(X), dim = 1). (2.23)

where X ∈ (H, W, 3), fθ(X) ∈ (C, H, W ), y ∈ (H, W ) and C is the number of class.

The Eq. (2.23) can be applied to both 2D and 3D biomedical segmentation tasks,
while the key distinction lies in the additional spatial dimension present in 3D data.
It is also apparent that directly training a 3D network is more computationally ex-
pensive than that of 2d. As the neural network is key to the medical segmentation

18


2. Background

Figure 2.6: The U-Net Architecture: A powerful deep learning framework for
semantic segmentation tasks in medical imaging. It comprises a U-shaped network
structure with both contracting and expanding paths. The contracting path consists
of convolutional and pooling layers to capture context, while the expanding path
uses transposed convolutions for precise localization. Skip connections facilitate the
integration of multi-scale features, enabling different size of receptive fields.

performance, in this section, we review the biomedical segmentation network archi-
tecture. It is also noted that the same network is also utilized to train the DDPM
with coincidence due to task similarity.

2.4.1 U-Net
U-Net (See Fig. 2.6) proposed by Olaf R., etc [44] is a CNN architecture originally
designed by biomedical segmentation, and soon being extended to many applica-
tions where the model’s input and output share the same spatial resolution. The
name "U-Net" originates from the entire U-shaped design of the architecture, which
resembles an encoder-decoder framework. Currently, the encoder and decoder are
not only limited to the CNN structure but general feature extractor and interpreter.

19


2. Background

The encoder-decoder structure is highly effective for tasks where pixel-wise under-
standing is required for the output, such as semantic segmentation, denoising, etc.
Overally, the entire architecture consists of two main parts: the contracting path
for downsampling and the expanding path for reconstruction.

The contracting path resembles a typical CNN encoder and is responsible for cap-
turing context and extracting features from the input image. It composes of a series
of convolutional and pooling layers, which progressively reduce the spatial dimen-
sions of the input while increasing the number of feature channels. The pooling
operations are used for downsampling, contracts the image into a lower resolution
feature map. These operations, including maxpooling, average pooling or stride
convolutional kernel, capture the context and extract higher-level features.

The expanding path serves as the decoder and reconstructs the segmented image
from the lower-resolution feature map. It consists of a series of up-convolutional lay-
ers, also known as transposed convolutions or deconvolution [42]. The up-convolutional
layers perform upsampling implemented by either parameter-free extrapolation or
convolutional kernel, increasing the spatial dimensions of the feature map while re-
ducing the number of feature channels. To compensate for the information loss
during dowansampling and fuse the multi-scale features maps for different spatial
levels of features, the skip connection is adopted to aggregate the upsampled features
with the corresponding feature maps from the contracting path. The skip connec-
tion is critical for recognizing objects of various sizes as multi-scale features being
fused have different receptive fields, and also supports the active gradient flow with
more paths for backpropagation. In the original design [44], the skip connection is
implemented by simple concatenation. Then, the concatenated features are further
processed through convolutional layers to refine the segmentation. The contracting
and expanding paths are symmetric, with skip connections between them.

Overall, the U-Net architecture has proven to be very successful for various image
segmentation tasks, particularly in the medical domain. Its ability to learn from
limited training data and produce accurate pixel-level predictions has made it a
popular choice in the field of image analysis.

2.4.2 nnU-Net
nnU-Net [25] (See Fig. 2.7) is a deep learning framework specifically designed for
volumetric medical segmentation tasks. It is purely based on the U-Net [44] archi-
tecture, but over-engineered with a multitude of training recipes built upon the data
format, exact memory constraint, etc.

20


2. Background

Figure 2.7: nnU-Net’s interdependence among the hyperparameter setting, data
pre-post processing, and model selection.

This results in a highly generalized and self-configured framework for practical train-
ing decisions via heuristic rules in terms of hyperparameters and data processing.
nnU-Net achieves state-of-art on a majority of tasks without detailed network ar-
chitecture design, but systematizing the complex process of manual method config-
uration, which was previously addressed either by cumbersome manual tuning or
purely empirical approaches with practical limitations [25].

nnU-Net is a comprehensive toolbox with high flexibility, robustness, making it a
valuable tool and baseline for medical imaging researchers and practitioners. Be-
cause of this, it can also be an appropriate training instance for to assess the ef-
fectiveness of new plug-and-play methods (e.g. data augmentation) for volumetric
segmentation. The whole design of distilling the domain knowledge emcompasses
three parts of parameter groups: fixed, rule-based and empirical parameters. In the
rest of this section, we roughly review the core design. The full details are available
in [25].

The fixed parameters, involving architecture, optimizer, learning rate, etc (See Fig.
2.7), are optimized for joint configuration across multiple datasets. This part is
determined manually by the designer and stay fixed regardless of the datasets being
applied to.

21


2. Background

The rule-based parameters manage the data pre-processing and the setting of some
parts of hyperparameters. This decision is a set of heuristic if-else rules merely
inspired by the experience of the designer. For instance, the distribution of spac-
ing decides the image resampling strategy (See Fig. 2.7): if the collected data is
anisotrpic4, a cubic spline interpolation is employed within the image plane and
nearest neighbor interpolation for the outer plane, while a cubic spline interpolation
is applied to all planes if the opposite.

The empirical parameters mainly decide the data post-processing and the model
inference. For example, the best configured model among 2D, 3D, and 3D cascade
model will be chosen for inference according to their validation performance.

Besides, nnU-Net utilizes an ensembling approach during the inference phase to
further boost performance. It combines the predictions from multiple models trained
on different training sets or with different settings. This ensemble of models helps
to capture diverse information and improve the segmentation results.

4According to [5], anisotropic refers to a situation where the pixel or voxel dimensions are not
equal in all directions. In the context of medical imaging, it means that the spacing between pixels
or voxels differs along different axes.

22


3
Joint Diffusion Model for Paired
Data Synthesis

In this chapter, we illustrate the Joint Diffusion Model (JDM), a generative frame-
work for pair data synthesis in parallel with the classical DMs, and CDM. We show
how a joint model is trained and sampled to produce data, and how to leverage a
joint model for conditional sampling, inspired by the zero-shot data edition using
pretrained DMs (See Section 2.3). Generally speaking, the JDM still follows the
training diagram of the DMs, however, instead of injecting noise and predicting the
sampled noise for the single modality input, the same operation is performed on the
multiple modalities simultaneously. To that end, joint modeling incorporates the
marginal distribution of single modality data and conditional sampling, resembling
the image inpainting using pretrained DMs [35]. Section 3.1 discusses the frame-
work of the JDM in detail. Section 3.2 further illustrates the theory of the condition
sampling.

3.1 Joint Diffusion Model

Supposing paired data is collected as {(xi1, xi2, ..., xim)}N
i=1, where N is the number

of data points and m is the number of modality (e.g. text image pair corresponds
to m equals to 2 ). Rather than model the conditional distribution with parts of
data adopted for conditioning, a JDM can be trained to model the joint distribu-
tion. Under this circumstance, extending from the classical DMs, the target data

23


3. Joint Diffusion Model for Paired Data Synthesis

distribution switches from uni-modal x to the joint variable (x1, x2, ..., xm):

p(x) ⇒ p(x1, x2, ..., xm). (3.1)
In spite of the huge shift from unimodal to multi-modal data, the mathematical
framework remains the same as the DDPM [22], because we can simply define a
joint variable and this results in the same mathematical derivation of the DMs:

x =: (x1, x2, ..., xm). (3.2)

3.2 Conditional Sampling
There are several advantages of considering pair data entirely as the target in the
context of DMs. First, joint modeling incorporates the marginal distribution. If
the joint distribution p(x1, x2, ..., xm) is correctly modeled, we can easily sample
the arbitrary unimodal data {xi}m

i=1 by simply dropping the remaining modalities
{xτ }τ ̸=i. Second, it also permits condition sampling given an arbitrary number of
modalities. For instance, a joint distribution is modeled for CT, MRI images for the
same patient, one can use either CT or MRI to induce the generation of the other.
Theoretically, assuming the number of available modalities is k, we reorder the
multimodal data as (x1, x2, ..., xm−k) and (xm−k+1, xm−k+2, ..., xm), with the former
for sampling and the latter for conditioning. The conditional distribution can be
decomposed by the multiplication rule:

p(x1, x2, ..., xm−k|xm−k+1, xm−k+2, ..., xm) = p(x1, x2, ..., xm)
p(xm−k+1, xm−k+2, ..., xm) .

(3.3)

where m is the number of modalities.

In the diffusion setting, the distribution is characterized as the progressive denoised
distribution. Therefore, the practical distribution for sampling is:

p(x1, x2, ..., xm)
p(xm−k+1, xm−k+2, ..., xm) ⇒ p(xt−1,1, xt−1,2, ..., xt−1,m|xt,1, xt,2, ..., xt,m)

p(xt−1,m−k+1, xt−1,m−k+2, ..., xt−1,m) .

(3.4)

where t is the timestep across the gradual denoising process.

Note that this results in two easy-to-sample distributions in the practical context.
For the denominator in Eq. (3.4), we can use the forward diffusion to corrupt the
data as the (xm−k+1, xm−k+2, ..., xm) being known. For the nominator in Eq. (3.4),

24


3. Joint Diffusion Model for Paired Data Synthesis

the denoised joint distribution can be employed for reverse sampling with a trained
joint model on hand:

p(xt−1,1, xt−1,2, ..., xt−1,m|xt,1, xt,2, ..., xt,m) ∼ N (µθ(xt, t)), Σθ(xt, t))

p(xt−1,m−k+1, xt−1,m−k+2, ..., xt−1,m) ∼ N (
√

¯αt−1x0, (1 − ¯αt−1)I)
(3.5)

where θ is the model parameters and ¯αt−1 is the pre-defined noise schedule (See
Section 2.1.1).

Notably, this resembles the zero-shot image inpainting using a pretrained diffusion
model in Section 2.3, in which the image is subgrouped into the known and unknown
regions. Literally, training a JDM and performing conditional sampling spread the
target from pixel level to image level. Hence, JDM is also orthogonal to various
zero-shot inverse problem solvers using the DMs [55, 10, 9].

Empirically, the sampling can be achieved by replacing the generated intermediate
noisy data of known modality with the known data through the forward data cor-
ruption.

However, extending from unimodal data to higher dimensional paired data inevitably
poses a huge challenge to the neural network. This may possibly lead to failure of
mode capturing, unstable training, etc., as the network shoulders the extra burden
of learning the mutual dependence among the multimodal data rather than the uni-
lateral dependence in the conditional model. Hence, in this thesis, we study the
simple case where m = 2 and train the JDM of medical image and segmentation
mask, a task that usually suffers from data sparsity problems. A trained JDM can
ideally produce countless numbers of data. Given the great sample quality and di-
versity, we expect the JDM can be leveraged as a data augmentation tool to improve
the performance of medical image segmentation.

As for mask-conditioned sampling, we can add an extra step of substituting the
generated mask with the true mask injected by noise, which only results in a few
lines of code difference (See Alg. 2). Theoretically, a trained model can also be used
for medical segmentation, but it is beyond the interest of this work.

25


3. Joint Diffusion Model for Paired Data Synthesis

Figure 3.1: The gradual denoising procedure of the Joint Diffusion Model (JDM).
A paired sample of medical image and segmentation mask is produced with a single
reverse run.

Algorithm 2 Conditional Sampling
1: Initialization:
2: xT ∼ N (0, I), m0 := mask_true
3: for t = T, ..., 1 do
4: z ∼ N (0, I) if t > 1 else z = 0
5: xt−1 ∼ N (xt−1; µθ(xt, t)), Σθ(xt, t))
6: imgt−1, maskt−1 = chunk(xt−1, dim = 1)
7: maskt−1 ∼ N (mt−1;

√
¯αt−1m0, (1 − ¯αt−1)I)

8: xt−1 = cat([imgt−1, maskt−1], dim = 1)
9: end for

10: Output: x0

26


4
Experiments

In this chapter, we discuss the experimental setups of this work. This involves
in the dataset descriptions, experimental environments, setups of the training and
evaluation, and performance metrics.

4.1 Experimental Setups
Software Setup:

We use Pytorch [40], an open source deep learning framework that provides a flexible
and efficient platform for building and training various types of deep learning mod-
els. The experiments of the JDM is based on the code1 from [64]. For biomedical
segmentation, we use the training instance nnU-Net 2. The code and scripts to reim-
plement this work is accessible through: https://github.com/CaviarLover/diffAug.

Hardware Setup:

All experiments are executed on two NVIDIA GEFORCE RTX 3090 GPUs and
the AMD Ryzen 9 3950X 16-Core Processor.

1https://github.com/JuliaWolleb/Diffusion-based-Segmentation
2https://github.com/MIC-DKFZ/nnUNet

27

https://github.com/CaviarLover/diffAug


4. Experiments

Transfer Learning on Synthetic Data:

We adopt a two-step process to perform supervised transfer learning on synthetic
data:

1. Pretraining with Synthetic Data: We begin by pretraining the segmentation
model using supervised learning on the synthetic data samples. This initial training
step allows the model to gain insights from the generated data and learn relevant
features.

2. Fine-tuning with Real Data: After the pretraining phase, we proceed to finetune
the segmentation model using real data. By fine-tuning on real data, the model
can adapt and refine its learned features to better align with the characteristics of
real-world data.

It is worth noting that an alternative strategy could involve mixing both real and
synthetic data during training. However, our preliminary experiments show that the
success of this scheme largely depends on the proportion of the data combination
and varies significantly across different datasets. Due to the complexities involved
and the potential challenges in finding the optimal data mixing ratio, we have de-
cided not to consider this strategy in this work.

By adopting the two-step approach of pretraining with synthetic data and then fine-
tuning on real data, we aim to improve the segmentation performance and leverage
the benefits of synthetic data in a more controlled and effective manner.

4.2 Datasets
Kvasir dataset is a polyp segmentation dataset composed of consistent video frames
of endoscopy. The train test split is adopted from [15], where randomly selected 900
images from Kvasir and 450 images from Clinic-DB formulating the training set.

GlaS dataset comprises microscopic images of slides stained with Hematoxylin and
Eosin (H&E). These images and corresponding ground truth annotations by expert
pathologists are used for training and testing purposes. The dataset contains 85
images for training and 80 images for testing.

MoNuSeg dataset is created using images captured at 40x magnification of H&E
stained tissue. It includes images from multiple organs and patients. The MoNuSeg
dataset encompasses 30 training images with approximately 22,000 nuclear bound-

28


4. Experiments

ary annotations and 14 testing images with over 7,000 nuclear boundary annotations.

Testsets:

Kvasir [26], Clinic-DB [4], CVC-ColonDB [57], CVC-300 [60] and ETIS [49] are
five polyp segmentation datasets, commonly used to demonstrate the effectiveness
of the segmentation model of polyps. Since the generative model is trained on a
mixed dataset of Kvasir and Clinic-DB as mentioned earlier, we claim the Kvasir
and Clinic-DB as the validation set, and CVC-ColonDB, CVC-300, and ETIS as the
generalization set. The testset of MoNuSeg with 14 images is leveraged to evaluate
the nuclei segmentation. The segmentation model is trained on all images resized
to 256 (polyps) and 512 (nuclei) and inference is performed in the same resolution
and upsample to the original resolution via nearest upsampling.

4.3 Hyperparameters

4.3.1 Joint Diffusion Model (JDM)

Key hyperparameters of the JDM are summarized in Table 4.1. Aligned with [41],
we train the generative model with the number of residual blocks at each level set
to 2. The number of base channels is 128, and the attention block of 4 heads
is applied on the U-Net level with the downsampling factor of 8 and 16. We
do not use the dropout during training. For more detailed information, we rec-
ommend referring to the publication [41] and visiting the GitHub repository at
https://github.com/openai/improved-diffusion. These sources provide additional
details and resources related to the setting.

Train Period. Since there is no suggested number of steps for training a diffusion
model, we train the model for 60k steps on Kvasir, 40k on Monuseg, and 40k on
GlaS. After training, the checkpoints after half steps are loaded to visualize the gen-
erated samples. We pinpoint the valid setup of the number of steps giving the most
harmonious sample between the generated segmentation mask and medical image,
while the early checkpoints lack fine-grained image features and are less harmonious.
We do not qualitatively check the metric of the sample quality such as "Fréchet In-
ception Distance" (FID) [23] for picking the checkpoint.

29

https://github.com/openai/improved-diffusion


4. Experiments

setting
number of diffusion steps 1000

learned variance True
dropout 0.0

batch size 10
learning rate 1e-4

number of base channels 128
channel multiplier 1, 1, 2, 2, 4, 4

number of residual blocks 2(3)*
number of input channels 4

number of heads per attention block 4
depth level applied with attention block 16, 8

Table 4.1: The hyperparameter setting of the JDM. (*) The number of residual
blocks is set to 2 and 3 for Kvasir and MoNuSeg, respectively.

4.4 Metrics

To evaluate the segmentation results, we adopt two most commonly used metrics
mean Dice score (mDice) and mean Intersection over Union (mIoU). mDice also
known as the Dice coefficient or Dice similarity coefficient, measures the similarity
between the predicted segmentation and the ground truth. It quantifies the overlap
between the two by computing the ratio of twice the intersection of the predicted
and ground truth regions to the sum of their areas.

mIoU is another widely used metric that evaluates the accuracy of segmentation
algorithms. It calculates the ratio of the intersection area to the union area between
the predicted and ground truth regions. The mIoU is the average IoU calculated
over all the classes in the dataset. The computation of two metrics is as follows:

mDice = 2 ·∑c
i=1 TPi∑c

i=1(TPi + FPi) +∑c
i=1(TPi + FNi)

(4.1)

mIoU = 1
C

C∑
i=1

TPi

TPi + FPi + FNi

(4.2)

where TPi is the true positives for class i, FPi is the false positives for class i, FNi

false negatives for class i, and c is the total number of classes.

30


4. Experiments

4.5 Baselines
nnU-Net. The training of nnU-Net [25] strictly adheres to the guidelines presented
in the official implementation. For detailed instructions, please refer to the nnU-Net
repository on GitHub: https://github.com/MIC-DKFZ/nnUNet.

Polyp Segmentation. Except for the nnU-Net [25], we include some other meth-
ods for comparison. PraNet [15] and CaraNet [24] are two methods designed specif-
ically for polyp segmentation. The PraNet set the best performance back in 2020
and CaraNet is known for its ability to effectively detect small medical objects by
combining various modules to achieve this capability. The result can be found on the
leaderboard of paperswithcode.com3. This leaderboard showcases the latest state-of-
the-art of the benchmarks in the field of computer vision.

Nuclei Segmentation. We have incorporated several state-of-the-art methods for
comparison on the MoNuSeg [31] dataset. Each of these methods introduces unique
advancements to the U-Net [44] architecture to improve its performance in medical
image segmentation. The results are extracted from the leaderboard of paperswith-
code.com4.

1. U-Net++ [66]: This method enhances the standard U-Net by deeply supervising
the intermediate feature maps. This strategy compels the low-level features to be-
come more discriminative, leading to improved segmentation results.

2. Attention U-Net [38]: By incorporating an attention gate mechanism on the skip
connections, this method selectively highlights the salient features of the encoder.
This pruning process helps the model focus on relevant information.

3. MedT [58]: This approach combines both transformer and CNN structures, cre-
ating two separate branches to process medical images. The transformer enables
capturing long-range dependencies, while the CNN branch handles local features,
improving the model’s ability to capture both global and local context.

4. UCTransUnet [61]: Building upon the U-Net architecture, UCTransUnet in-
troduces a global feature aggregation module. This module effectively fuses rep-
resentations from different levels of the encoders and decoders, leading to a more
comprehensive understanding of the input.

3https://paperswithcode.com/sota/medical-image-segmentation-on-kvasir-seg
4https://paperswithcode.com/sota/medical-image-segmentation-on-monuseg

31

https://github.com/MIC-DKFZ/nnUNet
https://paperswithcode.com/sota/medical-image-segmentation-on-kvasir-seg
https://paperswithcode.com/sota/medical-image-segmentation-on-monuseg


4. Experiments

5. SMESwin Unet [62]: This method leverages a superpixel segmentation branch to
identify and exclude irrelevant features from the original image. By doing so, the
Swin Transformer [34] based U-Net is enhanced to focus solely on important regions.

32


5
Results and Discussion

In this chapter, we present the main results of this thesis. We explore the capability
of the Joint Diffusion Model (JDM) to produce medical image segmentation pair
data. We further train the segmentation model based upon nnU-Net [25] to inves-
tigate the usefulness of the synthetic data.

In Section 5.1, we visualize the generated samples on several datasets, including
pair data synthesis and mask-conditioned synthesis. In Section 5.2, we train the
segmentation model in supervised manner using purely synthetic data for polyp and
nuclei segmentation. In Section 5.3, we further leverage the pretrained checkpoint
and finetune the segmentation model on the real data. In both cases, the number
of used synthetic data is compared.

5.1 Visual Comparison
In this section, we train the JDM on three medical image segmentation datasets:
Kvasir [26], MoNuSeg [31] and GlaS [50]. The generative model is trained from
scratch. The full descriptions of the datasets can be found in [26] [50] [31].

5.1.1 Pair Data Synthesis
We highlight some of our generated samples and their associated segmentation masks
(See Fig. 5.1). We further qualitatively investigate the similarity between the syn-

33


5. Results and Discussion

thetic sample and the closest sample in the training set in terms of the cross entropy
mutual information.

As shown in Fig. 5.1, the synthetic samples demonstrate the effectiveness of the JDM
in generating high-quality images with intricate details. It showcases the model’s
capability to capture the nuances of light contrast in endoscopy and accurately vi-
sualize small text descriptions of patients, particularly for polyp images (See last
two rows of Fig. 5.1). Furthermore, the generated segmentation masks, together
with their corresponding images, exhibit a remarkable visual consistency and are
appropriately positioned.

Notably, the results display a significant diversity in the generated samples, despite
being trained on a limited number of large-resolution images (512 × 512 for the
MoNuSeg with only 30 images). The first two rows of Fig. 5.1 for the MoNuSeg are
particularly distinct from each other in various aspects. These differences include
the background, color composition, size of nuclei, and compactness of nuclei struc-
tures. This proves the model’s ability to produce diverse and varied outputs even
when trained on a relatively small set of high-resolution images.

To inspect the true effectiveness of the model distribution in the context of deep
learning, it is necessary to verify that the model does not just simply memorize
the training set, especially for the dataset with a limited number of data. This
verification can be accomplished by first computing the mutual distance between
the images used for training and the generated samples. Then, the real image with
the closest distance is visually compared with the synthetic one to further check the
similarity, so that the validity of the model distribution can be demonstrated if huge
visual differences are witnessed.

Following [12], a similar work for retina image and vessel network synthesis, we adopt
the Mutual Information Measure to analyze the distance between the training and
synthetic segmentation masks. We denote the generated segmentation mask as s̃,
and a true mask as s. The metric can be computed as follows:

MI(s, s̃) =
∑
si∈s

∑
s̃i∈s̃

p(si, s̃i) log p(si, s̃i)
p(si) · p(s̃i)

(5.1)

where si and s̃i are the pixel values of the s and s̃, the p(si, s̃i) is the joint histogram
of the true and synthetic segmentation masks, and p(si), p(s̃i) are the marginals.

The closest sample is extracted from the training set for visual comparison. The
randomly selected examples are shown in Fig 5.2.

34


5. Results and Discussion

The findings indicate that there are noticeable distinctions between synthetic im-
ages and real images, in spite of high structure similarities between the segmentation
masks. This is particularly evident for polyp images and histopathological images
from the MoNuSeg dataset (e.g. nuclei of different sizes, background color, etc).
We will see later that the mask-conditioned synthesis (See Section 5.1.2) gives us
a great diversity of images. This further reinforces the observation that synthetic
samples differ from real ones.

In the case of the GlaS dataset, the synthetic images primarily differ from real images
in terms of style and fine-grained details, while the main content of the images
remains relatively consistent. This overall similarity might be partly attributed to
the size of the training sets and the low resolution of the training images being used.
The limited size of the training sets may overlook the overall interdependence among
different regions of the images. Moreover, the low resolution of the training images
along with the small size might miss some intricate details.

5.1.2 Mask Conditioned Synthesis
We investigate the outcomes of mask-conditioned synthesis using the JDM (See Sec-
tion 3.1) for polyp and nuclei images. To demonstrate that our method goes beyond
simple memorization of the training set, we apply the training mask for conditional
sampling. The qualitative results can be found in Fig. 5.3.

Fig. 5.3a and 5.3b provide visual evidence that the generated images induced by
the training mask exhibit significant differences compared to the original training
images. The main distinctions lie in context, textures, and contrast. The substantial
deviation between the generated images and the training images again indicates the
validity of the model distribution, image representation, and mutual consistency.

5.1.3 Discussion
In this section, we visually evaluate the generated samples on both polyp and nuclei
segmentation datasets.

The analysis reveals that the synthetic paired data generated by the JDM exhibits
visual harmony and demonstrates a high level of consistency between the generated
images and their associated segmentation masks. The model is capable of producing
paired samples that possess both quality and diversity.

35


5. Results and Discussion

Figure 5.1: Random generated samples of nuclei and polyp images and their asso-
ciated segmentation masks. All the images and masks are resized to 256 for display.

36


5. Results and Discussion

Figure 5.2: Random generated samples and closest train data regarding mutual
information concerning segmentation mask. From left to right: synthetic image,
associated synthetic segmentation mask, closest real image, closest real segmentation
mask. 37


5. Results and Discussion

(a) Synthetic images induced by the training mask of MoNuseg [31].

(b) Synthetic images induced by the training mask of Kvasir [26].

Figure 5.3: Synthetic medical images guided by the training mask using the joint
diffusion model. The significant difference in contrast to the real image demonstrates
the sample diversity and model distribution.

38


5. Results and Discussion

For polyp segmentation, the generated samples appropriately position the polyps
within the images, however, the boundaries of the polyps may not be entirely
smooth. The model successfully avoids simple memorization of the training im-
ages, as evidenced by studying the closest sample in the training set compared to
the synthetic ones. As for the Glas dataset, the results exhibit a high degree of
structural similarity to real images. This outcome might be partly attributed to the
size of the training set and the relatively low resolution of the images used, which
might lower the learning difficulty of the generative model, and thus fail to capture
the interdependence among different image regions. Hence, we suggest training the
DDPM in a large resolution under the situation of data sparsity.

For future work, we plan to explore how the training time will impact the sample
quality and diversity quantitatively. In the next section, we will investigate the
generated images for training the segmentation model.

5.2 Medical Segmentation on Synthetic Data
In this section, we investigate the segmentation performance of training on purely
synthetic data. Given that countless numbers of data can be produced, we compare
the number of synthetic images used for training. The results can be an indication
of how many numbers of images should be used for transfer learning in Section 5.3.
We start from the same number of synthetic data as the original dataset, and then
gradually increase the data percentage. The number of synthetic data is quantified
as the proportion relative to the training set.

5.2.1 Medical Segmentation on Polyps
We set the number of training epochs to 400, as the validation curve saturates nearly
after 300 epochs. The test is split into two subgroups (validation set and general-
ization set) for Results 1 & 2.

Result 1:

We compare the number of synthetic samples training the segmentation model.
The results are presented in Table 5.1 and the first row of Fig. 5.4, in contrast to
training on real data.

Note that the results are quite close to that of training on real data, achieving com-
petitive performance on both datasets. It can be observed that both performance
metrics vary with different numbers of synthetic images. Two metrics peak when

39


5. Results and Discussion

Training images Kvasir Clinic-DB
mDice(%) mIoU(%) mDice(%) mIoU(%)

Real 90.81 85.16 88.54 83.91
Synthetic - 100% 85.87 78.71 77.40 70.66
Synthetic - 150% 86.50 79.76 77.46 70.28
Synthetic - 200% 86.36 79.60 77.69 71.06
Synthetic - 250% 87.08(-3.73) 80.41(-4.75) 78.01 71.49
Synthetic - 300% 86.00 78.96 78.21(-10.33) 71.70(-12.21)
Synthetic - 400% 84.52 77.28 77.19 70.39
Synthetic - 500% 83.84 76.69 77.93 71.35

Table 5.1: Results on validation sets of polyp segmentation. (# Synthetic - %)
represents the number of synthetic data used in training the model. The first and
second best results are in bold and underlined, respectively. The number in blue
quantifies the performance difference in contrast to the result on real data.

250% or 300% of images are leveraged for training and decline afterwards (See the
first row of Fig. 5.4). The variation of the number of synthetic data is more sensitive
to the Kvasir rather than the Clinic-DB with larger performance drops is witness.
Besides, the performance gap between the synthetic and real data is much smaller
for Kvasir compared with that of Clinic-DB, with the best case (250% and 300%,
respectively) of 3.73% and 10.33% differences in terms of mDice. This suggests that
the model distribution is more inclined to approximate the distribution of Kvasir
instead of Clinic-DB.

Result 2:

The results of the generalization sets can be found in Table 5.2 and the last two
rows of Fig. 5.4.

It is unexpected that the best outcome (Synthetic 100%) is achieved with the same
number of real images (100%). Generally, Fig. 5.4 indicates a trend of decline
as more synthetic data is used to train the model. All three testsets experience a
small bump when the number of data is 250%. Similar to the validation set, both
mDice and mIoU decrease as the number of images exceeds 250%. Additionally, the
performance drop is huge for ETIS of 8.2% as opposed to 4.4% of Colon-DB and
3.2% of CVC-300.

40


5. Results and Discussion

Figure 5.4: mDice vs. the number of synthetic data. The dashed line denotes
the result of training on real data. From left to right and top to down: Kvasir,
Clinic-DB, Colon-DB, CVC-300, ETIS.

41


5. Results and Discussion

Training Images Colon-DB CVC-300 ETIS
mDice(%) mIoU(%) mDice(%) mIoU(%) mDice(%) mIoU(%)

Real 76.62 69.39 85.35 77.84 69.21 62.14
Synthetic - 100% 72.26 65.53 82.11 74.56 60.96 53.28

(-4.36) (-3.86) (-3.24) (-3.28) (-8.25) (-8.86)
Synthetic - 150% 71.50 64.77 81.93 74.24 58.19 51.22
Synthetic - 200% 70.77 64.10 81.38 73.63 57.66 50.60
Synthetic - 250% 72.07 65.44 81.94 74.28 60.40 53.18
Synthetic - 300% 71.15 64.25 81.56 73.67 59.05 51.75
Synthetic - 400% 69.94 63.15 81.47 73.16 58.08 50.43
Synthetic - 500% 68.50 62.14 81.27 73.41 56.93 49.14

Table 5.2: Results on generalization sets of polyp segmentation. (# Synthetic -
%) represents the number of synthetic data used in training the model. The first
and second best results are in bold and underlined, respectively. The number in
blue quantifies the performance difference compared to the result on real data.

5.2.2 Medical Segmentation on Nuclei
We set the number of epochs by default to be 1000 as the computational cost is bear-
able and the general trend of the validation curve stabilizes only when the number
of training epochs is from 900 to 1000. Other hyperparameter setting is determined
by the nnU-Net [25] without checking.

Result:

We perform an ablation study on the number of synthetic data same as the Section
5.2.1. The results of these experiments are presented in Table 5.3 and visualized in
Fig. 5.5.

Table 5.3 demonstrates that the segmentation performance is only marginally lower
than the real data counterpart (with a difference of 2.44% and 3.32% for mDice
and mIoU, respectively) when training with 150% of synthetic data. This finding
suggests that the JDM can effectively model complex data distributions even when
the available data is limited.

It is further illustrated in Fig. 5.5 about the segmentation model’s robustness,
showing a slight tendency of performance drop as the number of synthetic data
increases, but overall it remains resilient. Notably, it is observed that the mDice
for the case of using 100% synthetic data is significantly lower compared to the
results obtained on the real data, indicating potential challenges in modeling the

42


5. Results and Discussion

Training images MoNuSeg
mDice(%) mIoU(%)

Real 81.68 69.10
Synthetic - 100% 75.47 61.15
Synthetic - 150% 79.24(-2.44) 65.78(-3.32)
Synthetic - 200% 77.71 63.91
Synthetic - 250% 78.26 64.57
Synthetic - 300% 78.30 64.74
Synthetic - 400% 78.32 64.91
Synthetic - 500% 78.24 64.68
Synthetic - 1000% 77.56 63.77

Table 5.3: Results on MoNuSeg of nuclei segmentation. (# Synthetic - %) repre-
sents the number of synthetic data used in training the model. The first and second
best results are in bold and underlined, respectively. The number in blue quantifies
the performance difference in contrast to the result on real data.

small dataset.

5.2.3 Discussion
In this section, we evaluate the segmentation model’s performance using various
quantities of synthetic images. The experiments are conducted by training the
model exclusively on synthetic data for two tasks: polyp segmentation and nuclei
segmentation.

When a model is trained using synthetic data, its overall performance is encourag-
ing, surpassing that of a randomly initialized model, with only a minor single-digit
percentage difference compared to real data. This again demonstrates the effec-
tiveness of the model distribution, approximating the data distribution. However,
as the quantity of synthetic data increases, the model’s performance begins to de-
cline, albeit at varying rates depending on the datasets used. This decline can be
attributed to the growing deviation between the imperfect model distribution and
the true data distribution caused by the increased use of synthetic data. In practice,
the optimal "sweet spot" for the number of data points lies within the range of 100%
to 300% of the original dataset, and its exact value should be determined empirically.

Regarding the segmentation gap observed between real and synthetic data for polyp
segmentation, it is noteworthy that the performance drop is more pronounced on the
validation set especially on the Clinic-DB dataset compared to the generalization

43


5. Results and Discussion

Figure 5.5: mDice vs. the number of synthetic data on MoNuSeg. The dashed
line denotes the result of training on real data.

set, even though the generative model is partially constructed using Clinic-DB data.
We explain this phenomenon to the fact that the model distribution is actually a
mixture of data distributions of the Kvasir and Clinic-DB datasets. However, it
leans much closer to the distribution of Kvasir. Despite incorporating Clinic-DB
data in the generative model, the synthetic data still exhibits a more significant per-
formance drop due to the dominant influence of the Kvasir data distribution in the
mixture model. Furthermore, this mixture model distribution acts as a mitigating
factor to some extent, reducing the gap between the validation set and the general-
ization set, leading to better performance on the generalization set rather than on
the Clinic-DB. This finding sheds light on the intricacies of data distributions and
their impact on the performance of generative models.

For future work, we propose investigating the potential benefits of training a new
generative model based on a different mixture of training data. By leveraging diverse
datasets and possibly different combinations, this new generative model could offer
valuable insights into improving the usability of synthetic data to generalize better
on various datasets. In the next section, we focus on evaluating the effectiveness of
synthetic data through the lens of transfer learning.

44


5. Results and Discussion

5.3 Transfer Learning on Synthetic Data
In this section, we delve deeper into finding an effective strategy to enhance segmen-
tation performance using synthetic data. The approach involves first pretraining on
the synthetic data and finetuning on the real data (See Section 4.1).

5.3.1 Transfer Learning on Polyps
We compare the number of data for pretraining at a linear rate, using proportions
of 100%, 250%, and 500% of synthetic data. The pretrained checkpoints utilized in
these experiments are derived from the segmentation model in Section 5.2.1. The
fine-tuning process involves training the model on real data for 150 epochs. To
determine the optimal learning rate for fine-tuning, we perform a grid search and
find that a learning rate of 1e-2 yields the best results. We carry out the training
and fine-tuning for all five folds, employing the same pretraining checkpoint. The
evaluation of these experiments is presented in Table 5.4 for the validation set and
in Table 5.5 for the generalization set. The experiments are conducted five times
and compute the average. Again, the results are illustrated in Results 1 and Results
2, respectively.

Result 1:

The noticeable improvements in model performance are evident from Table 5.4,
showcasing the benefits of pretraining with synthetic data. In general, the segmen-
tation performance surpasses that of PraNet and narrows the performance gap with
CaraNet, highlighting the potential of properly utilizing synthetic data to bridge the
differences with task-specific networks.

While the enhancement on the Kvasir [26] is relatively marginal, there is a substan-
tial increase in performance on Clinic-DB [4], with a remarkable improvement of
+3.31% and +3.27% observed when leveraging 500% of synthetic data. As antic-
ipated, models pretrained with larger amounts of synthetic data generally tend to
perform better.

It is worth noting that models pretrained with 250% of synthetic data perform worse
than those pretrained with only 100%. This result suggests that there may be an
optimal amount of synthetic data for pretraining, and exceeding that threshold could
have an unexpected impact on model performance. Hence, careful consideration of
the quantity of synthetic data used for pretraining is essential for achieving optimal
results.

45


5. Results and Discussion

Methods # Syn. Pretrain Kvasir Clinic-DB
mDice(%) mIoU(%) mDice(%) mIoU(%)

U-Net [44] - 81.80 74.60 82.30 75.50
ResUNet++ - 81.30 79.30 79.60 79.60
PraNet [15] - 89.80 84.00 89.90 84.90
CaraNet [24] - 91.80 86.50 93.60 88.70

nnU-Net - 90.81 85.16 88.54 83.91
nnU-Net 100% 91.02 85.60 90.21 85.61
nnU-Net 250% 90.81 85.51 89.31 84.87
nnU-Net 500% 91.22(+0.41) 85.89(+0.73) 91.85(+3.31) 87.18(+3.27)
nnU-Net 1000% 91.09 85.77 91.55 86.96

Table 5.4: Results on the validation set pretrain on a different number of synthetic
data. (# Syn. Pretrain) represents the number of synthetic data used pretraining
the model. The first and second best results are in bold and underlined, respectively.
The number in blue quantifies the performance difference in contrast to the result
on real data.

Result 2:

Table 5.5 presents the results of the generalization set, demonstrating that pre-
training enhances the segmentation model’s ability to generalize. The pretrained
nnU-Net outperforms PraNet on all three datasets and performs comparably to
CaraNet on the challenging dataset ETIS [49]. Specifically, on Colon-DB [4], the
pretrained nnU-Net outperforms the task-specific network, and a notable increase
of +5.4% is observed on ETIS when utilizing 500% of pretraining data. The im-
provement is particularly significant for ETIS, which contains numerous images with
small medical objects, showing how pretraining with diverse synthetic datasets en-
hances the recognition of such small objects. Figure 5.6 visually demonstrates a
more accurate segmentation of small polyps after pretraining. Notably, the most
substantial increase in performance for CVC-300 [60] is achieved when using a 250%
pretraining data increment. These findings further disclose that the improvement
achieved through pretraining is influenced by the specific dataset and the quantity
of synthetic data employed.

5.3.2 Transfer Learning on Nuclei
Following the section 5.3.1, we conduct the experiments using 100%, 250%, 500%
and 1000% percent of synthetic data and checkpoints from section 5.2.2. We set
the training epoch to 100. The grid search of the learning rate yields 5e-6 as the

46


5. Results and Discussion

Methods # Syn. Pretrain Colon-DB CVC-300 ETIS
mDice(%)
mIoU(%)

mDice(%)
mIoU(%)

mDice(%)
mIoU(%)

U-Net [44] - 51.20 44.40 71.00 62.70 39.80 33.50
PraNet [15] - 70.90 64.00 87.10 79.71 62.80 56.70
CaraNet [24] - 77.30 68.90 90.30 83.80 74.70 67.20
nnU-Net - 76.62 69.39 85.45 77.84 69.21 62.14
nnU-Net 100% 77.58 70.80 87.49 80.81 71.71 65.06
nnU-Net 250% 77.90 71.16 87.61 81.09 72.10 65.21
nnU-Net 500% 78.19 70.89 86.73 79.93 74.61 67.67
nnU-Net 1000% 77.65 70.84 87.05 80.46 73.63 66.68

(+1.57) (+1.50) (+2.16) (+3.25) (+5.40) (+5.53)

Table 5.5: Results on the generalization set pretrained on a different number
of synthetic data. (# Syn. Pretrain) represents the number of synthetic data used
pretraining the model. The first and second best results are in bold and underlined,
respectively. The number in blue quantifies the performance difference in contrast
to the result on real data.

Figure 5.6: Quliative results. The red box highlights the improvement of the pre-
trained segmentation model on synthetic data. The results are taken from the best
case on ETIS [49] and MoNuSeg [31].

47


5. Results and Discussion

Methods # Syn. Pretrain MoNuSeg
mDice(%) mIoU(%)

U-Net++ - 81.17 69.29
Attention U-Net - 79.74 67.19

MedT - 79.55 66.17
UCTransUnet - 80.73 68.42

SMESwin Unet - 81.13 68.35
nnU-Net - 81.68 69.09
nnU-Net 100% 81.13 68.36
nnU-Net 250% 81.57 68.98
nnU-Net 500% 81.80 69.30
nnU-Net 1000% 82.31(+0.51) 70.03(+0.73)

Table 5.6: Results on MoNuSeg [31]. (# Syn. Pretrain) represents the number of
synthetic data used pretraining the model. The first and second best results are in
bold and underlined, respectively. The number in blue quantifies the performance
difference in contrast to the result on real data.

optimal value. The experiments are done five times and compute the average. The
result can be found in Table 5.6.

Result:

As shown in Table 5.6, nnU-Net has already appeared as one of the top-performing
methods compared with other methods on this dataset. Further leveraging the power
of pretraining on a much larger dataset comprising ten times more synthetic data,
a new state-of-the-art result is achieved with 0.51% and 0.73% increase on mDice
and mIoU. Visualization of the first row of Fig. 5.6 shows that the pretrained model
excels at capturing fine-grained details present in the medical images, outperforming
models trained directly on real data.

However, note that the benefits of pretraining are not universally consistent. When
the amount of pretrain synthetic data is kept relatively small (100% and 150%),
the performance of the segmentation model actually degrades compared to train-
ing on real data alone. We explain this degradation to the scale of the synthetic
dataset (only 30 images for 100% as opposed to the 100% of Kvasir consisting of 1450
images). With insufficient data for pretraining, the model fails to learn the repre-
sentation for recognizing the image features, leading to overfitting on the synthetic
data and imperfect initialization.

48


5. Results and Discussion

5.3.3 Discussion
In this section, we further explore the effectiveness of the synthetic data through
supervised transfer learning on polyps and nuclei segmentation testsets. We also
compare the number of synthetic data used for pretraining.

We found a substantial overall increase in performance following pretraining with
synthetic data. Moreover, there is a consistent trend of performance improvement
with the utilization of more synthetic data, up to a certain point where further in-
creases show diminishing returns. In essence, pretraining with synthetic data proves
to be an effective strategy for enhancing model performance. As the quantity of
synthetic data used for pretraining increases, the performance of the model tends
to improve steadily. However, this improvement reaches a saturation point, beyond
which additional synthetic data does not yield significant gains in performance.
Thus, there is an optimal balance to be struck in terms of the amount of synthetic
data employed to achieve the best results.

On ETIS, the model is significantly increased by recognizing small medical objects.
We explain this by better feature perception and more small objects being seen af-
ter pretraining. On MoNuSeg, the model outperforms all state-of-arts in Table 5.6.
We believe that the generated samples from the JDM alleviate the problem of im-
balanced examples in the training set and enhance the diversity of the training data.

For future work, we aim to delve into more advanced strategies to fully harness
the potential of synthetic data, such as a student teacher model to enhance the
segmentation model.

49


6
Conclusion and Limitations

Diffusion models (DMs) have incited the latest trend in image synthesis for their
great sample quality and diversity, accomplished by moving a Gaussian sample to
the data distribution through progressive denoising. This work explores the possi-
bility of DMs as a data augmentation tool for biomedical segmentation.

A theoretical framework Joint Diffusion Model (JDM) is proposed to synthesize
paired samples. It is proved that the joint modeling resembles the classical DMs for
conditional sampling and is orthogonal to various methods of solving inverse prob-
lems using the DMs, including image inpainting, super-resolution, colorization, and
etc. The qualitative results on the medical image segmentation dataset present the
realness and diversity of the synthetic samples, demonstrating the capability of the
JDM even with only a small amount of data samples. Using the segmentation mask
for training to perform conditional sampling and identifying the closest matching
sample from the training set, we have confirmed that the JDM effectively learns the
underlying data distribution rather than merely memorizing the specific training
data.

We further investigate using synthetic data for biomedical segmentation. It is dis-
covered that training with synthetic data can lead to relatively comparable segmen-
tation accuracy to real data. However, increasing the amount of synthetic data may
cause performance to decline. We believe this is caused by the growing deviation
between the model distribution and the data distribution. This deviation affects
segmentation performance differently across various datasets.

50


6. Conclusion and Limitations

For transfer learning, the pipeline involves pretraining on synthetic data and then
finetuning on real data. This process improves segmentation performance by empha-
sizing sample diversity, which helps the model recognize fine-grained image features
effectively. Implicit likelihood training of the DMs encourages the model to generate
diverse samples, and thus balance the training data, leading to better generalization
and robustness of the segmentation model.

6.0.1 Limitations and Future Work
Joint modeling poses a severe challenge to the network architecture as the number
of modalities increases. Hence, it still remains unclear whether the method is solid
in the context of multi-class