DF

Constrained space MCMC methods for
nested sampling Bayesian computations
Master’s thesis in Physics and Astronomy

JACOB OLANDER

Department of Physics
CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2020


Thesis for the degree of Master of Science

Constrained space MCMC methods for nested sampling
Bayesian computations

JACOB OLANDER

DF

Department of Physics
Division of Subatomic, High Energy and Plasma Physics

Theoretical Subatomic Physics
Chalmers University of Technology

Gothenburg, Sweden 2020


Constrained space MCMC methods for nested sampling Bayesian computations
JACOB OLANDER

© JACOB OLANDER, 2020.

Supervisor and examiner: Christian Forssén, Department of Physics

Department of Physics
Division of Subatomic, High Energy and Plasma Physics
Theoretical Subatomic Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Visualizations of likelihood-constrained MCMC random walks for prior sampling
in two dimensions. Three different methods are shown: constrained Metropolis (left),
Galilean Monte Carlo (middle) and the constrained stretch move (right). The likelihood
constraint is illustrated by the ellipsoid-shaped bound and the prior by the underlying
density color map.

Typeset in LATEX, template by David Frisk
Printed by Chalmers Reproservice
Gothenburg, Sweden 2020

iv


Constrained space MCMC methods for nested sampling Bayesian computations
JACOB OLANDER
Department of Physics
Chalmers University of Technology

Abstract
Natural phenomena can in general be described using several different scientific models,
which creates a need for systematic model selection. Bayesian model comparison as-
signs relative probabilities to a set of possible models using the model evidence (marginal
likelihood), obtained by an integral that in general needs to be evaluated numerically.
Nested sampling is a conceptual framework that efficiently estimates the model evidence
and, additionally, provides samples from the model parameter posterior distribution used
in Bayesian parameter estimation. A vital step of nested sampling is the likelihood-
constrained sampling of the model parameter prior distribution, a task that has proven
particularly difficult and that is subject to ongoing research. In this thesis we implement,
evaluate and compare three methods for constrained sampling in conjunction with a nested
sampling framework. The methods are variants of Markov chain Monte Carlo algorithms:
Metropolis, Galilean Monte Carlo and the affine-invariant stretch move, respectively. The
latter is applied in the context of nested sampling for the first time in this work. The
performances of the methods are assessed by their application to a reference problem
that has a known analytical solution. The problem is inspired by effective field theories
in subatomic physics where the model parameters take the form of coefficients that are
of natural size. We conclude that the efficiency and computational accuracy of nested
sampling is strongly dependent on the choice of sampling method and the settings of its
associated hyperparameters. In certain cases, especially for high-dimensional parameter
spaces, the implementations of this work are seen to achieve better computational accu-
racy than MultiNest, a state-of-the-art nested sampling implementation extensively used
in astronomy and cosmology. Generally for nested sampling, we observe that it is possible
to obtain an inaccurate result without receiving any clear warning signs indicating that
this is the case. However, we demonstrate that the validity of the computational results
can be assessed by monitoring the sampling process.

Keywords: Bayesian inference, parameter estimation, model comparison, evidence, nested
sampling, MCMC, Metropolis, Galilean Monte Carlo, affine-invariant sampling

v


Acknowledgements
I wish to express my deepest gratitude to my supervisor Christian Forssén for invaluable
guidance and for sharing his endless knowledge and wisdom. Furthermore, I would like to
thank all members of the Theoretical Subatomic Physics group for interesting discussion
and fruitful input during the course of my project. Lastly, I want to recognize the tireless
support from my parents, without which this thesis could never have been completed.

Jacob Olander, Gothenburg, June 2020

vii


Contents

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Purpose and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Bayesian data analysis and numerical methods 3
2.1 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 EFT toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 EFT toy example revisited . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Markov chain Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Sampling with affine invariance: the stretch move . . . . . . . . . . . 11

3 Nested sampling 13
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Prior mass fraction and sorted likelihood function . . . . . . . . . . 13
3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Obtaining the evidence and posterior . . . . . . . . . . . . . . . . . . 17
3.2.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.4 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.5 Uncertainty estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Generating new points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Constrained Metropolis . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Galilean Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Constrained stretch move . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.4 Ellipsoidal nested sampling . . . . . . . . . . . . . . . . . . . . . . . 23

4 Constrained prior sampling implementations 25
4.1 Choice of coordinates: the unit hypercube . . . . . . . . . . . . . . . . . . . 25
4.2 Constrained Metropolis implementation . . . . . . . . . . . . . . . . . . . . 25
4.3 Galilean Monte Carlo implementation . . . . . . . . . . . . . . . . . . . . . 27
4.4 Constrained stretch move implementation . . . . . . . . . . . . . . . . . . . 29

ix


Contents

4.5 Monitoring the sampling progress . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6.1 Sensitivity to the scale parameter value . . . . . . . . . . . . . . . . 33
4.6.2 Choosing a sufficient number of steps . . . . . . . . . . . . . . . . . 36

5 Convergence 37
5.1 The curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Divergence in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Debunking the uncertainty estimate . . . . . . . . . . . . . . . . . . 42

6 Conclusion 45

References 49

A Posterior probability distributions I

x


List of Figures

2.1 The true function g(x) from which synthetic data is generated in the EFT
toy example. Data consists of 10 points evenly distributed in 0 < x ≤ 1/π
with a relative error c = 5% according to Equation (2.7). . . . . . . . . . . . 7

2.2 Marginal distributions of the analytically derived posterior in the EFT ex-
ample for a model with n = 3 parameters. Two-dimensional distributions
for pairs of parameters are located below the diagonal on which the one-
dimensional distributions are displayed. Point estimates are given for each
parameter in terms of their mean values. Error estimates are given in terms
of standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 The model evidence computed for models Mn defined by their parameter
space dimensionality n. It is clear that the evidence is maximized for n = 3
which means thatM3 is the model favoured by the data. The evidence for
n = 1 is ∼ 0 and is therefore not included in the figure. . . . . . . . . . . . 9

3.1 Illustrations of the relationship between the likelihoods L(θ) and L(ξ), the
prior mass fraction ξ and the model evidence Z. . . . . . . . . . . . . . . . 15

3.2 Evolution of active points in a nested sampling iteration. In (a) the worst
point is identified and removed. The limit ξ∗ is updated in (b) to the
position the point was removed from. A new point is generated in (c),
nested within the current limit. . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Sequence of discarded points obtained by estimating the prior mass fractions
ξk statistically. The data is artificially produced. . . . . . . . . . . . . . . . 18

3.4 Illustrations of the basic principles of Galilean Monte Carlo. . . . . . . . . . 23

4.1 Example of a Metropolis random walk in a likelihood-constrained region
of a two-dimensional parameter space. The walk is captured at a nested
sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates
the start of the walk and the end of the walk, respectively. The black dots
indicate intermediate steps. The average number of steps for this walk is
〈Ns〉 = 20 and the scale parameter is s = 0.5. . . . . . . . . . . . . . . . . . 27

4.2 Example of a GMC trajectory in a likelihood-constrained region of a two-
dimensional parameter space. The trajectory is captured at a nested sam-
pling iteration with ξ ≈ e−H. The green plus and the red cross indicates the
start of the trajectory and the end of the walk, respectively. The black dots
indicate intermediate time steps. The reflections do not occur exactly at
the boundary but rather at proxy surfaces just outside the boundary (not
in figure) as described in Section 3.3.2. . . . . . . . . . . . . . . . . . . . . . 29

xi


List of Figures

4.3 Example of a stretch move random walk in a likelihood-constrained region
of a two-dimensional parameter space. The walk is captured at a nested
sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates
the start of the walk and the end of the walk, respectively. The black dots
indicate intermediate steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Progress of three key quantities over the course of a nested sampling run for
the three different methods. The sampling is performed in n = 3 dimensions
using 〈Ns〉 = 40 with sampling parameters N = 1000 and fln = 0.01. The
top panel shows the likelihood L for the worst point of each iteration, the
middle panel shows the posterior weights w, used to compute the evidence,
and the bottom panel shows the moving median of the acceptance rate ra for
each iteration. The methods distinguish themselves clearly in the variation
of the acceptance rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 The (log) evidence error (top row), the bulk median acceptance rate (mid-
dle row) and the number of likelihood evaluations (bottom row) in n = 3
dimensions for the three methods applied in the nested sampling frame-
work to the toy problem in Section 2.1.1 for different values of the scale
parameters s, τ and a. Sampling parameters were N = 1000 active points,
fln = 0.01 tolerance and 〈Ns〉 = 40 exploration steps per iteration. . . . . . 35

4.6 Error in the computed (log) evidence with different choices of the average
number of steps in each nested sampling iteration. Sampling parameters
are N = 1000 active points and fln = 0.01 and the scale parameters are
as indicated by the legend. The bands are the standard deviations of five
identical runs with different random seeds. The general trend is that the
error approaches zero and fluctuates less for larger 〈Ns〉. . . . . . . . . . . . 36

5.1 Illustration of the exponential increase of adjacent points with the dimen-
sionality n of a Euclidean space. In (a): one, (b): two and (c): three
dimensions, a grid point (filled circle) has 2, 8 and 26 neighbouring points
(unfilled circles), respectively. In arbitrary dimensions, n, the correspond-
ing number is 3n − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 The evidence error (top row), the bulk median acceptance rate (middle
row) and the number of likelihood evaluations (bottom row) for varying
parameter space dimensionality. Each point is the average of five identical
runs and the bands are corresponding standard deviations. The left and
right columns display two different choices of hyperparameters, respectively,
as specified in the figure. Sampling parameters areN = 1000 and fln = 0.01.
The overall trend is that performance worsens and computations become
less accurate for higher dimensions, as expected. . . . . . . . . . . . . . . . 41

5.3 Tracked progress for the three methods, displayed in terms of the likelihood
L, the posterior weights w and the acceptance rate ra in n = 24 dimensions
using 〈Ns〉 = 40. In contrast to the n = 3 case in Figure 4.4, the methods
are seen to disagree on where the bulk of the posterior is located in terms
of ξ, resulting in significant differences in the computed evidences. The
vertical lines indicate the computed information H. Sampling parameters
are N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xii


List of Figures

A.1 Marginal posterior distributions obtained using the constrained Metropolis
method with s = 0.5 and 〈Ns〉 = 40. The distributions are compared to the
analytical solution (dash-dotted). Vertical dashed lines represent the sample
median and 16th and 84th percentiles, respectively. The Taylor coefficients
(Equation (2.17)) are indicated by the square markers and vertical solid
lines. The histograms contain ∼ 5000 samples in each case. N = 1000 and
fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2 Marginal posterior distributions obtained using the GMC method with τ =
0.1 and 〈Ns〉 = 40. The distributions are compared to the analytical solution
(dash-dotted). Vertical dashed lines represent the sample median and 16th

and 84th percentiles, respectively. The Taylor coefficients (Equation (2.17))
are indicated by the square markers and vertical solid lines. The histograms
contain ∼ 5000 samples in each case. N = 1000 and fln = 0.01. . . . . . . . III

A.3 Marginal posterior distributions obtained using the constrained stretch move
method with a = 2.0 and 〈Ns〉 = 40. The distributions are compared to
the analytical solution (dash-dotted). Vertical dashed lines represent the
sample median and 16th and 84th percentiles, respectively. The Taylor coef-
ficients (Equation (2.17)) are indicated by the square markers and vertical
solid lines. The histograms contain ∼ 5000 samples in each case. N = 1000
and fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.4 Marginal posterior distributions obtained using PyMultiNest. The distri-
butions are compared to the analytical solution (dash-dotted). Vertical
dashed lines represent the sample median and 16th and 84th percentiles,
respectively. The Taylor coefficients (Equation (2.17)) are indicated by the
square markers and vertical solid lines. The histograms contain ∼ 5000
samples in each case. N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . IV

xiii


List of Figures

xiv


List of Tables

4.1 Nested sampling results for the three methods applied to the same problem
in n = 3 dimensions. Sampling parameters are N = 1000 active points and
fln = 0.01 tolerance. The true evidence value for this model is lnZtrue =
8.09 as was analytically obtained in Figure 2.3. The number of likelihood
evaluations, NL, does in the GMC case, include the number of evaluations
of the gradient. The fraction of gradient evaluations is given in parenthesis. 32

5.1 The computed (log) evidence lnZ along with the uncertainty estimate
±
√
H/N and the number of likelihood evaluations NL required as functions

of the parameter space dimensionality n. The analytical values lnZtrue are
given for reference. Hyperparameter settings are the same as in the left col-
umn of Figure 5.2. Every value is the average obtained from five identical
runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xv


List of Tables

xvi


1
Introduction

Scientific models are typically associated with a set of model parameters used to make ex-
plicit theoretical predictions. A theoretical framework may in fact contain one or a set of
competing models {Mn}, each with their own collection of model parameters whose values
need to be calibrated against experimental observation. A suitable example in subatomic
physics regards effective field theories (EFT), e.g. chiral effective field theory (χEFT) [1],
which are described by observable coefficients that parametrize the theory at the scale at
which it is valid [2]. In this type of scenario we would like to make well-informed state-
ments about which model and associated model parameter values to prefer, given a set of
experimental data. To this end we employ Bayesian statistical methods [3, 4] in order to
perform inductive inference, including parameter estimation and model comparison. For
parameter estimation, the central object is the posterior probability distribution for the
parameters, which in general needs to be represented by a set of random samples. The key
quantity for model comparison is the Bayesian model evidence (or marginal likelihood),
obtained by integration over the space of model parameters. By evaluating the evidence,
competing models in the set can be assigned a relative probability, indicating which model
is favoured by the data.

Bayesian inference generally requires demanding numerical computations in terms of high-
dimensional evidence integrals and sampling of complex posterior probability distributions.
For this purpose one mainly resorts to Markov chain Monte Carlo (MCMC) methods [5]
which explore the model parameter space in order to find regions of high probability.
However, although MCMC methods, such as the Metropolis-Hastings algorithm [6, 7],
theoretically could be used to estimate the evidence integral, in practice they fail to do so
with acceptable efficiency.

A less established Monte Carlo method compared to MCMC is nested sampling, intro-
duced by Skilling [8, 9] and further described by Sivia and Skilling [4]. Nested sampling
was specifically developed to efficiently provide an estimate of the model evidence. As
a by-product it also generates a set of samples from the posterior. The algorithm has
been successfully applied in parameter estimation and model comparison applications in a
broad variety of fields, from astronomy, cosmology and particle physics [10–15] to biomath-
ematics [16–18]. The basic idea of nested sampling is to transform the high-dimensional
evidence integral to a one-dimensional integral over unit range by considering secluded
and closed regions of parameter space, nested within each other. A most crucial step
in the nested sampling procedure is to generate independent samples from within these
constrained regions and a variety of methods for performing this step have been proposed
and implemented [10, 19–23]. Producing high-quality constrained samples has proven to
be a difficult task, to say the least, and is subject to ongoing research. This work aims at
contributing to this research by implementing and evaluating three different constrained

1


1. Introduction

space sampling methods which are integrated into a nested sampling framework. The
implementations are in themselves MCMC based although their surrounding nested sam-
pling environment is not. Two of the methods are inspired by previous work on the topic
whereas the third method is introduced to the context of nested sampling for the first time
in this work. The implemented methods are applied to an EFT-inspired example problem
and evaluated based on their respective performance.

1.1 Purpose and scope
The purpose of this thesis is to increase the understanding of the nested sampling algorithm
in general and constrained space sampling in particular. This is achieved by proposing
the specific designs, implementing and evaluating three different methods for constrained
sampling in conjunction with nested sampling. Focus will mainly be on the performance
based on the specifics of the three methods and not on the design of the nested sampling
algorithm in general. In contrast to sampling of unconstrained probability distributions,
the process of constrained sampling is poorly understood. A specific goal of this work is to
present modified versions of ordinary MCMC methods, adjusted to suit the requirements
introduced by nested sampling.

1.2 Outline
In order to understand the context in which nested sampling is used it is necessary to
grasp the concepts of the Bayesian statistical analysis framework. This background is
provided in Chapter 2 and includes the basics of Bayesian inference as well as a practical
example on parameter estimation and model comparison. Furthermore, the chapter is
ended by a brief description of a few MCMC sampling methods which are conventionally
used in the context of Bayesian inference. The underlying ideas and a full description of the
nested sampling algorithm are given in Chapter 3. The chapter additionally introduces the
principles of the constrained sampling methods implemented in this work and describes
their place in relation to nested sampling. The specifics of the design choices for the
implementations are described in Chapter 4 which further presents diagnostic results used
to optimize the methods. The first method being presented is a constrained version of
the Metropolis algorithm, the second is a version of Galielean Monte Carlo and the third
is a constrained version of an affine-invariant algorithm referred to as the stretch move.
The latter method is to our knowledge unique to this work. In Chapter 5 the methods
are observed when pushed over their tipping points and broken down by applying them
to models with increasingly larger parameter space dimensionalities. Here we also discuss
uncertainty estimates and make a comparison to the well-used state-of-the-art nested
sampling software MultiNest [10, 11, 24], which the methods of this work are seen to
outperform in some cases. The thesis is concluded in Chapter 6 by summarizing the main
findings and by proposing starting points for further related work.

2


2
Bayesian data analysis and

numerical methods

Bayesian statistics and its applications is a vast subject and is probably relevant to most
scientific research or any field of work where making predictions from data and previous
experience is important. Here follows a brief description of the aspects of Bayesian infer-
ence relevant to this work. For an extensive review of the subject, see e.g. MacKay [3] or
Sivia and Skilling [4].

2.1 Bayesian inference
One of the powers of the Bayesian description of probability is that it allows for a relation
to be established between a conditional probability prob(X|Y ) and its reverse prob(Y |X)
where X and Y are propositions that could be outcomes of random processes (although
that property is not required). This reversal is desirable in scenarios e.g. where prob(X|Y )
is the sought after quantity but is difficult to directly write down whereas prob(Y |X) is
easier to interpret and more naturally expressed. The product rule for joint probabilities
states that

prob(X,Y ) = prob(X|Y )prob(Y ), (2.1)

which in words reads: the probability that X and Y occurs equals the probability that
X occurs given that Y has occurred multiplied by the probability that Y occurs. Joint
probabilities are symmetric such that prob(X,Y ) = prob(Y,X) which by the product rule
in Equation (2.1) gives Bayes’ theorem

prob(X|Y ) = prob(Y |X)prob(X)
prob(Y ) , (2.2)

which yields the relation between the conditional probabilities prob(X|Y ) and prob(Y |X).
The marginal probabilities prob(X) and prob(Y ) can be expressed by marginalization of
the joint probability prob(X,Y ) (2.1)= prob(Y |X)prob(X) through the sum rule according
to

prob(Y ) =
{∫

prob(Y |X)prob(X)dX, for continuous X∑
X prob(Y |X)prob(X), for discrete X

(2.3)

and correspondingly for prob(X).

In the context of this work, X takes the form of an n-dimensional parameter vector θ =
(θ0, ..., θn−1) used to make predictions in a given model M. Y is in this context a set
of data D obtained from an experiment or simulation yielding measurements of some

3


2. Bayesian data analysis and numerical methods

quantity sought to be predicted byM. In other words, the goal is to find the probability
(distribution) of the parameters θ given data D. Bayes’ theorem then reads

prob(θ|D, I) = prob(D|θ, I)prob(θ|I)
prob(D|I) (2.4)

where I is any other relevant information including e.g. the choice ofM. The probabilities
entering Equation (2.4) are

the posterior probability distribution of the parameters prob(θ|D, I) ≡ P(θ), (2.5a)
the likelihood of the data prob(D|θ, I) ≡ L(θ), (2.5b)

the prior probability distribution of the parameters prob(θ|I) ≡ π(θ), (2.5c)
the Bayesian evidence (or marginal likelihood) prob(D|I) ≡ Z. (2.5d)

2.1.1 EFT toy example

In order to put Bayesian inference into context, we here review a simple EFT-inspired
example from Schindler et al. [25] and Wesolowski et al. [26]. In this example, a function
g(x) is taken to represent the true behavior of some observable in the underlying theory
and x is an EFT expansion parameter, e.g. a ratio of momenta representing a low-energy
physics scale and a high-energy breakdown scale. The function is chosen such that the
coefficients of its Taylor expansion at x = 0 within |x| < 1 ought to be of natural order:

g(x) = a0 + a1x+ a2x
2 + a3x

3 + ... (2.6)

In this way the observable coefficients ai are related to order-by-order predictions of the
EFT. More specifically, including the first n terms would correspond to the prediction
of the EFT truncated at order n. By physical arguments the observable coefficients are
expected to be of natural size [2]. To simulate observations from an experiment, synthetic
data points d are generated by adding a Gaussian relative noise to g(x) at points of
measurement xj , j = 1, ..., Nd:

dj = g(xj)(1 + cηj) with experimental error σj = cdj (2.7)

where ηj ∼ N (0, 1)1 is a standard normal random sample and c is a relative error (5%
in this example). To make accurate predictions, the observable coefficients need to be
calibrated against this data which makes them ideal subjects for Bayesian parameter
estimation (see Section 2.1.2). In this example, our model M is a polynomial of order
n− 1,

gM(x; θ) =
n−1∑
i=0

θix
i, (2.8)

modelling the behavior of the true function g(x). This is where the parameter vector
θ of the model enters, upon which the predictions of M depends. A non-Bayesian ap-
proach usually aims to find a point estimate of the parameters, such as the least-squares
minimization arg minθ χ

2(θ) where

χ2(θ) =
Nd∑
j=1

(
dj − gM(xj ; θ)

σj

)2

. (2.9)

1N (µ, σ2) denotes a normal distribution with mean µ and variance σ2.

4


2. Bayesian data analysis and numerical methods

This approach might be sufficient in some applications but often fails to provide any infor-
mation on the distribution of the parameters and thereby their uncertainties. Neither does
it take into account any prior knowledge of the parameters (such as naturalness in the case
of the EFT expansion). We will now turn to the general problem of parameter estimation
and model comparison. However, we will revisit this toy example in Section 2.1.4.

2.1.2 Parameter estimation

The model parameters θ are in the Bayesian formalism estimated using the posterior
P(θ) determined by Bayes’ theorem Equation (2.4). In parameter estimation applications
it is possible to ignore the evidence Z as it is independent of θ and merely amounts to
a normalization factor; it is however a central quantity in the context of Bayesian model
comparison (see Section 2.1.3). The posterior

P(θ) ∝ L(θ)π(θ) (2.10)

is consequently determined solely by the likelihood and prior and is obtained either by
numerical methods, such as MCMC sampling (see Section 2.2), or as an analytical solution
(if possible). Subsequently, it is possible to derive distributions for any subset θj , ..., θk of
parameters by marginalization

prob(θj , ..., θk|D, I) =
∫
P(θ)

∏
i 6=j,...,k

dθi, (2.11)

where integration is taken over the appropriate ranges in θ. Marginalization is useful for
e.g. discarding of nuisance parameters but also convenient for visualization which (unfortu-
nately) is bounded by two or three dimensions. Examples of such two- and one-dimensional
marginal probability density functions (pdfs) are shown in Section 2.1.4. Point estimates
of any function f(θ) with respect to the posterior are obtained with the expectation value

E[f ] =
∫
f(θ)P(θ)dnθ (2.12)

where the special cases f(θ) = θi and f(θ) = (θi − E[θi])(θj − E[θj ]) give the parameter
means and covariances respectively. It is important to stress, however, that the posterior
contains more information than mere point estimates such as mean or mode values. The
posterior has in general a complex structure, such as multiple modes, not well described by
a single number. Only in the special case when the posterior is (approximately) Gaussian
can it be fully described by its mean vector and covariance matrix.

2.1.3 Model comparison

Model comparison can be illustrated by the problem of determining which of two possible
modelsM1 andM2 (e.g. polynomials of different orders in the example of Section 2.1.1)
to prefer. In the Bayesian formalism this problem is addressed by comparing the model
posteriors prob(M1|D, I) and prob(M2|D, I)2. These posteriors are conditioned on the
same data D and do not include any parameter dependence as they are more general than
to describe a certain fit for a given prediction. As opposed to parameter estimation the
evidence Z plays a crucial role in model comparison. It is given by normalization of the
posterior

2Note that as M1 and M2 are here made explicit they are not included in I.

5


2. Bayesian data analysis and numerical methods

Z =
∫

prob(D|θ,M, I)prob(θ|M, I)dnθ =
∫
L(θ)π(θ)dnθ. (2.13)

The model comparison is then performed by constructing the ratio of the model posterior
pdfs using Bayes’ theorem according to

prob(M1|D, I)
prob(M2|D, I) = prob(D|M1, I)prob(M1|I)

prob(D|M2, I)prob(M2|I) (2.14)

where the common factor prob(D|I) has cancelled. If there is no initial reason to prefer one
model over the other, the ratio of model priors will evaluate to prob(M1|I)/prob(M2|I) =
1. The remaining ratio is called the Bayes’ factor and should be determined in order to
compare the models as

prob(M1|D, I)
prob(M2|D, I) −→

prob(D|M1, I)
prob(D|M2, I) . (2.15)

Using the sum (2.3) and product (2.1) rules the Bayes’ factor can be computed by inte-
gration as

prob(D|M1, I)
prob(D|M2, I) =

∫
prob(D|θ,M1, I)prob(θ|M1, I)dnθ∫
prob(D|θ,M2, I)prob(θ|M2, I)dnθ

(2.13)= Z1
Z2

(2.16)

meaning that the evidences, Z1,2, are used to quantitatively assess the relative performance
of different models.

2.1.4 EFT toy example revisited

In this section the representative problem introduced in Section 2.1.1 of estimating ob-
servable coefficients in an EFT will be approached using Bayesian inference. A function
g(x) with the desired property of natural Taylor coefficients ai, described in the example,
is

g(x) =
(1

2 + tan
(
π

2x
))2

= a0 + a1x+ a2x
2 + a3x

3 + ...

= 1
4 + π

2x+ π2

4 x
2 + π3

24x
3 + ... (2.17)

≈ 0.25 + 1.57x+ 2.47x2 + 1.29x3 + ...

which here will be used to generate synthetic data. The true function (2.17) along with
data and associated errors generated according to Equation (2.7) can be seen in Figure 2.1.
The data consists of 10 points evenly distributed in 0 < x ≤ 1/π and the relative error is
c = 5%. As a reminder, the goal is to predict the coefficients ai (and thereby also g(x))
given this data by studying the parameters θ of the model function gM(x; θ) defined in
Equation (2.8). This is done by constructing a posterior pdf for the parameters via a
likelihood and a prior according to Bayes’ theorem (2.4)).

Using the principle of maximum entropy (MaxEnt) [27] it is possible to derive [4] a likeli-
hood for the data. Assuming uncorrelated data, the resulting pdf is

prob(D|θ, I) =
Nd∏
j=1

 1√
2πσ2

j

 exp
(
−χ

2

2

)
(2.18)

6


2. Bayesian data analysis and numerical methods

0.0 0.1 0.2 0.3 0.4
x

0.2

0.4

0.6

0.8

1.0

1.2

1.4

g
(x

)

True function

Synthetic data

Figure 2.1: The true function g(x) from which synthetic data is generated in the EFT
toy example. Data consists of 10 points evenly distributed in 0 < x ≤ 1/π with a relative
error c = 5% according to Equation (2.7).

which is an Nd-dimensional Gaussian in the data and where χ2 is defined in Equation (2.9).
It is important to note that this is a pdf for the data D, conditioned on the parameters
θ, not a pdf for the parameters themselves. If a uniform prior, i.e. prob(θ|I) = const., is
chosen in combination with the likelihood (2.18), the maximum of the posterior would be
exactly at the least-squares point estimate discussed in Section 2.1.1 (given that the prior
range includes this point). To account for the naturalness of the coefficients, however,
Wesolowski et al. [26] introduces a naturalness prior given by the symmetric Gaussian
N (θ; 0, θ̄21n), i.e.

prob(θ|I) =
n−1∏
i=0

prob(θi|I) =
n−1∏
i=0

1√
2πθ̄2

exp
(
− θ2

i

2θ̄2

)
(2.19)

where θ̄ = 5 is the width of the distribution. In this way the parameters are favoured by
the prior to the interval ±5 (roughly) and thereby stay close to natural size. The posterior
obtained as the product of the MaxEnt-likelihood (2.18) and the naturalness prior (2.19)
is visualized in Figure 2.2 for a model with n = 3. This is done by presenting its marginal
pdfs for different parameter subsets. The posterior is in this case a Gaussian, as can be
derived analytically from the relatively simple forms of the likelihood and the prior. In
general, analytical expressions are impossible to find and one has to resort to numerical
sampling methods such as MCMC, which will be discussed in Section 2.2. Along with the
distributions, Figure 2.2 contains point and error estimates for the parameters given by
their means and standard deviations respectively. The point estimates should be compared
to the actual coefficients in Equation (2.17).

The posterior illustrated in Figure 2.2 is obtained given a model with n = 3 parameters,
i.e. a polynomial of order n − 1 according to Equation (2.8). This specific choice might
however not be the model that has the most evidence given the data. We can nevertheless
try to inform ourselves by employing the principles of model comparison described in
Section 2.1.3. It is in this example possible to compute the evidence prob(D|Mn, I) for a

7


2. Bayesian data analysis and numerical methods

θ0 = 0.25± 0.02

0.
5

1.
0

1.
5

2.
0

2.
5

θ 1

θ1 = 1.63± 0.39

0.
20

0
0.
22

5
0.
25

0
0.
27

5
0.
30

0

θ0

0

2

4

6

θ 2

0.
5

1.
0

1.
5

2.
0

2.
5

θ1

0 2 4 6

θ2

θ2 = 3.14± 1.27

Figure 2.2: Marginal distributions of the analytically derived posterior in the EFT
example for a model with n = 3 parameters. Two-dimensional distributions for pairs of
parameters are located below the diagonal on which the one-dimensional distributions are
displayed. Point estimates are given for each parameter in terms of their mean values.
Error estimates are given in terms of standard deviations.

modelMn with n parameters analytically and the result is shown in Figure 2.3. We see
clearly thatM3 is in fact the model with the maximum evidence. The fact that the model
evidence reaches a plateau for n ≥ 5 means that not much information is neither gained
nor lost by adding or removing a parameter in this region. The evidence integral (2.13)
is, however, in general not possible to compute analytically and numerical methods are
required. The work of this thesis concerns different versions of such a method, namely
nested sampling, which will be introduced in Chapter 3. For this, we first have to acquaint
ourselves with a few MCMC sampling methods.

2.2 Markov chain Monte Carlo methods

MCMC methods [5] comprise a set of sampling algorithms widely used in multiple ap-
plications, not least in Bayesian inference. They are designed to approximate a target
distribution p(θ) by the construction of Markov chains whose equilibrium distributions
are that of the target. This is done by letting walkers explore parameter space looking
for regions of significant probability. Markov chains are created by ensuring that the
next position θ

′ of each walker only depends on its current position θ and a transition
probability T (θ′

,θ). In contrast to other Monte Carlo methods, MCMC thus produces
correlated samples which is important to take into account when e.g. the sample mean
and variance are computed. The Markov chain limit theorem [28] provides principles for

8


2. Bayesian data analysis and numerical methods

2 3 4 5 6 7 8 9

Parameter space dimensionality n

500

1000

1500

2000

2500

3000

p
ro

b
(D
|M

n
,I

)

Figure 2.3: The model evidence computed for models Mn defined by their parameter
space dimensionality n. It is clear that the evidence is maximized for n = 3 which means
thatM3 is the model favoured by the data. The evidence for n = 1 is ∼ 0 and is therefore
not included in the figure.

handling correlated samples and justification for computations performed using MCMC
samples. We say that a Markov chain is ergodic if its temporal average, i.e. over time
steps t, in the long run approaches its ensemble average over all possible states, i.e. if all
possible states are eventually explored. This is a key property for MCMC methods for
generating samples from the target distribution. Furthermore, a Markov chain is reversible
if it satisfies detailed balance, meaning that the transition probability from θ to θ

′ is the
same as from θ

′ to θ, i.e. T (θ,θ′) = T (θ′
,θ). Detailed balance implies that the Markov

chain is stationary which means that the distribution of a sample θ(t) is independent of the
time t. Stationarity gives the Markov chain the right properties concerning its equilibrium
distribution. We will now proceed by describing a few specific MCMC algorithms used to
sample distributions.

2.2.1 Metropolis-Hastings

The Metropolis-Hastings algorithm [6, 7] is an MCMC method for producing samples from
a target p(θ). This is done by exploring parameter space by random walks, producing
Markov chains. At each step t in a chain, a new position θ

′ is proposed, drawn from
a proposal distribution Q(θ′ |θ(t)) where θ(t) is the current position. The proposition is
accepted with probability

α = min
(

1, p(θ
′)Q(θ(t)|θ′)

p(θ(t))Q(θ′ |θ(t))

)
(2.20)

and θ(t+1) ← θ
′ is set, else θ(t+1) ← θ(t). In the original Metropolis implementation

the proposal distribution is taken to be symmetric, meaning Q(θ(t)|θ′) = Q(θ′ |θ(t)),
and is regularly chosen to be a symmetric Gaussian N (θ′ ; θ(t), σ2

Q1n) where the scale
σQ needs to be set appropriately. The acceptance probability therefore simplifies to
α = min

(
1, p(θ′)/p(θ(t))

)
. In general, the proposal distribution does not have to be

9


2. Bayesian data analysis and numerical methods

isotropic and can for instance be a Gaussian with an arbitrary n × n covariance matrix
ΣQ. In this case, there are 1

2n(n + 1) parameters that need to be set depending on the
problem at hand, complicating the user experience. The Metropolis procedure is described
in Algorithm 2.1. A caveat of the Metropolis sampling algorithm is that it, like MCMC
methods in general, inherently produces correlated samples. This property needs to be
properly considered in applications and will be revisited in Section 3.3.1.

Algorithm 2.1: Metropolis procedure.
Input: Target p(θ) and proposal distribution Q(θ′ |θ)
Output: Target samples θ(t)

Initialize θ(0)

for t = 0, ..., Ns − 1 do
θ

′ ← sample from Q(θ′ |θ(t))
α← min

(
1, p(θ

′
)

p(θ(t))

)
u← sample from U(0, 1)
if u ≤ α then

θ(t+1) ← θ
′

else
θ(t+1) ← θ(t)

2.2.2 Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) [3, 5], introduced by Duane et al. (1987) [29], is an
MCMC method for generating high-quality samples from challenging target distributions
p(x)3, particularly for larger number of dimensions n. The idea is taken from Hamiltonian
classical mechanics by considering the logarithm of the target distribution as an energy
potential −V (x) in parameter space, i.e.

p(x) ∝ exp(−V (x)). (2.21)

The concept of momentum p is included by considering the phase space extended pdf
prob(x,p) = prob(x)prob(p) = p(x)q(p) where we have introduced

q(p) ∝ exp
(
−|p|

2

2

)
, (2.22)

which is the exponent of the kinetic energy. The Hamiltonian of a classical system is
of the form H = 1

2 |p|
2 + V (x), which implies that the full 2n-dimensional phase space

distribution can be written as

prob(x,p) ∝ exp(−H). (2.23)

Any point (or sample) (x,p) can thus be evolved in time t for any given period, yielding
a new point (x′,p′), by solving Hamilton’s equations4:

3We shall here use x instead of θ for the parameters to emphasize the parallel to position in classical
mechanics.

4These equations need in general to be solved numerically, which is a topic that we will not elaborate
on here.

10


2. Bayesian data analysis and numerical methods

dx
dt = ∂H

∂p = p (2.24a)

dp
dt = −∂H

∂x = −∇V (x), (2.24b)

serving the purpose of the transition probability T (x′,p′; x,p). If the momentum p is ran-
domly initiated at each new position x, these points will form a Markov chain satisfying
ergodicity [29]. Detailed balance is satisfied by the time reversal symmetry of Hamilton’s
equations (2.24). The desired target p(x) is obtained by discarding the p-samples, only
keeping the x-part of phase space. These samples should be of high quality as the trajec-
tories described by the dynamics of Equation (2.24) are favoured to move towards regions
of low potential energy, i.e. high probability. This is especially useful in large-dimensional
spaces out of which these high-quality regions usually make up a small fraction, making
them hard to locate. The HMC algorithm also covers distance in parameter space faster
than random walks such as Metropolis-Hastings, effectively decreasing the correlation of
samples. The most noticeable drawback of HMC is that it requires evaluation of the
gradient −∇V (x) = ∇ ln p(x) at every step of a trajectory. This is in general a very
computationally expensive and thus time consuming task. HMC furthermore contains a
number of hyperparameters, such as the number of time steps and the size of the steps,
that need to be appropriately set for your current problem.

2.2.3 Sampling with affine invariance: the stretch move

Model parameters can, in general, be strongly correlated, making their joint probability
distributions highly asymmetric. This means that the characteristic length scales and
relevant directions can differ largely between parameter dimensions making efficient ex-
ploration of parameter space more challenging. Building upon work by Christen [30],
Goodman and Weare [31] proposed an MCMC sampling method that is independent of
scale differences between dimensions. Formally, the algorithm is invariant under affine
transformations, which in n dimensions are of the form

yi = βi0θ0 + ...+ βin−1θn−1 =
n−1∑
j=0

βijθj , i = 0, ..., n− 1, (2.25)

where θj are the parameters of the problem, βij are scale coefficients and yi are the
corresponding transformed parameters. Affine invariance of the algorithm implies that
parameter dimensions are effectively scaled to the same size, simplifying the shape of the
distribution making parameter space easier to explore. To achieve this, the algorithm
utilizes a special kind of random walk, informally referred to as the stretch move. This
method simultaneously evolves an ensemble of K walkers S = {Θk}Kk=1 where the proposal
distribution for walker Θk depends on the complementary ensemble S[k] = {Θj , ∀j 6= k}.

At step t, a new position Y is proposed for walker Θk by choosing another walker Θj

randomly from S[k] and setting

Θ(t)
k → Y = Θj + ζ̃[Θ(t)

k −Θj ] (2.26)

where ζ̃ is a scale factor randomly sampled from a distribution g(ζ). For the proposal to
be symmetric and satisfy detailed balance we must have g(ζ−1) = ζg(ζ) as well as add a
Metropolis style rejection scheme where Θ(t+1)

k ← Y is accepted with probability

11


2. Bayesian data analysis and numerical methods

α = min
(

1, ζ̃n−1 p(Y )
p(Θ(t)

k )

)
, (2.27)

where p(θ) is the target distribution. This proposal procedure is repeated for every walker
in S resulting in a collective evolution.

As an explicit form for g(ζ), Goodman and Weare [31] suggests

g(ζ) ∝ 1√
ζ
, for ζ ∈

[1
a
, a

]
, (2.28)

where a is a hyperparameter adjusting the scale of g(ζ) and which they set to 2. One
main advantage of this method is that there are very few hyperparameters that need to
be set by the user, there are essentially only two: step scale a and the number of steps.
A state-of-the-art implementation of the stretch move procedure for MCMC sampling can
be found in e.g. the Python module emcee [32].

The MCMC methods described in the previous sections provide a set of samples that
can be used to make computations — such as the model evidence — with respect to the
posterior. However, this procedure quickly becomes inefficient for large-scale problems and
alternative approaches are necessary. In the next chapter we introduce such an approach:
nested sampling.

12


3
Nested sampling

Introduced by Skilling [8], nested sampling is a Monte Carlo sampling method that natu-
rally provides an estimate of the evidence Z. It is however important to note that nested
sampling is not an MCMC method as it does not produce Markov chains distributed ac-
cording to the posterior. The acquisition of posterior samples is nevertheless also a product
but not the main focus of the algorithm. Here follows a review of the principles of nested
sampling, for a detailed description see e.g. Sivia and Skilling [4] or Skilling [9].

3.1 Background

Let us first repeat some Bayesian key concepts from Chapter 2. Given data D and other
information I, such as the choice of model, the goal is to obtain the evidence (or marginal
likelihood) Z = prob(D|I) but also the posterior P(θ) = prob(θ|D, I) from the prior
π(θ) = prob(θ|I) and the likelihood L(θ) = prob(D|θ, I). The relation between these
quantities is given through Bayes’ theorem which in this more compact notation takes the
form

L(θ)× π(θ) = Z × P(θ). (3.1)

Assuming that the posterior distribution is properly normalized to unity, i.e.
∫
P(θ)dnθ =

1, the evidence can be written in terms of the integral

Z =
∫
L(θ)π(θ)dnθ (3.2)

where the integration is taken over the parameter ranges according to the prior. The com-
plexity of computing this integral increases exponentially with the number of dimensions
as a space with n dimensions resolved to one part in R has Rn volume elements. High
dimensionality furthermore obstructs the use of analytical approximations as the degree
of freedom in these spaces becomes too large [4].

3.1.1 Prior mass fraction and sorted likelihood function

What makes the numerical computation of integrals such as Equation (3.2) challenging
is the brute-force approach of direct summation of the exponentially large number of
elements in parameter space. However, the likelihood values associated to each of these
elements can be sorted in a decreasing order, resulting in a sequence {Lk}. The sequence
could be enumerated using a global variable encoding the information of the part of prior
parameter space with likelihoods larger than Lk. This variable is the prior mass fraction,
here denoted ξ. In the case of continuous parameters θ, the prior mass fraction as a
function of likelihood limit is formally defined as

13


3. Nested sampling

ξ(λ) =
∫
L(θ)>λ

π(θ)dnθ (3.3)

and is interpreted as the fraction of prior probability with likelihood greater than λ [4].
Assuming the prior π(θ) to be normalized to unity, the prior mass fraction has by def-
inition properties: 0 ≤ ξ ≤ 1, ξ(0) = 1 and ξ(Lmax) = 0 where 0 < L ≤ Lmax. Also
by definition, the mass element associated with likelihoods λ ≤ L ≤ λ + dλ is given by
dξ = π(θ)dnθ. In this way the problem is transformed from n-dimensional parameter
space, in terms of the parameter vector θ = (θ0, ..., θn−1), to a single variable ξ. There
is no information lost in this reduction of dimensions if ξ is stored at a resolution of one
part in Rn and each of the n parameters θi is stored at a resolution of one part in R as
the information required in both cases amounts to n log2R bits [4].

For the sake of convenience, it is rather the inverse function L(ξ), defined by

L(ξ(λ)) = λ, (3.4)

that we shall consider in practice. It takes on values 0 ≤ L ≤ Lmax where the extremes are
at L(0) = Lmax and L(1) = 0. L(ξ) is a sorted version of the original likelihood L(θ) and
only has a single variable dependence ξ. The existence of the inverse is not trivial to prove.
In fact, it has been proven by Schittenhelm and Wacker [33] that the statement (3.4) is
violated for specific forms of L(θ). For the likelihood functions in this work, however, the
definition of the inverse L(ξ) will always hold.

A cartoon representing the relation between the likelihood versions L(θ) and L(ξ) can be
seen in Figure 3.1a. The left panel shows an example of a likelihood in two dimensions
where each of the iso-likelihood contours L(θ) = Lk, k = 1, ..., 4, corresponds to a prior
mass fraction ξk shown in the right panel. In this example, ξk is simply the integral of the
prior over the area enclosed by the contour Lk. The filled in areas illustrate what parts
of parameter space correspond to what ξ-intervals. Note that the figure is for illustrative
purposes and is not quantitatively correct.

The key idea setting up the nested sampling method is to express the evidence Z in terms
of the quantities L and ξ. The evidence (3.2) is a sum of likelihood values L(θ) with
associated prior mass weights π(θ)dnθ over volume elements dnθ. The mass elements dξ
originate from the same volume elements meaning the evidence is equivalently given by

Z =
∫ 1

0
L(ξ)dξ (3.5)

which notably is a one-dimensional integral. Geometrically, this means that Z is the area
enclosed by the curve generated by L(ξ) as shown in Figure 3.1b. Furthermore, P (ξ) is
the ξ-counterpart to the posterior P(θ) and is proportional to L(ξ) according to

P (ξ) = 1
Z
L(ξ) (3.6)

from which a sample ξ̃ corresponds to a parameter sample θ̃ from P(θ). This is true from
the previous argument that mass elements dξ and π(θ)dnθ come from the same volume
element. The sorted likelihood function L(ξ) is thus the primary constituent in the nested
sampling machinery and can be used to yield Z as well as samples from P(θ).

14


3. Nested sampling

θ

L4

L3

L2

L1

0 ξ4 ξ3 ξ2 ξ1 1
ξ

L(ξ)

L4 L3

L2

L1

(a) Cartoon of the relation between iso-likelihood contours L(θ) = Lk
and prior mass fractions ξk, described by L(ξk) = Lk. Each interval in
ξ corresponds to an area in parameter space. Note that the figure is
not meant to be quantitatively correct.

0 1
ξ

0

Lmax

L(ξ)

Z

(b) Representation of the evidence Z
as the area under the curve of the like-
lihood function L(ξ).

Figure 3.1: Illustrations of the relationship between the likelihoods L(θ) and L(ξ), the
prior mass fraction ξ and the model evidence Z.

15


3. Nested sampling

3.2 The algorithm

The nested sampling algorithm makes use of active points exploring parameter space to
reveal the structure of L(ξ). N points are sampled from the parameter space in proportion
to the prior and are then set to evolve under successively stricter likelihood constraints
L(θ) > L∗, where L∗ ≥ 0 is an evolving likelihood limit. These constraints iteratively
define the iso-likelihood contours seen in Figure 3.1a. In terms of ξ this corresponds to
N uniformly distributed samples evolving subject to the constraint ξ < ξ∗ = ξ(L∗). Each
iteration hence starts with N points uniformly distributed in (0, ξ∗)1 (initially ξ∗ = 1) and
the subsequent evolution, illustrated in Figure 3.2, proceeds as follows:

(a) The worst of the N active points, the one with the lowest likelihood, is identified,
removed and stored for later use.

(b) Limit ξ∗ (and therefore also L∗) is updated to the previous position of the removed
point.

(c) A new point, nested within ξ∗, is generated and added to the set of active points.

This procedure is set to repeat for a pre-defined number of iterations M . The way in
which item (c) above is implemented is a subject of particular importance and will be
further discussed in Section 3.3. Sampling from the prior under the likelihood-constraint
could be argued to be the most crucial subtopic of nested sampling and will be the main
focus of this thesis. For now it is assumed to be possible to generate such new points.

0 ξ∗ 1

worst

(a)

0 ξ∗ 1

(b)

0 ξ∗ 1

new

(c)

Figure 3.2: Evolution of active points in a nested sampling iteration. In (a) the worst
point is identified and removed. The limit ξ∗ is updated in (b) to the position the point
was removed from. A new point is generated in (c), nested within the current limit.

3.2.1 Statistical properties

At this stage it is necessary to address the topic that the ξ’s are not explicitly known. It
is however possible to determine their statistical properties from which an estimate can
be extracted. If the worst point removed from iterate k is denoted ξk, the shrinkage ratio

1A procedure for validating this assumption of uniformity has been suggested by Fowlie et al. [34].

16


3. Nested sampling

of the active volume between iterations is

tk = ξk/ξk−1, where ξ0 = 1 =⇒ t1 = ξ1. (3.7)

Noting that ξk is the topmost of N uniformly distributed samples between 0 and ξ∗ = ξk−1,
the ratio tk can be considered to be the topmost of N samples uniformly distributed
between 0 and 1. This means that the properties of order statistics [35] can be used to
show that tk ∼ Beta(N, 1)2 with pdf

prob(tk) = NtN−1
k . (3.8)

Given the distribution (3.8), the associated mean and variance of ln tk are

E[ln tk] = − 1
N

and Var[ln tk] = 1
N2 (3.9)

respectively. Using the (natural) logarithm rather than the plain value in (3.9) ensures
better properties in the transition from full prior space to smaller regions where the bulk
of the posterior mass resides [9]. At iterate k, ξk can be expressed in terms of a product
of shrinkage ratios, i.e.

ξk = t1t2...tk =
k∏
j=1

tj (3.10)

leading to a geometrical progression of the active ξ-range (but linear in ln ξ). The prop-
erties of the ξ’s are thereby determined entirely by the properties of the t’s and it is
straightforward to derive the mean and variance

E[ln ξk] =
k∑
j=1

E[ln tj ] = − k

N
and Var[ln ξk] = k

N2 , (3.11)

of the logarithm ln ξk. Crudely adopting the mean in Equation (3.11) as an estimator one
obtains an exponentially decreasing sequence of discarded prior mass fractions

ξk = e−k/N , (3.12)

illustrated in Figure 3.3. That is to say that with this scheme the active range shrinks
exponentially as regions of higher likelihoods are approached. Along with each ξk is an
associated likelihood Lk that together compose a list of pairs {(ξk, Lk)}Mk=1 which is the
main output of the algorithm and used to compute the sought after quantities.

3.2.2 Obtaining the evidence and posterior

The evidence Z is estimated by approximating the integral expression of Equation (3.5)
by a weighted sum using the list of samples {(ξk, Lk)}Mk=1 according to

Z =
∫ 1

0
L(ξ)dξ ≈

M∑
k=1

Lk∆ξk. (3.13)

Straight-forward estimation of the mass element by ∆ξk = ξk−1 − ξk implies a numerical
integration error of O(N−1). Any other valid numerical integration method would also
be an option, such as the trapezoidal rule ∆ξk = 1

2(ξk−1 − ξk+1) which lowers the error
2Sample from Beta distribution with parameters N and 1.

17


3. Nested sampling

0 1
ξ

L(ξ)

ξk = e−k/N

Figure 3.3: Sequence of discarded points obtained by estimating the prior mass fractions
ξk statistically. The data is artificially produced.

to O(N−2) in most cases [9]. Improvement from using the trapezoidal rule will however
be small as the principal source of error is assumed to originate from the crude estimate
ξk = e−k/N that will be discussed further in Section 3.2.5.

Along with each pair (ξk, Lk) the algorithm also produces a parameter sample θk with an
associated probability weight

wk = 1
Z
Lk∆ξk. (3.14)

Together θk and wk form a list of weighted samples that can be used to estimate any
quantity Q(θ) with respect to the posterior via

E[Q] =
∫
Q(θ)P(θ)dnθ ≈

M∑
k=1

wkQ(θk) (3.15)

The shape of the posterior can be extracted by binning the obtained samples θk, creating
a histogram that can be visualized by plotting its one- and/or two-dimensional projec-
tions. As opposed to MCMC samples, the samples produced by nested sampling are not
equally weighted implying that their contribution to the bin values (or “heights”) should
be proportional to their associated weights wk. We also see this in the formula for the ex-
pectation value (3.15), which takes into account that the samples are unequally weighted.
It is however possible to extract an equally weighted subset of the nested samples making
storage more convenient. One method for extracting samples with equal weights is stair-
case sampling [4].

In summary, the nested sampling procedure described above provides natural means of
both computing the evidence Z and revealing the structure of the posterior distribution
P(θ). An additional quantity, relevant for post-analysis, is the information

18


3. Nested sampling

H = P(θ) ln
[P(θ)
π(θ)

]
dnθ =

∫ 1

0
P (ξ) lnP (ξ)dξ ≈

M∑
k=1

wk ln
[
Lk
Z

]
(3.16)

being a logarithmic measure of the prior-to-posterior shrinkage. A useful interpretation
is H = “information gained going from the prior to the posterior” [36]. The information
H (3.16) is in the nested sampling context used to formulate an uncertainty estimation
which we shall discuss in Section 3.2.5.

3.2.3 Pseudocode

Algorithm 3.1 portrays the basic principle of the nested sampling procedure in pseudocode
format. Nested sampling is performed M times, pre-defined by the user. In practice it is
preferable to automate the termination by imposing a condition for when to stop sampling,
this will be further discussed next in Section 3.2.4.

Algorithm 3.1: Nested sampling procedure.
Input: Prior π(θ) and likelihood L(θ)
Output: Evidence Z and posterior samples θk with weights wk
Initialization

Draw N active points θa
1, ...,θ

a
N from the prior π(θ)

Set Z ← 0 and ξ0 ← 1
for k = 1, 2, ...,M do

θk ← θa
worst, point in the current set θa

1, ...,θ
a
N with lowest likelihood

Lk ← L(θk), likelihood bound
ξk ← exp(−k/N)
∆ξk ← ξk−1 − ξk or e.g. 1

2(ξk−1 − ξk+1)
wk ← Lk∆ξk
Z ← Z + wk
Replace θa

worst with an independent sample from π(θ) obeying L(θ) > Lk
Normalize weights, i.e. wk ← wk/Z

3.2.4 Termination

The main nested sampling loop can be set to terminate after a fixed number of steps
M . It may however be desirable to implement a more well-motivated stopping criterion.
One such approach is to stop when the remaining contribution to Z is smaller than some
(small) fraction f . At iteration k, the remaining contribution is taken to be bounded from
above by the maximum likelihood of the set of active points La

max = max(L(θa
1), ...,L(θa

N ))
multiplied by the current prior mass ξk. The condition then reads

La
maxξk < fZk =⇒ Termination (3.17)

where Zk is the evidence accumulated at iteration k. If La
max is close to the true maximum

likelihood, this stopping criterion implies that approximately all but a fraction f of the
evidence has been accounted for. The accumulation of Z usually begins to increase as
regions of higher likelihoods are found and flattens out when the bulk of the posterior is
reached and the decrease in range ∆ξk ∝ e−k/N starts to dominate increases in Lk. The
amount of prior mass associated to the region of the posterior bulk is roughly estimated

19


3. Nested sampling

by ξ ≈ e−H which in general could be very small. Recalling that ξk = e−k/N , the number
of steps taken to the bulk according to Equation (3.11) is NH ±

√
NH, resembling the

mean and standard deviation of a Poisson distribution.

In practice — where computations are performed in terms of logarithmic quantities for
numerical stability — the condition is implemented in the form

ln(Zk + La
maxξk)− ln(Zk) = ln

(
1 + La

maxξk
Zk

)
< fln =⇒ Termination. (3.18)

Equation (3.18) is equivalent to Equation (3.17) if fln = ln(1 + f) which means that if
f � 1 then fln ≈ f .

3.2.5 Uncertainty estimation

The major source of uncertainty is assumed to stem from the variability in the number
of nested sampling steps taken to reach the bulk of the posterior mass NH ±

√
NH.

This corresponds to an uncertainty in ln ξ of
√
NH/N =

√
H/N which through Equation

(3.13) also gives an uncertainty in lnZ of
√
H/N . The full expression for lnZ including

the uncertainty estimate is thus

lnZ ≈ ln
(

M∑
k=1

Lk∆ξk

)
±

√
H
N
. (3.19)

Expressing the uncertainty as an additive term to lnZ rather than as a geometrical factor
e±
√
H/N to Z is common practice since the distribution in lnZ should have better proper-

ties in terms of symmetry and similarity to a Gaussian. It would also be disadvantageous
to translate the uncertainty to an additive deviation in Z as it would make possible for
negative Z-values to be within this range. As an example, an assumed normally dis-
tributed lnZ = 100± 10 would be naively translated to Z = e150 ± e200, which obviously
is a useless statement.

3.3 Generating new points
Until now the details of how to generate new points as the nested sampling algorithm
progresses have been left out. This task is at the very core of the method and is of utmost
importance for obtaining useful results. Each iteration requires a new sample θ which
needs to:

(a) be sampled in proportion to the prior π(θ)
(b) be (approximately) independent from other samples
(c) obey the current likelihood constraint L(θ) > L∗.

These requirements make the task of generating new points non-trivial and worthy of
special attention. The likelihood constraint (c) stands out as it exponentially reduces
the range of active prior mass to explore as sampling progresses. In many sampling
scenarios one often resorts to MC methods in general and MCMC methods in particular
(see Section 2.2). This approach is available also in this case but under the addition of
the likelihood constraint. For the remaining part of this thesis the main topic will be
MCMC methods adjusted to generate samples from the constrained prior in the context
of nested sampling. Modified versions of MCMC methods for constrained prior sampling

20


3. Nested sampling

are introduced in the following sections and are further described and implemented in
Chapter 4.

3.3.1 Constrained Metropolis

The standard Metropolis-Hastings algorithm, described in Section 2.2.1, is likely the most
well known among MCMC methods. As suggested by Sivia and Skilling [4] and followed
up by Feroz and Hobson [11] it can be applied in the context of nested sampling by simply
adding the likelihood constraint L(θ) > L∗ to the acceptance conditions. At each nested
sampling iteration, one of the active points θ is picked at random as the start of a random
walk. A step to a new point θ

′ is generated from a symmetric proposal distribution Q(θ′ |θ)
and accepted with probability

α =

min
(

1, π(θ
′
)

π(θ)

)
, if L(θ′) > L∗

0, otherwise.
(3.20)

The proposal distribution is usually an isotropic Gaussian Q(θ′ |θ) = N (θ′ ; θ, σ2
Q1n) where

the scale σQ is a free parameter that needs to be tuned in relation to the problem. How-
ever, it is fully possible to use a Gaussian with a general covariance matrix ΣQ. As the
Metropolis algorithm produces correlated samples a sufficiently large number of steps Ns
needs to be taken in order for the new sample to grow independent of its starting point.
This might be a cause of concern as the active prior volume shrinks at each iteration and
acceptable Metropolis steps become more rare. However, Sivia and Skilling [4] suggests
that one should always use Ns ≈ 20, a recommendation that we shall evaluate later. Since
the allowed region shrinks in each nested sampling iteration it would be unwise to keep
the distribution scale σQ, or scales ΣQ, constant as it would make it increasingly harder
to propose acceptable positions. This would cause the acceptance rate to drop drastically
making the exploration very inefficient. A scheme for adjusting the step scale will be
suggested in Section 4.2.

3.3.2 Galilean Monte Carlo

A likelihood-constrained version of the HMC algorithm has been suggested by Betan-
court [37] to produce high-quality samples from the prior in the context of nested sam-
pling. In constrained HMC, the likelihood contour is interpreted as a hard wall, or infinite
potential barrier in the terms of Hamiltonian mechanics, with a potential of the form

VcHMC(x) =
{
− ln π(x), if L(x) > L∗

∞, otherwise.
(3.21)

The time evolution of the trajectories is therefore adjusted such that they will specularly
reflect if they were to encounter the L(x) = L∗ barrier. As a starting point for the Hamil-
tonian evolution, one of the active points x of the current nested sampling iteration is
chosen at random, ensuring that the constraint will be satisfied also for a new point x′ if
the reflection is properly conducted. This procedure, however, requires the gradient of the
prior to be evaluated in every time step. In fact, for a typical problem the prior is often
at least approximately constant in regions where the likelihood varies significantly. This
means that the prior gradient will be small and that it will not influence the trajectories
enough to make it worth while computing.

21


3. Nested sampling

As a simpler, more straight forward successor to constrained HMC, Skilling suggests
Galilean Monte Carlo (GMC) [19, 38, 39]. Time evolution in GMC is performed without
the influence of any forces, and is therefore without the need for prior gradient evaluation.
For the sake of convenience and presentation we shall without loss of generality adopt
coordinates in which the prior is flat, i.e. π(x) ≡ 1, over the unit hypercube. Such a
coordinate transformation is always possible [38] but may however be quite impractical,
depending on the form of π(x). This transformation is discussed in more detail in Sec-
tion 4.1.

GMC trajectories will proceed as straight lines3 as long as they are within the allowed
region according to the potential

VGMC(x) =
{
const., if L(x) > L∗

∞, otherwise.
(3.22)

Time evolution is particularly simple as a phase space point (x,v), where v is its velocity4,
will in one time step move to a new point

(x,v)→ (x′,v) = (x + τv,v) (proceed), (3.23)

where τ is the size of the time step. This step is acceptable under the assumption that
x′ is still within the available region. We consequently need a mechanism to reroute and
return a point back into the active region if the step according to (3.23) were to cross the
likelihood-barrier. A natural way to do this is by letting the point reflect specularly from
the iso-likelihood surface by using that the surface normal vector n is proportional to the
gradient of the likelihood ∇L, or equivalently to the gradient of the log-likelihood ∇ lnL.
A reflection of this sort, occurring at point (x′,v), will transform (x,v) according to

(x,v)→ (x′′,v′) = (x + τv + τv′,v− 2nnTv) (reflect), (3.24)

where n = ∇ lnL(x′)/|∇ lnL(x′)|. A schematic illustration of the GMC exploration in-
cluding reflections off the hard walls is shown in Figure 3.4a. It is obviously unlikely for x′
to lie exactly on the iso-likelihood surface but the gradient at x′ will work as a substitute
for the gradient at the ideal reflection point located somewhere between x and x′. The re-
flected point x′′ will nevertheless often be returned to the active region making it possible
for the trajectory to continue with a redirected velocity. This reflection by proxy scheme
is illustrated in Figure 3.4b. In the assumably rare case that also L(x′′) > L∗ is violated
a possible option satisfying detailed balance is to reverse the trajectory according to

(x,v)→ (x,−v) (reverse). (3.25)

Detailed balance actually allows an additional option in the reflection’s mirror point

(x,v)→ (x + τv− τv′,−v′) (mirrored reflection), (3.26)

which in most cases should lie outside of the allowed region but is however still possible.
For detailed balance (time-reversal symmetry) to remain satisfied under the inclusion of
the mirrored reflection, either the reflection or the mirrored reflection should be accept-
able if a trajectory is to redirect, but not both. If both were acceptable, the time reversed

3When the coordinates are transformed back to the original prior space the trajectories will follow
geodesics determined by the geometry of the prior which in general are not straight lines.

4We will for GMC adopt the notion of velocity v to distinguish it from the HMC momentum p.

22


3. Nested sampling

(a) Bird’s-eye view illustration of a
typical GMC trajectory propagating
in a constrained region of parameter
space.

x

x′

x′′

τv τv′n

True surface

Proxy surface

(b) Detailed illustration of the GMC re-
flection off the proxy iso-likelihood sur-
face.

Figure 3.4: Illustrations of the basic principles of Galilean Monte Carlo.

trajectory would break detailed balance.

As the active region shrinks during the nested sampling iterations it is necessary to adjust
the effective size of the GMC steps accordingly. If no adjustment of the step size would
be carried out, fewer and fewer proposed steps will be accepted making the exploration
increasingly inefficient. A GMC implementation is presented in Section 4.3 and includes
a suggested scheme for adjusting the effective step size.

3.3.3 Constrained stretch move

As previously stressed, one of the key difficulties of nested sampling is to produce samples
from the prior for successively smaller regions of parameter space. Furthermore, this
region is typically highly asymmetric, spanning significantly different distances in different
dimensions. This problem of varying and separated scales is exactly what the stretch move
algorithm, outlined in Section 2.2.3, is designed to manage due to its affine invariance. We
therefore propose a constrained version of the stretch move, adapted for the context of
nested sampling. The stretch move is dependent on an ensemble of walkers; a requirement
which is naturally provided by the active set of N points used in nested sampling. At each
iteration one of the active points Θk is randomly selected to perform the stretch move
random walk where one of the remaining points Θj is used to propose a new position Y
according to Equation (2.27). Each step is then accepted with the probability

α =

min
(
1, ζ̃n−1 π(Y )

π(Θk)

)
, if L(Θk) > L∗

0, otherwise
(3.27)

which is of the exact same form as the corresponding constrained Metropolis accept proba-
bility defined in Equation (3.20). The constrained stretch move is implemented, evaluated
and further discussed in Section 4.4.

3.3.4 Ellipsoidal nested sampling

A rather different (non-MCMC) approach to likelihood-constrained prior sampling is el-
lipsoidal sampling (Mukherjee et al. [12]). From the covariance matrix of the current set of
active points it is possible to define an n-dimensional ellipsoid intended to approximate the

23


3. Nested sampling

iso-likelihood surface defined by L(θ) = L∗. Ellipsoidal sampling proceeds by enlarging
the ellipsoid by some factor, compensating for the iso-likelihood surface not being perfectly
ellipsoidal, then drawing a sample from the prior volume within this bound. The more
closely the ellipsoidal approximates the iso-likelihood surface, the higher is the probability
that the sample fulfills the likelihood constraint. In a hypothetical situation where the
iso-likelihood is perfectly ellipsoidal, the acceptance ratio is one. Shaw et al. [40] improves
the method for multi-modal distributions and gives a method for uniform sampling of an
ellipsoid (extension to non-uniform priors is straightforward). Ellipsoidal sampling has
proven quite successful and is used in state-of-the-art implementations such as MultiNest
by Feroz et al. [10]. The benefits of ellipsoidal sampling are that the samples produced are
independent and that this can be achieved with typically very few likelihood-evaluations.
This is in contrast to MCMC methods which need several steps for samples to become
(approximately) independent, where each of those steps requires the likelihood to be evalu-
ated. On the other hand, the iso-likelihood is not always well approximated by the ellipsoid
in which case the method will fail without any warning signal. No implementation of el-
lipsoidal sampling has been carried out in this work but it will act as a benchmark for
the MCMC methods outlined above through the use of MultiNest via the Python code
PyMultiNest [24].

24


4
Constrained prior sampling

implementations

Three different MCMC schemes for generating constrained prior samples as required by
the nested sampling algorithm were implemented in this work and are described in this
chapter. The implementations are based on the methods introduced in Section 3.3 and
will be referred to as constrained Metropolis, GMC and constrained stretch move. These
three MCMC methods have all been integrated into a nested sampling framework that was
developed in this work and that follows the principles presented in Chapter 3 in general
and Algorithm 3.1 in particular. For our purposes we have mainly used N = 1000 ac-
tive points and imposed a stopping criterion according to Equation (3.17) using fln = 0.01.

Here follows a description of the specific implementation designs and usage, including
e.g. method-specific hyperparameters. Each method has two main input parameters: the
number of exploration steps per iteration and a step size scale. Demonstration and testing
will be carried out by applying each method to the toy problem introduced in Section 2.1.1.

4.1 Choice of coordinates: the unit hypercube

A nested sampling convention is to work with parameter coordinates u = (u0, ..., un−1)
in which the prior is uniform over a cube of unit volume in n-dimensional space. This
convention is also adopted in this work and has already been mentioned in the context of
GMC in Section 3.3.2. If the joint prior probability is independent, which means it can
be written as a product of individual parameter priors1 π(θ) = π0(θ0)...πn−1(θn−1), the
transformation relating θ to u is

θi(ui) = Π−1
i (ui) (4.1)

where Πi(θi) =
∫ θi
−∞ πi(θ′i)dθ′i is the cumulative distribution function (cdf) for θi. The

MCMC implementations described below are formulated in terms of the coordinates u
but the obtained results are however transformed back to θ using Equation (4.1). A
consequence of working in coordinates where the prior is flat is that any Metropolis style
rejection step simplifies. The acceptance probability α (see e.g. Equations (3.20) and
(3.27)) contains a prior ratio π(θ′)/π(θ) which cancels if the prior is constant.

4.2 Constrained Metropolis implementation

The key object in the Metropolis scheme is the proposal distribution Q(u′|u) from which
new positions u′ are proposed in the random walk, depending on the current position

1This is the case for the prior defined in Equation (2.19).

25


4. Constrained prior sampling implementations

u. The common approach of taking Q to be a symmetric Gaussian centered at u with a
fixed covariance matrix ΣQ = σ2

Q1n (see Section 2.2.1) would be quite insufficient for the
requirements concerning nested sampling. Fixing the scale σQ of the proposal distribution
would not be compatible with the exponentially shrinking domain of nested sampling.
We therefore propose that information from the active set of points ua

1, ...,ua
N be utilized

to estimate the scale of the active region. Firstly, we randomly select a starting point
ustart from the active set. Secondly, a subset {um}Mm=1 of the remaining active points is
created, also at random. The size of the subset, M , is in this work always taken to be
M = max

(
1,
⌊
N
10

⌋)
. Thirdly, the covariance matrix elements of the proposal distribution

are set according to

(ΣQ)ij =
{
s2

M

∑M
m=1(ustart − um)2

i , if i = j

0, otherwise,
(4.2)

where (ustart − um)2
i denotes the squared distance along dimension i and s is an overall

scale parameter. With this formula, ΣQ is a diagonal matrix with variances on the diag-
onal proportional to estimates of the typical squared distances for each dimension in the
active set of points and by extension in the active region. A typical value of the scale pa-
rameter is s ∼ 0.1 in order for proposed steps to be well within the allowed region. Effects
of different choices of s are studied in Section 4.6.1. To introduce additional randomness,
the number of steps taken in every walk is drawn from a discrete uniform distribution
U
(

1
2〈Ns〉, 3

2〈Ns〉
)
where the average 〈Ns〉 is set by the user. The proposed constrained

Metropolis algorithm is shown in Algorithm 4.1.

Algorithm 4.1: Constrained Metropolis procedure.
Input: Likelihood limit L∗ and active points ua

1, ...,ua
N

Output: New active point unew

Set u(0) ← ustart and ΣQ from ua
1, ...,ua

N , M and s
Ns ← sample from U

(
1
2〈Ns〉, 3

2〈Ns〉
)

t← 0
while t < Ns or acceptance is zero do

u′ ← sample from Q(u′ |u(t)) = N (u′; u(t),ΣQ)
if L(θ(u′)) ≥ L∗ then

u(t+1) ← u′

else
u(t+1) ← u(t)

t← t+ 1
unew ← last position from the chain

An example of a constrained Metropolis walk can be seen in Figure 4.1 and is for the sake
of visualization conducted in two-dimensions. The walk starts at the green plus-shaped
marker and ends at the position of the red cross which is promoted to member of the active
points. The hyperparameters settings used in this example are 〈Ns〉 = 20 and s = 0.5.
The number of unique steps in the walk in Figure 4.1 are however fewer than 1

2〈Ns〉 = 10,
meaning that not all steps have been accepted. The acceptance rate ra, defined as the
fraction of accepted steps in an MCMC walk or trajectory, is an important measure for
assessment of the exploration performance and will be discussed later in this thesis (e.g.
Section 4.6.1).

26


4. Constrained prior sampling implementations

0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24

θ0

2.3

2.4

2.5

2.6

2.7

2.8

θ 1 L(θ) =
L ∗

Figure 4.1: Example of a Metropolis random walk in a likelihood-constrained region of
a two-dimensional parameter space. The walk is captured at a nested sampling iteration
with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the
end of the walk, respectively. The black dots indicate intermediate steps. The average
number of steps for this walk is 〈Ns〉 = 20 and the scale parameter is s = 0.5.

4.3 Galilean Monte Carlo implementation

The GMC equivalent of the Metropolis proposal distribution Q discussed in the previous
Section 4.2 is the velocity distribution q(v) from which the initial velocity is sampled at the
beginning of each iteration2. By the same arguments as in the Metropolis case above, this
distribution needs to adapt to the shrinking active volume. We will therefore similarly
suggest that q(v) be a Gaussian N (v; 0,Σvel) where the covariance matrix elements at
each iteration are set according to

(Σvel)ij =


1
M

∑M
m=1

1
τ2

ref
(ustart − um)2

i , if i = j

0, otherwise,
(4.3)

where ustart and um are the same as before and τref = 1 is a reference time scale which
gives the covariances units of velocity squared. The lack of a variable overall length scale
parameter in Equation (4.3) compared to the corresponding s in Equation (4.2) is due
to the time step formulation “u + τv”of GMC, where τ is used to set the overall step
size. We will in Section 4.6.1 see the impact from specific choices of τ . Furthermore, the
number of steps taken in the trajectory are set randomly in the same way as in Section 4.2.
Consequently, the average number of steps 〈Ns〉 together with the scale parameter τ con-
stitutes the main user input for GMC. Following this scheme and the principles laid out in
Section 3.3.2, the full implemented GMC procedure is shown in Algorithm 4.2. The four
possible outcomes after a GMC step described in Section 3.3.2 are here denoted North
(proceed), East (reflect), West (mirrored reflection) and South (reverse). It is important
to stress that it is the gradient with respect to u of the likelihood that should be com-
puted to obtain the normal to the reflection surface. This is related to the gradient with

2Analogous to the HMC case discussed in Section 2.2.2.

27


4. Constrained prior sampling implementations

respect to θ by the chain rule via the Jacobian matrix J of the transformation in Equa-
tion (4.1) according to ∇u → J(u)∇θ. The Jacobian matrix element definition used here
is Jij = ∂θj/∂ui.

Algorithm 4.2: Galilean Monte Carlo procedure.
Input: Likelihood limit L∗ and active points ua

1, ...,ua
N

Output: New active point unew

Set u(0) ← ustart and Σvel from ua
1, ...,ua

N and M
v← sample from q(v) = N (v; 0,Σvel)
Ns ← sample from U

(
1
2〈Ns〉, 3

2〈Ns〉
)

for t = 0, 1, ..., Ns − 1 do
u′ ← u(t) + τv
N ← L(θ(u′)) ≥ L∗ # Continue north

if N then

u(t+1) ← u′ # Go north
else

n← ∇u lnL(θ(u′))
|∇u lnL(θ(u′))|

v′ ← v− 2nnTv

# Check possible directions
E ← L(θ(u′ + τv′)) ≥ L∗ # Reflection to east
W ← L(θ(u′ − τv′)) ≥ L∗ # Mirrored reflection to west
S ← L(θ(u′ − τv)) ≥ L∗ # Reversal to south

if S and (E but not W) then
u(t+1) ← u′ + τv′ # Go east
v← v′

else if S and (W but not E) then
u(t+1) ← u′ − τv′ # Go west
v← −v′

else
u(t+1) ← u(t) # Aim south
v← −v

unew ← last point of the trajectory u(Ns)

Figure 4.2 shows an example of a GMC trajectory in n = 2 dimensions contained by a
iso-likelihood contour. The trajectory is initiated at the the green plus-shaped marker,
reflects twice off the walls and ends at the red cross. The reflections in the figure do
not appear physically correct given the directions of the iso-likelihood surface. However,
this is a combined effect of the different scales on the two axes and that the trajectory
is presented in the original coordinates θ whereas the simulation is performed in the flat
coordinates u. The parameters used for this example are 〈Ns〉 = 20 and τ = 0.1.

28


4. Constrained prior sampling implementations

0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24

θ0

2.3

2.4

2.5

2.6

2.7

2.8

θ 1 L(θ) =
L ∗

Figure 4.2: Example of a GMC trajectory in a likelihood-constrained region of a two-
dimensional parameter space. The trajectory is captured at a nested sampling iteration
with ξ ≈ e−H. The green plus and the red cross indicates the start of the trajectory and
the end of the walk, respectively. The black dots indicate intermediate time steps. The
reflections do not occur exactly at the boundary but rather at proxy surfaces just outside
the boundary (not in figure) as described in Section 3.3.2.

4.4 Constrained stretch move implementation

One major advantage of the combination of the stretch move and nested sampling, pro-
posed in this thesis, is that it requires no explicit adjustment when the nested sampling
process progresses and the active volume shrinks. The stretch move proposal distribution
naturally adapts to the decreasing scales of parameter space and there is no need to in-
troduce an approach similar to the covariance matrix updates defined in Equations (4.2)
and (4.3). The number of stretch move steps is randomly drawn as above and the aver-
age number of steps 〈Ns〉 is set by the user. Except from the number of steps, the user
additionally needs to specify the scale parameter a that sets the extent of the distribution
g(ζ) defined in Equation (2.28). In Section 4.6.1 we will study the sampling performance
based on the choice of a. The full stretch move procedure is shown in Algorithm 4.3.

A constrained stretch move walk in n = 2 dimensions is shown in Figure 4.3 with hyper-
parameters 〈Ns〉 = 20 and a = 2. It is clear that the step sizes seem to fluctuate more
compared to the Metropolis walk in Figure 4.1. This could potentially be an effect of
the fact that the Metropolis procedure uses the same proposal distribution in every step
whereas the stretch move proposal distribution changes in every step depending on the
randomly selected point uj . This is, however, the results of only a single iteration from a
single run and one should be careful when drawing conclusions from it.

29


4. Constrained prior sampling implementations

Algorithm 4.3: Constrained stretch move procedure.
Input: Likelihood limit L∗ and active points ua

1, ...,ua
N

Output: New active point unew

Set u(0) ← ustart from ua
1, ...,ua

N

Ns ← sample from U
(

1
2〈Ns〉, 3

2〈Ns〉
)

t← 0
while t < Ns or acceptance is zero do

Draw uj randomly from the complementary set {ua
j}Nj=1 \ {ustart}

ζ̃ ← random sample from g(ζ), Equation (2.28)
u′ ← uj + ζ̃(u(t) − uj)
α← ζ̃n−1, Equation (3.27)
r ← random sample from U(0, 1)
if r ≤ α and L(θ(u′)) ≥ L∗ then

u(t+1) ← u′

else
u(t+1) ← u(t)

t← t+ 1
unew ← last position from the chain

0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24

θ0

2.3

2.4

2.5

2.6

2.7

2.8

θ 1 L(θ) =
L ∗

Figure 4.3: Example of a stretch move random walk in a likelihood-constrained region of
a two-dimensional parameter space. The walk is captured at a nested sampling iteration
with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the
end of the walk, respectively. The black dots indicate intermediate steps.

4.5 Monitoring the sampling progress

The progress of the sampling over the course of a nested sampling run can be examined
by tracking a few key quantities. Examples of such quantities are the likelihood L defined
in Equation (3.4), the posterior weights w = L∆ξ/Z defined in Equation (3.14) and the
acceptance rate ra defined as

30


4. Constrained prior sampling implementations

ra = # steps accepted
# steps proposed (4.4)

in each iteration. The tracking of these quantities is shown in Figure 4.4 for the three
methods applied to the problem of Sections 2.1.1 and 2.1.4 in n = 3 dimensions using
〈Ns〉 = 40. The natural choice of progress parameter is − ln ξ which is directly propor-
tional to the iteration number k according to Equation (3.12). We see clearly that it
takes a considerable fraction of the total amount of iterations before reaching significant
likelihood values (top panel). The peak of the weights (middle panel) indicates where
the bulk of the posterior probability mass is located and it is obvious that this does not
coincide with the point of maximum likelihood. The reason for this is that as the likeli-
hood increases, the prior mass elements, ∆ξ ∝ e−k/N , decrease exponentially. This tug of
war implies the existence of an optimal point where w is maximized. The vertical lines
(middle panel) accompanied by bands (see the inset) show the average and uncertainty
NH ±

√
NH of the number of steps that is needed to reach the bulk, as was discussed

in Section 3.2.5, and where the information H is computed according to Equation (3.16).
These estimates seem to predict reasonably accurately where the weights have reached sig-
nificant values, i.e. where the bulk of the posterior mass is. The most obvious difference
between the methods is, however, the evolution of the acceptance rate ra (bottom panel).
The acceptance rate values typically vary rapidly between individual iterations, making it
necessary to a apply a moving median3 to read out the overall behavior. This has been
done to obtain the acceptance rate curves in Figure 4.4 and the associated band edges
show the 16th and 84th percentiles, respectively. We see from these percentiles that the
typical variation between iterations is reasonably large for all three methods, especially for
GMC in the second half of the run where we also observe a prominent asymmetry. Fur-
thermore, it is natural for the GMC acceptance rate to be larger compared to the other
methods as the GMC step should be smaller in order to create a trajectory rather than a
random walk. The most drastic change is seen for the Metropolis acceptance rate as its
median drops to approximately a third of its initial value. The drop ends roughly in the
region where the posterior mass starts to accumulate which means this is where the algo-
rithm struggles the most with finding acceptable proposal steps. Remarkably, the stretch
move median acceptance rate is fairly unchanged over the course of the run, which suggests
that the proposal distribution adapts well to the successively stricter likelihood constraint.

The results of the nested sampling runs described above can be seen in Table 4.1. The
computed evidence values lnZ differ between the methods but do however agree fairly
well with the true value lnZtrue = 8.09 obtained in Section 2.1.4 for the model with n = 3.
The uncertainty estimates

√
H/N (see Section 3.2.5) are, to two significant figures, equal

for the three methods. This is a general observation that has been made throughout this
work and that we will return to in Section 5.2.1. The number of likelihood evaluations,
NL, are seen to be almost twice as many for GMC compared to the other methods despite
that the number of steps are the same. The first reason for this is that the GMC algorithm
requires evaluations of the likelihood gradient, in the event of a reflection, which is here
counted as an extra likelihood evaluation and included in NL. The fraction of gradient
evaluations are given in parenthesis in Table 4.1. The second reason is that more than
one likelihood evaluation occur in the event of a reflection in order to assess the reflection
conditions according to Algorithm 4.2.

3The median-equivalent to a moving (or rolling) average, which is better suited for constrained and
skewed data [41] as in this case for ra.

31


4. Constrained prior sampling implementations

0

2

4
L

×108

Metropolis, s = 0.5

GMC, τ = 0.1

Stretch move, a = 2

0.0000

0.0001

0.0002

0.0003

w

0 2 4 6 8 10 12 14 16

− ln ξ

0.0

0.2

0.4

0.6

0.8

1.0

r a

Figure 4.4: Progress of three key quantities over the course of a nested sampling run
for the three different methods. The sampling is performed in n = 3 dimensions using
〈Ns〉 = 40 with sampling parameters N = 1000 and fln = 0.01. The top panel shows the
likelihood L for the worst point of each iteration, the middle panel shows the posterior
weights w, used to compute the evidence, and the bottom panel shows the moving median
of the acceptance rate ra for each iteration. The methods distinguish themselves clearly
in the variation of the acceptance rate.

Table 4.1: Nested sampling results for the three methods applied to the same problem
in n = 3 dimensions. Sampling parameters are N = 1000 active points and fln = 0.01
tolerance. The true evidence value for this model is lnZtrue = 8.09 as was analytically
obtained in Figure 2.3. The number of likelihood evaluations, NL, does in the GMC case,
include the number of evaluations of the gradient. The fraction of gradient evaluations is
given in parenthesis.

Method 〈Ns〉 Scale parameter lnZ ±
√
H
N NL # Iterations

Metropolis 40 s = 0.5 8.13± 0.10 ∼ 6.65 · 105 ∼ 1.65 · 104

GMC 40 τ = 0.1 8.32± 0.10 ∼ 1.12 · 106 (13%) ∼ 1.63 · 104

Stretch move 40 a = 2.0 7.99± 0.10 ∼ 6.68 · 105 ∼ 1.66 · 104

32


4. Constrained prior sampling implementations

4.6 Hyperparameters

The three constrained MCMC implementations described in the previous sections have in
common that they have two main hyperparameters:

• a scale parameter effectively setting the overall step size, denoted s, τ and a, respec-
tively, in the three approaches

• the average number of steps 〈Ns〉.

These hyperparameters need to be set by the user and it is therefore important to have
an idea of how they influence the sampling performance. We will therefore proceed by
studying the effects of different specific hyperparameter choices.

4.6.1 Sensitivity to the scale parameter value

The scale parameters s, τ and a act as specifiers for the overall step size for the three
MCMC exploration methods. They are, however, defined in different ways and will there-
fore not take on the same values. For instance, the stretch move parameter a defines an
interval [a−1, a] from which ζ̃ is drawn (see Equation (2.28)), meaning that we must have
a > 1 for this formulation to be reasonable. The Metropolis parameter s, on the other
hand, measures the step size relative to the current active volume and should therefore
be ∼ 0.1–1.0 in order for a step not to be too large to step outside too often, but not
too small to make the walk not cover enough distance. The time step τ of the GMC
algorithm should typically be smaller than s such that the proxy surface approximation is
acceptable. For the idea of a trajectory to be valid, at least several GMC steps should be
taken in a straight line before a reflection occurs and we should therefore have τ ∼ 0.1.

To measure the impact of the scale parameters we apply the methods to the problem
presented in Sections 2.1.1 and 2.1.4 using different scale parameter values. The top row
of Figure 4.5 shows the error made when computing the (log) evidence, lnZ, for a model
of the form (2.8) with n = 3 for different choices of s, τ and a. It is important to stress
that this type of error is only possible to obtain when the true value Ztrue is analytically
obtainable as was done in Section 2.1.4 and showed in Figure 2.3. The bands are standard
deviations between five identical runs using different random seeds. The sampling was
performed using N = 1000 active points, fln = 0.01 tolerance and 〈Ns〉 = 40 exploration
steps per iteration (on average). Note the logarithmic axes for s and τ . The Metropolis
method (left column) slightly overestimates the evidence for the entire range in s, but
there appears to be a minimum in the error around s ≈ 0.25–0.5. This is in agreement
with the argument above that s should not be so large that too many proposed steps are
outside the allowed region but not so small that not enough distance is covered. The GMC
error (middle column) is seen to be roughly constant for τ ≈ 0.05–0.1 before it drastically
diverges at τ ≈ 0.2. This is probably a consequence of a too low acceptance rate causing
the trajectories to insufficiently explore the available parts of parameter space. The stretch
move error (right column) is in this case seen to be the closest to zero among the three
methods and is also quite stable to variations of a around the value a = 2.0 proposed
by [31] for the unconstrained case. Note, however, that the scale for a is linear as opposed
to s and τ .

To quantify how efficiently each method explores parameter space we need to compute the
acceptance rate ra for each nested sampling iteration, defined in Equation (4.4). The value

33


4. Constrained prior sampling implementations

of the acceptance rate will typically vary significantly between individual nested sampling
iterations but will on a larger scale show clear trends, as seen in the bottom panel of
Figure 4.4. As the nested sampling compression approaches the bulk of the posterior
mass, the active region will have shrunken enough to make ra tend to lower values. As
argued in Sections 3.2.4 and 3.2.5 and showed in the middle panel of Figure 4.4, the bulk
is approximately reached when − ln ξ ≈ H, which roughly should be where the value of ra
is of most importance. This motivates us to introduce the bulk median of the acceptance
rate,MH[ra], defined as the median of the recorded values ra,k, over all nested sampling
iterations k, for which − ln ξk ≥ H. Formally:

MH[ra] = Median[{ra,k, ∀k : − ln ξk ≥ H}] (4.5)

where the information H is computed according to Equation (3.16). The bulk median
MH[...] should be interpreted as a point estimate of a quantity (the acceptance rate in
this case) in the region where it matters the most, i.e. where the bulk of the probability
mass is located. MH[ra]-values for the three methods are shown for different values of
the scale parameters in the middle row of Figure 4.5. The bands (barely visible) are the
same as for the evidence errors in the top row. There is a clear decrease in MH[ra] for
each of the three methods as their scale parameters are increased. This is to be expected
as a larger step size implies a higher probability for a proposed step to be outside the
active volume. The most drastic change is for the Metropolis case (left column) which
drops to near zero already at s ≈ 1.0. Moreover, the acceptance rate should not be too
large, as indicated by the evidence error in the top row which is minimized at s ≈ 0.25–
0.5, corresponding to a bulk median acceptance rate of MH[ra] ≈ 0.2–0.4. The stretch
move implementation (right column) also shows a significant drop in acceptance rate as
a increases. However the drop is more moderate compared to Metropolis. Comparing
this result to the top row of Figure 4.5 we see that MH[ra] ≈ 0.4–0.6 for a ≈ 1.5–3.0
seems to be a suitable range as this is where the evidence computed by the stretch move
implementation fluctuates the least. For the GMC implementation (middle column) it is
quite noteworthy how the character of MH[ra]’s dependence on τ differs from the other
cases. A plausible explanation for this fundamentally different behavior is that a single
GMC trajectory is, as opposed to Metropolis and the stretch move, not a random walk
and does not form a Markov chain. In this sense, the GMC approach is fundamentally
different. Once the GMC velocity is set, the step size will not change during the course of
the trajectory and the direction only changes in the event of a reflection. In contrast, the
Metropolis as well as the stretch move procedures draw direction as well as size randomly
for each step, causing an important distinction.

The bottom row of Figure 4.5 shows how the total number of likelihood evaluations, NL,
varies with the scale parameters. Note that for the random walk methods, Metropolis and
stretch move, NL will only exceed Ns× (# iterations) if there are iterations where none of
the first Ns proposed steps are rejected. This is a consequence of the while loop-conditions
in Algorithms 4.1 and 4.3, which assure that at least one step is accepted. This is the
reason behind the increase in NL for the Metropolis case (left column) when s is increased.
Furthermore, there seems to be a sudden jump between s = 1.0 and s = 2.0, indicating
the existence of a threshold in s above which almost no steps are accepted. This agrees
well with the observation of the Metropolis acceptance rate (middle row, left column),
which for s = 2.0 is close to zero. In contrast, the str