DF Constrained space MCMC methods for nested sampling Bayesian computations Master’s thesis in Physics and Astronomy JACOB OLANDER Department of Physics CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2020 Thesis for the degree of Master of Science Constrained space MCMC methods for nested sampling Bayesian computations JACOB OLANDER DF Department of Physics Division of Subatomic, High Energy and Plasma Physics Theoretical Subatomic Physics Chalmers University of Technology Gothenburg, Sweden 2020 Constrained space MCMC methods for nested sampling Bayesian computations JACOB OLANDER © JACOB OLANDER, 2020. Supervisor and examiner: Christian Forssén, Department of Physics Department of Physics Division of Subatomic, High Energy and Plasma Physics Theoretical Subatomic Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Visualizations of likelihood-constrained MCMC random walks for prior sampling in two dimensions. Three different methods are shown: constrained Metropolis (left), Galilean Monte Carlo (middle) and the constrained stretch move (right). The likelihood constraint is illustrated by the ellipsoid-shaped bound and the prior by the underlying density color map. Typeset in LATEX, template by David Frisk Printed by Chalmers Reproservice Gothenburg, Sweden 2020 iv Constrained space MCMC methods for nested sampling Bayesian computations JACOB OLANDER Department of Physics Chalmers University of Technology Abstract Natural phenomena can in general be described using several different scientific models, which creates a need for systematic model selection. Bayesian model comparison as- signs relative probabilities to a set of possible models using the model evidence (marginal likelihood), obtained by an integral that in general needs to be evaluated numerically. Nested sampling is a conceptual framework that efficiently estimates the model evidence and, additionally, provides samples from the model parameter posterior distribution used in Bayesian parameter estimation. A vital step of nested sampling is the likelihood- constrained sampling of the model parameter prior distribution, a task that has proven particularly difficult and that is subject to ongoing research. In this thesis we implement, evaluate and compare three methods for constrained sampling in conjunction with a nested sampling framework. The methods are variants of Markov chain Monte Carlo algorithms: Metropolis, Galilean Monte Carlo and the affine-invariant stretch move, respectively. The latter is applied in the context of nested sampling for the first time in this work. The performances of the methods are assessed by their application to a reference problem that has a known analytical solution. The problem is inspired by effective field theories in subatomic physics where the model parameters take the form of coefficients that are of natural size. We conclude that the efficiency and computational accuracy of nested sampling is strongly dependent on the choice of sampling method and the settings of its associated hyperparameters. In certain cases, especially for high-dimensional parameter spaces, the implementations of this work are seen to achieve better computational accu- racy than MultiNest, a state-of-the-art nested sampling implementation extensively used in astronomy and cosmology. Generally for nested sampling, we observe that it is possible to obtain an inaccurate result without receiving any clear warning signs indicating that this is the case. However, we demonstrate that the validity of the computational results can be assessed by monitoring the sampling process. Keywords: Bayesian inference, parameter estimation, model comparison, evidence, nested sampling, MCMC, Metropolis, Galilean Monte Carlo, affine-invariant sampling v Acknowledgements I wish to express my deepest gratitude to my supervisor Christian Forssén for invaluable guidance and for sharing his endless knowledge and wisdom. Furthermore, I would like to thank all members of the Theoretical Subatomic Physics group for interesting discussion and fruitful input during the course of my project. Lastly, I want to recognize the tireless support from my parents, without which this thesis could never have been completed. Jacob Olander, Gothenburg, June 2020 vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Purpose and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Bayesian data analysis and numerical methods 3 2.1 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 EFT toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 EFT toy example revisited . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Markov chain Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Sampling with affine invariance: the stretch move . . . . . . . . . . . 11 3 Nested sampling 13 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Prior mass fraction and sorted likelihood function . . . . . . . . . . 13 3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Obtaining the evidence and posterior . . . . . . . . . . . . . . . . . . 17 3.2.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.4 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.5 Uncertainty estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Generating new points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Constrained Metropolis . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 Galilean Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.3 Constrained stretch move . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.4 Ellipsoidal nested sampling . . . . . . . . . . . . . . . . . . . . . . . 23 4 Constrained prior sampling implementations 25 4.1 Choice of coordinates: the unit hypercube . . . . . . . . . . . . . . . . . . . 25 4.2 Constrained Metropolis implementation . . . . . . . . . . . . . . . . . . . . 25 4.3 Galilean Monte Carlo implementation . . . . . . . . . . . . . . . . . . . . . 27 4.4 Constrained stretch move implementation . . . . . . . . . . . . . . . . . . . 29 ix Contents 4.5 Monitoring the sampling progress . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.1 Sensitivity to the scale parameter value . . . . . . . . . . . . . . . . 33 4.6.2 Choosing a sufficient number of steps . . . . . . . . . . . . . . . . . 36 5 Convergence 37 5.1 The curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Divergence in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Debunking the uncertainty estimate . . . . . . . . . . . . . . . . . . 42 6 Conclusion 45 References 49 A Posterior probability distributions I x List of Figures 2.1 The true function g(x) from which synthetic data is generated in the EFT toy example. Data consists of 10 points evenly distributed in 0 < x ≤ 1/π with a relative error c = 5% according to Equation (2.7). . . . . . . . . . . . 7 2.2 Marginal distributions of the analytically derived posterior in the EFT ex- ample for a model with n = 3 parameters. Two-dimensional distributions for pairs of parameters are located below the diagonal on which the one- dimensional distributions are displayed. Point estimates are given for each parameter in terms of their mean values. Error estimates are given in terms of standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 The model evidence computed for models Mn defined by their parameter space dimensionality n. It is clear that the evidence is maximized for n = 3 which means thatM3 is the model favoured by the data. The evidence for n = 1 is ∼ 0 and is therefore not included in the figure. . . . . . . . . . . . 9 3.1 Illustrations of the relationship between the likelihoods L(θ) and L(ξ), the prior mass fraction ξ and the model evidence Z. . . . . . . . . . . . . . . . 15 3.2 Evolution of active points in a nested sampling iteration. In (a) the worst point is identified and removed. The limit ξ∗ is updated in (b) to the position the point was removed from. A new point is generated in (c), nested within the current limit. . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Sequence of discarded points obtained by estimating the prior mass fractions ξk statistically. The data is artificially produced. . . . . . . . . . . . . . . . 18 3.4 Illustrations of the basic principles of Galilean Monte Carlo. . . . . . . . . . 23 4.1 Example of a Metropolis random walk in a likelihood-constrained region of a two-dimensional parameter space. The walk is captured at a nested sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the end of the walk, respectively. The black dots indicate intermediate steps. The average number of steps for this walk is 〈Ns〉 = 20 and the scale parameter is s = 0.5. . . . . . . . . . . . . . . . . . 27 4.2 Example of a GMC trajectory in a likelihood-constrained region of a two- dimensional parameter space. The trajectory is captured at a nested sam- pling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the trajectory and the end of the walk, respectively. The black dots indicate intermediate time steps. The reflections do not occur exactly at the boundary but rather at proxy surfaces just outside the boundary (not in figure) as described in Section 3.3.2. . . . . . . . . . . . . . . . . . . . . . 29 xi List of Figures 4.3 Example of a stretch move random walk in a likelihood-constrained region of a two-dimensional parameter space. The walk is captured at a nested sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the end of the walk, respectively. The black dots indicate intermediate steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Progress of three key quantities over the course of a nested sampling run for the three different methods. The sampling is performed in n = 3 dimensions using 〈Ns〉 = 40 with sampling parameters N = 1000 and fln = 0.01. The top panel shows the likelihood L for the worst point of each iteration, the middle panel shows the posterior weights w, used to compute the evidence, and the bottom panel shows the moving median of the acceptance rate ra for each iteration. The methods distinguish themselves clearly in the variation of the acceptance rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 The (log) evidence error (top row), the bulk median acceptance rate (mid- dle row) and the number of likelihood evaluations (bottom row) in n = 3 dimensions for the three methods applied in the nested sampling frame- work to the toy problem in Section 2.1.1 for different values of the scale parameters s, τ and a. Sampling parameters were N = 1000 active points, fln = 0.01 tolerance and 〈Ns〉 = 40 exploration steps per iteration. . . . . . 35 4.6 Error in the computed (log) evidence with different choices of the average number of steps in each nested sampling iteration. Sampling parameters are N = 1000 active points and fln = 0.01 and the scale parameters are as indicated by the legend. The bands are the standard deviations of five identical runs with different random seeds. The general trend is that the error approaches zero and fluctuates less for larger 〈Ns〉. . . . . . . . . . . . 36 5.1 Illustration of the exponential increase of adjacent points with the dimen- sionality n of a Euclidean space. In (a): one, (b): two and (c): three dimensions, a grid point (filled circle) has 2, 8 and 26 neighbouring points (unfilled circles), respectively. In arbitrary dimensions, n, the correspond- ing number is 3n − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 The evidence error (top row), the bulk median acceptance rate (middle row) and the number of likelihood evaluations (bottom row) for varying parameter space dimensionality. Each point is the average of five identical runs and the bands are corresponding standard deviations. The left and right columns display two different choices of hyperparameters, respectively, as specified in the figure. Sampling parameters areN = 1000 and fln = 0.01. The overall trend is that performance worsens and computations become less accurate for higher dimensions, as expected. . . . . . . . . . . . . . . . 41 5.3 Tracked progress for the three methods, displayed in terms of the likelihood L, the posterior weights w and the acceptance rate ra in n = 24 dimensions using 〈Ns〉 = 40. In contrast to the n = 3 case in Figure 4.4, the methods are seen to disagree on where the bulk of the posterior is located in terms of ξ, resulting in significant differences in the computed evidences. The vertical lines indicate the computed information H. Sampling parameters are N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xii List of Figures A.1 Marginal posterior distributions obtained using the constrained Metropolis method with s = 0.5 and 〈Ns〉 = 40. The distributions are compared to the analytical solution (dash-dotted). Vertical dashed lines represent the sample median and 16th and 84th percentiles, respectively. The Taylor coefficients (Equation (2.17)) are indicated by the square markers and vertical solid lines. The histograms contain ∼ 5000 samples in each case. N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II A.2 Marginal posterior distributions obtained using the GMC method with τ = 0.1 and 〈Ns〉 = 40. The distributions are compared to the analytical solution (dash-dotted). Vertical dashed lines represent the sample median and 16th and 84th percentiles, respectively. The Taylor coefficients (Equation (2.17)) are indicated by the square markers and vertical solid lines. The histograms contain ∼ 5000 samples in each case. N = 1000 and fln = 0.01. . . . . . . . III A.3 Marginal posterior distributions obtained using the constrained stretch move method with a = 2.0 and 〈Ns〉 = 40. The distributions are compared to the analytical solution (dash-dotted). Vertical dashed lines represent the sample median and 16th and 84th percentiles, respectively. The Taylor coef- ficients (Equation (2.17)) are indicated by the square markers and vertical solid lines. The histograms contain ∼ 5000 samples in each case. N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.4 Marginal posterior distributions obtained using PyMultiNest. The distri- butions are compared to the analytical solution (dash-dotted). Vertical dashed lines represent the sample median and 16th and 84th percentiles, respectively. The Taylor coefficients (Equation (2.17)) are indicated by the square markers and vertical solid lines. The histograms contain ∼ 5000 samples in each case. N = 1000 and fln = 0.01. . . . . . . . . . . . . . . . . IV xiii List of Figures xiv List of Tables 4.1 Nested sampling results for the three methods applied to the same problem in n = 3 dimensions. Sampling parameters are N = 1000 active points and fln = 0.01 tolerance. The true evidence value for this model is lnZtrue = 8.09 as was analytically obtained in Figure 2.3. The number of likelihood evaluations, NL, does in the GMC case, include the number of evaluations of the gradient. The fraction of gradient evaluations is given in parenthesis. 32 5.1 The computed (log) evidence lnZ along with the uncertainty estimate ± √ H/N and the number of likelihood evaluations NL required as functions of the parameter space dimensionality n. The analytical values lnZtrue are given for reference. Hyperparameter settings are the same as in the left col- umn of Figure 5.2. Every value is the average obtained from five identical runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xv List of Tables xvi 1 Introduction Scientific models are typically associated with a set of model parameters used to make ex- plicit theoretical predictions. A theoretical framework may in fact contain one or a set of competing models {Mn}, each with their own collection of model parameters whose values need to be calibrated against experimental observation. A suitable example in subatomic physics regards effective field theories (EFT), e.g. chiral effective field theory (χEFT) [1], which are described by observable coefficients that parametrize the theory at the scale at which it is valid [2]. In this type of scenario we would like to make well-informed state- ments about which model and associated model parameter values to prefer, given a set of experimental data. To this end we employ Bayesian statistical methods [3, 4] in order to perform inductive inference, including parameter estimation and model comparison. For parameter estimation, the central object is the posterior probability distribution for the parameters, which in general needs to be represented by a set of random samples. The key quantity for model comparison is the Bayesian model evidence (or marginal likelihood), obtained by integration over the space of model parameters. By evaluating the evidence, competing models in the set can be assigned a relative probability, indicating which model is favoured by the data. Bayesian inference generally requires demanding numerical computations in terms of high- dimensional evidence integrals and sampling of complex posterior probability distributions. For this purpose one mainly resorts to Markov chain Monte Carlo (MCMC) methods [5] which explore the model parameter space in order to find regions of high probability. However, although MCMC methods, such as the Metropolis-Hastings algorithm [6, 7], theoretically could be used to estimate the evidence integral, in practice they fail to do so with acceptable efficiency. A less established Monte Carlo method compared to MCMC is nested sampling, intro- duced by Skilling [8, 9] and further described by Sivia and Skilling [4]. Nested sampling was specifically developed to efficiently provide an estimate of the model evidence. As a by-product it also generates a set of samples from the posterior. The algorithm has been successfully applied in parameter estimation and model comparison applications in a broad variety of fields, from astronomy, cosmology and particle physics [10–15] to biomath- ematics [16–18]. The basic idea of nested sampling is to transform the high-dimensional evidence integral to a one-dimensional integral over unit range by considering secluded and closed regions of parameter space, nested within each other. A most crucial step in the nested sampling procedure is to generate independent samples from within these constrained regions and a variety of methods for performing this step have been proposed and implemented [10, 19–23]. Producing high-quality constrained samples has proven to be a difficult task, to say the least, and is subject to ongoing research. This work aims at contributing to this research by implementing and evaluating three different constrained 1 1. Introduction space sampling methods which are integrated into a nested sampling framework. The implementations are in themselves MCMC based although their surrounding nested sam- pling environment is not. Two of the methods are inspired by previous work on the topic whereas the third method is introduced to the context of nested sampling for the first time in this work. The implemented methods are applied to an EFT-inspired example problem and evaluated based on their respective performance. 1.1 Purpose and scope The purpose of this thesis is to increase the understanding of the nested sampling algorithm in general and constrained space sampling in particular. This is achieved by proposing the specific designs, implementing and evaluating three different methods for constrained sampling in conjunction with nested sampling. Focus will mainly be on the performance based on the specifics of the three methods and not on the design of the nested sampling algorithm in general. In contrast to sampling of unconstrained probability distributions, the process of constrained sampling is poorly understood. A specific goal of this work is to present modified versions of ordinary MCMC methods, adjusted to suit the requirements introduced by nested sampling. 1.2 Outline In order to understand the context in which nested sampling is used it is necessary to grasp the concepts of the Bayesian statistical analysis framework. This background is provided in Chapter 2 and includes the basics of Bayesian inference as well as a practical example on parameter estimation and model comparison. Furthermore, the chapter is ended by a brief description of a few MCMC sampling methods which are conventionally used in the context of Bayesian inference. The underlying ideas and a full description of the nested sampling algorithm are given in Chapter 3. The chapter additionally introduces the principles of the constrained sampling methods implemented in this work and describes their place in relation to nested sampling. The specifics of the design choices for the implementations are described in Chapter 4 which further presents diagnostic results used to optimize the methods. The first method being presented is a constrained version of the Metropolis algorithm, the second is a version of Galielean Monte Carlo and the third is a constrained version of an affine-invariant algorithm referred to as the stretch move. The latter method is to our knowledge unique to this work. In Chapter 5 the methods are observed when pushed over their tipping points and broken down by applying them to models with increasingly larger parameter space dimensionalities. Here we also discuss uncertainty estimates and make a comparison to the well-used state-of-the-art nested sampling software MultiNest [10, 11, 24], which the methods of this work are seen to outperform in some cases. The thesis is concluded in Chapter 6 by summarizing the main findings and by proposing starting points for further related work. 2 2 Bayesian data analysis and numerical methods Bayesian statistics and its applications is a vast subject and is probably relevant to most scientific research or any field of work where making predictions from data and previous experience is important. Here follows a brief description of the aspects of Bayesian infer- ence relevant to this work. For an extensive review of the subject, see e.g. MacKay [3] or Sivia and Skilling [4]. 2.1 Bayesian inference One of the powers of the Bayesian description of probability is that it allows for a relation to be established between a conditional probability prob(X|Y ) and its reverse prob(Y |X) where X and Y are propositions that could be outcomes of random processes (although that property is not required). This reversal is desirable in scenarios e.g. where prob(X|Y ) is the sought after quantity but is difficult to directly write down whereas prob(Y |X) is easier to interpret and more naturally expressed. The product rule for joint probabilities states that prob(X,Y ) = prob(X|Y )prob(Y ), (2.1) which in words reads: the probability that X and Y occurs equals the probability that X occurs given that Y has occurred multiplied by the probability that Y occurs. Joint probabilities are symmetric such that prob(X,Y ) = prob(Y,X) which by the product rule in Equation (2.1) gives Bayes’ theorem prob(X|Y ) = prob(Y |X)prob(X) prob(Y ) , (2.2) which yields the relation between the conditional probabilities prob(X|Y ) and prob(Y |X). The marginal probabilities prob(X) and prob(Y ) can be expressed by marginalization of the joint probability prob(X,Y ) (2.1)= prob(Y |X)prob(X) through the sum rule according to prob(Y ) = {∫ prob(Y |X)prob(X)dX, for continuous X∑ X prob(Y |X)prob(X), for discrete X (2.3) and correspondingly for prob(X). In the context of this work, X takes the form of an n-dimensional parameter vector θ = (θ0, ..., θn−1) used to make predictions in a given model M. Y is in this context a set of data D obtained from an experiment or simulation yielding measurements of some 3 2. Bayesian data analysis and numerical methods quantity sought to be predicted byM. In other words, the goal is to find the probability (distribution) of the parameters θ given data D. Bayes’ theorem then reads prob(θ|D, I) = prob(D|θ, I)prob(θ|I) prob(D|I) (2.4) where I is any other relevant information including e.g. the choice ofM. The probabilities entering Equation (2.4) are the posterior probability distribution of the parameters prob(θ|D, I) ≡ P(θ), (2.5a) the likelihood of the data prob(D|θ, I) ≡ L(θ), (2.5b) the prior probability distribution of the parameters prob(θ|I) ≡ π(θ), (2.5c) the Bayesian evidence (or marginal likelihood) prob(D|I) ≡ Z. (2.5d) 2.1.1 EFT toy example In order to put Bayesian inference into context, we here review a simple EFT-inspired example from Schindler et al. [25] and Wesolowski et al. [26]. In this example, a function g(x) is taken to represent the true behavior of some observable in the underlying theory and x is an EFT expansion parameter, e.g. a ratio of momenta representing a low-energy physics scale and a high-energy breakdown scale. The function is chosen such that the coefficients of its Taylor expansion at x = 0 within |x| < 1 ought to be of natural order: g(x) = a0 + a1x+ a2x 2 + a3x 3 + ... (2.6) In this way the observable coefficients ai are related to order-by-order predictions of the EFT. More specifically, including the first n terms would correspond to the prediction of the EFT truncated at order n. By physical arguments the observable coefficients are expected to be of natural size [2]. To simulate observations from an experiment, synthetic data points d are generated by adding a Gaussian relative noise to g(x) at points of measurement xj , j = 1, ..., Nd: dj = g(xj)(1 + cηj) with experimental error σj = cdj (2.7) where ηj ∼ N (0, 1)1 is a standard normal random sample and c is a relative error (5% in this example). To make accurate predictions, the observable coefficients need to be calibrated against this data which makes them ideal subjects for Bayesian parameter estimation (see Section 2.1.2). In this example, our model M is a polynomial of order n− 1, gM(x; θ) = n−1∑ i=0 θix i, (2.8) modelling the behavior of the true function g(x). This is where the parameter vector θ of the model enters, upon which the predictions of M depends. A non-Bayesian ap- proach usually aims to find a point estimate of the parameters, such as the least-squares minimization arg minθ χ 2(θ) where χ2(θ) = Nd∑ j=1 ( dj − gM(xj ; θ) σj )2 . (2.9) 1N (µ, σ2) denotes a normal distribution with mean µ and variance σ2. 4 2. Bayesian data analysis and numerical methods This approach might be sufficient in some applications but often fails to provide any infor- mation on the distribution of the parameters and thereby their uncertainties. Neither does it take into account any prior knowledge of the parameters (such as naturalness in the case of the EFT expansion). We will now turn to the general problem of parameter estimation and model comparison. However, we will revisit this toy example in Section 2.1.4. 2.1.2 Parameter estimation The model parameters θ are in the Bayesian formalism estimated using the posterior P(θ) determined by Bayes’ theorem Equation (2.4). In parameter estimation applications it is possible to ignore the evidence Z as it is independent of θ and merely amounts to a normalization factor; it is however a central quantity in the context of Bayesian model comparison (see Section 2.1.3). The posterior P(θ) ∝ L(θ)π(θ) (2.10) is consequently determined solely by the likelihood and prior and is obtained either by numerical methods, such as MCMC sampling (see Section 2.2), or as an analytical solution (if possible). Subsequently, it is possible to derive distributions for any subset θj , ..., θk of parameters by marginalization prob(θj , ..., θk|D, I) = ∫ P(θ) ∏ i 6=j,...,k dθi, (2.11) where integration is taken over the appropriate ranges in θ. Marginalization is useful for e.g. discarding of nuisance parameters but also convenient for visualization which (unfortu- nately) is bounded by two or three dimensions. Examples of such two- and one-dimensional marginal probability density functions (pdfs) are shown in Section 2.1.4. Point estimates of any function f(θ) with respect to the posterior are obtained with the expectation value E[f ] = ∫ f(θ)P(θ)dnθ (2.12) where the special cases f(θ) = θi and f(θ) = (θi − E[θi])(θj − E[θj ]) give the parameter means and covariances respectively. It is important to stress, however, that the posterior contains more information than mere point estimates such as mean or mode values. The posterior has in general a complex structure, such as multiple modes, not well described by a single number. Only in the special case when the posterior is (approximately) Gaussian can it be fully described by its mean vector and covariance matrix. 2.1.3 Model comparison Model comparison can be illustrated by the problem of determining which of two possible modelsM1 andM2 (e.g. polynomials of different orders in the example of Section 2.1.1) to prefer. In the Bayesian formalism this problem is addressed by comparing the model posteriors prob(M1|D, I) and prob(M2|D, I)2. These posteriors are conditioned on the same data D and do not include any parameter dependence as they are more general than to describe a certain fit for a given prediction. As opposed to parameter estimation the evidence Z plays a crucial role in model comparison. It is given by normalization of the posterior 2Note that as M1 and M2 are here made explicit they are not included in I. 5 2. Bayesian data analysis and numerical methods Z = ∫ prob(D|θ,M, I)prob(θ|M, I)dnθ = ∫ L(θ)π(θ)dnθ. (2.13) The model comparison is then performed by constructing the ratio of the model posterior pdfs using Bayes’ theorem according to prob(M1|D, I) prob(M2|D, I) = prob(D|M1, I)prob(M1|I) prob(D|M2, I)prob(M2|I) (2.14) where the common factor prob(D|I) has cancelled. If there is no initial reason to prefer one model over the other, the ratio of model priors will evaluate to prob(M1|I)/prob(M2|I) = 1. The remaining ratio is called the Bayes’ factor and should be determined in order to compare the models as prob(M1|D, I) prob(M2|D, I) −→ prob(D|M1, I) prob(D|M2, I) . (2.15) Using the sum (2.3) and product (2.1) rules the Bayes’ factor can be computed by inte- gration as prob(D|M1, I) prob(D|M2, I) = ∫ prob(D|θ,M1, I)prob(θ|M1, I)dnθ∫ prob(D|θ,M2, I)prob(θ|M2, I)dnθ (2.13)= Z1 Z2 (2.16) meaning that the evidences, Z1,2, are used to quantitatively assess the relative performance of different models. 2.1.4 EFT toy example revisited In this section the representative problem introduced in Section 2.1.1 of estimating ob- servable coefficients in an EFT will be approached using Bayesian inference. A function g(x) with the desired property of natural Taylor coefficients ai, described in the example, is g(x) = (1 2 + tan ( π 2x ))2 = a0 + a1x+ a2x 2 + a3x 3 + ... = 1 4 + π 2x+ π2 4 x 2 + π3 24x 3 + ... (2.17) ≈ 0.25 + 1.57x+ 2.47x2 + 1.29x3 + ... which here will be used to generate synthetic data. The true function (2.17) along with data and associated errors generated according to Equation (2.7) can be seen in Figure 2.1. The data consists of 10 points evenly distributed in 0 < x ≤ 1/π and the relative error is c = 5%. As a reminder, the goal is to predict the coefficients ai (and thereby also g(x)) given this data by studying the parameters θ of the model function gM(x; θ) defined in Equation (2.8). This is done by constructing a posterior pdf for the parameters via a likelihood and a prior according to Bayes’ theorem (2.4)). Using the principle of maximum entropy (MaxEnt) [27] it is possible to derive [4] a likeli- hood for the data. Assuming uncorrelated data, the resulting pdf is prob(D|θ, I) = Nd∏ j=1  1√ 2πσ2 j  exp ( −χ 2 2 ) (2.18) 6 2. Bayesian data analysis and numerical methods 0.0 0.1 0.2 0.3 0.4 x 0.2 0.4 0.6 0.8 1.0 1.2 1.4 g (x ) True function Synthetic data Figure 2.1: The true function g(x) from which synthetic data is generated in the EFT toy example. Data consists of 10 points evenly distributed in 0 < x ≤ 1/π with a relative error c = 5% according to Equation (2.7). which is an Nd-dimensional Gaussian in the data and where χ2 is defined in Equation (2.9). It is important to note that this is a pdf for the data D, conditioned on the parameters θ, not a pdf for the parameters themselves. If a uniform prior, i.e. prob(θ|I) = const., is chosen in combination with the likelihood (2.18), the maximum of the posterior would be exactly at the least-squares point estimate discussed in Section 2.1.1 (given that the prior range includes this point). To account for the naturalness of the coefficients, however, Wesolowski et al. [26] introduces a naturalness prior given by the symmetric Gaussian N (θ; 0, θ̄21n), i.e. prob(θ|I) = n−1∏ i=0 prob(θi|I) = n−1∏ i=0 1√ 2πθ̄2 exp ( − θ2 i 2θ̄2 ) (2.19) where θ̄ = 5 is the width of the distribution. In this way the parameters are favoured by the prior to the interval ±5 (roughly) and thereby stay close to natural size. The posterior obtained as the product of the MaxEnt-likelihood (2.18) and the naturalness prior (2.19) is visualized in Figure 2.2 for a model with n = 3. This is done by presenting its marginal pdfs for different parameter subsets. The posterior is in this case a Gaussian, as can be derived analytically from the relatively simple forms of the likelihood and the prior. In general, analytical expressions are impossible to find and one has to resort to numerical sampling methods such as MCMC, which will be discussed in Section 2.2. Along with the distributions, Figure 2.2 contains point and error estimates for the parameters given by their means and standard deviations respectively. The point estimates should be compared to the actual coefficients in Equation (2.17). The posterior illustrated in Figure 2.2 is obtained given a model with n = 3 parameters, i.e. a polynomial of order n − 1 according to Equation (2.8). This specific choice might however not be the model that has the most evidence given the data. We can nevertheless try to inform ourselves by employing the principles of model comparison described in Section 2.1.3. It is in this example possible to compute the evidence prob(D|Mn, I) for a 7 2. Bayesian data analysis and numerical methods θ0 = 0.25± 0.02 0. 5 1. 0 1. 5 2. 0 2. 5 θ 1 θ1 = 1.63± 0.39 0. 20 0 0. 22 5 0. 25 0 0. 27 5 0. 30 0 θ0 0 2 4 6 θ 2 0. 5 1. 0 1. 5 2. 0 2. 5 θ1 0 2 4 6 θ2 θ2 = 3.14± 1.27 Figure 2.2: Marginal distributions of the analytically derived posterior in the EFT example for a model with n = 3 parameters. Two-dimensional distributions for pairs of parameters are located below the diagonal on which the one-dimensional distributions are displayed. Point estimates are given for each parameter in terms of their mean values. Error estimates are given in terms of standard deviations. modelMn with n parameters analytically and the result is shown in Figure 2.3. We see clearly thatM3 is in fact the model with the maximum evidence. The fact that the model evidence reaches a plateau for n ≥ 5 means that not much information is neither gained nor lost by adding or removing a parameter in this region. The evidence integral (2.13) is, however, in general not possible to compute analytically and numerical methods are required. The work of this thesis concerns different versions of such a method, namely nested sampling, which will be introduced in Chapter 3. For this, we first have to acquaint ourselves with a few MCMC sampling methods. 2.2 Markov chain Monte Carlo methods MCMC methods [5] comprise a set of sampling algorithms widely used in multiple ap- plications, not least in Bayesian inference. They are designed to approximate a target distribution p(θ) by the construction of Markov chains whose equilibrium distributions are that of the target. This is done by letting walkers explore parameter space looking for regions of significant probability. Markov chains are created by ensuring that the next position θ ′ of each walker only depends on its current position θ and a transition probability T (θ′ ,θ). In contrast to other Monte Carlo methods, MCMC thus produces correlated samples which is important to take into account when e.g. the sample mean and variance are computed. The Markov chain limit theorem [28] provides principles for 8 2. Bayesian data analysis and numerical methods 2 3 4 5 6 7 8 9 Parameter space dimensionality n 500 1000 1500 2000 2500 3000 p ro b (D |M n ,I ) Figure 2.3: The model evidence computed for models Mn defined by their parameter space dimensionality n. It is clear that the evidence is maximized for n = 3 which means thatM3 is the model favoured by the data. The evidence for n = 1 is ∼ 0 and is therefore not included in the figure. handling correlated samples and justification for computations performed using MCMC samples. We say that a Markov chain is ergodic if its temporal average, i.e. over time steps t, in the long run approaches its ensemble average over all possible states, i.e. if all possible states are eventually explored. This is a key property for MCMC methods for generating samples from the target distribution. Furthermore, a Markov chain is reversible if it satisfies detailed balance, meaning that the transition probability from θ to θ ′ is the same as from θ ′ to θ, i.e. T (θ,θ′) = T (θ′ ,θ). Detailed balance implies that the Markov chain is stationary which means that the distribution of a sample θ(t) is independent of the time t. Stationarity gives the Markov chain the right properties concerning its equilibrium distribution. We will now proceed by describing a few specific MCMC algorithms used to sample distributions. 2.2.1 Metropolis-Hastings The Metropolis-Hastings algorithm [6, 7] is an MCMC method for producing samples from a target p(θ). This is done by exploring parameter space by random walks, producing Markov chains. At each step t in a chain, a new position θ ′ is proposed, drawn from a proposal distribution Q(θ′ |θ(t)) where θ(t) is the current position. The proposition is accepted with probability α = min ( 1, p(θ ′)Q(θ(t)|θ′) p(θ(t))Q(θ′ |θ(t)) ) (2.20) and θ(t+1) ← θ ′ is set, else θ(t+1) ← θ(t). In the original Metropolis implementation the proposal distribution is taken to be symmetric, meaning Q(θ(t)|θ′) = Q(θ′ |θ(t)), and is regularly chosen to be a symmetric Gaussian N (θ′ ; θ(t), σ2 Q1n) where the scale σQ needs to be set appropriately. The acceptance probability therefore simplifies to α = min ( 1, p(θ′)/p(θ(t)) ) . In general, the proposal distribution does not have to be 9 2. Bayesian data analysis and numerical methods isotropic and can for instance be a Gaussian with an arbitrary n × n covariance matrix ΣQ. In this case, there are 1 2n(n + 1) parameters that need to be set depending on the problem at hand, complicating the user experience. The Metropolis procedure is described in Algorithm 2.1. A caveat of the Metropolis sampling algorithm is that it, like MCMC methods in general, inherently produces correlated samples. This property needs to be properly considered in applications and will be revisited in Section 3.3.1. Algorithm 2.1: Metropolis procedure. Input: Target p(θ) and proposal distribution Q(θ′ |θ) Output: Target samples θ(t) Initialize θ(0) for t = 0, ..., Ns − 1 do θ ′ ← sample from Q(θ′ |θ(t)) α← min ( 1, p(θ ′ ) p(θ(t)) ) u← sample from U(0, 1) if u ≤ α then θ(t+1) ← θ ′ else θ(t+1) ← θ(t) 2.2.2 Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) [3, 5], introduced by Duane et al. (1987) [29], is an MCMC method for generating high-quality samples from challenging target distributions p(x)3, particularly for larger number of dimensions n. The idea is taken from Hamiltonian classical mechanics by considering the logarithm of the target distribution as an energy potential −V (x) in parameter space, i.e. p(x) ∝ exp(−V (x)). (2.21) The concept of momentum p is included by considering the phase space extended pdf prob(x,p) = prob(x)prob(p) = p(x)q(p) where we have introduced q(p) ∝ exp ( −|p| 2 2 ) , (2.22) which is the exponent of the kinetic energy. The Hamiltonian of a classical system is of the form H = 1 2 |p| 2 + V (x), which implies that the full 2n-dimensional phase space distribution can be written as prob(x,p) ∝ exp(−H). (2.23) Any point (or sample) (x,p) can thus be evolved in time t for any given period, yielding a new point (x′,p′), by solving Hamilton’s equations4: 3We shall here use x instead of θ for the parameters to emphasize the parallel to position in classical mechanics. 4These equations need in general to be solved numerically, which is a topic that we will not elaborate on here. 10 2. Bayesian data analysis and numerical methods dx dt = ∂H ∂p = p (2.24a) dp dt = −∂H ∂x = −∇V (x), (2.24b) serving the purpose of the transition probability T (x′,p′; x,p). If the momentum p is ran- domly initiated at each new position x, these points will form a Markov chain satisfying ergodicity [29]. Detailed balance is satisfied by the time reversal symmetry of Hamilton’s equations (2.24). The desired target p(x) is obtained by discarding the p-samples, only keeping the x-part of phase space. These samples should be of high quality as the trajec- tories described by the dynamics of Equation (2.24) are favoured to move towards regions of low potential energy, i.e. high probability. This is especially useful in large-dimensional spaces out of which these high-quality regions usually make up a small fraction, making them hard to locate. The HMC algorithm also covers distance in parameter space faster than random walks such as Metropolis-Hastings, effectively decreasing the correlation of samples. The most noticeable drawback of HMC is that it requires evaluation of the gradient −∇V (x) = ∇ ln p(x) at every step of a trajectory. This is in general a very computationally expensive and thus time consuming task. HMC furthermore contains a number of hyperparameters, such as the number of time steps and the size of the steps, that need to be appropriately set for your current problem. 2.2.3 Sampling with affine invariance: the stretch move Model parameters can, in general, be strongly correlated, making their joint probability distributions highly asymmetric. This means that the characteristic length scales and relevant directions can differ largely between parameter dimensions making efficient ex- ploration of parameter space more challenging. Building upon work by Christen [30], Goodman and Weare [31] proposed an MCMC sampling method that is independent of scale differences between dimensions. Formally, the algorithm is invariant under affine transformations, which in n dimensions are of the form yi = βi0θ0 + ...+ βin−1θn−1 = n−1∑ j=0 βijθj , i = 0, ..., n− 1, (2.25) where θj are the parameters of the problem, βij are scale coefficients and yi are the corresponding transformed parameters. Affine invariance of the algorithm implies that parameter dimensions are effectively scaled to the same size, simplifying the shape of the distribution making parameter space easier to explore. To achieve this, the algorithm utilizes a special kind of random walk, informally referred to as the stretch move. This method simultaneously evolves an ensemble of K walkers S = {Θk}Kk=1 where the proposal distribution for walker Θk depends on the complementary ensemble S[k] = {Θj , ∀j 6= k}. At step t, a new position Y is proposed for walker Θk by choosing another walker Θj randomly from S[k] and setting Θ(t) k → Y = Θj + ζ̃[Θ(t) k −Θj ] (2.26) where ζ̃ is a scale factor randomly sampled from a distribution g(ζ). For the proposal to be symmetric and satisfy detailed balance we must have g(ζ−1) = ζg(ζ) as well as add a Metropolis style rejection scheme where Θ(t+1) k ← Y is accepted with probability 11 2. Bayesian data analysis and numerical methods α = min ( 1, ζ̃n−1 p(Y ) p(Θ(t) k ) ) , (2.27) where p(θ) is the target distribution. This proposal procedure is repeated for every walker in S resulting in a collective evolution. As an explicit form for g(ζ), Goodman and Weare [31] suggests g(ζ) ∝ 1√ ζ , for ζ ∈ [1 a , a ] , (2.28) where a is a hyperparameter adjusting the scale of g(ζ) and which they set to 2. One main advantage of this method is that there are very few hyperparameters that need to be set by the user, there are essentially only two: step scale a and the number of steps. A state-of-the-art implementation of the stretch move procedure for MCMC sampling can be found in e.g. the Python module emcee [32]. The MCMC methods described in the previous sections provide a set of samples that can be used to make computations — such as the model evidence — with respect to the posterior. However, this procedure quickly becomes inefficient for large-scale problems and alternative approaches are necessary. In the next chapter we introduce such an approach: nested sampling. 12 3 Nested sampling Introduced by Skilling [8], nested sampling is a Monte Carlo sampling method that natu- rally provides an estimate of the evidence Z. It is however important to note that nested sampling is not an MCMC method as it does not produce Markov chains distributed ac- cording to the posterior. The acquisition of posterior samples is nevertheless also a product but not the main focus of the algorithm. Here follows a review of the principles of nested sampling, for a detailed description see e.g. Sivia and Skilling [4] or Skilling [9]. 3.1 Background Let us first repeat some Bayesian key concepts from Chapter 2. Given data D and other information I, such as the choice of model, the goal is to obtain the evidence (or marginal likelihood) Z = prob(D|I) but also the posterior P(θ) = prob(θ|D, I) from the prior π(θ) = prob(θ|I) and the likelihood L(θ) = prob(D|θ, I). The relation between these quantities is given through Bayes’ theorem which in this more compact notation takes the form L(θ)× π(θ) = Z × P(θ). (3.1) Assuming that the posterior distribution is properly normalized to unity, i.e. ∫ P(θ)dnθ = 1, the evidence can be written in terms of the integral Z = ∫ L(θ)π(θ)dnθ (3.2) where the integration is taken over the parameter ranges according to the prior. The com- plexity of computing this integral increases exponentially with the number of dimensions as a space with n dimensions resolved to one part in R has Rn volume elements. High dimensionality furthermore obstructs the use of analytical approximations as the degree of freedom in these spaces becomes too large [4]. 3.1.1 Prior mass fraction and sorted likelihood function What makes the numerical computation of integrals such as Equation (3.2) challenging is the brute-force approach of direct summation of the exponentially large number of elements in parameter space. However, the likelihood values associated to each of these elements can be sorted in a decreasing order, resulting in a sequence {Lk}. The sequence could be enumerated using a global variable encoding the information of the part of prior parameter space with likelihoods larger than Lk. This variable is the prior mass fraction, here denoted ξ. In the case of continuous parameters θ, the prior mass fraction as a function of likelihood limit is formally defined as 13 3. Nested sampling ξ(λ) = ∫ L(θ)>λ π(θ)dnθ (3.3) and is interpreted as the fraction of prior probability with likelihood greater than λ [4]. Assuming the prior π(θ) to be normalized to unity, the prior mass fraction has by def- inition properties: 0 ≤ ξ ≤ 1, ξ(0) = 1 and ξ(Lmax) = 0 where 0 < L ≤ Lmax. Also by definition, the mass element associated with likelihoods λ ≤ L ≤ λ + dλ is given by dξ = π(θ)dnθ. In this way the problem is transformed from n-dimensional parameter space, in terms of the parameter vector θ = (θ0, ..., θn−1), to a single variable ξ. There is no information lost in this reduction of dimensions if ξ is stored at a resolution of one part in Rn and each of the n parameters θi is stored at a resolution of one part in R as the information required in both cases amounts to n log2R bits [4]. For the sake of convenience, it is rather the inverse function L(ξ), defined by L(ξ(λ)) = λ, (3.4) that we shall consider in practice. It takes on values 0 ≤ L ≤ Lmax where the extremes are at L(0) = Lmax and L(1) = 0. L(ξ) is a sorted version of the original likelihood L(θ) and only has a single variable dependence ξ. The existence of the inverse is not trivial to prove. In fact, it has been proven by Schittenhelm and Wacker [33] that the statement (3.4) is violated for specific forms of L(θ). For the likelihood functions in this work, however, the definition of the inverse L(ξ) will always hold. A cartoon representing the relation between the likelihood versions L(θ) and L(ξ) can be seen in Figure 3.1a. The left panel shows an example of a likelihood in two dimensions where each of the iso-likelihood contours L(θ) = Lk, k = 1, ..., 4, corresponds to a prior mass fraction ξk shown in the right panel. In this example, ξk is simply the integral of the prior over the area enclosed by the contour Lk. The filled in areas illustrate what parts of parameter space correspond to what ξ-intervals. Note that the figure is for illustrative purposes and is not quantitatively correct. The key idea setting up the nested sampling method is to express the evidence Z in terms of the quantities L and ξ. The evidence (3.2) is a sum of likelihood values L(θ) with associated prior mass weights π(θ)dnθ over volume elements dnθ. The mass elements dξ originate from the same volume elements meaning the evidence is equivalently given by Z = ∫ 1 0 L(ξ)dξ (3.5) which notably is a one-dimensional integral. Geometrically, this means that Z is the area enclosed by the curve generated by L(ξ) as shown in Figure 3.1b. Furthermore, P (ξ) is the ξ-counterpart to the posterior P(θ) and is proportional to L(ξ) according to P (ξ) = 1 Z L(ξ) (3.6) from which a sample ξ̃ corresponds to a parameter sample θ̃ from P(θ). This is true from the previous argument that mass elements dξ and π(θ)dnθ come from the same volume element. The sorted likelihood function L(ξ) is thus the primary constituent in the nested sampling machinery and can be used to yield Z as well as samples from P(θ). 14 3. Nested sampling θ L4 L3 L2 L1 0 ξ4 ξ3 ξ2 ξ1 1 ξ L(ξ) L4 L3 L2 L1 (a) Cartoon of the relation between iso-likelihood contours L(θ) = Lk and prior mass fractions ξk, described by L(ξk) = Lk. Each interval in ξ corresponds to an area in parameter space. Note that the figure is not meant to be quantitatively correct. 0 1 ξ 0 Lmax L(ξ) Z (b) Representation of the evidence Z as the area under the curve of the like- lihood function L(ξ). Figure 3.1: Illustrations of the relationship between the likelihoods L(θ) and L(ξ), the prior mass fraction ξ and the model evidence Z. 15 3. Nested sampling 3.2 The algorithm The nested sampling algorithm makes use of active points exploring parameter space to reveal the structure of L(ξ). N points are sampled from the parameter space in proportion to the prior and are then set to evolve under successively stricter likelihood constraints L(θ) > L∗, where L∗ ≥ 0 is an evolving likelihood limit. These constraints iteratively define the iso-likelihood contours seen in Figure 3.1a. In terms of ξ this corresponds to N uniformly distributed samples evolving subject to the constraint ξ < ξ∗ = ξ(L∗). Each iteration hence starts with N points uniformly distributed in (0, ξ∗)1 (initially ξ∗ = 1) and the subsequent evolution, illustrated in Figure 3.2, proceeds as follows: (a) The worst of the N active points, the one with the lowest likelihood, is identified, removed and stored for later use. (b) Limit ξ∗ (and therefore also L∗) is updated to the previous position of the removed point. (c) A new point, nested within ξ∗, is generated and added to the set of active points. This procedure is set to repeat for a pre-defined number of iterations M . The way in which item (c) above is implemented is a subject of particular importance and will be further discussed in Section 3.3. Sampling from the prior under the likelihood-constraint could be argued to be the most crucial subtopic of nested sampling and will be the main focus of this thesis. For now it is assumed to be possible to generate such new points. 0 ξ∗ 1 worst (a) 0 ξ∗ 1 (b) 0 ξ∗ 1 new (c) Figure 3.2: Evolution of active points in a nested sampling iteration. In (a) the worst point is identified and removed. The limit ξ∗ is updated in (b) to the position the point was removed from. A new point is generated in (c), nested within the current limit. 3.2.1 Statistical properties At this stage it is necessary to address the topic that the ξ’s are not explicitly known. It is however possible to determine their statistical properties from which an estimate can be extracted. If the worst point removed from iterate k is denoted ξk, the shrinkage ratio 1A procedure for validating this assumption of uniformity has been suggested by Fowlie et al. [34]. 16 3. Nested sampling of the active volume between iterations is tk = ξk/ξk−1, where ξ0 = 1 =⇒ t1 = ξ1. (3.7) Noting that ξk is the topmost of N uniformly distributed samples between 0 and ξ∗ = ξk−1, the ratio tk can be considered to be the topmost of N samples uniformly distributed between 0 and 1. This means that the properties of order statistics [35] can be used to show that tk ∼ Beta(N, 1)2 with pdf prob(tk) = NtN−1 k . (3.8) Given the distribution (3.8), the associated mean and variance of ln tk are E[ln tk] = − 1 N and Var[ln tk] = 1 N2 (3.9) respectively. Using the (natural) logarithm rather than the plain value in (3.9) ensures better properties in the transition from full prior space to smaller regions where the bulk of the posterior mass resides [9]. At iterate k, ξk can be expressed in terms of a product of shrinkage ratios, i.e. ξk = t1t2...tk = k∏ j=1 tj (3.10) leading to a geometrical progression of the active ξ-range (but linear in ln ξ). The prop- erties of the ξ’s are thereby determined entirely by the properties of the t’s and it is straightforward to derive the mean and variance E[ln ξk] = k∑ j=1 E[ln tj ] = − k N and Var[ln ξk] = k N2 , (3.11) of the logarithm ln ξk. Crudely adopting the mean in Equation (3.11) as an estimator one obtains an exponentially decreasing sequence of discarded prior mass fractions ξk = e−k/N , (3.12) illustrated in Figure 3.3. That is to say that with this scheme the active range shrinks exponentially as regions of higher likelihoods are approached. Along with each ξk is an associated likelihood Lk that together compose a list of pairs {(ξk, Lk)}Mk=1 which is the main output of the algorithm and used to compute the sought after quantities. 3.2.2 Obtaining the evidence and posterior The evidence Z is estimated by approximating the integral expression of Equation (3.5) by a weighted sum using the list of samples {(ξk, Lk)}Mk=1 according to Z = ∫ 1 0 L(ξ)dξ ≈ M∑ k=1 Lk∆ξk. (3.13) Straight-forward estimation of the mass element by ∆ξk = ξk−1 − ξk implies a numerical integration error of O(N−1). Any other valid numerical integration method would also be an option, such as the trapezoidal rule ∆ξk = 1 2(ξk−1 − ξk+1) which lowers the error 2Sample from Beta distribution with parameters N and 1. 17 3. Nested sampling 0 1 ξ L(ξ) ξk = e−k/N Figure 3.3: Sequence of discarded points obtained by estimating the prior mass fractions ξk statistically. The data is artificially produced. to O(N−2) in most cases [9]. Improvement from using the trapezoidal rule will however be small as the principal source of error is assumed to originate from the crude estimate ξk = e−k/N that will be discussed further in Section 3.2.5. Along with each pair (ξk, Lk) the algorithm also produces a parameter sample θk with an associated probability weight wk = 1 Z Lk∆ξk. (3.14) Together θk and wk form a list of weighted samples that can be used to estimate any quantity Q(θ) with respect to the posterior via E[Q] = ∫ Q(θ)P(θ)dnθ ≈ M∑ k=1 wkQ(θk) (3.15) The shape of the posterior can be extracted by binning the obtained samples θk, creating a histogram that can be visualized by plotting its one- and/or two-dimensional projec- tions. As opposed to MCMC samples, the samples produced by nested sampling are not equally weighted implying that their contribution to the bin values (or “heights”) should be proportional to their associated weights wk. We also see this in the formula for the ex- pectation value (3.15), which takes into account that the samples are unequally weighted. It is however possible to extract an equally weighted subset of the nested samples making storage more convenient. One method for extracting samples with equal weights is stair- case sampling [4]. In summary, the nested sampling procedure described above provides natural means of both computing the evidence Z and revealing the structure of the posterior distribution P(θ). An additional quantity, relevant for post-analysis, is the information 18 3. Nested sampling H = P(θ) ln [P(θ) π(θ) ] dnθ = ∫ 1 0 P (ξ) lnP (ξ)dξ ≈ M∑ k=1 wk ln [ Lk Z ] (3.16) being a logarithmic measure of the prior-to-posterior shrinkage. A useful interpretation is H = “information gained going from the prior to the posterior” [36]. The information H (3.16) is in the nested sampling context used to formulate an uncertainty estimation which we shall discuss in Section 3.2.5. 3.2.3 Pseudocode Algorithm 3.1 portrays the basic principle of the nested sampling procedure in pseudocode format. Nested sampling is performed M times, pre-defined by the user. In practice it is preferable to automate the termination by imposing a condition for when to stop sampling, this will be further discussed next in Section 3.2.4. Algorithm 3.1: Nested sampling procedure. Input: Prior π(θ) and likelihood L(θ) Output: Evidence Z and posterior samples θk with weights wk Initialization Draw N active points θa 1, ...,θ a N from the prior π(θ) Set Z ← 0 and ξ0 ← 1 for k = 1, 2, ...,M do θk ← θa worst, point in the current set θa 1, ...,θ a N with lowest likelihood Lk ← L(θk), likelihood bound ξk ← exp(−k/N) ∆ξk ← ξk−1 − ξk or e.g. 1 2(ξk−1 − ξk+1) wk ← Lk∆ξk Z ← Z + wk Replace θa worst with an independent sample from π(θ) obeying L(θ) > Lk Normalize weights, i.e. wk ← wk/Z 3.2.4 Termination The main nested sampling loop can be set to terminate after a fixed number of steps M . It may however be desirable to implement a more well-motivated stopping criterion. One such approach is to stop when the remaining contribution to Z is smaller than some (small) fraction f . At iteration k, the remaining contribution is taken to be bounded from above by the maximum likelihood of the set of active points La max = max(L(θa 1), ...,L(θa N )) multiplied by the current prior mass ξk. The condition then reads La maxξk < fZk =⇒ Termination (3.17) where Zk is the evidence accumulated at iteration k. If La max is close to the true maximum likelihood, this stopping criterion implies that approximately all but a fraction f of the evidence has been accounted for. The accumulation of Z usually begins to increase as regions of higher likelihoods are found and flattens out when the bulk of the posterior is reached and the decrease in range ∆ξk ∝ e−k/N starts to dominate increases in Lk. The amount of prior mass associated to the region of the posterior bulk is roughly estimated 19 3. Nested sampling by ξ ≈ e−H which in general could be very small. Recalling that ξk = e−k/N , the number of steps taken to the bulk according to Equation (3.11) is NH ± √ NH, resembling the mean and standard deviation of a Poisson distribution. In practice — where computations are performed in terms of logarithmic quantities for numerical stability — the condition is implemented in the form ln(Zk + La maxξk)− ln(Zk) = ln ( 1 + La maxξk Zk ) < fln =⇒ Termination. (3.18) Equation (3.18) is equivalent to Equation (3.17) if fln = ln(1 + f) which means that if f � 1 then fln ≈ f . 3.2.5 Uncertainty estimation The major source of uncertainty is assumed to stem from the variability in the number of nested sampling steps taken to reach the bulk of the posterior mass NH ± √ NH. This corresponds to an uncertainty in ln ξ of √ NH/N = √ H/N which through Equation (3.13) also gives an uncertainty in lnZ of √ H/N . The full expression for lnZ including the uncertainty estimate is thus lnZ ≈ ln ( M∑ k=1 Lk∆ξk ) ± √ H N . (3.19) Expressing the uncertainty as an additive term to lnZ rather than as a geometrical factor e± √ H/N to Z is common practice since the distribution in lnZ should have better proper- ties in terms of symmetry and similarity to a Gaussian. It would also be disadvantageous to translate the uncertainty to an additive deviation in Z as it would make possible for negative Z-values to be within this range. As an example, an assumed normally dis- tributed lnZ = 100± 10 would be naively translated to Z = e150 ± e200, which obviously is a useless statement. 3.3 Generating new points Until now the details of how to generate new points as the nested sampling algorithm progresses have been left out. This task is at the very core of the method and is of utmost importance for obtaining useful results. Each iteration requires a new sample θ which needs to: (a) be sampled in proportion to the prior π(θ) (b) be (approximately) independent from other samples (c) obey the current likelihood constraint L(θ) > L∗. These requirements make the task of generating new points non-trivial and worthy of special attention. The likelihood constraint (c) stands out as it exponentially reduces the range of active prior mass to explore as sampling progresses. In many sampling scenarios one often resorts to MC methods in general and MCMC methods in particular (see Section 2.2). This approach is available also in this case but under the addition of the likelihood constraint. For the remaining part of this thesis the main topic will be MCMC methods adjusted to generate samples from the constrained prior in the context of nested sampling. Modified versions of MCMC methods for constrained prior sampling 20 3. Nested sampling are introduced in the following sections and are further described and implemented in Chapter 4. 3.3.1 Constrained Metropolis The standard Metropolis-Hastings algorithm, described in Section 2.2.1, is likely the most well known among MCMC methods. As suggested by Sivia and Skilling [4] and followed up by Feroz and Hobson [11] it can be applied in the context of nested sampling by simply adding the likelihood constraint L(θ) > L∗ to the acceptance conditions. At each nested sampling iteration, one of the active points θ is picked at random as the start of a random walk. A step to a new point θ ′ is generated from a symmetric proposal distribution Q(θ′ |θ) and accepted with probability α = min ( 1, π(θ ′ ) π(θ) ) , if L(θ′) > L∗ 0, otherwise. (3.20) The proposal distribution is usually an isotropic Gaussian Q(θ′ |θ) = N (θ′ ; θ, σ2 Q1n) where the scale σQ is a free parameter that needs to be tuned in relation to the problem. How- ever, it is fully possible to use a Gaussian with a general covariance matrix ΣQ. As the Metropolis algorithm produces correlated samples a sufficiently large number of steps Ns needs to be taken in order for the new sample to grow independent of its starting point. This might be a cause of concern as the active prior volume shrinks at each iteration and acceptable Metropolis steps become more rare. However, Sivia and Skilling [4] suggests that one should always use Ns ≈ 20, a recommendation that we shall evaluate later. Since the allowed region shrinks in each nested sampling iteration it would be unwise to keep the distribution scale σQ, or scales ΣQ, constant as it would make it increasingly harder to propose acceptable positions. This would cause the acceptance rate to drop drastically making the exploration very inefficient. A scheme for adjusting the step scale will be suggested in Section 4.2. 3.3.2 Galilean Monte Carlo A likelihood-constrained version of the HMC algorithm has been suggested by Betan- court [37] to produce high-quality samples from the prior in the context of nested sam- pling. In constrained HMC, the likelihood contour is interpreted as a hard wall, or infinite potential barrier in the terms of Hamiltonian mechanics, with a potential of the form VcHMC(x) = { − ln π(x), if L(x) > L∗ ∞, otherwise. (3.21) The time evolution of the trajectories is therefore adjusted such that they will specularly reflect if they were to encounter the L(x) = L∗ barrier. As a starting point for the Hamil- tonian evolution, one of the active points x of the current nested sampling iteration is chosen at random, ensuring that the constraint will be satisfied also for a new point x′ if the reflection is properly conducted. This procedure, however, requires the gradient of the prior to be evaluated in every time step. In fact, for a typical problem the prior is often at least approximately constant in regions where the likelihood varies significantly. This means that the prior gradient will be small and that it will not influence the trajectories enough to make it worth while computing. 21 3. Nested sampling As a simpler, more straight forward successor to constrained HMC, Skilling suggests Galilean Monte Carlo (GMC) [19, 38, 39]. Time evolution in GMC is performed without the influence of any forces, and is therefore without the need for prior gradient evaluation. For the sake of convenience and presentation we shall without loss of generality adopt coordinates in which the prior is flat, i.e. π(x) ≡ 1, over the unit hypercube. Such a coordinate transformation is always possible [38] but may however be quite impractical, depending on the form of π(x). This transformation is discussed in more detail in Sec- tion 4.1. GMC trajectories will proceed as straight lines3 as long as they are within the allowed region according to the potential VGMC(x) = { const., if L(x) > L∗ ∞, otherwise. (3.22) Time evolution is particularly simple as a phase space point (x,v), where v is its velocity4, will in one time step move to a new point (x,v)→ (x′,v) = (x + τv,v) (proceed), (3.23) where τ is the size of the time step. This step is acceptable under the assumption that x′ is still within the available region. We consequently need a mechanism to reroute and return a point back into the active region if the step according to (3.23) were to cross the likelihood-barrier. A natural way to do this is by letting the point reflect specularly from the iso-likelihood surface by using that the surface normal vector n is proportional to the gradient of the likelihood ∇L, or equivalently to the gradient of the log-likelihood ∇ lnL. A reflection of this sort, occurring at point (x′,v), will transform (x,v) according to (x,v)→ (x′′,v′) = (x + τv + τv′,v− 2nnTv) (reflect), (3.24) where n = ∇ lnL(x′)/|∇ lnL(x′)|. A schematic illustration of the GMC exploration in- cluding reflections off the hard walls is shown in Figure 3.4a. It is obviously unlikely for x′ to lie exactly on the iso-likelihood surface but the gradient at x′ will work as a substitute for the gradient at the ideal reflection point located somewhere between x and x′. The re- flected point x′′ will nevertheless often be returned to the active region making it possible for the trajectory to continue with a redirected velocity. This reflection by proxy scheme is illustrated in Figure 3.4b. In the assumably rare case that also L(x′′) > L∗ is violated a possible option satisfying detailed balance is to reverse the trajectory according to (x,v)→ (x,−v) (reverse). (3.25) Detailed balance actually allows an additional option in the reflection’s mirror point (x,v)→ (x + τv− τv′,−v′) (mirrored reflection), (3.26) which in most cases should lie outside of the allowed region but is however still possible. For detailed balance (time-reversal symmetry) to remain satisfied under the inclusion of the mirrored reflection, either the reflection or the mirrored reflection should be accept- able if a trajectory is to redirect, but not both. If both were acceptable, the time reversed 3When the coordinates are transformed back to the original prior space the trajectories will follow geodesics determined by the geometry of the prior which in general are not straight lines. 4We will for GMC adopt the notion of velocity v to distinguish it from the HMC momentum p. 22 3. Nested sampling (a) Bird’s-eye view illustration of a typical GMC trajectory propagating in a constrained region of parameter space. x x′ x′′ τv τv′n True surface Proxy surface (b) Detailed illustration of the GMC re- flection off the proxy iso-likelihood sur- face. Figure 3.4: Illustrations of the basic principles of Galilean Monte Carlo. trajectory would break detailed balance. As the active region shrinks during the nested sampling iterations it is necessary to adjust the effective size of the GMC steps accordingly. If no adjustment of the step size would be carried out, fewer and fewer proposed steps will be accepted making the exploration increasingly inefficient. A GMC implementation is presented in Section 4.3 and includes a suggested scheme for adjusting the effective step size. 3.3.3 Constrained stretch move As previously stressed, one of the key difficulties of nested sampling is to produce samples from the prior for successively smaller regions of parameter space. Furthermore, this region is typically highly asymmetric, spanning significantly different distances in different dimensions. This problem of varying and separated scales is exactly what the stretch move algorithm, outlined in Section 2.2.3, is designed to manage due to its affine invariance. We therefore propose a constrained version of the stretch move, adapted for the context of nested sampling. The stretch move is dependent on an ensemble of walkers; a requirement which is naturally provided by the active set of N points used in nested sampling. At each iteration one of the active points Θk is randomly selected to perform the stretch move random walk where one of the remaining points Θj is used to propose a new position Y according to Equation (2.27). Each step is then accepted with the probability α = min ( 1, ζ̃n−1 π(Y ) π(Θk) ) , if L(Θk) > L∗ 0, otherwise (3.27) which is of the exact same form as the corresponding constrained Metropolis accept proba- bility defined in Equation (3.20). The constrained stretch move is implemented, evaluated and further discussed in Section 4.4. 3.3.4 Ellipsoidal nested sampling A rather different (non-MCMC) approach to likelihood-constrained prior sampling is el- lipsoidal sampling (Mukherjee et al. [12]). From the covariance matrix of the current set of active points it is possible to define an n-dimensional ellipsoid intended to approximate the 23 3. Nested sampling iso-likelihood surface defined by L(θ) = L∗. Ellipsoidal sampling proceeds by enlarging the ellipsoid by some factor, compensating for the iso-likelihood surface not being perfectly ellipsoidal, then drawing a sample from the prior volume within this bound. The more closely the ellipsoidal approximates the iso-likelihood surface, the higher is the probability that the sample fulfills the likelihood constraint. In a hypothetical situation where the iso-likelihood is perfectly ellipsoidal, the acceptance ratio is one. Shaw et al. [40] improves the method for multi-modal distributions and gives a method for uniform sampling of an ellipsoid (extension to non-uniform priors is straightforward). Ellipsoidal sampling has proven quite successful and is used in state-of-the-art implementations such as MultiNest by Feroz et al. [10]. The benefits of ellipsoidal sampling are that the samples produced are independent and that this can be achieved with typically very few likelihood-evaluations. This is in contrast to MCMC methods which need several steps for samples to become (approximately) independent, where each of those steps requires the likelihood to be evalu- ated. On the other hand, the iso-likelihood is not always well approximated by the ellipsoid in which case the method will fail without any warning signal. No implementation of el- lipsoidal sampling has been carried out in this work but it will act as a benchmark for the MCMC methods outlined above through the use of MultiNest via the Python code PyMultiNest [24]. 24 4 Constrained prior sampling implementations Three different MCMC schemes for generating constrained prior samples as required by the nested sampling algorithm were implemented in this work and are described in this chapter. The implementations are based on the methods introduced in Section 3.3 and will be referred to as constrained Metropolis, GMC and constrained stretch move. These three MCMC methods have all been integrated into a nested sampling framework that was developed in this work and that follows the principles presented in Chapter 3 in general and Algorithm 3.1 in particular. For our purposes we have mainly used N = 1000 ac- tive points and imposed a stopping criterion according to Equation (3.17) using fln = 0.01. Here follows a description of the specific implementation designs and usage, including e.g. method-specific hyperparameters. Each method has two main input parameters: the number of exploration steps per iteration and a step size scale. Demonstration and testing will be carried out by applying each method to the toy problem introduced in Section 2.1.1. 4.1 Choice of coordinates: the unit hypercube A nested sampling convention is to work with parameter coordinates u = (u0, ..., un−1) in which the prior is uniform over a cube of unit volume in n-dimensional space. This convention is also adopted in this work and has already been mentioned in the context of GMC in Section 3.3.2. If the joint prior probability is independent, which means it can be written as a product of individual parameter priors1 π(θ) = π0(θ0)...πn−1(θn−1), the transformation relating θ to u is θi(ui) = Π−1 i (ui) (4.1) where Πi(θi) = ∫ θi −∞ πi(θ′i)dθ′i is the cumulative distribution function (cdf) for θi. The MCMC implementations described below are formulated in terms of the coordinates u but the obtained results are however transformed back to θ using Equation (4.1). A consequence of working in coordinates where the prior is flat is that any Metropolis style rejection step simplifies. The acceptance probability α (see e.g. Equations (3.20) and (3.27)) contains a prior ratio π(θ′)/π(θ) which cancels if the prior is constant. 4.2 Constrained Metropolis implementation The key object in the Metropolis scheme is the proposal distribution Q(u′|u) from which new positions u′ are proposed in the random walk, depending on the current position 1This is the case for the prior defined in Equation (2.19). 25 4. Constrained prior sampling implementations u. The common approach of taking Q to be a symmetric Gaussian centered at u with a fixed covariance matrix ΣQ = σ2 Q1n (see Section 2.2.1) would be quite insufficient for the requirements concerning nested sampling. Fixing the scale σQ of the proposal distribution would not be compatible with the exponentially shrinking domain of nested sampling. We therefore propose that information from the active set of points ua 1, ...,ua N be utilized to estimate the scale of the active region. Firstly, we randomly select a starting point ustart from the active set. Secondly, a subset {um}Mm=1 of the remaining active points is created, also at random. The size of the subset, M , is in this work always taken to be M = max ( 1, ⌊ N 10 ⌋) . Thirdly, the covariance matrix elements of the proposal distribution are set according to (ΣQ)ij = { s2 M ∑M m=1(ustart − um)2 i , if i = j 0, otherwise, (4.2) where (ustart − um)2 i denotes the squared distance along dimension i and s is an overall scale parameter. With this formula, ΣQ is a diagonal matrix with variances on the diag- onal proportional to estimates of the typical squared distances for each dimension in the active set of points and by extension in the active region. A typical value of the scale pa- rameter is s ∼ 0.1 in order for proposed steps to be well within the allowed region. Effects of different choices of s are studied in Section 4.6.1. To introduce additional randomness, the number of steps taken in every walk is drawn from a discrete uniform distribution U ( 1 2〈Ns〉, 3 2〈Ns〉 ) where the average 〈Ns〉 is set by the user. The proposed constrained Metropolis algorithm is shown in Algorithm 4.1. Algorithm 4.1: Constrained Metropolis procedure. Input: Likelihood limit L∗ and active points ua 1, ...,ua N Output: New active point unew Set u(0) ← ustart and ΣQ from ua 1, ...,ua N , M and s Ns ← sample from U ( 1 2〈Ns〉, 3 2〈Ns〉 ) t← 0 while t < Ns or acceptance is zero do u′ ← sample from Q(u′ |u(t)) = N (u′; u(t),ΣQ) if L(θ(u′)) ≥ L∗ then u(t+1) ← u′ else u(t+1) ← u(t) t← t+ 1 unew ← last position from the chain An example of a constrained Metropolis walk can be seen in Figure 4.1 and is for the sake of visualization conducted in two-dimensions. The walk starts at the green plus-shaped marker and ends at the position of the red cross which is promoted to member of the active points. The hyperparameters settings used in this example are 〈Ns〉 = 20 and s = 0.5. The number of unique steps in the walk in Figure 4.1 are however fewer than 1 2〈Ns〉 = 10, meaning that not all steps have been accepted. The acceptance rate ra, defined as the fraction of accepted steps in an MCMC walk or trajectory, is an important measure for assessment of the exploration performance and will be discussed later in this thesis (e.g. Section 4.6.1). 26 4. Constrained prior sampling implementations 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 θ0 2.3 2.4 2.5 2.6 2.7 2.8 θ 1 L(θ) = L ∗ Figure 4.1: Example of a Metropolis random walk in a likelihood-constrained region of a two-dimensional parameter space. The walk is captured at a nested sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the end of the walk, respectively. The black dots indicate intermediate steps. The average number of steps for this walk is 〈Ns〉 = 20 and the scale parameter is s = 0.5. 4.3 Galilean Monte Carlo implementation The GMC equivalent of the Metropolis proposal distribution Q discussed in the previous Section 4.2 is the velocity distribution q(v) from which the initial velocity is sampled at the beginning of each iteration2. By the same arguments as in the Metropolis case above, this distribution needs to adapt to the shrinking active volume. We will therefore similarly suggest that q(v) be a Gaussian N (v; 0,Σvel) where the covariance matrix elements at each iteration are set according to (Σvel)ij =  1 M ∑M m=1 1 τ2 ref (ustart − um)2 i , if i = j 0, otherwise, (4.3) where ustart and um are the same as before and τref = 1 is a reference time scale which gives the covariances units of velocity squared. The lack of a variable overall length scale parameter in Equation (4.3) compared to the corresponding s in Equation (4.2) is due to the time step formulation “u + τv”of GMC, where τ is used to set the overall step size. We will in Section 4.6.1 see the impact from specific choices of τ . Furthermore, the number of steps taken in the trajectory are set randomly in the same way as in Section 4.2. Consequently, the average number of steps 〈Ns〉 together with the scale parameter τ con- stitutes the main user input for GMC. Following this scheme and the principles laid out in Section 3.3.2, the full implemented GMC procedure is shown in Algorithm 4.2. The four possible outcomes after a GMC step described in Section 3.3.2 are here denoted North (proceed), East (reflect), West (mirrored reflection) and South (reverse). It is important to stress that it is the gradient with respect to u of the likelihood that should be com- puted to obtain the normal to the reflection surface. This is related to the gradient with 2Analogous to the HMC case discussed in Section 2.2.2. 27 4. Constrained prior sampling implementations respect to θ by the chain rule via the Jacobian matrix J of the transformation in Equa- tion (4.1) according to ∇u → J(u)∇θ. The Jacobian matrix element definition used here is Jij = ∂θj/∂ui. Algorithm 4.2: Galilean Monte Carlo procedure. Input: Likelihood limit L∗ and active points ua 1, ...,ua N Output: New active point unew Set u(0) ← ustart and Σvel from ua 1, ...,ua N and M v← sample from q(v) = N (v; 0,Σvel) Ns ← sample from U ( 1 2〈Ns〉, 3 2〈Ns〉 ) for t = 0, 1, ..., Ns − 1 do u′ ← u(t) + τv N ← L(θ(u′)) ≥ L∗ # Continue north if N then u(t+1) ← u′ # Go north else n← ∇u lnL(θ(u′)) |∇u lnL(θ(u′))| v′ ← v− 2nnTv # Check possible directions E ← L(θ(u′ + τv′)) ≥ L∗ # Reflection to east W ← L(θ(u′ − τv′)) ≥ L∗ # Mirrored reflection to west S ← L(θ(u′ − τv)) ≥ L∗ # Reversal to south if S and (E but not W) then u(t+1) ← u′ + τv′ # Go east v← v′ else if S and (W but not E) then u(t+1) ← u′ − τv′ # Go west v← −v′ else u(t+1) ← u(t) # Aim south v← −v unew ← last point of the trajectory u(Ns) Figure 4.2 shows an example of a GMC trajectory in n = 2 dimensions contained by a iso-likelihood contour. The trajectory is initiated at the the green plus-shaped marker, reflects twice off the walls and ends at the red cross. The reflections in the figure do not appear physically correct given the directions of the iso-likelihood surface. However, this is a combined effect of the different scales on the two axes and that the trajectory is presented in the original coordinates θ whereas the simulation is performed in the flat coordinates u. The parameters used for this example are 〈Ns〉 = 20 and τ = 0.1. 28 4. Constrained prior sampling implementations 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 θ0 2.3 2.4 2.5 2.6 2.7 2.8 θ 1 L(θ) = L ∗ Figure 4.2: Example of a GMC trajectory in a likelihood-constrained region of a two- dimensional parameter space. The trajectory is captured at a nested sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the trajectory and the end of the walk, respectively. The black dots indicate intermediate time steps. The reflections do not occur exactly at the boundary but rather at proxy surfaces just outside the boundary (not in figure) as described in Section 3.3.2. 4.4 Constrained stretch move implementation One major advantage of the combination of the stretch move and nested sampling, pro- posed in this thesis, is that it requires no explicit adjustment when the nested sampling process progresses and the active volume shrinks. The stretch move proposal distribution naturally adapts to the decreasing scales of parameter space and there is no need to in- troduce an approach similar to the covariance matrix updates defined in Equations (4.2) and (4.3). The number of stretch move steps is randomly drawn as above and the aver- age number of steps 〈Ns〉 is set by the user. Except from the number of steps, the user additionally needs to specify the scale parameter a that sets the extent of the distribution g(ζ) defined in Equation (2.28). In Section 4.6.1 we will study the sampling performance based on the choice of a. The full stretch move procedure is shown in Algorithm 4.3. A constrained stretch move walk in n = 2 dimensions is shown in Figure 4.3 with hyper- parameters 〈Ns〉 = 20 and a = 2. It is clear that the step sizes seem to fluctuate more compared to the Metropolis walk in Figure 4.1. This could potentially be an effect of the fact that the Metropolis procedure uses the same proposal distribution in every step whereas the stretch move proposal distribution changes in every step depending on the randomly selected point uj . This is, however, the results of only a single iteration from a single run and one should be careful when drawing conclusions from it. 29 4. Constrained prior sampling implementations Algorithm 4.3: Constrained stretch move procedure. Input: Likelihood limit L∗ and active points ua 1, ...,ua N Output: New active point unew Set u(0) ← ustart from ua 1, ...,ua N Ns ← sample from U ( 1 2〈Ns〉, 3 2〈Ns〉 ) t← 0 while t < Ns or acceptance is zero do Draw uj randomly from the complementary set {ua j}Nj=1 \ {ustart} ζ̃ ← random sample from g(ζ), Equation (2.28) u′ ← uj + ζ̃(u(t) − uj) α← ζ̃n−1, Equation (3.27) r ← random sample from U(0, 1) if r ≤ α and L(θ(u′)) ≥ L∗ then u(t+1) ← u′ else u(t+1) ← u(t) t← t+ 1 unew ← last position from the chain 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 θ0 2.3 2.4 2.5 2.6 2.7 2.8 θ 1 L(θ) = L ∗ Figure 4.3: Example of a stretch move random walk in a likelihood-constrained region of a two-dimensional parameter space. The walk is captured at a nested sampling iteration with ξ ≈ e−H. The green plus and the red cross indicates the start of the walk and the end of the walk, respectively. The black dots indicate intermediate steps. 4.5 Monitoring the sampling progress The progress of the sampling over the course of a nested sampling run can be examined by tracking a few key quantities. Examples of such quantities are the likelihood L defined in Equation (3.4), the posterior weights w = L∆ξ/Z defined in Equation (3.14) and the acceptance rate ra defined as 30 4. Constrained prior sampling implementations ra = # steps accepted # steps proposed (4.4) in each iteration. The tracking of these quantities is shown in Figure 4.4 for the three methods applied to the problem of Sections 2.1.1 and 2.1.4 in n = 3 dimensions using 〈Ns〉 = 40. The natural choice of progress parameter is − ln ξ which is directly propor- tional to the iteration number k according to Equation (3.12). We see clearly that it takes a considerable fraction of the total amount of iterations before reaching significant likelihood values (top panel). The peak of the weights (middle panel) indicates where the bulk of the posterior probability mass is located and it is obvious that this does not coincide with the point of maximum likelihood. The reason for this is that as the likeli- hood increases, the prior mass elements, ∆ξ ∝ e−k/N , decrease exponentially. This tug of war implies the existence of an optimal point where w is maximized. The vertical lines (middle panel) accompanied by bands (see the inset) show the average and uncertainty NH ± √ NH of the number of steps that is needed to reach the bulk, as was discussed in Section 3.2.5, and where the information H is computed according to Equation (3.16). These estimates seem to predict reasonably accurately where the weights have reached sig- nificant values, i.e. where the bulk of the posterior mass is. The most obvious difference between the methods is, however, the evolution of the acceptance rate ra (bottom panel). The acceptance rate values typically vary rapidly between individual iterations, making it necessary to a apply a moving median3 to read out the overall behavior. This has been done to obtain the acceptance rate curves in Figure 4.4 and the associated band edges show the 16th and 84th percentiles, respectively. We see from these percentiles that the typical variation between iterations is reasonably large for all three methods, especially for GMC in the second half of the run where we also observe a prominent asymmetry. Fur- thermore, it is natural for the GMC acceptance rate to be larger compared to the other methods as the GMC step should be smaller in order to create a trajectory rather than a random walk. The most drastic change is seen for the Metropolis acceptance rate as its median drops to approximately a third of its initial value. The drop ends roughly in the region where the posterior mass starts to accumulate which means this is where the algo- rithm struggles the most with finding acceptable proposal steps. Remarkably, the stretch move median acceptance rate is fairly unchanged over the course of the run, which suggests that the proposal distribution adapts well to the successively stricter likelihood constraint. The results of the nested sampling runs described above can be seen in Table 4.1. The computed evidence values lnZ differ between the methods but do however agree fairly well with the true value lnZtrue = 8.09 obtained in Section 2.1.4 for the model with n = 3. The uncertainty estimates √ H/N (see Section 3.2.5) are, to two significant figures, equal for the three methods. This is a general observation that has been made throughout this work and that we will return to in Section 5.2.1. The number of likelihood evaluations, NL, are seen to be almost twice as many for GMC compared to the other methods despite that the number of steps are the same. The first reason for this is that the GMC algorithm requires evaluations of the likelihood gradient, in the event of a reflection, which is here counted as an extra likelihood evaluation and included in NL. The fraction of gradient evaluations are given in parenthesis in Table 4.1. The second reason is that more than one likelihood evaluation occur in the event of a reflection in order to assess the reflection conditions according to Algorithm 4.2. 3The median-equivalent to a moving (or rolling) average, which is better suited for constrained and skewed data [41] as in this case for ra. 31 4. Constrained prior sampling implementations 0 2 4 L ×108 Metropolis, s = 0.5 GMC, τ = 0.1 Stretch move, a = 2 0.0000 0.0001 0.0002 0.0003 w 0 2 4 6 8 10 12 14 16 − ln ξ 0.0 0.2 0.4 0.6 0.8 1.0 r a Figure 4.4: Progress of three key quantities over the course of a nested sampling run for the three different methods. The sampling is performed in n = 3 dimensions using 〈Ns〉 = 40 with sampling parameters N = 1000 and fln = 0.01. The top panel shows the likelihood L for the worst point of each iteration, the middle panel shows the posterior weights w, used to compute the evidence, and the bottom panel shows the moving median of the acceptance rate ra for each iteration. The methods distinguish themselves clearly in the variation of the acceptance rate. Table 4.1: Nested sampling results for the three methods applied to the same problem in n = 3 dimensions. Sampling parameters are N = 1000 active points and fln = 0.01 tolerance. The true evidence value for this model is lnZtrue = 8.09 as was analytically obtained in Figure 2.3. The number of likelihood evaluations, NL, does in the GMC case, include the number of evaluations of the gradient. The fraction of gradient evaluations is given in parenthesis. Method 〈Ns〉 Scale parameter lnZ ± √ H N NL # Iterations Metropolis 40 s = 0.5 8.13± 0.10 ∼ 6.65 · 105 ∼ 1.65 · 104 GMC 40 τ = 0.1 8.32± 0.10 ∼ 1.12 · 106 (13%) ∼ 1.63 · 104 Stretch move 40 a = 2.0 7.99± 0.10 ∼ 6.68 · 105 ∼ 1.66 · 104 32 4. Constrained prior sampling implementations 4.6 Hyperparameters The three constrained MCMC implementations described in the previous sections have in common that they have two main hyperparameters: • a scale parameter effectively setting the overall step size, denoted s, τ and a, respec- tively, in the three approaches • the average number of steps 〈Ns〉. These hyperparameters need to be set by the user and it is therefore important to have an idea of how they influence the sampling performance. We will therefore proceed by studying the effects of different specific hyperparameter choices. 4.6.1 Sensitivity to the scale parameter value The scale parameters s, τ and a act as specifiers for the overall step size for the three MCMC exploration methods. They are, however, defined in different ways and will there- fore not take on the same values. For instance, the stretch move parameter a defines an interval [a−1, a] from which ζ̃ is drawn (see Equation (2.28)), meaning that we must have a > 1 for this formulation to be reasonable. The Metropolis parameter s, on the other hand, measures the step size relative to the current active volume and should therefore be ∼ 0.1–1.0 in order for a step not to be too large to step outside too often, but not too small to make the walk not cover enough distance. The time step τ of the GMC algorithm should typically be smaller than s such that the proxy surface approximation is acceptable. For the idea of a trajectory to be valid, at least several GMC steps should be taken in a straight line before a reflection occurs and we should therefore have τ ∼ 0.1. To measure the impact of the scale parameters we apply the methods to the problem presented in Sections 2.1.1 and 2.1.4 using different scale parameter values. The top row of Figure 4.5 shows the error made when computing the (log) evidence, lnZ, for a model of the form (2.8) with n = 3 for different choices of s, τ and a. It is important to stress that this type of error is only possible to obtain when the true value Ztrue is analytically obtainable as was done in Section 2.1.4 and showed in Figure 2.3. The bands are standard deviations between five identical runs using different random seeds. The sampling was performed using N = 1000 active points, fln = 0.01 tolerance and 〈Ns〉 = 40 exploration steps per iteration (on average). Note the logarithmic axes for s and τ . The Metropolis method (left column) slightly overestimates the evidence for the entire range in s, but there appears to be a minimum in the error around s ≈ 0.25–0.5. This is in agreement with the argument above that s should not be so large that too many proposed steps are outside the allowed region but not so small that not enough distance is covered. The GMC error (middle column) is seen to be roughly constant for τ ≈ 0.05–0.1 before it drastically diverges at τ ≈ 0.2. This is probably a consequence of a too low acceptance rate causing the trajectories to insufficiently explore the available parts of parameter space. The stretch move error (right column) is in this case seen to be the closest to zero among the three methods and is also quite stable to variations of a around the value a = 2.0 proposed by [31] for the unconstrained case. Note, however, that the scale for a is linear as opposed to s and τ . To quantify how efficiently each method explores parameter space we need to compute the acceptance rate ra for each nested sampling iteration, defined in Equation (4.4). The value 33 4. Constrained prior sampling implementations of the acceptance rate will typically vary significantly between individual nested sampling iterations but will on a larger scale show clear trends, as seen in the bottom panel of Figure 4.4. As the nested sampling compression approaches the bulk of the posterior mass, the active region will have shrunken enough to make ra tend to lower values. As argued in Sections 3.2.4 and 3.2.5 and showed in the middle panel of Figure 4.4, the bulk is approximately reached when − ln ξ ≈ H, which roughly should be where the value of ra is of most importance. This motivates us to introduce the bulk median of the acceptance rate,MH[ra], defined as the median of the recorded values ra,k, over all nested sampling iterations k, for which − ln ξk ≥ H. Formally: MH[ra] = Median[{ra,k, ∀k : − ln ξk ≥ H}] (4.5) where the information H is computed according to Equation (3.16). The bulk median MH[...] should be interpreted as a point estimate of a quantity (the acceptance rate in this case) in the region where it matters the most, i.e. where the bulk of the probability mass is located. MH[ra]-values for the three methods are shown for different values of the scale parameters in the middle row of Figure 4.5. The bands (barely visible) are the same as for the evidence errors in the top row. There is a clear decrease in MH[ra] for each of the three methods as their scale parameters are increased. This is to be expected as a larger step size implies a higher probability for a proposed step to be outside the active volume. The most drastic change is for the Metropolis case (left column) which drops to near zero already at s ≈ 1.0. Moreover, the acceptance rate should not be too large, as indicated by the evidence error in the top row which is minimized at s ≈ 0.25– 0.5, corresponding to a bulk median acceptance rate of MH[ra] ≈ 0.2–0.4. The stretch move implementation (right column) also shows a significant drop in acceptance rate as a increases. However the drop is more moderate compared to Metropolis. Comparing this result to the top row of Figure 4.5 we see that MH[ra] ≈ 0.4–0.6 for a ≈ 1.5–3.0 seems to be a suitable range as this is where the evidence computed by the stretch move implementation fluctuates the least. For the GMC implementation (middle column) it is quite noteworthy how the character of MH[ra]’s dependence on τ differs from the other cases. A plausible explanation for this fundamentally different behavior is that a single GMC trajectory is, as opposed to Metropolis and the stretch move, not a random walk and does not form a Markov chain. In this sense, the GMC approach is fundamentally different. Once the GMC velocity is set, the step size will not change during the course of the trajectory and the direction only changes in the event of a reflection. In contrast, the Metropolis as well as the stretch move procedures draw direction as well as size randomly for each step, causing an important distinction. The bottom row of Figure 4.5 shows how the total number of likelihood evaluations, NL, varies with the scale parameters. Note that for the random walk methods, Metropolis and stretch move, NL will only exceed Ns× (# iterations) if there are iterations where none of the first Ns proposed steps are rejected. This is a consequence of the while loop-conditions in Algorithms 4.1 and 4.3, which assure that at least one step is accepted. This is the reason behind the increase in NL for the Metropolis case (left column) when s is increased. Furthermore, there seems to be a sudden jump between s = 1.0 and s = 2.0, indicating the existence of a threshold in s above which almost no steps are accepted. This agrees well with the observation of the Metropolis acceptance rate (middle row, left column), which for s = 2.0 is close to zero. In contrast, the str