Professor Darren Wilkinson has an excellent blog post on a common implementation issue that arises when the standard random-walk Metropolis Hastings algorithm, with a Gaussian proposal, is modified to sample from a target distribution $p(x)$ where

$$\text{supp}(p) = \{ x : p(x) > 0 \} = (0, \infty).$$

In this blog post, we are going to briefly review the setup, the problem, and then generalize the solution proposed by Wilkinson to the situation where $\text{supp}(p) = (a,b)$ (instead of the non-negative reals). We will also allow the Gaussian proposal to have a general variance $\sigma^2$ (rather than just $\sigma^2 = 1$).

The purpose of this blog post is

- to illustrate that small programmatic changes in a probabilistic algorithm can violate correctness,

- to hope someone implementing this sampler will find the general derivation informative and useful for their own purposes.

### Sampling From A Gamma

To reuse the example from Wilkinson, suppose we are interested in sampling from a gamma with shape $\alpha = 2$ and scale $\theta =1$,

$$p(x) = \frac{1}{Z} x\exp(-x), x > 0$$

by implementing random walk MH. The proposal $N(\cdot | x_i, \sigma^2)$ is a Gaussian with mean at the current sample $x_{i}$, and known variance $\sigma^2$ (*aside*: the gamma can be sampled much more efficiently). The standard MH algorithm proceeds as

** **1.** initialize** $x_0$ to some value in $(0, \infty)$

** **2.** for** $i = 0,1,2,\dots$

** **2.1** sample** from the proposal, $x’ \sim N(\cdot|x_i,\sigma^2)$

** **2.2** compute **the acceptance ratio $A(x’, x_i) = \frac{ p(x’)N(x_i| x’, \sigma^2) }{ p(x_{i})N(x’| x_i, \sigma^2) } = \frac{p(x’)}{p(x_{i})}$

** **2.3** if **$U[0,1] \le A(x’, x_{i})$

** ****then **$ x_{i+1} \leftarrow x’$ (accept)

** ****else** $ x_{i+1} \leftarrow x_{i}$ (reject)

In the **sample** step 2.1, it is possible that we obtain $x’ < 0$ (for instance, if $x_{i}$ is near 0 or $\sigma^2$ is large). One intuitive approach is to *automatically reject* samples that are outside $\text{supp}(p)$ and repeat until $x>0$. In the case of $Ga(2,1)$, this rejection would be enforced anyway since the acceptance ratio $A(x’,x_{i})$ will be negative when $x’ < 0$ (if we ignore the fact that $p$ is defined on the positive reals and evaluate $p(x’)$ anyway).

However, Wilkinson observes that if we were sampling from $Ga(3,1)$ for instance, then

$$p(x) = \frac{1}{Z} x^2\exp(-x), x > 0$$

which is positive even when $x < 0$ (again, when ignoring the positive constraint on $x$). Our rejection step would **not have** been guaranteed to occur in 2.3 had we decided to keep the bad sample $x’ < 0$ in step 2.1. So we were just lucky with the $Ga(2,1)$ example, and need a more robust sampling strategy for the general case.

### Rejection Sampling Changes The Proposal Distribution

By discarding samples outside the target support in step 2.1, we are *rejection sampling the Gaussian proposal *$N(\cdot|x_i,\sigma^2)$ for all values $x’ < 0$. This strategy is equivalent to sampling from a truncated Gaussian, which is **not symmetric** in its mean and argument. Therefore, we cannot cancel terms in the acceptance ratio (step 2.2) anymore.

**A Correct Sampler Needs A Correct Acceptance Ratio**

Generalizing the results from Wilkinson, suppose that the target $p(x)$ has $\text{supp}(p) = (a,b)$ and the Gaussian has variance $\sigma^2$ (typically, this variance is adapted during runtime to achieve a desired acceptance rate). We will still *reject* samples $x’ \notin (a,b)$ in 2.1, and keep sampling until $x’ \in (a,b)$. We need to account for all the previous rejections.

The proposal density is now

$$N_{(a,b)}(x’ | x_i, \sigma^2) = \frac{ \frac{1}{\sigma} \phi (\frac{x’-x_i}{\sigma}) } {\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})} \mathbf{I}_{(a,b)}(x’),$$

where $\phi$ is the standard Gaussian density, and $\Phi$ is the standard Gaussian cumulative distribution function. While the numerator is symmetric in $(x’,x_i)$, the denominator is not. It follows that with our new strategy, the correct acceptance ratio for step 2.2

$$A(x’,x_i) = \frac{p(x’) ( \Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) ) }{ p(x_{i}) ( \Phi(\frac{b-x’}{\sigma}) – \Phi(\frac{a-x’}{\sigma})) } $$

A good step is to check we agree with Wilkinson in the special case that $a=0,b=\infty,\sigma^2=1$. In this case, the term $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) = 1-\Phi(-x_i) = \Phi(x_i)$, and similarly for the term in the numerator. This result cross-checks with Wilkinson.

Also remember to change the **initialize** step 1 to choose $x_0$ from the range of interest.

**The Rejection Rate and Numerical Stability**

The good news for this sampler is that the mean of the truncated normal $N_{(a,b)}(\cdot|x_i,\sigma^2)$ is guaranteed to remain in the truncation interval, which means the rejection rate is not going to be problematic (unless for some reason $\sigma^2$ is unusually large, which should be designed in tandem with $\text{supp}(p)$ to avoid this issue).

The rejection rate is a problem when the mean $x_i$ of the underlying, non-truncated normal is not in the interval $(a,b)$, and more elaborate algorithms exist (or use a different proposal). Another critical issue in this case is that $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})$ becomes numerically unstable, tends to zero. We would be able to get away with $x/0$ or $0/x$ in $A$, but we might also have the numerically disastrous $0/0$. Thankfully, we do not have these problems.

**Conclusion**

Samplers are delicate probabilistic objects, treat them with care.