# Metropolis-Hastings with Gaussian drift proposal on bounded support

By | February 13, 2016

Professor Darren Wilkinson has an excellent blog post on a common implementation issue that arises when the standard random-walk Metropolis Hastings algorithm, with a Gaussian proposal, is modified to sample from a target distribution $p(x)$ where $$\text{supp}(p) = \lbrace x : p(x) > 0 \rbrace = (0, \infty).$$ In this blog post, we are going to briefly review the setup, the problem, and then generalize the solution proposed by Wilkinson to the situation where $\text{supp}(p) = (a,b)$ (instead of the non-negative reals). We will also allow the Gaussian proposal to have a general variance $\sigma^2$ (rather than just $\sigma^2 = 1$). The purpose of this blog post is to:

• Illustrate that small programmatic changes in a probabilistic algorithm can violate correctness.
• Hope someone implementing this sampler will find the general derivation informative and useful for their own purposes.

### Sampling from a Gamma using Metropolis Hastings

To reuse the example from Wilkinson, suppose we are interested in sampling from a gamma with shape $\alpha = 2$ and scale $\theta =1$:

$$p(x) = \frac{1}{Z} x\exp(-x), \quad\quad(x > 0)$$

by implementing random-walk MH. The proposal $N(\cdot ; x_i, \sigma^2)$ is a Gaussian with mean at the current sample $x_{i}$, and known variance $\sigma^2$ (aside: the gamma can be sampled much more efficiently). The standard MH algorithm proceeds as:

1. Initialize $x_0$ to some value in $(0, \infty)$.

2. For $i = 0,1,2,\dots$

2.1 Sample from the proposal, $x’ \sim N(\cdot; x_i,\sigma^2)$.

2.2 Compute the acceptance ratio $$A(x’, x_i) = \frac{ p(x’)N(x_i; x’, \sigma^2) }{ p(x_{i})N(x’; x_i, \sigma^2) } = \frac{p(x’)}{p(x_{i})}.$$

2.3 If $U[0,1] \le A(x’, x_{i})$ then $x_{i+1} \leftarrow x’$ (accept) else $x_{i+1} \leftarrow x_{i}$ (reject).

In the sample step 2.1, it is possible that we obtain $x’ < 0$ (for instance, if $x_{i}$ is near 0 or $\sigma^2$ is large). One intuitive approach is to automatically reject samples that are outside $\text{supp}(p)$ and repeat until $x>0$. In the case of $Ga(2,1)$, this rejection would be enforced anyway since the acceptance ratio $A(x’,x_{i})$ will be negative when $x’ < 0$ (if we ignore the fact that $p$ is defined on the positive reals and evaluate $p(x’)$ anyway). However, Wilkinson observes that if we were sampling from $Ga(3,1)$ for instance, then

$$p(x) = \frac{1}{Z} x^2\exp(-x), \quad\quad(x > 0)$$

is positive even when $x < 0$ (again, if we ignored the positivity constraint on $x$). Our rejection step would not have been guaranteed to occur in 2.3 had we decided to keep the bad sample $x’ < 0$ in step 2.1. So we were just lucky with the $Ga(2,1)$ example, and need a more robust sampling strategy for the general case.

### Rejection sampling changes the proposal distribution

By discarding samples outside the target support in step 2.1, we are rejection sampling the Gaussian proposal $N(\cdot; x_i,\sigma^2)$ for all values $x’ < 0$. This strategy is equivalent to sampling from a truncated Gaussian as the proposal itself, which is not symmetric in its mean and argument. Therefore, we cannot cancel terms in the acceptance ratio (step 2.2) anymore.

### A correct sampler needs a correct acceptance ratio

Generalizing the results from Wilkinson, suppose that the target $p(x)$ has $\text{supp}(p) = (a,b)$ and the Gaussian has variance $\sigma^2$ (typically, this variance is adapted during runtime to achieve a desired acceptance rate). We will still  reject samples $x’ \notin (a,b)$ in 2.1, and keep sampling until $x’ \in (a,b)$. We need to account for all the previous rejections. The proposal density is now

$$N_{(a,b)}(x’ ; x_i, \sigma^2) = \frac{ \frac{1}{\sigma} \phi (\frac{x’-x_i}{\sigma}) } {\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})} \mathbf{I}_{(a,b)}(x’),$$

where $\phi$ is the standard Gaussian density, and $\Phi$ is the standard Gaussian cumulative distribution function. While the numerator is symmetric in $(x’,x_i)$, the denominator is not. It follows that with our new strategy, the correct acceptance ratio for step 2.2 is

$$A(x’,x_i) = \frac{p(x’) ( \Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) ) }{ p(x_{i}) ( \Phi(\frac{b-x’}{\sigma}) – \Phi(\frac{a-x’}{\sigma})) }.$$

A good step is to check we agree with Wilkinson in the special case that $a=0,b=\infty,\sigma^2=1$. In this case, the term $$\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) = 1-\Phi(-x_i) = \Phi(x_i),$$ and similarly for the term in the numerator. This result cross-checks with Wilkinson. Also remember to change the initialize step 1 to choose $x_0$ from the range of interest.

### The rejection rate and numerical stability

The good news for this sampler is that the mean of the truncated normal $N_{(a,b)}(\cdot; x_i,\sigma^2)$ is guaranteed to remain in the truncation interval, which means the rejection rate is not going to be problematic (unless for some reason $\sigma^2$ is unusually large, which should be designed in tandem with $\text{supp}(p)$ to avoid this issue). The rejection rate is a problem when the mean $x_i$ of the underlying, non-truncated normal is not in the interval $(a,b)$, and more elaborate algorithms exist (or use a different proposal). Another critical issue in this case is that $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})$ becomes numerically unstable, tends to zero. We would be able to get away with $x/0$ or $0/x$ in $A$, but we might also have the numerically disastrous $0/0$. Thankfully, we do not have these problems.

### Conclusion

Samplers are delicate probabilistic objects, treat them with care.