# A thought experiment with the Bayesian posterior predictive distribution

By | January 29, 2017

Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the posterior predictive distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta,$$

where the second equality follows from assuming the data is exchangeable (or i.i.d. conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right].$$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

Approach 1

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$p_{\text{A}1}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right].$$

Approach 2

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities:

1. joint: $p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta$
2. marginal: $p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta$

and then return their ratio, $$p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}.$$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right]$$

Exercise: Show why the two expressions for $p_{\text{A}2}$ are equivalent.

Thoughts

The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$. Instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter) and then compute its expectation.

Approach 2 follows directly from standard laws of conditional probability, but is perhaps less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real number — since the Bayesian does not know $\theta$, they marginalize over it. But it is a matter of interpretation as to whether probe function $g(s,r,\theta)$ should influence the distribution $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

Intuitively, we expect that the two approaches converge in some sense in the infinite limit of data $N$, where the marginal effect of a fresh observation (the probe) on the unknown parameter becomes exceedingly small. Such a result could then justify “approximate Bayesian updating” using Approach 1 in terms of its convergence to the “full Bayesian updating” of Approach 2, which is useful for repeatedly querying a learned Bayesian models without having to re-infer the global parameters on a per-query basis.