Let $p$ denote a probability density with support $x \in \mathcal{X}$, and let $\theta \in \mathbf{T}$ parametrize $p$. It is common in the literature to express this situation using the notation $p(x\mid \theta)$. However, this “conditional” notation using the vertical bar (or `mid`

in LaTeX) is often used ambiguously in the machine learning, AI and statistics literature.

Notation with the conditional symbol $p(x \mid \theta)$ should only be used when there exists a joint density $p(x,\theta)$ with normalized marginals that properly define a conditional density; i.e.

- $p(x) = \int_{\theta\in \mathbf{T}}p(x,\theta)\mathrm{d}\theta$
- $p(\theta) = \int_{x \in \mathcal{X}}p(x,\theta)\mathrm{d}x$
- $p(x\mid\theta) := p(x,\theta)/p(\theta)$, for $p(\theta) \ne 0$.

If there is no underlying joint density $p(x,\theta)$ in the probabilistic setup, then the symbol $p(x\mid\theta)$ is not well-defined, which leads to ambiguity or confusion.

One place this notational issue arises is in variational inference. One has a generative model given by $p(x,z) = p(x \mid z)p(z)$ and an intractable posterior $p(z\mid x)$ they wish to estimate. The variational (or approximating) density will be written as $q(z \mid \phi)$, and we minimize the KL divergence $D_{\mathrm{KL}}(q(z\mid \phi) || p(z \mid x))$ over $\phi \in \mathcal{S}$. The reason that $q(z \mid \phi)$ does not make sense, however, is that the variational distribution $q$ does not have a prior over its parameter $\phi$, so $q(z,\phi)$ is not a well-defined joint density.

Bayesians who use this notation (of which there are many) may argue that the condition sign just means “parametrized by” — regardless of whether there is a joint probability model over all terms on both sides on the conditioning sign. However, this convention is problematic. Reusing the variational inference example from the previous paragraph, our variational distribution $q$ may additionally be parametrized by the observations $x$, so that one is minimizing $D_{\rm KL}(q(z\mid \phi, x) || p(z \mid x))$ over $\phi \in \mathcal{S}$ with $x$ held fixed at the observed values.

The situation becomes even more complicated in settings such as variational auto-encoders, where $p$ also has some exogenous parameters $\theta$. The true posterior density is notated as $p_\theta(z \mid x)$ and the variational density is $q_\phi(z \mid x)$. There is an underlying type mismatch induced by the notation, since $x$ from the perspective of $p$ (a random variable with a generative prior) is of different “type” than the $x$ from the perspective of $q$ (an exogenous parameter). Using the convention that exogenous parameters appear as subscripts, the type-correct notation would be $p_\theta(z\mid x)$ for the Bayesian posterior density, and $q_{\phi,x}(z)$ for the variational density.

An alternative convention, which appears in frequentist statistics, is to use semi-colons for exogenous parameters — this approach yields $p(z \mid x; \theta)$ and $q(z; x, \phi)$ for the VAE. A third convention, which is not widely used, is to remember that densities are nothing more than non-negative functions in some arguments, and to use different symbols for the different densities. More precisely, the joint density would be $p(x,z)$; the marginal density $p(x)$ becomes $m(x)$; and the prior $p(z)$ becomes $\pi(z)$. If there are exogenous parameters $\theta$ which do not represent realizations of random variables, they can simply be added as additional arguments, which gives $p(x,z,\theta)$, $m(x,\theta)$, and $\pi(z,\theta)$. The third approach quickly grows unwieldy, however, and scales poorly as the number of random variables grows.

The purpose of notation in mathematical writing is to use symbols as an aid for conveying technical ideas. Imprecise notation that is inconsistent (or asymmetric) actively works against this goal by obfuscating the key ideas. Clear notation is not only essential for understanding a piece of writing at hand, but also essential for making connections between related concepts across many writings (e.g., the relationship between EM, variational EM, variational Bayes, variational auto-encoders, …). Our brains are particularly good at “pattern matching” when we see familiar notation in unfamiliar contexts, which makes consistent, symmetric, and formally-defined symbols a highly effective technique extrapolating ideas to new settings.

The fine print: This post has dealt loosely with conditional densities for real-valued random variables. We have avoided measure-theoretic questions regarding the existence of densities when conditioning on events of probability zero, a topic for a future post.