# Variational Methods

### Variational Inference (pros)

From (Kucukelbir et al. 2015):

• For machine learning models, calculating the posterior is often difficult; we resort to approximation.

• Variational inference (VI) approximates the posterior with a simpler density.

• We search over a family of simple densities and find the member closest to the posterior.

• This turns approximate inference into optimization.

• VI has had a tremendous impact on machine learning; it is typically faster than Markov chain Monte Carlo (MCMC) sampling (as we show here too) and has recently scaled up to massive data.

### Variational Inference (cons)

• Which family of approximating densities to choose?

• How to solve the resulting optimisation problem?

### Automatic Differentiation Variational Inference

• Automatic Differentiation Variational Inference (ADVI)

• Given a (Stan) model,

• Automatically determine an appropriate variational family (family of candidate approximating densities), and

• Automatically work out how to solve the optimisation problem

• It uses Automatic Differentiation ;-)

### VI by minising KL divergence

• Given some posterior density $$p(\theta | \mathbf{X})$$,

• and some family of approximating densities $$q(\theta ; \phi)$$

• find the $$\phi$$ which gives the smallest KL divergence

$\min_{\phi} \mathrm{KL}(q(\theta ; \phi) \; || \; p(\theta | \mathbf{X}))$

### The evidence lower bound

The evidence lower bound (ELBO) is:

${\cal L}(\phi) = \mathbb{E}_{q(\theta:\phi)}[ \log(p(\theta,\mathbf{X}) ] - \mathbb{E}_{q(\theta ; \phi)}[ \log(q(\theta ; \phi) ]$

Maximising the ELBO minimises the KL divergence (and so that’s what we do).

### A transformation-based approach

• The support of a density (e.g. $$p(\theta | \mathbf{X})$$) is where it is non-zero.

• We often have variables in $$\theta$$ which have to be positive (e.g. a variance), so have no negative numbers in the support.

• The Stan approach to VI is transformation-based, step one is to define a one-to-one differentiable function:

$T: \mathrm{supp}(p(\theta)) \rightarrow \mathbb{R}^{K}$ where $$K$$ is the dimension of $$\theta$$.

• Cue (Kucukelbir et al. 2015 Fig 3)

### Mean field approximation

• Let $$\zeta$$ be the transformed variables ($$\zeta$$ lives in $$\mathbb{R}^{K}$$).

• In the mean field approach to VI we choose the family of approximating distributions to be products of independent Gaussians:

$q(\zeta : \phi) = {\cal N}(\zeta ; \mu, \sigma^{2}) = \prod_{k=1}^{K} {\cal N}(\zeta_{k} ; \mu_{k}, \sigma_{k}^{2})$ where $$\phi = (\mu_{1}, \dots, \mu_{K}, \sigma^{2}_{1}, \dots, \sigma^{2}_{K})$$

• Note that this mean field approach would have been a weird choice had, say, some of the $$\zeta_k$$ been only allowed to take positive values.

### Maximising ELBO in real co-ordinate space

• Let $${\cal L}(\mu,\sigma^{2})$$ be the ELBO in the real co-ordinate space (i.e. the transformed space).

“We now seek to maximize the ELBO in real coordinate space $\mu^{*}, \sigma^{2*} = \arg \max_{\mu, \sigma^{2}} {\cal L}(\mu,\sigma^{2}) \mbox{ such that \sigma^{2} \succ 0}$ We can use gradient ascent to reach a local maximum of the ELBO” (Kucukelbir et al. 2015):

• This is doable (after one further reparameterisation (Kucukelbir et al. 2015 Fig 3)) using AD.

### So does it actually work?

• Let’s have a look …

Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David Blei. 2015. “Automatic Variational Inference in Stan.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 568–76. Curran Associates, Inc. http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan.pdf.