James Cussens, University of York

From (Kucukelbir et al. 2015):

For machine learning models, calculating the posterior is often difficult; we resort to approximation.

Variational inference (VI) approximates the posterior with a simpler density.

We search over a family of simple densities and find the member closest to the posterior.

This turns approximate inference into optimization.

VI has had a tremendous impact on machine learning; it is typically faster than Markov chain Monte Carlo (MCMC) sampling (as we show here too) and has recently scaled up to massive data.

Which family of approximating densities to choose?

How to solve the resulting optimisation problem?

Automatic Differentiation Variational Inference (ADVI)

Given a (Stan) model,

Automatically determine an appropriate

*variational family*(family of candidate approximating densities), andAutomatically work out how to solve the optimisation problem

It uses Automatic Differentiation ;-)

Given some posterior density \(p(\theta | \mathbf{X})\),

and some family of approximating densities \(q(\theta ; \phi)\)

find the \(\phi\) which gives the smallest KL divergence

\[\min_{\phi} \mathrm{KL}(q(\theta ; \phi) \; || \; p(\theta | \mathbf{X}))\]

The evidence lower bound (ELBO) is:

\[{\cal L}(\phi) = \mathbb{E}_{q(\theta:\phi)}[ \log(p(\theta,\mathbf{X}) ] - \mathbb{E}_{q(\theta ; \phi)}[ \log(q(\theta ; \phi) ]\]

Maximising the ELBO minimises the KL divergence (and so that’s what we do).

The

*support*of a density (e.g. \(p(\theta | \mathbf{X})\)) is where it is non-zero.We often have variables in \(\theta\) which have to be positive (e.g. a variance), so have no negative numbers in the support.

The Stan approach to VI is transformation-based, step one is to define a one-to-one differentiable function:

\[T: \mathrm{supp}(p(\theta)) \rightarrow \mathbb{R}^{K}\] where \(K\) is the dimension of \(\theta\).

Cue (Kucukelbir et al. 2015 Fig 3)

Let \(\zeta\) be the transformed variables (\(\zeta\) lives in \(\mathbb{R}^{K}\)).

In the

*mean field*approach to VI we choose the family of approximating distributions to be products of independent Gaussians:

\[q(\zeta : \phi) = {\cal N}(\zeta ; \mu, \sigma^{2}) = \prod_{k=1}^{K} {\cal N}(\zeta_{k} ; \mu_{k}, \sigma_{k}^{2})\] where \(\phi = (\mu_{1}, \dots, \mu_{K}, \sigma^{2}_{1}, \dots, \sigma^{2}_{K})\)

Note that this mean field approach would have been a weird choice had, say, some of the \(\zeta_k\) been only allowed to take positive values.

Let \({\cal L}(\mu,\sigma^{2})\) be the ELBO in the real co-ordinate space (i.e. the transformed space).

“We now seek to maximize the ELBO in real coordinate space \[\mu^{*}, \sigma^{2*} = \arg \max_{\mu, \sigma^{2}} {\cal L}(\mu,\sigma^{2}) \mbox{ such that $\sigma^{2} \succ 0$}\] We can use gradient ascent to reach a local maximum of the ELBO” (Kucukelbir et al. 2015):

This is doable (after one further reparameterisation (Kucukelbir et al. 2015 Fig 3)) using AD.

Let’s have a look …

Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David Blei. 2015. “Automatic Variational Inference in Stan.” In *Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 568–76. Curran Associates, Inc. http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan.pdf.