We continued the discussion on Markov Chain Monte Carlo (MCMC) methods with an extensive review and some discussion of the Metropolis algorithm. The review section was intended to be short (about 5 mins), but we ended up taking about half the lecture due to a lot of interesting discussion and clarifications (Priyank, thanks for all the good discussion). Sanmi led the meeting.

Inference in many ML problems involve calculating the expectation of some function – E[f]. For instance, predictive distributions are calculated by integrating out known variables from a joint distribution. This can be written as an expectation. This integration can easily be approximated with samples – X; if the samples are drawn from the true distribution p(X). MCMC methods attempt to achieve this. What does MCMC mean? The “Monte Carlo” refers to general methods of approximating integrations using samples. The “Markov Chain” refers to the method of drawing samples using Markov chains; one can sample from very complicated distributions and show that the Markov chain distribution will quickly converge to the true sampling distribution. This convergent distribution is called the invariant distribution.

Sampling using Markov chains only requires that the chain eventually converges to the invariant distribution. However, for the chain to be practically useful, the following properties are important:

- How much computation is required per transition?
- How many steps does the chain require to converge to the invariant distribution?
- Once the MC has converged, how many transitions are required to make sure subsequent draws are IID?

These questions can be answered in a principled way (using probabilistic bounds). However, these bounds can be quite difficult to compute. In practice, many practitioners use simple rules of thumb to decide if these properties hold. The workhorses of MCMC are the transitions – T(x). These global transitions are often built as sums or or product of local transitions B_k. The methods we will study will focus on how to pick as good B_k to make sure the chain converges and adequately explores the space. In Gibbs sampling, the local transitions are the marginal distributions i.e. B_k = p(x_k | X\k); where X\k refers to the set of variables with x_k removed. In other words:

- Fix all the variables to known values except variable x_k.
- Compute the partial distribution of x_k given that all other variables are known
- Sample x_k’ from this distribution.

Clearly, to use Gibbs sampling, one must be able to compute the required partials. This is usually tractable when the distribution is over a small set of variables or the variables have a relatively simple analytical structure. This is the case in hierarchical Bayesian models when one uses conjugate priors. What do you do when you can’t compute partials? One method that solves this problem is the Metropolis sampling algorithm. The high level idea is to sample using a simple proposal distribution, and compensate for the difference using an acceptance probability. This is similar to the idea behind rejection sampling. Samples are generated as follows:

- Generate a sample from some proposal distribution S_k
- Accept the sample with some probability A_k

A_k often measures how much the joint distribution changes after sampling the new variables. The idea is to discourage moves that drastically change the distribution. One can compare this to Gibbs sampling; where S_k = P(x_k | X\k) and the acceptance probability is 1.

For both Gibbs sampling and the Metropolis algorithm, once can chose the subsets x_k as a group of variables instead of a single variable. This means that the algorithms can also be used to update subsets of the variables at a time. This can improve convergence speed and reduce computational overhead especially when each of the subsets are relatively independent.

Next week, we will discuss algorithms that borrow heavily from statistical physics; where one uses dynamical models to generate samples.

Tags: MCMC

## Leave a Reply