## Friday 12/4/2009

This week, we discussed Simulated Annealing and methods for estimating free energy. Meghana and Sanmi led the meeting. This was the last meeting of the fall 2009 semester.

For complicated problems, many optimization algorithms can only guarantee locally optimal solutions. In particular, learning and inference from a complicated statistical model using methods such as MCMC can only guarantee a solution which is a local optimum. Simulated annealing is a principled technique that can be used to bias an optimization algorithm towards a global optimum.

The idea behind Simulated annealing is to slowly reduce the “temperature” of the cost function. Recall that a probability distribution that is nowhere zero can be written in canonical form as $P(z) \propto e^{-E(s)/T} = (e^{-E(s)})^\frac{1}{T}$ where $E(s)$ is the energy function and $T$ is the temperature. Therefore, given a distribution in canonical form, one can control the temperature simply by raising the probability to $\frac{1}{T}$ and re-normalizing appropriately. This can be applied directly to the techniques we have studied so far (such as Gibbs sampling and the metropolis algorithm). At low temperatures, the distribution is almost uniform, and as the temperature is lowered to 1, the distribution converges to $p(x)$. In this way, one can hope to escape some local minima in the MCMC optimization. In Bayesian inference, the annealing can be applied only to the prior bias, essentially enforcing a stronger prior as the optimization progresses.

Much of the work now involves picking an appropriate annealing schedule i.e. deciding how to lower the temperature. In many cases, the schedules are picked before the simulation begins. One example of a fixed schedule is $T_t = T_1/\log(t)$. One can show that for an appropriate value of $T_1$, the global optimum will be found. However, the proof is analogous to an exhaustive search and might not be useful in practice. There are several other examples of fixed schedules with similar properties.

When using the Metropolis
algorithm, the annealing schedule can be matched with the state transitions. The idea is to reduce the temperature proportional to the width of the proposal distribution. Recall that in the metropolis algorithm, proposals are rejected when they lead to a very large change in the energy. Assuming a Gaussian distribution, one can now design the proposal distribution to have a fixed acceptance probability. For instance, if $E(x) = \frac{x^2}{2}$, the standard deviation is $T^\frac{1}{2}$. Using this distribution, proposals with a change in energy larger than a magnitude proportional to $T^\frac{1}{2}$ are rejected. In this way, the change in temperature also changes the proposal distributions.

In most of the methods we have studied so far, we have side-stepped the problem of estimating the normalization constant $Z = \int \exp(-E(s)/T)$, or analogously, the free energy: $F = -T\log(Z)$. The free energy is useful for comparing different models (Bayesian model comparison) and for computing the probability of rare events. Computing the free energy $Z_1$ directly is usually difficult. Instead, we can compute the difference with the free energy of a known system, $Z_0$ i.e. the quantity $\log(Z_1) - \log(Z_0)$ or analogously $Z_1/Z_0$.

Given a known initial system with free energy $Z_0$, the ratio can be estimated using importance sampling and related techniques, effectively computing $E_g[f(X)/G(X)]$ where $E_g[\cdot]$ is the expectation with respect to the distribution $g(X)$. Here  $g(X)$ is the known distribution and $f(X)$ is the distribution whose normalizing constant we would like to compute. This method works well if the two distribution have a lot of overlap i.e. the two distribution share many high probability regions.

When this is not the case, one can still estimate the normalization using a chain of approximations. The idea is to compute the ratio of free energy for a chain of similar distributions such that the product is the quantity we seek. This can be computed as
$\frac{Z_{1}}{Z_{0}} = \frac{Z_{1}}{Z_{\alpha_{n-1}}} \frac{Z_{\alpha_{n-1}}}{Z_{\alpha_{n-2}}} \cdots \frac{Z_{\alpha_{1}}}{Z_{0}}$,  where, by design, $Z_{\alpha_{n-1}}$ is very similar to $Z_{\alpha_{n}}$. for instance one an use a convex sum of their energies: $E_\alpha(s) = (1-\alpha)E_o(s)+ \alpha E_1(s)$. Each pair of models has a large overlap, and the normalizations can be computed independently and multiplied.