Wednesday 03/03/2010, 03/10/2010

March 8, 2010

We began discussion of Gaussian Processes for regression.

I am going to try uploading the presenter’s notes directly. You can find the notes for this week here and here.

Next week (after spring break), we will move on to Gaussian Processes for classification. We will cover the end of chapter 2 in some detail in a couple of weeks.

We need a volunteer to cover MCMC for GP’s. This paper on slice sampling was recommended.

Wednesday 02/10/2010, 02/17/2010, 02/24/2010

February 16, 2010

Priyank discussed some basic measure theory in preparation for our discussion on Gaussian processes. Next week, we will discuss how stochastic processes such as the Gaussian process arise from basic concepts.

There will be no weekly scribe notes during the series on basic theory. Instead, I will post a comprehensive set of scribe notes (with references) after we are done.

EDIT: We spent three meetings discussing general stochastic processes. I will post notes once they are avaliable.

Wednesday 02/03/2010

February 9, 2010
Welcome back. This semester, the meetings will be held on Wednesday mornings at 11am in ACES 3.116. We will be discussing Gaussian processes, using the book “Gaussian Processes for Machine Learning” by Carl Edward Rasmussen and Chris Williams.
Goo led the first meeting. We covered a basic overview of Gaussian processes for regression. For details, see the NIPS 2006 tutorial “Advances in Gaussian processes” by Rasmussen.
The tutorial begins by discussing the prediction problem. Suppose one has some historical data; such as the carbon dioxide concentration in the air over time. We would like to predict the concentration at some future time. One simple solution is to find a linear fit. However, the yearly cycles in the CO2 concentration might suggest using a sinusoidal term. There might also be other types of correlations observed in the data that we would like to model. How does one select the best model to use? How does one select the best parameters? Rasmussen states “Gaussian processes solve some of the above, and provide a practical framework to address the remaining issues”.
Definition (Rasmussen): A Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions.
Gaussian processes provide a principled framework for generalizing the multivariate Gaussian distribution to an infinite number of variables. Because of this property, a Gaussian process model is sometimes described as a method for defining a prior over functions. Given an index (or input) x, the Gaussian process is completely defined by some mean function \mu(x) and a covariance function k(\cdot, \cdot).  In most applications, we will assume \mu(x) = 0 and concentrate on the effect of the covariance. k(\cdot, \cdot) is also known as the kernel function. We will discuss the properties of this function in more detail over the coming weeks.
Gaussian process regression exploits the marginalization properties of the multivariate Gaussian distribution i.e. conditionals and marginals of a joint Gaussian are also Gaussian. Suppose the learner is given data \{ x_i, y_i \}_{i = 1}^N and would like to learn a regression function modeled as y_i = f(x_i) + \epsilon, where \epsilon is the noise term. One good model for this function is a Gaussian process with an appropriate kernel function. The means and variances of the prediction are easy to write out (though it is a good exercise to work out the equations yourself). The predictive distribution for a new data point y^* is given as:
p ( y^* | x^*, \mathbf{x}, \mathbf{y}) \sim \mathcal{N} ( \mathbf{k}_n^T \mathbf{K}^{-1} \mathbf{y}, k_{n+1} - \mathbf{k}_n^T \mathbf{K}^{-1} \mathbf{k}_n ) ,
where \mathbf{k}_n is the N length kernel covariance between the training data and new point x^*, k_{n+1} is the variance of x*, and mathbf{K} is the N \times N kernel matrix capturing the covariance between the training data points.
Next week, Priyank will discuss the idea of a stochastic process in more detail, and complete the discussion on Gaussian process regression.

Friday 12/4/2009

December 9, 2009

This week, we discussed Simulated Annealing and methods for estimating free energy. Meghana and Sanmi led the meeting. This was the last meeting of the fall 2009 semester.

For complicated problems, many optimization algorithms can only guarantee locally optimal solutions. In particular, learning and inference from a complicated statistical model using methods such as MCMC can only guarantee a solution which is a local optimum. Simulated annealing is a principled technique that can be used to bias an optimization algorithm towards a global optimum.

The idea behind Simulated annealing is to slowly reduce the “temperature” of the cost function. Recall that a probability distribution that is nowhere zero can be written in canonical form as P(z) \propto e^{-E(s)/T} = (e^{-E(s)})^\frac{1}{T} where E(s) is the energy function and T is the temperature. Therefore, given a distribution in canonical form, one can control the temperature simply by raising the probability to \frac{1}{T} and re-normalizing appropriately. This can be applied directly to the techniques we have studied so far (such as Gibbs sampling and the metropolis algorithm). At low temperatures, the distribution is almost uniform, and as the temperature is lowered to 1, the distribution converges to p(x). In this way, one can hope to escape some local minima in the MCMC optimization. In Bayesian inference, the annealing can be applied only to the prior bias, essentially enforcing a stronger prior as the optimization progresses.

Much of the work now involves picking an appropriate annealing schedule i.e. deciding how to lower the temperature. In many cases, the schedules are picked before the simulation begins. One example of a fixed schedule is T_t = T_1/\log(t). One can show that for an appropriate value of T_1, the global optimum will be found. However, the proof is analogous to an exhaustive search and might not be useful in practice. There are several other examples of fixed schedules with similar properties.

When using the Metropolis
algorithm, the annealing schedule can be matched with the state transitions. The idea is to reduce the temperature proportional to the width of the proposal distribution. Recall that in the metropolis algorithm, proposals are rejected when they lead to a very large change in the energy. Assuming a Gaussian distribution, one can now design the proposal distribution to have a fixed acceptance probability. For instance, if E(x) = \frac{x^2}{2}, the standard deviation is T^\frac{1}{2}. Using this distribution, proposals with a change in energy larger than a magnitude proportional to T^\frac{1}{2} are rejected. In this way, the change in temperature also changes the proposal distributions.

In most of the methods we have studied so far, we have side-stepped the problem of estimating the normalization constant Z = \int \exp(-E(s)/T), or analogously, the free energy: F = -T\log(Z). The free energy is useful for comparing different models (Bayesian model comparison) and for computing the probability of rare events. Computing the free energy Z_1 directly is usually difficult. Instead, we can compute the difference with the free energy of a known system, Z_0 i.e. the quantity \log(Z_1) - \log(Z_0) or analogously Z_1/Z_0.

Given a known initial system with free energy Z_0, the ratio can be estimated using importance sampling and related techniques, effectively computing E_g[f(X)/G(X)] where E_g[\cdot] is the expectation with respect to the distribution g(X). Here  g(X) is the known distribution and f(X) is the distribution whose normalizing constant we would like to compute. This method works well if the two distribution have a lot of overlap i.e. the two distribution share many high probability regions.

When this is not the case, one can still estimate the normalization using a chain of approximations. The idea is to compute the ratio of free energy for a chain of similar distributions such that the product is the quantity we seek. This can be computed as
\frac{Z_{1}}{Z_{0}} = \frac{Z_{1}}{Z_{\alpha_{n-1}}} \frac{Z_{\alpha_{n-1}}}{Z_{\alpha_{n-2}}} \cdots \frac{Z_{\alpha_{1}}}{Z_{0}} ,  where, by design, Z_{\alpha_{n-1}} is very similar to Z_{\alpha_{n}}. for instance one an use a convex sum of their energies: E_\alpha(s) = (1-\alpha)E_o(s)+ \alpha E_1(s). Each pair of models has a large overlap, and the normalizations can be computed independently and multiplied.

Friday 11/20/2009

November 21, 2009

This week, we discussed dynamical models for MCMC. We began with a toy example using Gibbs / Metropolis sampling to motivate the dynamical methods. Then we discussed two related dynamical models; Stochastic dynamics and Hybrid Monte-Carlo. Sanmi Led the discussion.

The toy model is as follows. The goal is to estimate the likelihood of a bivariate Gaussian. In other words, P(X) = Gaussian(m,S). We assume without loss of generality that m = 0. We let the variance of x_1 = variance of x_2 = 1, with a correlation of 0.99. For Gibbs sampling, it is clear that the marginal distribution P(x_1 | x_2) = Gaussian and can easily be sampled. Similarly, we can use a Gaussian distribution centered around the current state as the proposal distribution for Metropolis sampling. The variance of this Gaussian will have to be kept small so the acceptance probability is reasonable. In both cases, we showed that we explore the more constrained direction well, but sampling in the less constrained direction has almost random walk behavior. We also showed that this can be avoided for nearly independent variables. This observation illustrates why one might prefer the variable groups used for Gibbs / Metropolis sampling to be approximately independent.

It is clear that we need a method that allows large jumps in state so we can explore the high probability regions quickly (faster than with a random walk). Dynamical systems can help to overcome this random walk behavior. We began this section with a review of dynamical systems. The canonical (Gibbs) distribution can be defined as P(s) = .5*exp(-E(s)/T) where s is the micro-state, E(s) is the energy of the state, and T is the temperature. Any parametric distribution over variables that is nowhere zero can also be written in this form. This is the form we used for Gibbs/Metropolis sampling. The idea behind dynamical systems is to add dynamics that allow large changes in state without changing the energy of the system (equivalently, without changing the distribution). We studied stochastic dynamics and hybrid Monte-Carlo sampling methods that achieve this.

We defined a phase space, made up of position variables and momentum variables. The position variables are analogous to the parameters of a model. The momentum variables have no clear interpretation, but they will allow the large sampling jumps we desire. We use a very simple I.I.D. Gaussian distribution for the momentum variables, and the distribution of the position (parameter) variables remains the same. In this simple model, the position and momentum variables are independent. The combined energy is called the “Hamiltonian”. We defined Hamiltonian dynamics; that govern how the states change. The Hamiltonian we chose had special properties. It does not change the energy (distribution) of the system, and the dynamics can be reversed in time.

In order to actually implement this system, we need to discretize the dynamics. We showed how the “leapfrog” discretization  with a step-size of eps can be used. We showed that this model introduced a systematic error of order(eps^2). This means that the energy of the system no longer remains constant; though the change is relatively small. To create samples, we used stochastic dynamics:

  • Stochastic step: Sample the momentum variables. This can be done using Gibbs sampling as the momentum variables are I.I.D. Gaussians.
  • Dynamics step: Evolve the system dynamics using leapfrog iterations.

For sampling the model parameters, one simply ignores the momentum variables since the positions and momentums are independent. In other words, the dynamics of the position and momentum variables are coupled, but their probabilities are independent.

How do we deal with the systematic error in the system energy (distribution)? Hybrid Monte-Carlo modifies the dynamics step to remove this bias. The Stochastic step is identical, but the dynamics step is modified as follows:

  • Randomly decide whether to run the dynamics forwards or backwards in time.
  • After the leapfrog iterations, decide whether to accept or reject the new state with some acceptance probability.

The acceptance probability is very similar to the Metropolis acceptance probability. It also  measures how much the distribution changes; to discourage large changes. We showed that this algorithm satisfied detailed balance, and so the Markov chain generated is ergodic.

Next week, we will discuss simulated annealing and methods for estimating the free energy.

Friday 11/13/2009

November 15, 2009

We continued the discussion on Markov Chain Monte Carlo (MCMC) methods with an extensive review and some discussion of the Metropolis algorithm. The review section was intended to be short (about 5 mins), but we ended up taking about half the lecture due to a lot of interesting discussion and clarifications (Priyank, thanks for all the good discussion). Sanmi led the meeting.

Inference in many ML problems involve calculating the expectation of some function – E[f]. For instance, predictive distributions are calculated by integrating out known variables from a joint distribution. This can be written as an expectation. This integration can easily be approximated with samples – X; if the samples are drawn from the true distribution p(X). MCMC methods attempt to achieve this. What does MCMC mean? The “Monte Carlo” refers to general methods of approximating integrations using samples. The “Markov Chain” refers to the method of drawing samples using Markov chains; one can sample from very complicated distributions and show that the Markov chain distribution will quickly converge to the true sampling distribution. This convergent distribution is called the invariant distribution.

Sampling using Markov chains only requires that the chain eventually converges to the invariant distribution. However, for the chain to be practically useful, the following properties are important:

  • How much computation is required per transition?
  • How many steps does the chain require to converge to the invariant distribution?
  • Once the MC has converged, how many transitions are required to make sure subsequent draws are IID?

These questions can be answered in a principled way (using probabilistic bounds). However, these bounds can be quite difficult to compute. In practice, many practitioners use simple rules of thumb to decide if these properties hold. The workhorses of MCMC are the transitions – T(x). These global transitions are often built as sums or or product of local transitions B_k. The methods we will study will focus on how to pick as good B_k to make sure the chain converges and adequately explores the space. In Gibbs sampling, the local transitions are the marginal distributions i.e. B_k = p(x_k | X\k); where X\k refers to the set of variables with x_k removed. In other words:

  1. Fix all the variables to known values except variable x_k.
  2. Compute the partial distribution of x_k given that all other variables are known
  3. Sample x_k’ from this distribution.

Clearly, to use Gibbs sampling, one must be able to compute the required partials. This is usually tractable when the distribution is over a small set of variables or the variables have a relatively simple analytical structure. This is the case in hierarchical Bayesian models when one uses conjugate priors. What do you do when you can’t compute partials? One method that solves this problem is the Metropolis sampling algorithm. The high level idea is to sample using a simple proposal distribution, and compensate for the difference using an acceptance probability. This is similar to the idea behind rejection sampling. Samples are generated as follows:

  1. Generate a sample from some proposal distribution S_k
  2. Accept the sample with some probability A_k

A_k often measures how much the joint distribution changes after sampling the new variables. The idea is to discourage moves that drastically change the distribution. One can compare this to Gibbs sampling; where S_k  = P(x_k | X\k) and the acceptance probability is 1.
For both Gibbs sampling and the Metropolis algorithm, once can chose the subsets x_k as a group of variables instead of a single variable. This means that the algorithms can also be used to update subsets of the variables at a time. This can improve convergence speed and reduce computational overhead especially when each of the subsets are relatively independent.

Next week, we will discuss algorithms that borrow heavily from statistical physics; where one uses dynamical models to generate samples.


Friday 11/06/2009

November 15, 2009

Meghana continued the introduction to MCMC. She covered some Markov chain theory and Gibbs sampling.

I was out of town, so no detailed notes were taken. Please let me know if you need more information on these topics.

Friday 10/30/2009

October 30, 2009

This week, we began the discussion of MCMC using Probabilistic Inference Using Markov Chain Monte Carlo Methods by Neal. Sanmi led the discussion.

We began by motivating probabilistic models of data. We discussed some common types of models including belief networks and latent variable models. We discussed how realistic data is modeled by complicated models. However, these models might be difficult to optimize. For this reason, assumptions such as assuming the dataset is independent and identically distributed (IID) might help reduce the complexity of the optimization. Next, we discussed how latent models can be used to induce correlations in the data model even with IID assumptions by averaging over unknown variables.

We discussed the history of MCMC methods in data analysis, starting with the work of Metropolis (’57) to Alder and Wainwright (’59), and discussed how the MCMC model developed from statistical physics models. Statistical physics often attempts to model the distribution of unknown micro-states; such as position and velocity, by using macroscopic properties; such as temperature and volume. We defined the energy function and showed how it can be used to model simple distributions over the states. If the system is closed, the energy E(s) is constant E_o. The micro-canonical model assumes that the distribution of states is uniform over states with E(s) = E_o and 0 elsewhere. The “canonical” or “Gibbs” model assumes a more flexible system where the state is distributed as an exponential function of the energy. This model is useful for studying large scale system properties such as phase transition e.g from solid to liquid.
We discussed a specific model – the 2-D Ising model for ferromagnetism, and discussed how this model can be used to describe distributions over pixels of images. Statistical physics models are closely related to statistical models in data analysis using a simple identification between the energy function and the probability distribution.

We showed how many parameters of interest; such as conditional distributions or predictive distributions, can be expressed as an expectation of some function with variables drawn from some complicated distribution. This motivated methods for solving this expectation. We began with analytical methods using closed form integration. We also discussed how numerical integration can be used. However, many distributions of interest in high dimensional spaces have only a few peaks and large regions of low probability. We discussed why numerical integration methods might fail in this scenario. We also discussed the second order approximation. In this scenario, one approximates the distribution using a second order Taylor series about some mode. The Taylor expansion is identified with some canonical distribution such as a Gaussian distribution. One can now compute the integration using this substitute. Unfortunately, this method requires the Hessian; which might be difficult to compute in high dimensions, and this model does not handle multi-modal data well. We discussed how this model can be improved to deal with multi-modal data.

Next week, we will discuss other approximation methods such as importance sampling. We will also discuss relevant properties of Markov chains and use these to motivate MCMC methods.

Friday 10/23/2009

October 24, 2009

No meeting this week.

Friday 10/16/2009

October 21, 2009

Alex led the discussion of the ICML09 tutorial on Active Learning by Dasgupta and Langford. We began by discussing where conventional active learning algorithms; in particular uncertainty sampling can fail. Uncertainty sampling begins with a few random samples. From these samples, a classifier is built. Now the algorithm queries examples close to the decision boundary of the initial classifier as the most “uncertain” points.

Consider a simple scenario where the data is distributed on a line. The far right has 45% of the positive labels, The far left has 45% of the negative labels, but the middle is a sandwich of positive-negative-positive samples. This is an example of non-separable data. An uncertainty sampling based algorithm is likely to spend a lot of time querying the sandwich area, where most supervised algorithms would simply incur the minimum classification loss. Even in the infinite data limit, the uncertainty sampling classifier will have a higher than optimum test error.

One simple alternative to uncertainty sampling is to account for the distribution of the data using a simple clustering. The assumption is that data in clustered neighbourhoods tend to have the same label. The general idea is to:

  1. Cluster the data. Hierarchical clustering is suggested so the granularity can be adjusted easily.
  2. Sample a few random points.
  3. Assign each cluster to its majority label.
  4. Build a classifier based on labels
  5. Modify the clustering based on results and new samples, Repeat.

Next we discussed the differences between Aggressive and mellow algorithms. Aggressive algorithms tend to focus on the decision boundary and sample most points based on uncertainty. In contrast, mellow algorithms allow for sampling far away from the boundary. This allows them to deal with non-separable data where (in theory) aggressive algorithms will fail. The authors discussed a generic mellow learner. Let eps = {misclassification rate}, d = {dimension of hypothesis} and theta = {disagreement coefficient}. The authors show that on separable data, a supervised algorithm requires d/eps samples to learn a separator with probability greater than or equal to 0.9 while a simple mellow algorithm requires theta * d * log (1/eps) labels to achieve the same performance.

Next we discussed the A-squared algorithm. Here the goal is to find an optimal threshold function on [0,1] in a noisy domain. First the data is randomly sampled over the [0,1] domain and these samples are used to compute upper and lower bounds on the error rate of each potential hypothesis. These bounds can be used to reduce the size of the hypothesis space and the feature space. The output is the hypothesis with the minimum disagreement with the sampled labels using the error rate bounds. Unfortunately, The A-squared algorithm has several disadvantages. These include:

  • It only deals with 0/1 loss
  • To learn, the algorithm needs to enumerate and check hypothesis. This requires exponentially more computation and memory than conventional algorithms.
  • Examples from previous iterations are thrown away
  • Bounds are often too loose.

For this reason, the Authors propose IWAL. Let S  = {data, label, 1/p}. The steps are as follows:

  1. Receive unlabelled samples, x
  2. Compute p = Rejection-threshold(x,S).
  3. Use rejection sampling with p to decide if label(x)=y of data required. If so add (x,y,1/p) to S
  4. Learn a new hypothesis h = learn(S).

The authors show that if the Rejection-threshold(x,S) is greater or equal to some p-min. IWAL is at most a constant factor worse than a supervised algorithm.
The Rejection-threshold requires a search through all current hypothesis. In practice, this can be very expensive. One possible solution is to fix the set of valid hypothesis a-priori to some small set where this search is cheap. The show in an example that his learner can compete with a supervised learner while only requiring 2/3 of the samples.

Finally, we discussed how one might design an algorithm that smoothly interpolates between mellow and aggressive active learning based on some parameter.