Wednesday 02/03/2010

Welcome back. This semester, the meetings will be held on Wednesday mornings at 11am in ACES 3.116. We will be discussing Gaussian processes, using the book “Gaussian Processes for Machine Learning” by Carl Edward Rasmussen and Chris Williams.
Goo led the first meeting. We covered a basic overview of Gaussian processes for regression. For details, see the NIPS 2006 tutorial “Advances in Gaussian processes” by Rasmussen.
The tutorial begins by discussing the prediction problem. Suppose one has some historical data; such as the carbon dioxide concentration in the air over time. We would like to predict the concentration at some future time. One simple solution is to find a linear fit. However, the yearly cycles in the CO2 concentration might suggest using a sinusoidal term. There might also be other types of correlations observed in the data that we would like to model. How does one select the best model to use? How does one select the best parameters? Rasmussen states “Gaussian processes solve some of the above, and provide a practical framework to address the remaining issues”.
Definition (Rasmussen): A Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions.
Gaussian processes provide a principled framework for generalizing the multivariate Gaussian distribution to an infinite number of variables. Because of this property, a Gaussian process model is sometimes described as a method for defining a prior over functions. Given an index (or input) $x$, the Gaussian process is completely defined by some mean function $\mu(x)$ and a covariance function $k(\cdot, \cdot)$.  In most applications, we will assume $\mu(x) = 0$ and concentrate on the effect of the covariance. $k(\cdot, \cdot)$ is also known as the kernel function. We will discuss the properties of this function in more detail over the coming weeks.
Gaussian process regression exploits the marginalization properties of the multivariate Gaussian distribution i.e. conditionals and marginals of a joint Gaussian are also Gaussian. Suppose the learner is given data $\{ x_i, y_i \}_{i = 1}^N$ and would like to learn a regression function modeled as $y_i = f(x_i) + \epsilon$, where $\epsilon$ is the noise term. One good model for this function is a Gaussian process with an appropriate kernel function. The means and variances of the prediction are easy to write out (though it is a good exercise to work out the equations yourself). The predictive distribution for a new data point $y^*$ is given as:
$p ( y^* | x^*, \mathbf{x}, \mathbf{y}) \sim \mathcal{N} ( \mathbf{k}_n^T \mathbf{K}^{-1} \mathbf{y}, k_{n+1} - \mathbf{k}_n^T \mathbf{K}^{-1} \mathbf{k}_n )$,
where $\mathbf{k}_n$ is the N length kernel covariance between the training data and new point $x^*$, $k_{n+1}$ is the variance of $x*$, and $mathbf{K}$ is the $N \times N$ kernel matrix capturing the covariance between the training data points.
Next week, Priyank will discuss the idea of a stochastic process in more detail, and complete the discussion on Gaussian process regression.