This week, we continued the discussion on active learning (See last week’s meeting notes). Priyank led the discussion.

The expected model change measures the average change in the loss function incurred by selecting a particular example. The idea is that we can use the gradient of the loss function as a metric for measuring the change in the model if the label was known. The change in this loss is called the expected gradient length. The learner will select the example which changes the loss function the most if known.

In statistical learning problems using squared loss, one can compute the expected generalization error as error = (some fixed noise) + (model bias) + (variance). Assuming that the model is fixed, the learner should seek to minimize the variance. The estimated distribution of the learner variance after some new example is added can then be used to select the next example to be labeled. A closed form estimate of this variance quantity is available for many common models such as Gaussian mixture models and neural networks. Unfortunately, squared loss is most applicable to regression tasks.

One can also estimate the Fisher information matrix given some probabilistic model. The Fisher information matrix measures the effect of the parameters on the log likelihood; one can estimate the parameter change vectors that will affect a converged model the most. The Fisher information ratio between the unlabeled examples and the entire pool can then be used to select the next example. This method will tend to select examples that are most representative of the unlabeled examples. Unfortunately, the variance minimization and Fisher information methods can quickly become intractable for models with many parameters and a large unbalanced pool.

We also covered density weighted methods. Here the idea is to sample uncertain examples weighted by the data density. This leads to selecting “representative” examples. We discussed other miscellaneous topics such as learning with costs. In this case, there is some intrinsic cost associated with querying each example. The learner must now jointly optimize obtaining the most informative examples while minimizing the cost involved.

There was some discussion about using generative models in active learning. Most generative models assume the examples are independent for tractability. However, the methods described will necessarily bias the examples. In this case the independence assumption might not be valid. However, these methods still have good performance in practice.

Next week, we will be switching topics, and discussing MCMC.

## Leave a Reply