Friday 10/16/2009

Alex led the discussion of the ICML09 tutorial on Active Learning by Dasgupta and Langford. We began by discussing where conventional active learning algorithms; in particular uncertainty sampling can fail. Uncertainty sampling begins with a few random samples. From these samples, a classifier is built. Now the algorithm queries examples close to the decision boundary of the initial classifier as the most “uncertain” points.

Consider a simple scenario where the data is distributed on a line. The far right has 45% of the positive labels, The far left has 45% of the negative labels, but the middle is a sandwich of positive-negative-positive samples. This is an example of non-separable data. An uncertainty sampling based algorithm is likely to spend a lot of time querying the sandwich area, where most supervised algorithms would simply incur the minimum classification loss. Even in the infinite data limit, the uncertainty sampling classifier will have a higher than optimum test error.

One simple alternative to uncertainty sampling is to account for the distribution of the data using a simple clustering. The assumption is that data in clustered neighbourhoods tend to have the same label. The general idea is to:

  1. Cluster the data. Hierarchical clustering is suggested so the granularity can be adjusted easily.
  2. Sample a few random points.
  3. Assign each cluster to its majority label.
  4. Build a classifier based on labels
  5. Modify the clustering based on results and new samples, Repeat.

Next we discussed the differences between Aggressive and mellow algorithms. Aggressive algorithms tend to focus on the decision boundary and sample most points based on uncertainty. In contrast, mellow algorithms allow for sampling far away from the boundary. This allows them to deal with non-separable data where (in theory) aggressive algorithms will fail. The authors discussed a generic mellow learner. Let eps = {misclassification rate}, d = {dimension of hypothesis} and theta = {disagreement coefficient}. The authors show that on separable data, a supervised algorithm requires d/eps samples to learn a separator with probability greater than or equal to 0.9 while a simple mellow algorithm requires theta * d * log (1/eps) labels to achieve the same performance.

Next we discussed the A-squared algorithm. Here the goal is to find an optimal threshold function on [0,1] in a noisy domain. First the data is randomly sampled over the [0,1] domain and these samples are used to compute upper and lower bounds on the error rate of each potential hypothesis. These bounds can be used to reduce the size of the hypothesis space and the feature space. The output is the hypothesis with the minimum disagreement with the sampled labels using the error rate bounds. Unfortunately, The A-squared algorithm has several disadvantages. These include:

  • It only deals with 0/1 loss
  • To learn, the algorithm needs to enumerate and check hypothesis. This requires exponentially more computation and memory than conventional algorithms.
  • Examples from previous iterations are thrown away
  • Bounds are often too loose.

For this reason, the Authors propose IWAL. Let S  = {data, label, 1/p}. The steps are as follows:

  1. Receive unlabelled samples, x
  2. Compute p = Rejection-threshold(x,S).
  3. Use rejection sampling with p to decide if label(x)=y of data required. If so add (x,y,1/p) to S
  4. Learn a new hypothesis h = learn(S).

The authors show that if the Rejection-threshold(x,S) is greater or equal to some p-min. IWAL is at most a constant factor worse than a supervised algorithm.
The Rejection-threshold requires a search through all current hypothesis. In practice, this can be very expensive. One possible solution is to fix the set of valid hypothesis a-priori to some small set where this search is cheap. The show in an example that his learner can compete with a supervised learner while only requiring 2/3 of the samples.

Finally, we discussed how one might design an algorithm that smoothly interpolates between mellow and aggressive active learning based on some parameter.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: