Friday 9/25/2009

This was the first meeting of the semester. Priyank led the discussion on Active Learning using the survey by Burr Settles available here. There was a healthy discussion of the motivations for active learning and the current lack of theoretical guarantees. For a perspective with more emphasis on learning theory, see a tutorial by Sanjoy Dasgupta and John Langford.

We began by discussing motivating examples such as part of speech tagging in language applications and music recommendation for services such as Pandora. Many of these datasets require a human labeler. This process is time consuming and expensive. In such cases, active learning can be used to select the “most informative” examples. This brought up a fundamental questions: how many data points does a learner need in order to learn a good classifier if the learner can request specific examples? What is the “correct” way to ask for new examples?

We discussed membership query synthesis, where the learner synthesizes examples to be labeled. Unfortunately, the relationship between features and examples is rarely easy to capture. Further,  in many practical cases, the subset of “realistic” examples is not dense in the feature space. This means that in many cases, one can synthesize many features that do not correspond to any reasonable examples. We discussed a text labeling scenario where the neural network learner was allowed to synthesize letters to be labeled. Many of the examples synthesized did not correspond to any realistic text. We discussed the possibility of artificially constraining the feature space using probabilistic priors or manifold techniques.

It appears that the synthesis method has been abandoned recently in favor of other models. Alternatives to the synthesis approach are stream based selective sampling; where examples are drawn one at a time from some oracle, and the learner decides whether to request a label or discard the example, and the  pool based approach; where the learner has access to a pool of unlabeled data, and labeling decisions can be made in batches.

Next we discussed how the learner should evaluate which examples to be labeled. There are several approaches in the literature. One approach is to use uncertainty sampling. If the learner has learned a probabilistic model, the learner can simply request the label for the example that it is most uncertain about (the posterior probability of the label is closest to 0.5). A more sophisticated approach is to select the example whose label variable has the largest entropy, or the example whose best labeling is the least confident.

Another alternative is the query by committee framework. In this scenario, the learner constructs a committee of possible classification models and measures the disagreement between the models.  One can then request a label for  the example with the highest vote entropy or KL divergence. There seems to be no consensus on the best approach. In practice, the best approach seems to be application dependent.

Next week, We will continue discussing other query strategies (starting from section 3.3 in the tutorial)


2 Responses to “Friday 9/25/2009”

  1. Friday 10/2/2009 « Gnofai's Machine Learning Reading Group Says:

    […] Gnofai's Machine Learning Reading Group Just another weblog « Friday 5/25/2009 […]

  2. Zimaunoxomo Says:

    I’m frequently looking for new articles in the net about this theme. Thankz!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: