Online Learning
Introduction:
The concepts here are straightforward:
- You gots a classifier.
- The classifier classifies.
- The classifier waits from feedback from you.
- The classifier updates its parameters.
For this project I am working on, there isn't any decent sized dataset available. The result is that the classifier will just have to learn from the mistakes it makes (a fairly standard HCI concept).
Perceptron is a very simple online learning algorithm. Let us deal with the 2D perceptron situation. You ultimately want a line that will separate the 2 classes. In every feedback step, this line's orientation is changed to accomodate new feedback. After a certain number of training samples, the line's position is fixed and it no longer moves.
Of course, the perceptron is a little weak for our algorithm. After a lot of deliberation on IRC for a simple, easy to code up algorithm, I was recommended Coby Krammer and Shalev's Passive-Aggressive algorithms (which I decided to use)[1]. Let us discuss this paper.
Binary Classifier:
This guy's task is fairly simple. Pick 1 of 2 classes, wait for feedback and fix the decision boundary if needed.
The decision boundary is a hyperplane (basically think of this as some vector dotted with a generic vector in this space). This hyperplane needs to be tweaked everytime the decision boundary fails.
At this stage we need a metric to judge the current state of our hyperplane. Consider the following property:
- Consider all the points in the neighborhood of the hyperplane.
- Take the points in this cluster and pick the guy that is closest to the hyperplane (there is a simple formula to compute the distance)
- The distance between the closest point to the hyperplane and the hyperplane itself is called the margin.
- The margin is used to "judge" the hyperplane.
Let us explore what "judge" means. Now, if you are tightly constrained on either end by a point, you don't really have much of a wiggle room - your dataset might not be easily classifiable. Your hyperplane could end up misclassifying very easily because it doesn't have a lot of wiggle room to make sweeping adjustments on every mis-classification.
Now, what does an "update" or an "adjustment" look like. In the perceptron it was a straightforward fix where we computed a vector sum and rebuilt the hyperplane.
In the passive-aggressive setting, the main goals that an update aims to achieve is:
- After the update, we want to be as close as possible to our previous hyperplane
- At least unit margin must be achieved. (this leads to some solid problems with noisy data)
Now that we have a good hook of the process, let us get a bit more formal with the terms.
Classification Technique: The classifier must throw any point into 1 of 2 classes (duh!). Let us see how we perform a "classification" or a "prediction". Let us consider W to represent the normal to a homogenous hyperplane (our decision boundary).
Obviously, the equation of the hyperplane is given by <w, x> (this is my notation for a dot product).
We can use || <w, x> || as a good indicator of the confidence we have in our prediction. Simple vector algebra shows us that this value is 0 when x is on the plane. It is a large value if x is very far away from the plane. If we encounter a point on the decision boundary plane, then it obviously is a shitty prediction that we have with us.
Next, we need to label the obtained point as one of 2 classes. Let us go with +1 and -1. So, let us look at <w, x> again. This will end up returning different sign values for points that lie on different sides of the hyperplane.
So, our update will consist of tweaks to w. The w maintained during a round is called w_t (for round #t).
Now say after an update, we end up with a margin of < 1. We define this as an instantaneous loss. This loss function is 0 when we get a margin that meets our requirements. If we don't, it encodes how short we fell of meeting the margin value of 1.
The main equations are in the paper so I won't write them out here but the goal of the update procedure is :
1. In the entire vector space, pick that weight vector that is closest to the existing weight vector
2. This weight vector should produce a hinge loss of 0.
The above two conditions help us preserve previous info and still obtain a solid margin. It beats the perceptron since the decision boundary is not just any hyperplane but one that won't be prone to misclassifications.
In a later post I will describe how to tweak the algorithm to work with noise and we shall play with some Kinect data.
The C# source for the PA algorithm can be downloaded @ https://github.com/shriphani/ASLFingerSpell/tree/master/PassiveAggressive/ML
References:
