Support Vector Machines

SVM or large margin classifier are similar to other machine learning classification algorithm such as logistic regression. It is defined as given the labeled data, the algorithm find the optimal plane to categorizes. Other algorithms such as Neural Network and Logistic regression, SVM gives clear view of learning for non-linear function.

Logistic regression (sigmod function) says that hΘ(x) is close to 1 when y = 1 and vice versa for y = 0, thus  Θ^t.x is much larger or much smaller respectively. The problem with logistic regression is its cost function.

When we plot the graph for y = 1 with -log(1/1+e^-z) relationship between Z and
similarly for y = 0 -(log (1 – 1/1+e^z)) relationship between Z. It can be visualized that the cost increases if the Z is 0 or negative (for y = 1) and Z is 0 or positive (for y = 0). The contribution of large cost function when z is smaller/larger is high!

SVM cost function is slightly different from logistic regression cost function, for y = 1, z is flatten to 0 if z > 0, similarly for y = 0, z is flatten if z < 0, now we can replace cost function for logistic regression with cost1 and cost0 respectively which can be visualized as following

If you know the logistic regression equation. Here we have replaced h(Θ) with cost1 and cost0 functions, furthermore removing “m” terms will not affect the optimal value. Instead, it will return the same optimal value as before. Also notes here “C” hyper-parameter. In SVM, there are changes in regularization. If you call, sum of all training data set as “A” and regularization term as B, then the equation can be represented as
CA + B however, in logistic regression it is illustrated as A + λB.

C can be considered same as λ but, they are slightly different, since in logistic regression, as to deal differently with training data set and regularization term. If C value equals to 1/λ then A + λB is equal is CA + B.

equation

In SVM, it is bit different from logistic regression, SVM not only just get the result right but also put extra margin if the value are quite bigger than zero. SVM do this by putting  Θ^t.(X) <= 1 (not 0) and Θ^t(X) >= -1 (not 0). The margin and decision boundaries are affected by “C” parameters.

Lets consider that C parameter is very large. Furthermore, we are going to zero the “A” in the equation, thus only optimizing the “B” term in the equation. So equation becomes
0 + B, where A*C = 0.

If we visualize it, then we notice that the SVM has chosen the decision boundaries and more robust separator. If we compare the following result with logistic regression’s decision boundaries then it will be different and more likely would be close to the data points.

svm_margin

But what if the above data points as an outlier, let say in stars’ data point, then if the C parameter is very large than the SVM will changes its’ decision boundaries, trying to separate, that outlier. But if we choose “C” small or not too large value, the outlier can be ignore.

BUT!! wait, if what about the non-linear functions, as it was mentioned, “SVM gives us clear view about non-linear function”.

If we have non-linear function, then we often create new set of features by increasing polynomials. For example hθ(x) = θ0+ θ1.x1+ θ2.x2 + θ3.x1.x2. Well, increasing the degree in equation will result in high computation power and low predictions.

Gaussian kernel or Kernel I solve problem by creating landmarks. Landmarks are simply points which are selected on plane. After selecting the plane, we create new features set using does landmark by the following equations.

f1 = exp(- (|| x – l||2 ) / 2σ2)
f2 = exp(- (|| x – l||2 ) / 2σ2)
f3 = exp(- (|| x – l||2 ) / 2σ2)

and so on, f1,f2 and f3 are defined as similarity function. Let say x is close to landmark then similarity function returns approx to 1 however, if x is far from landmark then the function will return approximately zero. Furthermore, variance (σ^2) is responsible for creating the high or low steep of the landmark. 

If we used polynomial function for decision boundaries it will be something like this.

scattering

However, when using the kernel it will scatter data points differently and thus creates different boundaries, this improves the prediction and computation performance. After plotting the data using Kernal, this results in as follows.

kernel.png

Finding, y for the x by the above given points, then x close to landmark gives let say 0.5 and points far from l3 gives -0.5. Thus our decision boundaries look like as above.

Now questions, how many landmarks should be there? where to place does landmarks?

In Kernel II, take the training data, and for each training data place the landmark exactly on the location. Thus we will end up with ‘m’ landmarks. Given these, our equation will become θ0 + θ1.f1 + θ2.f2 …. + θm.fm

Using Kernel can be computationally expensive however SVM are far more efficient in computation.

Python implementation for SVM (Not done yet)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: