# Support Vector Machines

SVM or large margin classifier are similar to other machine learning classification algorithm such as logistic regression. It is defined as given the labeled data, the algorithm find the optimal plane to categorizes. Other algorithms such as Neural Network and Logistic regression, SVM gives clear view of learning for non-linear function.

Logistic regression (sigmod function) says that hΘ(x) is close to 1 when y = 1 and vice versa for y = 0, thus  Θ^t.x is much larger or much smaller respectively. The problem with logistic regression is its cost function.

When we plot the graph for y = 1 with -log(1/1+e^-z) relationship between Z and
similarly for y = 0 -(log (1 – 1/1+e^z)) relationship between Z. It can be visualized that the cost increases if the Z is 0 or negative (for y = 1) and Z is 0 or positive (for y = 0). The contribution of large cost function when z is smaller/larger is high!

SVM cost function is slightly different from logistic regression cost function, for y = 1, z is flatten to 0 if z > 0, similarly for y = 0, z is flatten if z < 0, now we can replace cost function for logistic regression with cost1 and cost0 respectively which can be visualized as following

If you know the logistic regression equation. Here we have replaced h(Θ) with cost1 and cost0 functions, furthermore removing “m” terms will not affect the optimal value. Instead, it will return the same optimal value as before. Also notes here “C” hyper-parameter. In SVM, there are changes in regularization. If you call, sum of all training data set as “A” and regularization term as B, then the equation can be represented as
CA + B however, in logistic regression it is illustrated as A + λB.

C can be considered same as λ but, they are slightly different, since in logistic regression, as to deal differently with training data set and regularization term. If C value equals to 1/λ then A + λB is equal is CA + B. In SVM, it is bit different from logistic regression, SVM not only just get the result right but also put extra margin if the value are quite bigger than zero. SVM do this by putting  Θ^t.(X) <= 1 (not 0) and Θ^t(X) >= -1 (not 0). The margin and decision boundaries are affected by “C” parameters.

Lets consider that C parameter is very large. Furthermore, we are going to zero the “A” in the equation, thus only optimizing the “B” term in the equation. So equation becomes
0 + B, where A*C = 0.

If we visualize it, then we notice that the SVM has chosen the decision boundaries and more robust separator. If we compare the following result with logistic regression’s decision boundaries then it will be different and more likely would be close to the data points. But what if the above data points as an outlier, let say in stars’ data point, then if the C parameter is very large than the SVM will changes its’ decision boundaries, trying to separate, that outlier. But if we choose “C” small or not too large value, the outlier can be ignore.

BUT!! wait, if what about the non-linear functions, as it was mentioned, “SVM gives us clear view about non-linear function”.

If we have non-linear function, then we often create new set of features by increasing polynomials. For example hθ(x) = θ0+ θ1.x1+ θ2.x2 + θ3.x1.x2. Well, increasing the degree in equation will result in high computation power and low predictions.

Gaussian kernel or Kernel I solve problem by creating landmarks. Landmarks are simply points which are selected on plane. After selecting the plane, we create new features set using does landmark by the following equations.

f1 = exp(- (|| x – l||2 ) / 2σ2)
f2 = exp(- (|| x – l||2 ) / 2σ2)
f3 = exp(- (|| x – l||2 ) / 2σ2)

and so on, f1,f2 and f3 are defined as similarity function. Let say x is close to landmark then similarity function returns approx to 1 however, if x is far from landmark then the function will return approximately zero. Furthermore, variance (σ^2) is responsible for creating the high or low steep of the landmark.

If we used polynomial function for decision boundaries it will be something like this. However, when using the kernel it will scatter data points differently and thus creates different boundaries, this improves the prediction and computation performance. After plotting the data using Kernal, this results in as follows. Finding, y for the x by the above given points, then x close to landmark gives let say 0.5 and points far from l3 gives -0.5. Thus our decision boundaries look like as above.

Now questions, how many landmarks should be there? where to place does landmarks?

In Kernel II, take the training data, and for each training data place the landmark exactly on the location. Thus we will end up with ‘m’ landmarks. Given these, our equation will become θ0 + θ1.f1 + θ2.f2 …. + θm.fm

Using Kernel can be computationally expensive however SVM are far more efficient in computation.

Python implementation for SVM (Not done yet)