Learning machine learning for three months, contact with a variety of algorithms, but many know it but do not know it, so I want to do a summary of the knowledge of the past, this series of articles will not have too much algorithm deduction.

We know that the earlier classification model, the perceptron (1957), is a linear classification model of two categories, and also the basis of later neural networks and support vector machines. Support vector machine (Support vector machines) is also a two class classification model.Evolution has now become a multi linear process.

And nonlinear problems can also deal with regression problems. Before deep learning is popular, it should be considered the best classification algorithm. However, there are still many applications of SVM, especially in small sample sets.

**1、Perceptron model**

The perceptron model is a linear classifier with two classifications, which can only deal with the linear separable problem. The model of the perceptron is to try to find a hyperplane to separate the data sets. In the two-dimensional space, the hyperplane is a straight line, and in the three-dimensional space it is a plane. The classification model of the perceptron is as follows:

signA function is an indicator function (when wx+b > 0, f (x) = +1; when wx+b < 0, f (x) = -1;The hyperplane of the perceptron is wx+b = 0）

The above piecewise function is integrated into y (wx+b) > 0, it satisfies the sample point of the model, which is the correct point of classification, and is not satisfied with the point of classification error. Our goal is to find such a set of parameters, W, which separates the positive class points from the negative class points in the training center.

Then we define our loss function (the loss function is a function of the degree of loss and error). We can use the number of samples that define the classification error as a loss function, but this loss function is not a continuous derivable function of the parameter W, B, so it is not easy to optimize. We know the mistakeThe point of classification

There are -y (wx+b) > 0, we make all misclassification points to the hyperplane distance and minimum.**Note: the loss function of perceptron is only for misclassification, not the whole training set.**）：

M is a set of misclassified samples. When the W, B multiplier increases, it does not change our hyperplane, and the value of ||w|| will increase accordingly, so ||w|| = 1 will not affect our results. The final perceptron loss function is as follows:

**2、Support vector machine**

In the perceptron, our goal is to separate the training sets, as long as the hyperplanes that can separate the samples meet our requirements, and this hyperplane has a lot. Support vector machines are essentially similar to perceptron, but the requirements are even more demanding. We know that in the process of classification,

The points that are far from the hyperplane are safe, and those that are easily misclassified are very close to the hyper plane, and the idea of support vector machines is to focus on these points close to the hyper plane. In the same way, the nearest point to the hyperplane is the most close to the hyperplane at the same time as the classification is correct.

Based on the above perceptron, we can express our goals as follows:

γIt is the geometric interval from the nearest point of the hyperplane to the hyperplane, and the geometric interval is replaced by the function interval.

γ(The hat) is a function interval, and the value of the function interval varies with the W, B multiplier, and does not affect the final result, so that the gamma (HAT) = 1, then our final problem can be expressed as:

Here’s the support vector machine**The first highlight: maximizing the interval and maximizing the interval can make the classification more accurate, and the maximum interval hyperplane is present and unique.**

** **

The 1/2||w|| in the above problem^{2 }It is a convex function, and the constraint inequality is an affine function, so this is a convex two order problem. According to the convex optimization theory, we can transform our constraint problems into unconstrained problems by using the Lagrange function. Our optimization function can be expressed as:

α_{i} It’s a Lagrange multiplier, alpha_{i} ≥ 0 i = 1, 2, 3, ….., n 。

According to the duality of Lagrange, the original problem can be transformed into a dual problem (as long as the dual problem exists, the optimal solution of the dual problem is the optimal solution of the original problem, the general dual problem is easier to solve than the original problem) and the minimax problem is:

The minimum value of W and B is obtained first, and the value of W and B can be obtained.

The solution is substituted into the Lagrange function, and the following optimization function can be obtained.

So only we need to get the value of our alpha, we can get our W, the value of B (the common algorithm for alpha is SMO algorithm can refer to the https://www.cnblogs.com/pinard/p/6111471.html), assuming the final request.The value of the obtained alpha is alpha *, then W, B can be expressed as:

KTT conditions are introduced:

As you can see from the KTT condition, when y_{i}(w*x_{i} + b*) – 1 > 0 When, alpha_{i}* = 0；When alpha_{i}* > 0 At the time, y_{i}(w*x_{i} + b*) – 1 = 0；

Combined with the above W, B expressions can lead to support vector machines.**Second highlights: W, B parameters only to meet y _{i}(w*x_{i} + b*) – 1 = 0 The sample points are related to the nearest point of the hyperplane from the maximum interval. We call these points the support vector.**Therefore, support vector is often used to classify small sample sets very well. That is why. (

**It is also important to note that the number of the alpha vectors is equal to the number of training sets, and the number of parameters required for the large training set increases, so the SVM will be slower than the other common machine learning algorithms when dealing with the large training set.**）

**3、Soft spaced maximization**

In general, there are some anomaly points in the training set, which will cause the training set to be linearly unseparable, but after removing these exceptions, the remaining samples are linearly separable, and the maximum of the hard interval above is that the linear unseparable problem can not be handled. The linear undistinction means some samples.The function interval of a point is not satisfied

A constraint above or equal to 1. So we have each sample (x_{i}, y_{i}）Introducing a relaxation variableξ_{i}， Our constraints are as follows:

The penalty function of relaxation variable is added to the objective function, and the penalty parameter is C > 0, the target optimization function becomes:

Because the original problem of the whole solution can be described as:

By using the same method as before, Lagrange transforms the constraint problem into an unconstrained problem, and transforms the original problem into a dual problem for the minimax problem, and we can get our final result.

The only difference between the results in the second part is alpha._{i} The range of value is more than an upper limit C value. When the soft interval is maximized, the support vector description is more complex, because the support vector can be at the interval boundary (such as the dotted line in the lower graph), or between the interval boundary and the hyperplane, or in the separation of the misdivision side of the hyperplane.

** 4、Hinge loss function**

The hinge loss function, also known as the hinge loss function, is expressed as follows:

Therefore, the above optimization problem can be described as:

The first item in the above loss function can be understood as when the sample is correctly classified and the interval is greater than 1, that is, y._{i}(wx_{i} + b) ≥ 1At the time, the loss is 0, and when y_{i}(wx_{i} + b) < 1 At the time, the loss is 1 – y_{i}(wx_{i} + b)，**Note that even if the sample classification is correct, the accounting loss is less than 1, which is the severity of support vector machines.**。

The following is a comparison between the hinge loss function and some other loss functions.

**5、Linear nonseparable**

The maximization of the soft spaced space above can only solve the linear inseparable problem caused by the anomalous point, but it is impossible for the data set to be nonlinear. According to the relevant theory, it is linearly separable after mapping it to the high dimensional space for the linear and unseparable problem in the low dimensional space.To use this theory

In support vector machines, a function x can be introduced to map the sample set from the current dimension to the higher dimension and look back on our previous optimization function:

We only need to optimize the X of the inner product in the optimization function_{i } **·** x_{j }Convert into x_{i}) **· **ϕ(x_{j}) We can solve our nonlinear problem, but at the same time, we also introduce a new problem, when our data dimension increases, the amount of inner product is also increased. When the dimension after mapping is very high, even after the infinite dimension, the calculation of the model will also increase significantly.

So how do you deal with this problem? This requires the introduction of our kernel function.

We know that even after mapping to higher dimensions, inner product (x)_{i}) **· **ϕ(x_{j}) The value is also a constant. Is there such a function?

K( x_{i }**·** x_{j }) = ϕ(x_{i}) **· **ϕ(x_{j}) ，It is theoretically proved that when such a function is present (Mercer theorem has been proved), we call it a kernel function.

This leads to the support vector machine**The third highlight is that we do not need to map the samples to high-dimensional space, and use kernel function to solve the nonlinear classification problem.**

We have solved the problem by using the kernel function. Of course, not any function can be used as a kernel function, and the proven kernel function is not much, and the common kernel function is a few. Then we introduce several common kernel functions.

1）Linear kernel function

Linear kernel functions are well understood and can only be used to deal with linear problems.

So we can put linear SVM and nonlinear SVM together, only by setting up kernel functions and processing them.

2）Polynomial kernel function

The values of a, C and D are all set up by adjusting parameters.

3）Radial basis kernel function

The RBF kernel function is also called the Gauss kernel function

The parameter is less, and you only need to set up the parameter.

4）Sigmoid kernel function

K(x, y) = tanh (ax **·** z + r)

We need to set the parameters a, R, tanh function hyperbolic tangent function, which is also commonly used as activation function in neural network.

According to the Taylor expansion, we know that the higher order derivative functions can be expressed by polynomial functions, where both the RBF kernel function and the Sigmoid kernel function are high order derivatives, so they can be expressed by polynomials. In other words, radial basis functions and Sigmoid kernel functions can be expressed.The higher order polynomials, therefore, when choosing a kernel function,

The RBF kernel function and Sigmoid kernel function are usually better than polynomial kernel functions, because they can automatically match the order, but do not need to specify the order of polynomial kernel function. In addition, the polynomial kernel needs to debug more parameters, so the RBF kernel function is usually selected.

**6、summary**

In general, SVM is basically the best classification algorithm before integration algorithms and neural networks are popular, but even today, it still holds a high position.

SVM The main advantages are:

1）The maximum interval is introduced and the accuracy of classification is high

2）It can also be categorization accurately when the sample size is small, and has good generalization ability.

3）The introduction of kernel function can solve the nonlinear problem easily.

4）It can solve the classification and regression problems of high-dimensional features, even if the feature dimension is larger than the sample data, it can also perform well.

SVM The main shortcomings are:

1）When the sample size is very large, the calculation of the inner product in the kernel function and the calculation of the calculation of the Lagrange multiplier alpha value are all related to the number of samples, which will lead to too much calculation in the solution of the model.

2）The selection of kernel functions usually has no explicit guidance, sometimes it is difficult to select a proper kernel function, and like a polynomial kernel, the parameters that need to be debugged are very much.

3）SVM Sensitive to missing data (as if many algorithms are sensitive to missing values, for missing values, either in Feature Engineering or in tree models).