This post is the second and last part of a double entry about how SVMs work (theoretical, in practice, and implemented). You can check the first part, SVM – Support Vector Machine explained with examples.
Index
- Introduction
- Optimization and Quadratic Programming
- Sequential Minimal Optimization (SMO) algorithm
- Algorithm
- Results
Introduction
Since the algorithm uses Lagrange to optimize a function subject to certain constrains, we need a computer-oriented algorithm to implement SVMs. This algorithm called Sequential Minimal Optimization (SMO from now on) was developed by John Platt in 1998. Since it is based on Quadratic Programming (QP from now on) I decided to learn and write about it. I think it is not strictly necessary to read it, but if you want to fully understand SMO, it is recommended to understand QP, which is very easy given the example I will briefly describe.
Optimization and Quadratic Programming
QP is a special type of mathematical optimization problem. It describes the problem of optimizing (min or max) a quadratic function of several variables subject to linear constraints on these variables. Convex functions are used for optimization since they only have one optima which is the global optima.
Optimality conditions:
If it is a positive definite, we have a unique global minimizer. If it is a positive indefinite (or negative), then we do not have a unique minimizer.
Example
Minimize
Initial point (where we start iterating)
Initial active set (we will only take into account these constraints).
Since a quadratic function looks like:
We have to configure the variables as:
Iteration 1
Solve EQP defined by , so we have to deal with the first two constraints. KKT method is used since it calculates the Lagrangian multipliers.
indicates the next point the algorithm will iterate.
Now it is necessary to check whether this solution is feasible:
It is clear that this point was going to be feasible because the point (0,0) is where we started.
Now, we remove a constraint with negative and move to the next value
.
Graphically, we started on (0,0) and we have to move to the same point (0,0). The removal of the constraint indicates to which axis and direction we are going to move: up or right. Since we took the first constraint out, we will move to the right. Now we need to minimize constrained to the second constraint.
Iteration 2
Solve EQP defined with
Check whether this is feasible: ?
Iteration 3
Solve EQP defined with
Check whether this is feasible:
It is not feasible. 3rd constrained is violated (that is why the third element is 0). As it is not feasible, we have to maximize
So,
Finally we add the constraint that was violated.
Iteration 4
Solve EQP defined with
Check whether this is feasible:
Iterations and steps are drawn on the plane below. It is graphically easy to see that the 4th step was illegal since it was out of the boundaries.
Sequential Minimal Optimization (SMO) algorithm
SMO is an algorithm for solving the QP problem that arises during the SVM training. The algorithm works as follows: it finds a Lagrange multiplier that violates the KKT conditions for the optimization problem. It picks a second multiplier
and it optimizes the pair
, and repeat this until convergence. The algorithm iterates over all
pairs twice to make it easier to understand, but it can be improved when choosing
and
.
Because we have that for a pair of
. This confines optimization to be as the following lines show:
Let (assuming that
)
We want to optimize:
If we pull out and
we have:
Let and:
It represents the original formula without the first and second
Let . This means that
is the output of
under old parameters.
Now we substitute in our original formula with and
pulled out using new variables:
.
Note that here we assume that
Now we substitute using
The first is a constant so it will be added to the
value.
Constant terms are grouped and is applied:
Let . Now, the formula is reduced to
. Let us focus on the second part (the coefficient “
“). Remember that
Since : (mathematical convenience)
Therefore, the objective function is:
First derivative:
Second derivative:
Note that . Proof: Let
. Then
This is important to understand because when we want to use other kernels this has to be true.
Therefore, this is the formula to get a maximum:
While performing the algorithm, after calculating (or
) it is necessary to clip its value since it is constrained by
. This constraint comes from the KKT algorithm.
So we have that:
If , then
If , then
If , then
If , then
If
Why do we need to clip our value?
We can understand it using a very simple example.
Remember that if , then
remains always constant. If
, then
so they can change as much as they want if they remain equal. In the first case, the numbers are balanced, so if
, as they sum 0.7, they can change maintaining that constraint true, so they may end up being
.
1) As explained before, and taking into account that ,
can only increase 0.3 because
can only decrease 0.3 keeping this true:
. Likewise,
can only decrease 0.4 because
can only increase that amount.
2) If they are different, can grow up to 1 because similarly,
will grow up to 1 (which is the value of C and hence, the upper boundary).
can only decrease till 0.1 because if it decreases till 0,
would be -0.1 and the constraint would be violated.
Remember that the reason why has to be
is because of the KKT algorithm.
|
||||||||||
|
If |
After updating and
we still need to update b.
(the increment of a variable is given by the increment of all variables which are part of it).
The change in the threshold can be computed by forcing 's without changing
: training data
Output:
: Lagrange for multipliers for solution
: threshold for solution
Algorithm:
Initialize
Initialize
Calculate using (2)
Calculate using (2)
Save old 's:
Compute and
by (10) and (11)
Continue to
Compute by (14)
Continue to
Compute and clip new value for using (12) and (15)
Determine value for using (16)
Compute and
using (17) and (18) respectively
Compute by (19)
(*B*)
Algorithm Legend
(*A*): If the difference between the new and
is negligible, it makes no sense to update the rest of variables.
(*B*): Useful data are those samples whose had a nonzero value during the previous algorithm iteration.
(2):
(10):
(11):
(12):
(14):
(15): and as
we get that
(*2*): This line has 4 parts: if ((a && b) || (c && d)). A and C parts control that you will not enter if the error is lower than the tolerance. And honestly, I don't get parts B and D. Actually I tried to run this code with several problems and the result is always the same with and without those parts. I know that alpha is constrained to be within that range, but I don't see the relationship between A and B parts, and C and D parts separately, because that double constraint (be greater than 0 and lower than C) should be always checked.
Results
Given a 3D space we have two samples from each class. The last column indicates the class:
Solution:
The code is provided in the Source code section.
References
1. The Simplified SMO Algorithm http://cs229.stanford.edu/materials/smo.pdf
2. Sequential Minimal Optimization for SVM http://www.cs.iastate.edu/~honavar/smo-svm.pdf
3. Inequality-constrained Quadratic Programming - Example https://www.youtube.com/watch?v=e6jDGxNZ-kk






I would like to modify the parallel sequential minimal optimization algorithm for my thesis and need some help. Based on recommendations for future work, the authors suggested that it is worthy to perform the multiclass classification problem in parallel…can you help me with some explanation? Tks