SVM algorithm improvements

In the previous post I talked about an SVM implementation in Matlab. I consider that post and implementation really interesting since it is not easy to find a simple SVM implementation. Instead, I found tons of files which may implement a very interesting algorithm but they are insanely difficult to examine in order to learn about how it works. This is the main reason why I put so much effort on this implementation I developed thanks to the algorithm [1].

First improvement

After I implemented that algorithm everything seemed to work:

Example 1:

Data points coordinates and target:
$ data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 6 & 7 & 1 \end{bmatrix} $

Distance to the border from each point:
$ dist = \begin{bmatrix} 5.0598 \\ 5.0597 \\ 5.0596 \end{bmatrix} $

Very acceptable results, right? However, when I added more points I experienced an odd behavior.

Example 2:

$ data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 9 & 12 & 1 \\ 7 & 12 & 1 \\ 6 & 7 & 1 \end{bmatrix} $

$ dist = \begin{bmatrix} 9.0079 \\ 7.3824 \\ 7.3824 \\ 5.6215 \\ 3.3705 \end{bmatrix} $

It clearly fails at finding the optimum boundary, however, the funny thing here is that the distance between the second and third point and the boundary are the same. This means that the rest of the samples were ignored and the algorithm focused only on those two. Actually, $\alpha_2$ and $\alpha_3$ were the only nonzero values.

After debugging the code everything seemed to be right according to the algorithm I followed [1] but after many trials I saw what definitely brought me to discover the error. In my trial the first two elements belong to class -1 whereas the rest of them belong to the other one. As you can see in the following examples, when I changed the order of the elements in the second class I got that the boundary was different depending only on the first element of the second class, ergo, the third sample.

Example 3:

$ data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 7 & 12 & 1 \\ 9 & 12 & 1 \\ 6 & 7 & 1 \end{bmatrix} $

$ dist = \begin{bmatrix} 8.8201 \\ 6.5192 \\ 6.5192 \\ 8.2065 \\ 2.9912 \end{bmatrix} $

Example 4:

$ data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 6 & 7 & 1 \\ 7 & 12 & 1 \\ 9 & 12 & 1 \end{bmatrix} $

$ dist = \begin{bmatrix} 5.0598 \\ 5.0597 \\ 5.0596 \\ 7.5894 \\ 9.4868 \end{bmatrix} $

In this last trial we get the best solution because in this case the algorithm has to focus on the third sample which is the closest one to the other class. However, this is not always true, so I had the need of fixing it. The fix is very simple but was not easy to find (at least quickly).

When I was debugging the code, I realized that in the first loop (iterating over $i$ ) it never reached the samples 4th and 5th. The reason was easy to understand: after calculating the temporal boundary (even if it is not the best, that is why it is called “temporal”), there were no errors because the algorithm classified it correctly, so it never entered that loops which needed to pass the “if” which takes care of the tolerance. In other words, if there is no error, it does not try to fix it because it is able to classify it correctly (and this actually makes sense).

If the samples were not encountered on the $i$ loop on purpose, then they should be encountered on the other loop. Surprisingly, the algorithm did not encountered any of them in the inner loop. After I checked that I wrote the code accordingly to the algorithm [1], I thought that there had to be a mistake in the algorithm itself. And the mistake was the “Continue to $\text{next i}$ “. Because of that line, it ignored the rest of $j$ ‘s, so it should be “Continue to $\text{next j}$ “.

Thus, the fix in the Matlab code was pretty simple: changing from “break” to “continue“. Break allows to stop iterating over the loop and therefore it continues in the outer loop whereas continue makes the current loop stop and start iterating over the next value in that loop.

Second improvement

After the first improvement was implemented, it seemed that it worked for many trials, but when I tried more complex examples, it failed again.

$ data = \begin{bmatrix} -7 & -4 & -1 \\ -9 & -8 & -1 \\ 2 & 5 & -1 \\ -3 & -10 & -1 \\ 9 & 7 & 1 \\ 3 & 8 & 1 \\ 8 & 11 & 1 \\ 8 & 9 & 1 \end{bmatrix} $

The original algorithm [1] uses the variable $\text{num\_changed\_alphas}$ to see whether alpha changed. If no alphas are changed during the iterations for $\text{max\_passes}$ times, the algorithm will stop. I think the idea of iterating various times over the main algorithm is correct, but the algorithm must focus on those samples that will help building the boundary. After I implemented the modification, the algorithm iterated less times than the original algorithm. Additionally, the original algorithm implementation seemed to fail in many cases whereas my implementation works.

When the algorithm iterates once, alphas are updated such that nonzero alphas correspond to the samples that will help building the boundary. In this example, after the first iteration, alpha values correspond to this:

$ \alpha = \begin{bmatrix} 0 \\ 0 \\ 0.2 \\ 0 \\ 0 \\ 0.2 \\ 0 \\ 0 \end{bmatrix} $

Therefore, in the next iteration it will update the samples to focus only in sample #3 and sample #6. After this implementation was done, all the trials I tried worked perfectly. This is the result of the same problem:

Algorithm

This is the algorithm [1] after both improvements:

Initialize $\alpha_i = 0, \forall i, b = 0$
Initialize $\text{counter} = 0$
$\text{while} ((\exists x \in \alpha | x = 0 ) \text{ } \& \& \text{ } (\text{counter} < \text{max\_iter})) [/latex] Initialize input and [latex]\alpha$
$\text{for } i = 1, ... numSamples$
Calculate $E_i = f(x^{(i)}) - y^{(i)}$ using (2)
$\text{if } ((y^{(i)} E_i < -tol \quad \& \& \quad \alpha_i < C) \| (y^{(i)} E_i > tol \quad \& \& \quad \alpha_i > 0))$

$\text{for } j = 1, ... numSamples \quad \& \quad j \neq i$
Calculate $E_j = f(x^{(j)}) - y^{(j)}$ using (2)
Save old $\alpha$ 's: $\alpha_i^{(old)} = \alpha_i, \alpha_j^{(old)} = \alpha_j$
Compute $L$ and $H$ by (10) and (11)
$\text{if } (L == H)$
Continue to $\text{next j}$
Compute $\eta$ by (14)
$\text{if } (\eta \geq 0)$
Continue to $\text{next j}$
Compute and clip new value for $\alpha_j$ using (12) and (15)
$\text{if } (| \alpha_j - \alpha_j^{(old)} < 10^{-5})[/latex] (*A*) Continue to [latex]\text{next j}$
Determine value for $\alpha_i$ using (16)
Compute $b_1$ and $b_2$ using (17) and (18) respectively
Compute $b$ by (19)
$\text{end for}$
$\text{end if}$
$\text{end for}$
$\text{counter } = \text{ counter}++$
$\text{data } = \text{ useful\_data}$ (*B*)
$\text{end while}$

Algorithm Legend

(*A*): If the difference between the new $\alpha$ and $\alpha^{(old)}$ is negligible, it makes no sense to update the rest of variables.
(*B*): Useful data are those samples whose $\alpha$ had a nonzero value during the previous algorithm iteration.
(2): $f(x) = \sum_{i=1}^m \alpha_i y^{(i)} \langle x^{(i)},x \rangle +b$
(10): $\text{If } y^{(i)} \neq y^{(j)}, \quad L = \text{max }(0, \alpha_j - \alpha_i), H = \text{min } (C,C+ \alpha_j - \alpha_i)$
(11): $\text{If } y^{(i)} = y^{(j)}, \quad L = \text{max }(0, \alpha_i + \alpha_j - C), H = \text{min } (C,C+ \alpha_i + \alpha_j)$
(12): $\alpha_ := \alpha_j - \frac{y^{(j)}(E_i - E_j)}{ \eta }$
(14): $\eta = 2 \langle x^{(i)},x^{(j)} \rangle - \langle x^{(i)},x^{(i)} \rangle - \langle x^{(j)},x^{(j)} \rangle$
(15): $\text{[math]}$ \alpha_j := \begin{cases} H \quad \text{if } \alpha_j > H \\
\alpha_j \quad \text{if } L \leq \alpha_j \leq H \\
L \quad \text{if } \alpha_j < L \end{cases}[/latex] (16): [latex]\alpha_i := \alpha_i + y^{(i)} y^{(j)} (\alpha_j^{(old)} - \alpha_j)[/latex] (17): [latex]b_1 = b - E_i - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(i)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(i)},x^{(j)} \rangle[/latex] (18): [latex]b_2 = b - E_j - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(j)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(j)},x^{(j)} \rangle[/latex] (19): [latex]\alpha_j := \begin{cases} b_1 \quad \quad \text{if } 0 < \alpha_i < C \\ b_2 \quad \quad \text{if } 0 < \alpha_j < C \\ (b_1 + b_2)/2 \quad \text{otherwise} \end{cases}[/latex] The code is provided in the Source code section.

References

1. The Simplified SMO Algorithm http://cs229.stanford.edu/materials/smo.pdf

3 thoughts to “SVM algorithm improvements”

Pingback: [SVM Matlab code implementation] SMO (Sequential Minimal Optimization) and Quadratic Programming explained | Lipman's Artificial Intelligence Directory
Rama says:

February 1, 2017 at 10:46 pm

Hi Juan,

Your blog is really informative and well-written.

I’ve had a look through and downloaded your source code. However, it appears your SVM plotting scripts are not plotting as per the example in this page.

Any chance you could update your github repo?

Cheers

Log in to Reply
1. lipman says:
  
  February 4, 2017 at 10:36 am
  
  Thank you very much!
  
  I will write a note in my TODO list and I will send you an email. Now I’m on vacations and I will be available next week. I hope I haven’t deleted those files from my other computer 🙂
  
  Regards,
  JM.
  
  Log in to Reply

SVM algorithm improvements

First improvement

Example 1:

Example 2:

Example 3:

Example 4:

Second improvement

Algorithm

Algorithm Legend

References

Related

Juan Miguel Valverde

3 thoughts to “SVM algorithm improvements”

Leave a Reply Cancel reply

First improvement

Example 1:

Example 2:

Example 3:

Example 4:

Second improvement

Algorithm

Algorithm Legend

References

Share this:

Related

Juan Miguel Valverde

3 thoughts to “SVM algorithm improvements”

Leave a Reply Cancel reply