[Hough transform] Line detection (Cartesian, Polar and Space reduction)

Brief Introduction

Line detection is one of the most important and basic feature extraction methods. Many currently developing and promising fields such as self driving cars may use line detection to detect lanes. Thus, it is important to understand how it works (both mathematically and the implementation).

As we are using a 2D plane (an image) we can use Cartesian or Polar parameterization. Polar parameterization is useful not only because of its own advantages, but also because it allows the algorithm to reduce costs by space reduction.


Let us keep in mind the line equation:
y = mx + c
In homogeneous form:
Ay+Bx +1 = 0 \quad where A = -1/c, B = m/c

To determine the line we must find m, c \quad \text{(or A, B)}

The way HT works is by simply counting the potential solution in an accumulator, tracing all possible lines for each point within the main iteration. Hence, finding the maximum in the accumulator means finding the line with the highest probability.

When iterating, after checking that a black pixel (typically corresponding to an edge) has been detected, it iterates over two different “for” loops. The first loop corresponds to angles between -45 and 45 degrees (both inclusive) and the second loop between 45 and 135. It is necessary to separate them because for slopes whose degrees are larger than 45 or lower than -45, c (intersection with y-axis) may take large values. Thus, an additional accumulator for angles between 45 and 135 is needed, which will store a similar c variable whose value is the intersection with x-axis rather than y-axis.

As we can see in the image below for the case when angles are between -45 and 45, when c is out of bounds (a) (bounds are 0 and the height), the accumulator is not increased. Otherwise, the accumulator will increase for all those angles between the allowed boundaries (b) (green). Note that all those angles that are outside are later taken into account when examining x-axis.


In the following picture, we have 5 points that may compound a line. In the green zone of the left side you can see how the region in the middle is getting larger and larger. That represents the accumulator values for those angles and intersection with the y-axis, and it shows that there is a high probability to find a line, as it is. Likewise, we can also see that the behavior of this algorithm is strong against noise and occlusion. As a drawback, two large matrices are needed.


Cartesian Algorithm

Iterate over rows (y)
Iterate over columns(x)
If an edge is detected
For angles between -45 and 45 (m)
Calculate c (y-axis intersections)
If c is between the bounds, increase accumulatorA
For angles between 45 and 135 (m)
Calculate c (x-axis intersections)
If c is between the bounds, increase accumulatorB

Cartesian Examples

line1 line2


The previous algorithm taken from [1] was used to understand Hough Transform for Cartesian coordinate systems, however, this algorithm failed to classify some lines such as the following:

At first I thought that the problem could be my implementation of how to draw those lines given the accumulators, but after a deep study of the algorithm I realized what was the mistake. In addition to the author mistake, I added a small improvement that may help to understand HT in Cartesian coordinate systems.

First of all, in the author’s algorithm the accumulators are split depending on the degrees that are being examined. The first accumulator stores vertical intersections of the y-axis on angles between -45 and 45 whereas the second is responsible of the x-axis intersections on angles between 45 and 135.

If a nearly horizontal line is examined, the first accumulator (vertical intersections) will have a higher maxima than the second accumulator. This can be seen in the image above the algorithm: in the vertical accumulator a peak around the center will be created. In contrast, the horizontal accumulator will grow but very plain. Let us recall that in the accumulator there are represented the pixel where the line should start (the intersection with the axis) and the angle. Once this is completely understood, one can imagine many problematic scenarios such as the one depicted previously.

In the previous image, the line that should be generated must start in the x-axis. This means that the horizontal accumulator (the second) should have a higher peak than the vertical one. However, the angle corresponding to that line is between -45 and 45 degrees, so it is only examined by the first accumulator. Thus, the assumption that one accumulator should be in charge of a certain range of degrees while the second takes cares of the rest, is extremely naïve. The solution to this problem is by simply computing the range from -45 to 135 degrees in both accumulators. It is worth saying that the computation time is almost not affected at all, but we need a higher amount of memory.

The improvement to better understand the algorithm is more related with the later line drawing. As the “for” loops and pixel detection works, the image is examined from top to bottom and left to right. This makes our coordinates system move from the typical 0,0 starting in the bottom-left corner to the top-left corner. This shift arises problematic issues regarding the formulas, especially with the one used for the second accumulator (x-axis crossings detection). The formula used for detecting these crossings is:

b = \text{round(} x- \frac{y}{\tan{m* \pi / 180}} \text{)}

Where m represents the angle and x, y represent the coordinates. This formula may work when the 0,0 is in the bottom-left corner:


But if the coordinate epicenter is moved, it will not work anymore. For this reason, instead of computing y when b is calculated, I decided to compute yInv = rows - y. An alternative solution would be changing the formula.

Final algorithm:
Iterate over rows (y)
Iterate over columns(x)
If an edge is detected
For angles between -45 and 135 (m)
Calculate yInv (yInv = rows - y)
Calculate c (y-axis intersections)
If c is between the bounds, increase accumulatorA
Calculate c (x-axis intersections)
If c is between the bounds, increase accumulatorB

Both algorithms (the original and the improved one) are in the Source code section.


Polar coordinate system is an alternative to the Cartesian in which a radius and angle are needed to locate a single point, rather than X-Y coordinates. The maximum length of the radius can be obtain by the Pythagoras formula: \sqrt{2} N where N is the largest size (width or height).


In contrast with Cartesian, in the Polar algorithm we only need one accumulator. The first dimension of the matrix is the radius which is between 0 and \sqrt{2} N, and the second dimension is the angle (0-180). It will work in the same way as Cartesian: examining each point of the edge individually the accumulator will increase in those points which may generate the objective line. The final line will be drawn by finding the maximum value in the accumulator and using the radius and angle where it is located. It may seem a bit confusing how to calculate prospective points in the Polar coordinate system, so I tried to explain it using some drawings.

Imagine that we have the point 2,4. It is not difficult to calculate its angle and radius.


r = \sqrt{x^2 + y^2} = \sqrt{4^2 + 2^2} = 4.47 \\ \sin{\theta} = \frac{a}{c}; \theta = 26.5

However, for the same point, it looks a bit confusing when examining different angles. This is how it looks when we try to figure out the radius of the same point for a 45º angle.


And here is an example of a straight line: 4 purple points in a row. If we examine all the angles, we will realize that for the 90º angle they all reach the same point (3), so the accumulator will be maximum there.


Polar Algorithm

Initialize max value of radius \sqrt{2} N
Iterate over columns (x)
Iterate over rows (y)
If an edge is detected
For angles between 1 and 180 (m)
Calculate the radius (*)
If radius is between the bounds (0 and maximum), increase accumulator

Polar Examples

The same pictures as Cartesian Examples.


As in the Cartesian algorithm given in [1], in the Polar algorithm another mistake related to the one found in the Cartesian is found as well. This algorithm seems to work always except for one particular case: when one side of the line is bent pointing from top to bottom following a north-west to south-east direction.

pic2 pic1 pic3
Ok. Fail Ok.

Again, the error is to naïvely assume that the previous studied radius-angles were all the possibilities.

The error is illustrated in the picture below. Given that angle (around 150º) the corresponding radius is negative and therefore, it is discarded (radius must be between 0 and \sqrt{2} N). The reason why the radius is negative is very simple: the intersection is in the other side of the line to which it is suppose to intersect (the red line with the arrow). Given this scenario the solution could be a negative radius and 150º degrees or a positive radius but a negative degrees (150-180 = -30º). Both scenarios cannot be stored in the accumulator for obvious reasons. The real (and more expensive in terms of memory) solution would be to extent the accumulator from 1-180 to 1-360 degrees to cover all cases. In this case, this line would be detected in the angle -30+360 = 330º and the radius will be positive because of the orientation of the arrow.


Nonetheless, I tried to study how good or bad was assuming that we only needed to check 1-180 degrees. For this, I run the algorithm through a 150×150 black image to make it fire in every pixel and try each combination. After that, I checked which parts of the accumulator were 0, meaning this that the will never be modified. I did the same in the 1-360 case to see those cases that cannot be seen from the first 1-180º implementation. The black color represents those elements who are never increased whereas the white color represents a zone that can be modified. The width indicates each angle, so in the first case the picture has 180 columns and the second one has 360.

op180 op360

The conclusion drawn from these pictures is that it is possible to make a more efficient algorithm to iterate only over those cases that may be meaningful.

Space Reduction

The space reduction is a modification of the polar algorithm in which the accumulator matrix is reduced from \text{max},180 to two matrices: 180,1 \quad \text{and} \quad \text{max},1. The huge save is obvious, but again, the naïve approach to enclose the problem from 1 to 180 degrees will have the same consequences as in the previous section.

The algorithm presented in the book does not iterate over all angles. Instead, it checks whether a point is located in a certain neighborhood (a 5×5 window) and it calculates the angle for that point.

This approach is more statistical than the previous algorithms, so instead of the accuracy given by knowing the coordinates which correspond to the characteristics of the line, it decides the parameters of the line given the statistics from the accumulator.

The code given in [1], page 211, does not work for almost any line. To make it work, one needs to fix it as in the previous section: adding the 360 degrees.

Additional notes for further improvements

·The accuracy is extremely related with the counter value
·Instead of taking the max, you can take the 2 or 3 max values since more lines may be found.
·Instead of 2 or 3 max, you can establish a threshold
·It can also be possible to study only those lines in a certain region of the picture. For instance, if you want to make a lane recognizer you should focus on the half in the bottom.

The code is provided in the Source code section.


1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

[Image Processing] Thresholding and Subtracting

One of the simplest methods to extract features from a picture is by subtracting and thresholding. Imagine that from a static background we want to get any figure that appears in front. For instance, let us say that we want to measure the height of different people who will lean against a wall. At first, the background (the wall) is observed. Then, someone will appear, and later, by subtracting the new picture from the background we will be able to extract the person who appeared. After that, it will not be difficult to guess the height. The thresholding operation comes from the fact that we have to decide how strong is the change to consider that it indeed changed.

The most impressive advantage of this method is that it is really easy to implement and fast, similar to the temporal median which can actually be used to generate the background. The biggest drawback is that it is very sensitive to noise and luminosity changes as we will see.

I cropped a couple of pictures used in the temporal median to try this out.

Background + figure

And these are the results after applying thresholding and subtraction. The first picture has a lot of noise as you can see, but I think is very easy to remove in this case. I think this noise comes from that when these pictures were taken, there was a bit of wind and pixels are not exactly the same. In order to remove the noise, in the second attempt I firstly used a Gaussian filter. By the way, both pictures were converted first to gray scale to make it simple and faster since the background and foreground will be in a different color, although Gaussian took really long compared to the first one. In any case, in the original image my right arm is almost blended with the background. That is why it is not well detected.

Simple thresholding and subtracting
Using a Gaussian filter first

When subtracting you can be very imaginative and try different things depending on the background, foreground and the application:
a) Having single thresholds for each color channel
b) Having a global threshold made from the sum of the subtracting of each channel
c) Combining a) and b)
d) Using gray scale
e) Applying a template convolution (Gaussian or any other)
f) Check out the neighborhood
and so on.

I wanted to try out how to use the webcam to process and show the results “in real time”, although my computer does not compute very fast and I took a picture every 1 or 1.5 seconds, but it is still interesting. I also wrote a post about how to take pictures from the webcam with Matlab.

This is the algorithm of the code I made

Initialize webcam, max_times
Wait 2 seconds before taking a picture of the background (to let me hide)
Take a picture of the background
Wait 1 second
while max_times>=0
Take a picture
result = blackPicture (background color)
Iterate over each pixel (y,x)
If background(y,x,redChannel) – picture(y,x,redChannel) > threshold || background(y,x,greenChannel) – picture(y,x,greenChannel) > threshold || background(y,x,blueChannel) – picture(y,x,blueChannel) > threshold
result(x,y,channels) = newColor
End if

Show picture (or store it)
Decrease max_times


Some times we can see that the whole area is green. This is because the luminosity changes depending on where the camera is focusing, and the focus changes depending on where the camera detects something. As the luminosity of the whole picture changes, everything turns to green. Shadows also make the luminosity change.

An example of how luminosity changes depending on where the webcam is focused.

Background picture
Take a look to the luminosity of the walls

The code is provided in the Source code section.


1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

How to take pictures from the webcam with Matlab

In Image Processing, taking pictures as well as recording videos is an important task since it is the first step before processing the images. Thus, I consider interesting to write an entry about configuring the webcam in Matlab. I have to say that I was extremely astonished by how easy this was. I was expecting installing external drivers or add-ons in Matlab (actually that was my first and wrong step) but nothing like that is needed. You have to, at least, have installed the camera in your PC, which seems to be quite obvious if you want to use it. The only requirement is having the Image Acquisition Toolbox installed, which I had installed by default in my R2013a Matlab version.

At first, we will use imaqhwinfo to see the adaptors recognized by Matlab:

>> imaqhwinfo

ans =

    InstalledAdaptors: {'gentl'  'gige'  'matrox'  'winvideo'}
        MATLABVersion: '8.1 (R2013a)'
          ToolboxName: 'Image Acquisition Toolbox'
       ToolboxVersion: '4.5 (R2013a)'

“winvideo” seems to be the adaptor I want to use, so I will try to get more information about it:

>> devices = imaqhwinfo('winvideo')

devices =

       AdaptorDllName: [1x81 char]
    AdaptorDllVersion: '4.5 (R2013a)'
          AdaptorName: 'winvideo'
            DeviceIDs: {[1]  [2]}
           DeviceInfo: [1x2 struct]

We can see in “DeviceIDs” that I have two cameras connected: my laptop camera and a USB camera. If I want to get more information about each camera, I will print out DeviceInfo.

>> devices.DeviceInfo(1)

ans =

             DefaultFormat: 'YUY2_160x120'
       DeviceFileSupported: 0
                DeviceName: 'USB2.0 Camera'
                  DeviceID: 1
     VideoInputConstructor: 'videoinput('winvideo', 1)'
    VideoDeviceConstructor: 'imaq.VideoDevice('winvideo', 1)'
          SupportedFormats: {1x5 cell}

>> devices.DeviceInfo(2)

ans =

             DefaultFormat: 'MJPG_1280x1024'
       DeviceFileSupported: 0
                DeviceName: '1.3M HD WebCam'
                  DeviceID: 2
     VideoInputConstructor: 'videoinput('winvideo', 2)'
    VideoDeviceConstructor: 'imaq.VideoDevice('winvideo', 2)'
          SupportedFormats: {1x18 cell}

When we are using a camera in Matlab, we need to determine which format will be used. I decided that I want to use the in-built camera, so let’s check what formats does it support.

>> device2 = devices.DeviceInfo(2);
>> device2.SupportedFormats

ans =

  Columns 1 through 5

    'MJPG_1280x1024'    'MJPG_1280x720'    'MJPG_1280x800'    'MJPG_1280x960'    'MJPG_160x120'

  Columns 6 through 10

    'MJPG_176x144'    'MJPG_320x240'    'MJPG_352x288'    'MJPG_640x480'    'YUY2_1280x1024'

  Columns 11 through 15

    'YUY2_1280x720'    'YUY2_1280x800'    'YUY2_1280x960'    'YUY2_160x120'    'YUY2_176x144'

  Columns 16 through 18
    'YUY2_320x240'    'YUY2_352x288'    'YUY2_640x480'

My recommendation is to use MJPG rather than YUY2, and to use a small size like 640×480 or lower because the lower, the faster to process. In addition, if you want to test your camera in Matlab, you can also type imaqtool and the Simulator will pop up. Now we have to configure Matlab writing which camera and format will be used, frames per second captured, and the color space:

vid = videoinput('winvideo', 2, 'MJPG_640x480');
vid.FramesPerTrigger = 1;
vid.ReturnedColorspace = 'rgb';

When we want to capture a picture with the camera, we simple have to trigger:


To retrieve the capture picture(s):

picture = getdata(vid);

When using RGB, we would expect that the variable picture have 3 dimensions (height, width, color channel), but if FramesPerTrigger is more than 1, this variable will have 4 dimensions. The 4th dimension will be each frame, so if we want to save the first and last frame taken, we will simply do:

vid = videoinput('winvideo', 2, 'MJPG_640x480');
vid.FramesPerTrigger = 30;
vid.ReturnedColorspace = 'rgb';
frames = getdata(vid);

SVM algorithm improvements

In the previous post I talked about an SVM implementation in Matlab. I consider that post and implementation really interesting since it is not easy to find a simple SVM implementation. Instead, I found tons of files which may implement a very interesting algorithm but they are insanely difficult to examine in order to learn about how it works. This is the main reason why I put so much effort on this implementation I developed thanks to the algorithm [1].

First improvement

After I implemented that algorithm everything seemed to work:

Example 1:


Data points coordinates and target:
 data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 6 & 7 & 1 \end{bmatrix}

Distance to the border from each point:
 dist = \begin{bmatrix} 5.0598 \\ 5.0597 \\ 5.0596 \end{bmatrix}

Very acceptable results, right? However, when I added more points I experienced an odd behavior.

Example 2:


 data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 9 & 12 & 1 \\ 7 & 12 & 1 \\ 6 & 7 & 1 \end{bmatrix}

 dist = \begin{bmatrix} 9.0079 \\ 7.3824 \\ 7.3824 \\ 5.6215 \\ 3.3705 \end{bmatrix}

It clearly fails at finding the optimum boundary, however, the funny thing here is that the distance between the second and third point and the boundary are the same. This means that the rest of the samples were ignored and the algorithm focused only on those two. Actually, \alpha_2 and \alpha_3 were the only nonzero values.

After debugging the code everything seemed to be right according to the algorithm I followed [1] but after many trials I saw what definitely brought me to discover the error. In my trial the first two elements belong to class -1 whereas the rest of them belong to the other one. As you can see in the following examples, when I changed the order of the elements in the second class I got that the boundary was different depending only on the first element of the second class, ergo, the third sample.

Example 3:


 data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 7 & 12 & 1 \\ 9 & 12 & 1 \\ 6 & 7 & 1 \end{bmatrix}

 dist = \begin{bmatrix} 8.8201 \\ 6.5192 \\ 6.5192 \\ 8.2065 \\ 2.9912 \end{bmatrix}

Example 4:


 data = \begin{bmatrix} -1 & -4 & -1 \\ -4 & 5 & -1 \\ 6 & 7 & 1 \\ 7 & 12 & 1 \\ 9 & 12 & 1 \end{bmatrix}

 dist = \begin{bmatrix} 5.0598 \\ 5.0597 \\ 5.0596 \\ 7.5894 \\ 9.4868 \end{bmatrix}

In this last trial we get the best solution because in this case the algorithm has to focus on the third sample which is the closest one to the other class. However, this is not always true, so I had the need of fixing it. The fix is very simple but was not easy to find (at least quickly).

When I was debugging the code, I realized that in the first loop (iterating over i) it never reached the samples 4th and 5th. The reason was easy to understand: after calculating the temporal boundary (even if it is not the best, that is why it is called “temporal”), there were no errors because the algorithm classified it correctly, so it never entered that loops which needed to pass the “if” which takes care of the tolerance. In other words, if there is no error, it does not try to fix it because it is able to classify it correctly (and this actually makes sense).

If the samples were not encountered on the i loop on purpose, then they should be encountered on the other loop. Surprisingly, the algorithm did not encountered any of them in the inner loop. After I checked that I wrote the code accordingly to the algorithm [1], I thought that there had to be a mistake in the algorithm itself. And the mistake was the “Continue to \text{next i}“. Because of that line, it ignored the rest of j‘s, so it should be “Continue to \text{next j}“.

Thus, the fix in the Matlab code was pretty simple: changing from “break” to “continue“. Break allows to stop iterating over the loop and therefore it continues in the outer loop whereas continue makes the current loop stop and start iterating over the next value in that loop.

Second improvement

After the first improvement was implemented, it seemed that it worked for many trials, but when I tried more complex examples, it failed again.


 data = \begin{bmatrix} -7 & -4 & -1 \\ -9 & -8 & -1 \\ 2 & 5 & -1 \\ -3 & -10 & -1 \\ 9 & 7 & 1 \\ 3 & 8 & 1 \\ 8 & 11 & 1 \\ 8 & 9 & 1 \end{bmatrix}

The original algorithm [1] uses the variable \text{num\_changed\_alphas} to see whether alpha changed. If no alphas are changed during the iterations for \text{max\_passes} times, the algorithm will stop. I think the idea of iterating various times over the main algorithm is correct, but the algorithm must focus on those samples that will help building the boundary. After I implemented the modification, the algorithm iterated less times than the original algorithm. Additionally, the original algorithm implementation seemed to fail in many cases whereas my implementation works.

When the algorithm iterates once, alphas are updated such that nonzero alphas correspond to the samples that will help building the boundary. In this example, after the first iteration, alpha values correspond to this:

 \alpha = \begin{bmatrix} 0 \\ 0 \\ 0.2 \\ 0 \\ 0 \\ 0.2 \\ 0 \\ 0 \end{bmatrix}

Therefore, in the next iteration it will update the samples to focus only in sample #3 and sample #6. After this implementation was done, all the trials I tried worked perfectly. This is the result of the same problem:



This is the algorithm [1] after both improvements:

Initialize \alpha_i = 0, \forall i, b = 0
Initialize \text{counter} = 0
\text{while} ((\exists x \in \alpha | x = 0 ) \text{ } \& \& \text{ } (\text{counter} < \text{max\_iter}))
Initialize input and \alpha
\text{for } i = 1, ... numSamples
Calculate E_i = f(x^{(i)}) - y^{(i)} using (2)
\text{if } ((y^{(i)} E_i < -tol \quad \& \& \quad \alpha_i < C) \| (y^{(i)} E_i > tol \quad \& \& \quad \alpha_i > 0))

\text{for } j = 1, ... numSamples \quad \& \quad j \neq i
Calculate E_j = f(x^{(j)}) - y^{(j)} using (2)
Save old \alpha‘s: \alpha_i^{(old)} = \alpha_i, \alpha_j^{(old)} = \alpha_j
Compute L and H by (10) and (11)
\text{if } (L == H)
Continue to \text{next j}
Compute \eta by (14)
\text{if } (\eta \geq 0)
Continue to \text{next j}
Compute and clip new value for \alpha_j using (12) and (15)
\text{if } (| \alpha_j - \alpha_j^{(old)} < 10^{-5}) (*A*)
Continue to \text{next j}
Determine value for \alpha_i using (16)
Compute b_1 and b_2 using (17) and (18) respectively
Compute b by (19)
\text{end for}
\text{end if}
\text{end for}
\text{counter } = \text{ counter}++
\text{data } = \text{ useful\_data} (*B*)
\text{end while}

Algorithm Legend

(*A*): If the difference between the new \alpha and \alpha^{(old)} is negligible, it makes no sense to update the rest of variables.
(*B*): Useful data are those samples whose \alpha had a nonzero value during the previous algorithm iteration.
(2): f(x) = \sum_{i=1}^m \alpha_i y^{(i)} \langle x^{(i)},x \rangle +b
(10): \text{If } y^{(i)} \neq y^{(j)}, \quad L = \text{max }(0, \alpha_j - \alpha_i), H = \text{min } (C,C+ \alpha_j - \alpha_i)
(11): \text{If } y^{(i)} = y^{(j)}, \quad L = \text{max }(0, \alpha_i + \alpha_j - C), H = \text{min } (C,C+ \alpha_i + \alpha_j)
(12):  \alpha_ := \alpha_j - \frac{y^{(j)}(E_i - E_j)}{ \eta }
(14): \eta = 2 \langle x^{(i)},x^{(j)} \rangle - \langle x^{(i)},x^{(i)} \rangle - \langle x^{(j)},x^{(j)} \rangle
(15): \alpha_j := \begin{cases} H \quad \text{if } \alpha_j > H \\ \alpha_j \quad \text{if } L \leq \alpha_j \leq H \\ L \quad \text{if } \alpha_j < L \end{cases}
(16): \alpha_i := \alpha_i + y^{(i)} y^{(j)} (\alpha_j^{(old)} - \alpha_j)
(17): b_1 = b - E_i - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(i)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(i)},x^{(j)} \rangle
(18): b_2 = b - E_j - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(j)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(j)},x^{(j)} \rangle
(19): \alpha_j := \begin{cases} b_1 \quad \quad \text{if } 0 < \alpha_i < C \\ b_2 \quad \quad \text{if } 0 < \alpha_j < C  \\ (b_1 + b_2)/2 \quad \text{otherwise} \end{cases}

The code is provided in the Source code section.


1. The Simplified SMO Algorithm http://cs229.stanford.edu/materials/smo.pdf

[SVM Matlab code implementation] SMO (Sequential Minimal Optimization) and Quadratic Programming explained

This post is the second and last part of a double entry about how SVMs work (theoretical, in practice, and implemented). You can check the first part, SVM – Support Vector Machine explained with examples.



Since the algorithm uses Lagrange to optimize a function subject to certain constrains, we need a computer-oriented algorithm to implement SVMs. This algorithm called Sequential Minimal Optimization (SMO from now on) was developed by John Platt in 1998. Since it is based on Quadratic Programming (QP from now on) I decided to learn and write about it. I think it is not strictly necessary to read it, but if you want to fully understand SMO, it is recommended to understand QP, which is very easy given the example I will briefly describe.

Optimization and Quadratic Programming

QP is a special type of mathematical optimization problem. It describes the problem of optimizing (min or max) a quadratic function of several variables subject to linear constraints on these variables. Convex functions are used for optimization since they only have one optima which is the global optima.
Optimality conditions:

\Delta f(x^*) = 0 \quad \text{(gradient is zero, derivative)} \\ \Delta^2 f(x^*) \geq 0 \quad \text{(positive semi definite)}

If it is a positive definite, we have a unique global minimizer. If it is a positive indefinite (or negative), then we do not have a unique minimizer.


Minimize x_1^2+x_2^2-8x_1-6x_2
\text{subject to } \quad -x_1 \leq 0 \\ \text{ } \quad \quad \quad \quad \quad -x_2 \leq 0 \\ \text{ } \quad \quad \quad \quad \quad x_1 + x_2 \leq 5

Initial point (where we start iterating) x_{(0)} = \begin{bmatrix}        0 \\        0      \end{bmatrix}
Initial active set s^{(0)} = \left\{ 1,2 \right\} (we will only take into account these constraints).

Since a quadratic function looks like: f(x) = \frac{1}{2}x^T Px + q^T x

We have to configure the variables as:
 P = \begin{bmatrix}        2 & 0 \\        0 & 2      \end{bmatrix}  q = \begin{bmatrix}        -8 \\        -6      \end{bmatrix} \text{(minimize)} \\  A_0 = \begin{bmatrix}        -1 & 0 \\        0 & -1 \\        1 & 1      \end{bmatrix}  b_0 = \begin{bmatrix}        0 \\        0 \\        5      \end{bmatrix} \text{(constraints)}


This picture represents how the problem looks like. The function we want to minimize is the red circumference whereas the green lines represent the constraints. Finally, the blue mark shows where the center and the minima are located.

Iteration 1
s^{(0)} = \left\{ 1,2 \right\} \quad \quad x_{(0)} = \begin{bmatrix}        0 \\        0      \end{bmatrix}

Solve EQP defined by s^{(0)}, so we have to deal with the first two constraints. KKT method is used since it calculates the Lagrangian multipliers.

KKT = \begin{bmatrix}        P & A^T \\        A & 0      \end{bmatrix}  \quad \quad  \begin{bmatrix}        P & A^T \\        A & 0      \end{bmatrix} \begin{bmatrix}        x^* \\        v^*      \end{bmatrix} = \begin{bmatrix}        -q \\        b      \end{bmatrix} \\  \begin{bmatrix}        2 & 0 & -1 & 0 \\        0 & 2 & 0 & -1 \\        -1 & 0 & 0 & 0 \\        0 & -1 & 0 & 0      \end{bmatrix} \cdot \begin{bmatrix}        x_1^* \\        x_2^* \\        v_1^* \\        v_2^*      \end{bmatrix} = \begin{bmatrix}        0 \\        0 \\        -8 \\        -6      \end{bmatrix} \to \text{Solution: } x_{EQP}^* = \begin{bmatrix}        0 \\        0      \end{bmatrix} v_{EQP}^* = \begin{bmatrix}        -8 \\        -6      \end{bmatrix}
x_{EQP}^* indicates the next point the algorithm will iterate.
Now it is necessary to check whether this solution is feasible: A_0 \cdot x_{EQP}^* \leq b_0

 \begin{bmatrix}        -1 & 0\\        0 & -1 \\        1 & 1      \end{bmatrix} \cdot  \begin{bmatrix}        0\\        0      \end{bmatrix} \leq \begin{bmatrix}        0\\        0 \\        5      \end{bmatrix} \to \begin{bmatrix}        1\\        1 \\        1      \end{bmatrix}

It is clear that this point was going to be feasible because the point (0,0) is where we started.
Now, we remove a constraint with negative v_{EQP}^* and move to the next value x_{EQP}^*.

S^{(1)} = S^{(0)} - \left\{\ 1 \right\} = \left\{ 2 \right\} \quad \text{constraint 1 is removed} \\ X^{(1)} = x_{EQP}^* = \begin{bmatrix}        0\\        0      \end{bmatrix}

Graphically, we started on (0,0) and we have to move to the same point (0,0). The removal of the constraint indicates to which axis and direction we are going to move: up or right. Since we took the first constraint out, we will move to the right. Now we need to minimize constrained to the second constraint.

Iteration 2
s^{(1)} = \left\{ 2 \right\} \quad \quad x_{(1)} = \begin{bmatrix}        0 \\        0      \end{bmatrix}

Solve EQP defined with A = \begin{bmatrix}        0 & -1      \end{bmatrix} \quad b = [0]

 K = \begin{bmatrix}        2 & 0 & 0 \\        0 & 2 & -1 \\        0 & -1 & 0      \end{bmatrix} \cdot \begin{bmatrix}        x_1^* \\        x_2^* \\        v_1^*      \end{bmatrix} = \begin{bmatrix}        8 \\        6 \\        0      \end{bmatrix} \to \text{Solution: }  x_{EQP}^* = \begin{bmatrix}        4 \\        0      \end{bmatrix} \quad v_{EQP}^* = \begin{bmatrix}        -6      \end{bmatrix}

Check whether this is feasible: A_0 \cdot x_{EQP}^* \leq b_0 ?

 \begin{bmatrix}        -1 & 0\\        0 & -1 \\        1 & 1      \end{bmatrix} \cdot  \begin{bmatrix}        4 \\        0      \end{bmatrix} = \begin{bmatrix}        -4 \\        0 \\        4      \end{bmatrix} \leq \begin{bmatrix}        0\\        0 \\        5      \end{bmatrix} \to \begin{bmatrix}        1\\        1 \\        1      \end{bmatrix} \quad \text{It is feasible}

S^{(2)} = S^{(1)} - \left\{\ 2 \right\} = \varnothing \quad \quad \text{constraint 2 is removed} \\ X^{(2)} = \begin{bmatrix}        4\\        0      \end{bmatrix}

Iteration 3
s^{(2)} = \varnothing \quad \quad x_{(2)} = \begin{bmatrix}        4 \\        0      \end{bmatrix}

Solve EQP defined with A = [] \quad b = []

 K = \begin{bmatrix}        2 & 0 \\        0 & 2      \end{bmatrix} \cdot \begin{bmatrix}        x_1^* \\        x_2^*      \end{bmatrix} = \begin{bmatrix}        8 \\        6      \end{bmatrix} \to \text{Solution: } x_{EQP}^* = \begin{bmatrix}        4 \\        3      \end{bmatrix}

Check whether this is feasible: A_0 * x_{EQP}^* \leq b_0 = \begin{bmatrix}        1\\        1 \\        0      \end{bmatrix}

It is not feasible. 3rd constrained is violated (that is why the third element is 0). As it is not feasible, we have to maximize \text{t} \quad \text{s.t.} \quad x^{(2)}+t(x_{EQP}^* - x^{(2)}) \in \text{Feasible}

A_0(\begin{bmatrix}        4\\        0      \end{bmatrix}+t(\begin{bmatrix}        4\\        3      \end{bmatrix}-\begin{bmatrix}        4\\        0      \end{bmatrix})) \leq b_0 \\ \vspace{2em}
 \begin{bmatrix}        -1 & 0 \\        0 & -1 \\        1 & 1      \end{bmatrix} \cdot \begin{bmatrix}        4 \\        3t      \end{bmatrix} \leq \begin{bmatrix}        0 \\        0 \\        5      \end{bmatrix}  \\ -4 \leq 0 \quad \text{This is correct} \\ -3t \leq 0 \quad \text{This is correct, since} \quad 0 \leq t \leq 1 \\ 4 -3t \leq 5 \quad \to \quad t = \frac{1}{3} \quad \text{(the largest)}

So, x^{(3)} = x^{(2)} + \frac{1}{3} (x_{EQP}^* - x^{(2)}) = \begin{bmatrix}        4 \\        1      \end{bmatrix}

Finally we add the constraint that was violated.

S^{(2)} = S^{(1)} + \left\{\ 3 \right\} = \left\{\ 3 \right\}

Iteration 4
s^{(3)} = \left\{ 3 \right\} \quad \quad x_{(3)} = \begin{bmatrix}        4 \\        1      \end{bmatrix}

Solve EQP defined with A = \begin{bmatrix}        1 & 1      \end{bmatrix} \quad b = [5]

 K = \begin{bmatrix}        2 & 0 & 1 \\        0 & 2 & 1 \\        1 & 1 & 0      \end{bmatrix} \cdot \begin{bmatrix}        x_1^* \\        x_2^* \\        v_1^*      \end{bmatrix} = \begin{bmatrix}        8 \\        6 \\        5      \end{bmatrix} \to \text{Solution: } x_{EQP}^* = \begin{bmatrix}        3 \\        2      \end{bmatrix} \quad v_{EQP}^* = \begin{bmatrix}        3      \end{bmatrix}

Check whether this is feasible: A_0 \cdot x_{EQP}^* \leq b_0 = \begin{bmatrix}        1\\        1 \\        1      \end{bmatrix} \quad \text{It is feasible}

v^* \geq 0 \quad \to \quad \text{We found the optimal}

Iterations and steps are drawn on the plane below. It is graphically easy to see that the 4th step was illegal since it was out of the boundaries.


Sequential Minimal Optimization (SMO) algorithm

SMO is an algorithm for solving the QP problem that arises during the SVM training. The algorithm works as follows: it finds a Lagrange multiplier \alpha_i that violates the KKT conditions for the optimization problem. It picks a second multiplier \alpha_j and it optimizes the pair (\alpha_i,\alpha_j), and repeat this until convergence. The algorithm iterates over all (\alpha_i,\alpha_j) pairs twice to make it easier to understand, but it can be improved when choosing \alpha_i and \alpha_j.

Because \sum_{i=0} y_i \alpha_i = 0 we have that for a pair of (\alpha_1,\alpha_2): y_1 \alpha_1 + y_2 \alpha_2 = y_1 \alpha_1^{old + y_2 \alpha_2^{old}}. This confines optimization to be as the following lines show:


Let s = y_1 y_2 (assuming that y_i \in \left\{ -1,1 \right\})
y_1 \alpha_1 + y_2 \alpha_2 = \text{constant}  = \alpha_1 + s \alpha_2 \quad \to \quad \alpha_1 = \gamma - s \alpha_2 \\ \gamma \equiv \alpha_1 + s \alpha_2 = \alpha_1^{old} + s \alpha_2^{old} = \text{constant}

We want to optimize:
L = \frac{1}{2} \| w \| ^2 - \sum \alpha_i [y_i (\overrightarrow{w} \cdot \overrightarrow{x_i} +b)-1] = \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \overrightarrow{x_i} \cdot \overrightarrow{x_j}

If we pull out \alpha_1 and \alpha_2 we have:
L = \alpha_1 + \alpha_2 + \text{const.} - \frac{1}{2}(y_1 y_1 \overrightarrow{x_1}^T \overrightarrow{x_1} \alpha_1^2 + y_2 y_2 \overrightarrow{x_2}^T \overrightarrow{x_2} \alpha_2^2 + 2 y_1 y_2 \overrightarrow{x_1}^T \overrightarrow{x_2} \alpha_1 \alpha_2 + 2 (\sum_{i=3}^N \alpha_i y_i \overrightarrow{x_i})(y_1 \overrightarrow{x_1} \alpha_1 + y_2 \overrightarrow{x_2} \alpha_2) + \text{const.} )

Let K_{11} = \overrightarrow{x_1}^T\overrightarrow{x_1}, K_{22} = \overrightarrow{x_2}^T\overrightarrow{x_2}, K_{12} = \overrightarrow{x_1}^T\overrightarrow{x_2} \quad \text{(kernel)} and:

v_j = \sum_{i=3}^N \alpha_i y_i \overrightarrow{x_i}^T\overrightarrow{x_j} = \overrightarrow{x_j}^T\overrightarrow{w}^{old} - \alpha_1^{old} y_1 \overrightarrow{x_1}^T \overrightarrow{x_j} - \alpha_2^{old} y_2 \overrightarrow{x_2}^T \overrightarrow{x_j}

It represents the original formula \overrightarrow{x_j}^T\overrightarrow{w}^{old} without the first and second \alpha

... = \overrightarrow{x_j}^T\overrightarrow{w}^{old} -b^{old} +b^{old} - \alpha_1^{old} y_1 \overrightarrow{x_1}^T \overrightarrow{x_j} - \alpha_2^{old} y_2 \overrightarrow{x_2}^T \overrightarrow{x_j}

Let u_j^{old} = \overrightarrow{x_j}^T\overrightarrow{w}^{old} -b^{old}. This means that u_j^{old} is the output of \overrightarrow{x_j} under old parameters.

u_j^{old} + b^{old} - \alpha_1^{old} y_1 \overrightarrow{x_1}^T \overrightarrow{x_j} - \alpha_2^{old} y_2 \overrightarrow{x_2}^T \overrightarrow{x_j}

Now we substitute in our original formula with \alpha_1 and \alpha_2 pulled out using new variables: s,\gamma,K_{11},K_{22},K_{12},v_j.

L = \alpha_1 + \alpha_2 - \frac{1}{2}(K_{11} \alpha_1^2 + K_{22} \alpha_2 + 2 s K_{12} \alpha_1 \alpha_2 + 2 y_1 v_1 \alpha_1 + 2 y_2 v_2 \alpha_2) + \text{const.}
Note that here we assume that y_i \in \left\{ -1,1 \right\} \quad \text{since} \quad y_1^2 = y_2^2 = 1

Now we substitute using \alpha_1 = \gamma - s \alpha_2

L = \gamma - s \alpha_2 + \alpha_2 - \frac{1}{2} (K_{11}(\gamma - s \alpha_2)^2 + K_{22} \alpha_2^2 + 2 s K_{12}(\gamma - s \alpha_2) \alpha_2 + 2 y_1 v_1 (\gamma - s \alpha_2) +2 y_2 v_2 \alpha_2 ) + \text{const.}

The first \gamma is a constant so it will be added to the \text{const.} value.

 L = (1 -s) \alpha_2 - \frac{1}{2} K_{11} (\gamma - s \alpha_2)^2 - \frac{1}{2} K_{22} \alpha_2^2 - s K_{12} (\gamma -s \alpha_2) \alpha_2 - y_1 v_1 (\gamma - s \alpha_2) - y_2 v_2 \alpha_2 + \text{const.} \\  = (1 -s) \alpha_2 - \frac{1}{2} K_{11} \gamma^2 + s K_{11} \gamma \alpha_2 - \frac{1}{2} K_{11} s^2 \alpha_2^2 -  \frac{1}{2} K_{22} \alpha_2^2 - s K_{12} \gamma \alpha_2 + s^2 K_{12} \alpha_2^2 - y_v v_1 \gamma + s y_1 v_1 \alpha_2 - y_2 v_2 \alpha_2 + \text{const.}

Constant terms are grouped and y_2 = \frac{s}{y_1} is applied:

(1 -s) \alpha_2 + s K_{11} \gamma \alpha_2 - \frac{1}{2} K_{11} \alpha_2^2 - \frac{1}{2} K_{22} \alpha_2^2 - s K_{12} \gamma \alpha_2 + K_{12} \alpha_2^2 + y_2 v_1 \alpha_2 - y_2 v_2 \alpha_2 + \text{const.} \\ = (- \frac{1}{2} K_{11} - \frac{1}{2} K_{22} + K_{12}) \alpha_2^2 + (1-s+s K_{11} \gamma - s K_{12} + y_2 v_1 - y_2 v_2) \alpha_2 + \text{const.} \\ = \frac{1}{2}(2 K_{12} - K_{11} - K_{22}) \alpha_2^2 + (1 -s +s K_{11} \gamma - s K_{12} \gamma + y_2 v_1 - y_2 v_2) \alpha_2 + \text{const.}

Let \eta \equiv 2 K_{12} - K_{11} - K_{12}. Now, the formula is reduced to \frac{1}{2} \eta \alpha_2^2 + (\dots)\alpha_2 + \text{const.}. Let us focus on the second part (the coefficient “\dots“). Remember that \gamma = \alpha_1^{old} + s \alpha_2^{old}

1 -s +s K_{11} \gamma - s K_{12} \gamma + y_2 v_1 - y_2 v_2 \\ = 1 - s + s K_{11}(\alpha_1^{old}+s \alpha_2^{old}) - s K_{12} (\alpha_1^{old}+s \alpha_2^{old}) \\ \hspace{6em} \text{ } + y_2 (u_1^{old}+b^{old} - \alpha_1^{old} y_1 K_{11} - \alpha_2^{old} y_2 K_{12}) \\ \hspace{6em} \text{ } - y_2 (u_2^{old}+b^{old} - \alpha_1^{old} y_1 K_{12} - \alpha_2^{old} y_2 K_{22}) \\ = 1 - s + s K_{11}\alpha_1^{old}+ K_{11} \alpha_2^{old} - s K_{12} \alpha_1^{old} - K_{12} \alpha_2^{old} \\ \hspace{6em} \text{ } + y_2 u_1^{old} + y_2 b^{old} - s \alpha_1^{old} K_{11} - \alpha_2^{old} K_{12} \\ \hspace{6em} \text{ } - y_2 u_2^{old} - y_2 b^{old} + s \alpha_1^{old} K_{12} + \alpha_2^{old} K_{22} \\ = 1 -s +(s K_{11} - s K_{12} - s K_{11} + s K_{12}) \alpha_1^{old} \\ \hspace{6em} \text{ } + (K_{11} - 2 K_{12} + K_{22}) \alpha_2^{old} + y_2 (u_1^{old} - u_2^{old})

Since y_2^2 = 1: (mathematical convenience)

y_2^2 - y_1 y_2 + (K_{11} - 2 K_{12} + K_{11} + s K_{12}) \alpha_2^{old} + y_2(u_1^{old} - u_2^{old}) \\ = y_2 (y_2 - y_1 + u_1^{old} - u_2^{old}) - \eta \alpha_2^{old} \\ = y_2 ((u_1^{old} - y_1) - (u_2^{old} - y_2)) - \eta \alpha_2^{old} \\ = y_2 (E_1^{old} - E_2^{old}) - \eta \alpha_2^{old}

Therefore, the objective function is:
\frac{1}{2} \eta \alpha_2^2 + (y_2 (E_1^{old} - E_2^{old}) - \eta \alpha_2^{old}) + \text{const.}

First derivative: \frac{\partial L}{\partial \alpha_2} = \eta \alpha_2 + (y_2 (E_1^{old} - E_2^{old}) - \eta \alpha_2^{old})

Second derivative: \frac{\partial^2 L}{\partial \alpha_2} = \eta

Note that \eta = 2 K_{12} - K_{11} - K_{22} \leq 0. Proof: Let K_{11} = \overrightarrow{x_1}^T \overrightarrow{x_1}, K_{12} = \overrightarrow{x_1}^T \overrightarrow{x_2}, K_{22} = \overrightarrow{x_2}^T \overrightarrow{x_2}. Then \eta = - (\overrightarrow{x_2} - \overrightarrow{x_1})^T (\overrightarrow{x_2} - \overrightarrow{x_1}) = -\|\overrightarrow{x_2} - \overrightarrow{x_1} \| \leq 0
This is important to understand because when we want to use other kernels this has to be true.

\text{Let} \quad \frac{\partial L}{\partial \alpha_2} = 0, \text{then we have that} \\ \text{ } \quad \alpha_2^{new} = \frac{- y_2 (E_1^{old} - E_2^{old}) - \eta \alpha_2^{old}}{\eta}

Therefore, this is the formula to get a maximum:

\alpha_2^{new} = \alpha_2^{old} + \frac{y_2 (E_2^{old} - E_1^{old})}{\eta}

While performing the algorithm, after calculating \alpha_j (or \alpha_2) it is necessary to clip its value since it is constrained by 0 \leq \alpha_i \leq C. This constraint comes from the KKT algorithm.

So we have that: \text{ } \quad s=y_1 y_2 \quad \gamma = \alpha_1^{old} + s \alpha_2^{old}
If s=1, then \alpha_1 + \alpha_2 = \gamma
If \gamma > C, then \text{max} \quad \alpha_2 = C, \text{min} \quad \alpha_2 = \gamma - C
If \gamma < C, then \text{min} \quad \alpha_2 = 0, \text{max} \quad \alpha_2 = \gamma
If s=-1, then \alpha_1 - \alpha_2 = \gamma
If \gamma > 0, then \text{min} \quad \alpha_2 = 0, \text{max} \quad \alpha_2 = C - \gamma
If \gamma < 0, then \text{min} \quad \alpha_2 = - \gamma, \text{max} \quad \alpha_2 = C

In other words: (L = lower bound, H = higher bound)
 \text{If}  \quad y^{(i)} \neq y^{(j)}, L = max(0,\alpha_j - \alpha_i), H = min(C, C+ \alpha_j - \alpha_i) \\ \text{If}  \quad y^{(i)} = y^{(j)}, L = max(0,\alpha_i + \alpha_j - C), H = min(C, \alpha_i + \alpha_j)

Why do we need to clip our \alpha_2 value?
We can understand it using a very simple example.
Remember that if y_1 = y_2, then s = 1 \quad \text{and} \quad \gamma = \alpha_1^{old} + s \alpha_2^{old} remains always constant. If y_2 \neq y_2, then \alpha_1 = \alpha_2 so they can change as much as they want if they remain equal. In the first case, the numbers are balanced, so if \alpha_1 = 0.2 \quad \text{and} \quad \alpha_2 = 0.5, as they sum 0.7, they can change maintaining that constraint true, so they may end up being \alpha_1 = 0.4, \alpha_2 = 0.3.

If \alpha_1 = 0.3, \alpha_2 = 0.4, C = 1:

1) As explained before, and taking into account that \Delta \alpha_1 = -s \Delta \alpha_2, \alpha_2 can only increase 0.3 because \alpha_1 can only decrease 0.3 keeping this true: 0 \leq \alpha_1 \leq C. Likewise, \alpha_2 can only decrease 0.4 because \alpha_1 can only increase that amount.

2) If they are different, \alpha_2 can grow up to 1 because similarly, \alpha_1 will grow up to 1 (which is the value of C and hence, the upper boundary). \alpha_1 can only decrease till 0.1 because if it decreases till 0, \alpha_1 would be -0.1 and the constraint would be violated.

Remember that the reason why \alpha_1 has to be 0 \leq \alpha_1 \leq C is because of the KKT algorithm.


\alpha_i \alpha_j s L H
0 1 -1 1 1
If L == M, it means that we cannot balance or proportionally increase/decrease at all between \alpha_i and \alpha_j. In this case, we skip this \alpha_i - \alpha_j combination and try a new one.

After updating \alpha_i and \alpha_j we still need to update b.
\sum (x,y) = \sum_{i=1}^N \alpha_i y_i \overrightarrow{x_i}^T\overrightarrow{x} -b -y (the increment of a variable is given by the increment of all variables which are part of it).

\Delta \sum (x,y) = \Delta \alpha_1 y_1 \overrightarrow{x_1}^T\overrightarrow{x} + \Delta \alpha_2 y_2 \overrightarrow{x_2}^T\overrightarrow{x} - \Delta b

The change in the threshold can be computed by forcing E_1^{new} = 0 \text{ if } 0 < \alpha_1^{new} < C \text{(or } E_2^{new} = 0 \text{ if } 0 < \alpha_2^{new} < C \text{)}

E (x,y)^{new} = 0 \\ E(x,y)^{old} + \Delta E (x,y) = E (x,y)^{old} + \Delta \alpha_1 y_1 \overrightarrow{x_1}^T \overrightarrow{x} + \Delta \alpha_2 y_2 \overrightarrow{x_2}^T \overrightarrow{x} - \Delta b

So we have:

\Delta b = E(x,y)^{old} + \Delta \alpha_1 y_1 \overrightarrow{x_1}^T \overrightarrow{x} + \Delta \alpha_2 y_2 \overrightarrow{x_2}^T \overrightarrow{x}


The code I provide in the Source code section was developed by me, but I followed the algorithm shown in [1]. I will copy the algorithm here since it made my life really easier and avoided me many headaches for sure. All the credit definitely goes to the writer.

Algorithm: Simplified SMO
Note: if you check [1] you will see that this algorithm differs from the original one written on that paper. The reason is because the original one had mistakes that I wanted to fix and improve. I talk about that in this post.

C: regularization parameter
tol: numerical tolerance
max_passes: max # of times to iterate over \alpha‘s without changing
(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}): training data

\alpha \in \mathbb{R}^m: Lagrange for multipliers for solution
b \in \mathbb{R}: threshold for solution

Initialize \alpha_i = 0, \forall i, b = 0
Initialize \text{counter} = 0
\text{while} ((\exists x \in \alpha | x = 0 ) \text{ } \& \& \text{ } (\text{counter} < \text{max\_iter}))
Initialize input and alpha
\text{for } i = 1, ... numSamples
Calculate E_i = f(x^{(i)}) - y^{(i)} using (2)
\text{if } ((y^{(i)} E_i < -tol \quad \& \& \quad \alpha_i < C) \| (y^{(i)} E_i > tol \quad \& \& \quad \alpha_i > 0))

\text{for } j = 1, ... numSamples \quad \& \quad j \neq i
Calculate E_j = f(x^{(j)}) - y^{(j)} using (2)
Save old \alpha‘s: \alpha_i^{(old)} = \alpha_i, \alpha_j^{(old)} = \alpha_j
Compute L and H by (10) and (11)
\text{if } (L == H)
Continue to \text{next j}
Compute \eta by (14)
\text{if } (\eta \geq 0)
Continue to \text{next j}
Compute and clip new value for \alpha_j using (12) and (15)
\text{if } (| \alpha_j - \alpha_j^{(old)} < 10^{-5}) (*A*)
Continue to \text{next j}
Determine value for \alpha_i using (16)
Compute b_1 and b_2 using (17) and (18) respectively
Compute b by (19)
\text{end for}
\text{end if}
\text{end for}
\text{counter } = \text{ counter}++
\text{data } = \text{ useful\_data} (*B*)
\text{end while}

Algorithm Legend

(*A*): If the difference between the new \alpha and \alpha^{(old)} is negligible, it makes no sense to update the rest of variables.
(*B*): Useful data are those samples whose \alpha had a nonzero value during the previous algorithm iteration.
(2): f(x) = \sum_{i=1}^m \alpha_i y^{(i)} \langle x^{(i)},x \rangle +b
(10): \text{If } y^{(i)} \neq y^{(j)}, \quad L = \text{max }(0, \alpha_j - \alpha_i), H = \text{min } (C,C+ \alpha_j - \alpha_i)
(11): \text{If } y^{(i)} = y^{(j)}, \quad L = \text{max }(0, \alpha_i + \alpha_j - C), H = \text{min } (C,C+ \alpha_i + \alpha_j)
(12):  \alpha_ := \alpha_j - \frac{y^{(j)}(E_i - E_j)}{ \eta }
(14): \eta = 2 \langle x^{(i)},x^{(j)} \rangle - \langle x^{(i)},x^{(i)} \rangle - \langle x^{(j)},x^{(j)} \rangle
(15): \alpha_j := \begin{cases} H \quad \text{if } \alpha_j > H \\ \alpha_j \quad \text{if } L \leq \alpha_j \leq H \\ L \quad \text{if } \alpha_j < L \end{cases}
(16): \alpha_i := \alpha_i + y^{(i)} y^{(j)} (\alpha_j^{(old)} - \alpha_j)
(17): b_1 = b - E_i - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(i)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(i)},x^{(j)} \rangle
(18): b_2 = b - E_j - y^{(i)} (\alpha_i^{(old)} - \alpha_i) \langle x^{(i)},x^{(j)} \rangle - y^{(j)} (\alpha_j^{(old)} - \alpha_j) \langle x^{(j)},x^{(j)} \rangle
(19): \alpha_j := \begin{cases} b_1 \quad \quad \text{if } 0 < \alpha_i < C \\ b_2 \quad \quad \text{if } 0 < \alpha_j < C  \\ (b_1 + b_2)/2 \quad \text{otherwise} \end{cases}

Source Code Legend

(*1*): This error function arises when we try to see the difference between our output and the target: E_i = f(x_i) - y_i = \overrightarrow{w}^T \overrightarrow{x_i} and as \overrightarrow{w} = \sum_i^N \alpha_i y_i x_i we get that E_i = \sum_i^N \alpha_i y_i \overrightarrow{x} \cdot \overrightarrow{x_i} + b - y_i
(*2*): This line has 4 parts: if ((a && b) || (c && d)). A and C parts control that you will not enter if the error is lower than the tolerance. And honestly, I don’t get parts B and D. Actually I tried to run this code with several problems and the result is always the same with and without those parts. I know that alpha is constrained to be within that range, but I don’t see the relationship between A and B parts, and C and D parts separately, because that double constraint (be greater than 0 and lower than C) should be always checked.


Given a 3D space we have two samples from each class. The last column indicates the class:

data = \begin{bmatrix}        0 & 0 & 3 & -1 \\        0 & 3 & 3 & -1 \\        3 & 0 & 0 & 1 \\        3 & 3 & 0 & 1      \end{bmatrix}

W = [0.3333 0 -0.3333] \to f(x,y,z) = x-z


The code is provided in the Source code section.


1. The Simplified SMO Algorithm http://cs229.stanford.edu/materials/smo.pdf
2. Sequential Minimal Optimization for SVM http://www.cs.iastate.edu/~honavar/smo-svm.pdf
3. Inequality-constrained Quadratic Programming – Example https://www.youtube.com/watch?v=e6jDGxNZ-kk

[Supervised Learning] SVM – Support Vector Machine explained with examples

This post is the first part of a double post about SVMs. This part tries to help to clarify how SVMs work in theory (with 2 full developed examples). The second part (not published yet) will explain the algorithm to solve this problem using a computer: Quadratic Programming and SMO.


Basic definition

Support vector machines (SVMs from now on) are supervised learning models used for classification and regression. The basic idea of SVMs in binary classification is to find the farthest boundary from each class.


Therefore, solving a basic mathematical function given the coordinates (features) of each sample will tell whether the sample belongs to one region (class) or other. Input features determine the dimension of the problem. To keep it simple, the explanation will include examples in 2 dimensions.

Mathematical explanation


The vector \overrightarrow{w} is a perpendicular vector to the boundary, but since boundary’s coefficients are unknown, \overrightarrow{w} vector’s coefficients are unknown as well. What we want to do is to calculate the boundary’s coefficients with respect to \overrightarrow{u} because we have its coordinates (sample’s coordinates).


If both vectors are multiplied, the result will be the purple vector.
\overrightarrow{w} \cdot \overrightarrow{u} \geq c \quad \text{or generally} \quad \overrightarrow{w} \cdot \overrightarrow{u} + b \geq 0

Let us say that:
\overrightarrow{w} \cdot \overrightarrow{x_+} + b \geq 1 \quad \text{and} \quad \overrightarrow{w} \cdot \overrightarrow{x_-} + b \leq -1
Where \overrightarrow{x_+} is a positive sample (class A) and \overrightarrow{x_-} is a negative sample (class B).
A new variable is now introduced:
y_i \quad \text{such that } y_i = +1 \quad \text{for + samples} \\ \text{ } \hspace{70pt} y_i = -1 \quad \text{for - samples}

Multiply each of them by the previous equations:
y_i(\overrightarrow{x_i} \cdot \overrightarrow{w}+b) \geq 1 \\ +1 * (\overrightarrow{x_i} \cdot \overrightarrow{w}+b) \geq 1 \\ -1 * (\overrightarrow{x_i} \cdot \overrightarrow{w}+b) \geq 1

The result is the same equation. Therefore, we only need the previous formula:
y_i(\overrightarrow{x_i} \cdot \overrightarrow{w}+b) \geq 1 \quad \to \quad y_i(\overrightarrow{x_i} \cdot \overrightarrow{w}+b) -1 \geq 0


Finally, we add an additional constrain y_i(\overrightarrow{x_i} \cdot \overrightarrow{w}+b) -1 = 0 so that the values that fulfill this, fall in between the two regions as depicted (green zone).


The next step is to maximize the projection of \overrightarrow{x_+}-\overrightarrow{x_-} on \overrightarrow{w} (the black perpendicular vector to the boundary) to keep samples from each class as far as possible. I assume that you know about scalar projection, but if you don’t, you can check out the Appendix A.

The length of the projection is given by the following formula:
(x_+ - x_-)  \cdot  \frac{\overrightarrow{w}}{\| w \|}

From the previous formula y_i(\overrightarrow{x_i} \cdot \overrightarrow{w}+b) -1 = 0 now let us substitute both positive and negative samples x_+ and x_- so that:
y_i = +1 \quad \to \quad 1(\overrightarrow{x_+} \cdot \overrightarrow{w} + b) -1 = 0 \quad \to \quad \overrightarrow{x_+} \cdot \overrightarrow{w} = 1-b \\ y_i = -1 \quad \to \quad -1(\overrightarrow{x_-} \cdot \overrightarrow{w} + b) -1 = 0 \quad \to \quad \overrightarrow{x_-} \cdot \overrightarrow{w} = -1-b

(x_+ - x_-)  \cdot  \frac{\overrightarrow{w}}{\| w \|} = \frac{\overrightarrow{x_+} \cdot \overrightarrow{w}-\overrightarrow{x_-} \cdot \overrightarrow{w}}{\| w \|} = \frac{1-b+1+b}{\| w \|} = \frac{2}{\| w \|}

The goal is to maximize \frac{2}{\| w \|} which is the same as minimizing \| w \| or, to make it more mathematically convenient, \frac{1}{2}\| w \|^2

Thus, we have a function to minimize with a constraint (y_i(\overrightarrow{x_i} \cdot \overrightarrow{w} + b) -1 = 0 ), so Lagrange is applied. In case you want to know more about Lagrange multipliers, you can check Appendix B.

L = \frac{1}{2}\| w \|^2 - \sum \alpha_i [ y_i (\overrightarrow{w} \cdot \overrightarrow{x_i} +b)-1]
First we have the function we want to minimize, and later the constraints.

\frac{\partial L}{\partial w} = \overrightarrow{w} - \sum \alpha_i y_i x_i = 0 \to \overrightarrow{w} = \sum \alpha_i y_i x_i \\ \frac{\partial L}{\partial b} = - \sum \alpha_i y_i = 0 \to \sum \alpha_i y_i = 0

Plug these two functions to L.

L = \frac{1}{2}(\sum \alpha_i y_i \overrightarrow{x_i}) \cdot (\sum \alpha_j y_j \overrightarrow{x_j}) - (\sum \alpha_i y_i \overrightarrow{x_i}) \cdot (\sum \alpha_j y_j \overrightarrow{x_j}) - \sum \alpha_i y_i b + \sum \alpha_i = \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \overrightarrow{x_j} \cdot \overrightarrow{x_j}

Hence, we aim to minimize \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \overrightarrow{x_j} \cdot \overrightarrow{x_j}
The optimization depends on \overrightarrow{x_j} \cdot \overrightarrow{x_j}

Kernel trick

One of the most interesting properties of SVMs is that we can transform problems from a certain number of dimensions to another dimensional space. This flexibility, as known as kernel trick, allows SVMs to classify nonlinear problems.



\text{from } \quad f(x) \quad \text{ to } \quad f(x,x^2)

The following example shows how to graphically solve the XOR problem using 3 dimensions.


Now it is not difficult to imagine a plane that can separate between blue and red samples.

Example 1: 2 points in a plane

Points and class (coordinate x (x1), coordinate y (x2), class/output (y)):

Point 1:

Coordinates: x = [-1,1]
Class (output): y = 1

Point 2:

Coordinates: x = [1,-1]
Class (output): y = -1

We want to minimize: \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{i=1}^N \alpha_i \alpha_j y_i y_j x_i^T \cdot x_j
We know that: w = \sum_{i=1}^N \alpha_i y_i x_i
and \sum_{i=1}^N \alpha_i y_i = 0

Let us calculate the second part of the function we want to minimize first to keep it simple:
 L_1 = \begin{cases} \alpha_1 * \alpha_1 * 1 * 1 * (-1 * -1 + 1*1) = 2 \alpha_1^2\\  \alpha_1 * \alpha_2 * 1 * -1 * (-1 * 1 + 1*-1) = 2 \alpha_1 \alpha_2 \\  \alpha_2 * \alpha_1 * -1 * 1 * (1 * -1 + -1*1) = 2 \alpha_1 \alpha_2 \\  \alpha_2 * \alpha_2 * -1 * -1 * (1 * 1 + -1*-1) = 2 \alpha_2^2 \end{cases} = 2 \alpha_1^2 + 2 \alpha_1 \alpha_2 + 2 \alpha_1 \alpha_2 + 2 \alpha_2 \\

L = \alpha_1 + \alpha_2 - \frac{1}{2} (2 \alpha_1^2 + 2 \alpha_1 \alpha_2 + 2 \alpha_1 \alpha_2 + 2 \alpha_2) = \alpha_1 + \alpha_2 - 2 \alpha_1 \alpha_2 - \alpha_1^2 - \alpha_2^2

 \vspace{2 em} \frac{\partial L}{\partial \alpha_1} = 0 \quad \to \quad -2 \alpha_1 - 2 \alpha_2 +1 = 0 \quad \to \quad \alpha_1 + \alpha_2 = \frac{1}{2} \\ \frac{\partial L}{\partial \alpha_2} = 0 \quad \to \quad -2 \alpha_2 - 2 \alpha_1 +1 = 0 \quad \to \quad \alpha_1 + \alpha_2 = \frac{1}{2}

 \vspace{2 em} \sum_{i=1}^N \alpha_i y_i = 0 \quad \to \quad 1 \alpha_1 - 1 \alpha_2 = 0 \quad \to \quad \alpha_1 = \alpha_2


\alpha_1 = \alpha_2 = \frac{1}{4}

Now let us calculate \overrightarrow{w}:

\overrightarrow{w} = \sum_{i=1}^N \alpha_i y_i x_i = \frac{1}{4} * 1 * [-1,1] + \frac{1}{4} * -1 * [1,-1] = [\frac{-1}{4} , \frac{1}{4}] + [\frac{-1}{4} , \frac{1}{4}] = [\frac{-2}{4} , \frac{2}{4}] = [\frac{-1}{2} , \frac{1}{2}]

Now we have to figure out the bias
\alpha [y_i (\overrightarrow{w}^T\overrightarrow{x_i} +b) -1] = 0 \quad \to \quad \alpha_i y_i \overrightarrow{w}^T\overrightarrow{x_i} + \alpha_i b y_i - \alpha_i = 0 \\ b = \frac{1-y_i \overrightarrow{w}^T\overrightarrow{x_i}}{y_i} \quad \to \quad b = \frac{1}{y_i} - \overrightarrow{w}^T\overrightarrow{x_i}

 \text{for i=1} \\ \text{ } \hspace{3em} b = 1 -[(-1,1) \cdot (\frac{-1}{2},\frac{1}{2})] = 0 \\ \text{for i=2} \\ \text{ } \hspace{3em} b = - 1 -[(1,-1) \cdot (\frac{-1}{2},\frac{1}{2})] = 0 \vspace{3em} \text{ }

Solution = \frac{-1}{2}x+\frac{1}{2}y=0 \quad \to \quad x-y=0

Example 2: 3 points in a plane

Points and class (coordinate x (x1), coordinate y (x2), class/output (y)):

Point 1:

Coordinates: x = [-1,-1]
Class (output): y = -1

Point 2:

Coordinates: x = [2,0]
Class (output): y = 1

Point 3:

Coordinates: x = [3,1]
Class (output): y = 1

First let us calculate the second part of the function we want to minimize. You can see both alphas being multiplied by two numbers. The first number is the product of y_i and y_j. The second number is the dot product between the two coordinates \overrightarrow{x_i} \cdot \overrightarrow{x_j}.
 L_1 = \begin{cases} \alpha_1 * \alpha_1 * 1 * 2\\  \alpha_1 * \alpha_2 * -1 * -2 \\  \alpha_2 * \alpha_1 * -1 * -4 \\  \alpha_2 * \alpha_2 * -1 * -2 \\  \alpha_1 * \alpha_2 * 1 * 4 \\  \alpha_2 * \alpha_1 * 1 * 6 \\  \alpha_2 * \alpha_2 * -1 * -4 \\  \alpha_1 * \alpha_2 * 1 * 6 \\  \alpha_2 * \alpha_1 * 1 * 10 \end{cases} = 2 \alpha_1^2 + 4 \alpha_1 \alpha_2 + 8 \alpha_1 \alpha_3 + 4 \alpha_2^2 + 12 \alpha_2 \alpha_3 + 10 \alpha_3^2 = 0 \\ \text{Apply } \alpha_1 = \alpha_2 + \alpha_3 \text{ and } \frac{-1}{2} \\ 2 \alpha_2 + 2 \alpha_3 - 5 \alpha_2^2 - 10 \alpha_3^2 - 14 \alpha_2 \alpha_3 = 0 \vspace{1em} \text{  }

 \frac{\partial L}{\partial \alpha_2} = 2 - 10 \alpha_2 - 14 \alpha_3 = 0 \\ \frac{\partial L}{\partial \alpha_3} = 2 - 14 \alpha_2 - 20 \alpha_3 = 0 \\ \alpha_2 = 3, \alpha_3 = -2, \alpha_1 = 1

w = 1*-1*[-1,-1] + 3*1*[2,0] - 2*1*[3,1] = [1,-1] \\ b_1 = b_2 = b_3 = -1

Result: x - y -1 = 0

Appendix A: Scalar Projection of Vectors


We have two points and its vector:
R = [2,10] \quad \text{(red)} \\ B = [9,7] \quad \text{(blue)} \\ \overrightarrow{RB} = (7,-3)

And we want to calculate the length of the vector’s projection (purple) on the orange vector \overrightarrow{O} (which by the way, is the perpendicular of both green lines). The result is the blue vector which is over the orange one within the green region.

For this, we have to solve the formula:
\overrightarrow{RB} \cdot \frac{\overrightarrow{O}}{\| O \|} \\ (7,-3) \cdot \frac{(3,3)}{\sqrt{3^2 + 3^2}} \\ \frac{7*3-3*3}{\sqrt{18}} = 2\sqrt{2}

Now with the length and the angle, we can calculate the coordinates using sine and cosine functions.
x_{relative} = 2\sqrt{2} * \cos{(180 + 45)} = -2 \\ x_{relative} = 2\sqrt{2} * \sin{(180 +45)} = -2

B was in [9,7], so the point on the other side of the projection is [9-2,7-2] \to [7,5]

Appendix B: Lagrange Multipliers

Lagrange is a strategy to find local maxima and minima of a function subject to equality constraints, i.e. max of f(x,y) subject to g(x,y)=c. f(\cdot) and g(\cdot) need to have continuous first partial derivatives.

\text{ } \quad \wedge (x,y,\lambda) = f(x,y) + \lambda(g(x,y)-c) \quad \lambda \text{ may be positive or negative}

If f(x_0,y_0) is a maximum of f(x,y) for the original constrained problem, then there exists \lambda such that (x_0,y_0,\lambda_0) is a stationary point for Lagrange function (so \partial \wedge is 0).


In mathematics, a stationary point of a differentiable function of one variable is a point of the function domain where the derivative is zero. For a function of several variables, the stationary point is an input whose all partial derivatives are zero (gradient zero). They correspond to local maxima or minima.


To make it clear, let us say that we have a surface g(x,y,z) whose gradient is \Delta g and it is perpendicular to the whole surface. We try to find its maxima whose gradient should theoretically be perpendicular as well. Let us not forget the relationship between the first derivative and the gradient. Hence we can say that the gradient of g and the gradient of f are pointing in the same direction so: \overrightarrow{\Delta f} = \lambda \overrightarrow{\Delta g} (proportional). \lambda is called Lagrange multiplier.

Example 1: 1 constraint, 2 dimensions

Maximize f(x,y) = 3x+3y on unit circle x^2+y^2=1 \to g(x,y) = x^2+y^2-1

\frac{\partial f}{\partial x} = \lambda \frac{\partial g}{\partial x}  \to 3 = 2x \lambda \to x = \frac{3}{2 \lambda}\\ \frac{\partial f}{\partial y} = \lambda \frac{\partial g}{\partial y} \to 3 = 2y \lambda \to y = \frac{3}{2 \lambda}

Now we plug these results into the original equation.

x^2 + y^2 = 1 \to \frac{9+9}{4 \lambda^2} = \frac{18}{4 \lambda^2} = 1 \to \lambda = \pm \frac{\sqrt{18}}{2} = 2.12

Therefore we have two points:
(x,y) = (0.707,0.707) \quad \text{or} \quad (x,y) = (-0.707,-0.707) \quad \text{so} \quad f(x,y) = 2.12 \quad \text{or} \quad -2.12


Example 2: 1 constraint, 2 dimensions

Find the rectangle of maximal perimeter that can be inscribed in the ellipse x^2+4y^2=4. We want to maximize the perimeter which is f(x,y) = 4x+4y subject to g(x,y) = x^2 +4y^2 -4


\frac{\partial f}{\partial x} = \lambda \frac{\partial g}{\partial x}  \to 4 = 2x \lambda \\ \frac{\partial f}{\partial y} = \lambda \frac{\partial g}{\partial y} \to 4 = 8y \lambda \\ x = 4y

Now plug it into the original equation.

x^2+4y^2 = 4 \quad \to \quad 20y^2=4 \quad \to \quad y = \pm \sqrt{\frac{1}{5}}, \quad x=\frac{4}{\sqrt{5}} \quad (we take the positive because it is a maximum),

Then, P = (\frac{4}{\sqrt{5}},\frac{1}{\sqrt{5}})
So the maximum permieter is: f(x,y) = 4x+4y = 4* \frac{4}{\sqrt{5}}+4* \frac{1}{\sqrt{5}} = 4 \sqrt{5}

Example 3: 2 constraints, 3 dimensions

f(x,y,z) = x^2 + y^2 + z^2 \quad \text{subject to} \quad x+y+z=1 \to g_1(x,y,z) = x+y+z-1 \\ \text{ } \hspace{140pt} \text{and } x+2y+3z = 6 \to g_2(x,y,z) = x+2y+3z-6 \\ \vspace{15pt} \\
 \frac{\partial f}{\partial x} = \lambda_1 \frac{\partial g_1}{\partial x} + \lambda_2 \frac{\partial g_2}{\partial x} \quad \to \quad 2x = \lambda_1 + \lambda_2 \quad \to \quad x = \frac{\lambda_1 + \lambda_2}{2}\\ \frac{\partial f}{\partial y} = \lambda_1 \frac{\partial g_1}{\partial y} + \lambda_2 \frac{\partial g_2}{\partial y} \quad \to \quad 2y = \lambda_1 + 2 \lambda_2 \quad \to \quad y = \frac{\lambda_1 + 2 \lambda_2}{2}\\ \frac{\partial f}{\partial z} = \lambda_1 \frac{\partial g_1}{\partial z} + \lambda_2 \frac{\partial g_2}{\partial z} \quad \to \quad 2z = \lambda_1 + 3 \lambda_2 \quad \to \quad z = \frac{\lambda_1 + 3 \lambda_2}{2}\\

Now plug it into g_1(x,y,z) and g_2(x,y,z):

\frac{\lambda_1 + \lambda_2}{2} + \frac{\lambda_1 + 2 \lambda_2}{2} + \frac{\lambda_1 + 3 \lambda_2}{2} = 1 \quad \to \quad 3 \lambda_1 + 6 \lambda_2 = 2 \\ \frac{\lambda_1 + \lambda_2}{2} + \lambda_1 + 2 \lambda_2 + \frac{3}{2} (\lambda_1 + 3 \lambda_2) = 6 \quad \to \quad 6 \lambda_1 + 14 \lambda_2 = 12 \\ \lambda_2 = -4, \quad \lambda_1 = \frac{-22}{3} \\ x = \frac{-5}{3}, \quad y = \frac{1}{3}, \quad z = \frac{7}{3} \\ f(x,y,z) = x^2 + y^2 + z^2 \quad f(\frac{-5}{3}, \frac{1}{3}, \frac{7}{3}) = \frac{25}{3}

Since this is a parabola in 3 dimensions, this has no maximum, so it is a minimum.

[Image Processing] Temporal median


Despite the huge simplicity of temporal median (both theoretical and practical approaches), I considered it very interesting since it allows to remove undesired objects from a static background. This can be especially useful when someone wants to take a picture of a very crowded sightseeing or monument, although it consumes a very important amount of resources.

First we have to assume that the pictures were taken from the same spot, they have the same luminosity and the same size. These are quite strong constraints but can be achieved by using a tripod and a remote trigger. In my case, I used my smartphone which has a function that by making an intense noise (snap) it takes a picture.

The algorithm iterates over each coordinate and calculates the median of each pixel and each color channel from the different images. The value of each color channel is then mixed and located in that coordinates.

Let us say that we have 3 pictures with the same background and it has a single pixel located in different spots. When the iteration arrives to [5,8] it encounters a black dot in the first picture.

noise1 noise2 noise3

Channel_{red} = [0 \quad 237 \quad 237] \\ Channel_{green} = [0 \quad 28\quad 28] \\ Channel_{blue} = [0 \quad 36 \quad 36]

When median is calculated, the pixel will be:

Pixel_{5,8} = [237 \quad 28 \quad 36]

And therefore, the black pixel will disappear.

Other temporal operations use mode and average instead of median. In case of average, the result is a blended picture.


4 images were processed (4208×2368) by an under-average laptop and it took 35 minutes to generate the result.


The code is provided in the Source code section.


1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

[Second order edge detection operators] Laplacian and Marr-Hildreth operator


First order detection is based on the differentiation of intensity within the same region to detect changes whereas the second order gets zero when the rate of change is constant. Therefore, when a zero-crossing is found, the algorithm would have found an edge.

Laplacian operator is a template which implements this. By differentiating between two adjacent pixels (first order), the second order is approximated.
f''(x) \cong f'(x) - f'(x+1) \\ f''(x) \cong -f(x) + 2f(x+1) -f(x+2)

Which can be translated into an horizontal (or vertical if transposed) as:

-1 2 1


0 -1 0
-1 4 -1
0 -1 0
-1 -1 -1
-1 8 -1
-1 -1 -1

Marr-Hildreth operator

The Marr-Hildreth operator combines Gaussian filtering to smooth the image with Laplacian (LoG operator, Laplacian of Gauss). After the second differentiation of Gauss we have the corresponding formula to find the coefficients to the template to convolve the image with.

\Delta^2g(x,y) = \frac{1}{\sigma^2}(\frac{(x^2+y^2)}{\sigma^2}-2)e^{\frac{-(x^2+y^2)}{2 \sigma^2}}

This function, which can be approximated by a difference of Gaussians, is called Mexican hat wavelet because of its shape:


After the image is convolved with the generated template, we have do determine the zero-crossing points. There are many techniques to achieve this goal, but a much simpler solution is shown: first we divide a 3×3 window in 4 sections as in the following picture, all of them containing the pixel in the center. The average of each section is computed, and if the maximum value is positive and the minimum is negative, a zero-crossing is found and therefore the pixel in the center will be set as white (otherwise is marked as black). This has to be applied to the whole image.



See how a simple figure changes after Laplacian is applied.


Original input


After Laplacian


In this case we would have to detect the zero-crossings to obtain the final result.


2nd derivative Gaussian: {size: 3, sigma: 0.7}
2nd derivative Gaussian: {size: 5, sigma: 0.7}
2nd derivative Gaussian: {size: 7, sigma: 0.7}
2nd derivative Gaussian: {size: 9, sigma: 0.7}
2nd derivative Gaussian: {size: 5, sigma: 0.5}
2nd derivative Gaussian: {size: 5, sigma: 0.6}
2nd derivative Gaussian: {size: 5, sigma: 0.8}
2nd derivative Gaussian: {size: 5, sigma: 0.9}
2nd derivative Gaussian: {size: 5, sigma: 1}

The code is provided in the Source code section.


1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

[Neural Networks] Multilayer Perceptron


An MLP (Multilayer perceptron) is a feedforward neural network which consists of many perceptrons joined in many layers. The most basic MLP has 3 layers: one input layer, one output layer and one hidden layer, but it may have as many hidden layers as necessary. Each individual layer may contain a different number of neurons.


 W_1 = \begin{bmatrix}        a & b & c \\        d & e & f \\        g & h & i      \end{bmatrix}  \quad  W_2 = \begin{bmatrix}        .. & .. & .. & ..      \end{bmatrix}

Feedforward means that nodes have a direct connection such that nodes located in between are fed by the outputs of previous nodes and so on. This is very simple to program because it is basically as the perceptron: multiply weights (including bias nodes) by the input vector, give the result to the activation function, and the output will be the input of the next layer until the output layer is reached.

In the entry about the perceptron I used the step function as the activation function, but in MLP a variety of them can be used. Most of them are sigmoid (mathematical functions having an “S” shape) such as:


The difference between these two sigmoids is that the hyperbolic tangent goes from -1 to 1 whereas the other goes from 0 to 1.

Sometimes linear functions are also needed for cases when neural networks are not used for classification but for extrapolation and any kind of prediction.

If the MLP misclassifies an input, it is necessary to back-propagate the error rate through the neural network in order to update old weights. Gradient descent through derivatives is used to correct the weights as much as possible. Derivatives show how the error varies depending on the weights to later apply the proper correction.

In perceptrons, the error function was basically t_i - y_i (desired output minus system’s output) but in MLP the squared error function will be used instead J = \sum \frac{1}{2}(y-\widehat{y})^2:

  • It has a sum because each element has its own error and it is necessary to sum them up.
  • It is squared because convex functions’ local and global minima are the same.
  • It has a \frac{1}{2} to make it simpler when obtaining the derivative.

The following mathematical explanation regarding the formulas for the learning process is based on the 2x3x1 neural network depicted in the first picture of the entry, but before that, some terms need to be introduced to avoid misunderstandings:

X \to Input matrix.
W_1 \to Weights’ matrix that connects the input and hidden layers.
W_2 \to Weights’ matrix that connects the hidden and output layers.
y \to Desired output.
\eta \to Learning rate.
z_2 = x*W_1 \to Content of each node in the hidden layer (before activation).
a_2 = f(z_2) \to Content of each node in the hidden layer (after activation).
z_3 = a_2*W_2 \to Content of each node in the output layer (before activation).
\widehat{y} = f(z_3) \to MLP output.
J = \sum \frac{1}{2}(y-\widehat{y})^2 \to Squared error function.

First, let us calculate the update formula for W_2. For this, it is necessary to calculate the derivative of the error with respect to W_2 to see how much the error varies depending on W_2.

\frac{\partial J}{\partial W_2} = \frac{\partial \sum \frac{1}{2} (y - \widehat{y})^2}{\partial W_2} = \sum \frac{\partial \frac{1}{2} (y - \widehat{y})^2}{\partial W_2}

By the sum rule in differentiation we can take the sum off the fraction, and from now on it is no longer needed since at the end we can sum each sample individually, so let us derive it.

\frac{1}{2} \cdot 2 \cdot (y - \widehat{y}) \cdot [ \frac{\partial y}{\partial W_2} - \frac{\partial \widehat{y}}{\partial W_2} ]

The derivative of y with respect to W_2 is 0 since y does not depend on W_2.

-1 (y - \widehat{y}) \cdot \frac{\partial \widehat{y}}{\partial W_2}

By applying the chain rule:

-1 (y - \widehat{y}) \cdot \frac{\partial \widehat{y}}{\partial z_3} \cdot \frac{\partial z_3}{\partial W_2}

Which finally:

-1 (y - \widehat{y}) f'(z_3) \cdot a_2 \to \delta_3

In case of W_1 the procedure is very similar:

\frac{\partial J}{W_1} = [\text{same things...} ] = \delta_3 \cdot \frac{\partial z_3}{\partial W_1} = \delta_3 \cdot \frac{\partial z_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial W_1} \\ = \delta_3 \cdot W_2 \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_1} = \delta_3 \cdot W_2 \cdot f'(z_2) \cdot x

Learning process (update rules):
W_1 = W_1 + \eta \frac{\partial J}{\partial W_1} \\ W_2 = W_2 + \eta \frac{\partial J}{\partial W_2}


The MLP I developed was able to correctly classify samples from its training set from 3 different classes. I have not written a word about overfitting because I will probably do that in a different entry with a good solution to overcome it, but this algorithm probably overfits and cannot classify other samples with the same good result.

Finally, I recorded the evolution of the error with two different learning rate to show that if \eta is too big, it can escape from the minimum with an unknown result. In this case, it resulted absolutely good, but it is not always the case.



Left: \eta = -0.2. Right: \eta = -0.1

The code is provided in the Source code section.


I strongly recommend the set of videos to which this one belongs to.