Colorizing black and white images using Deep Learning (Tensorflow)

Recently I was reading up on an interesting paper that explores how colorizing black and white images using Deep Learning. The paper was easy to read and understand, and to fully enjoy it I decided to implement it in a lower scale. I have a laptop with a humble graphic card and an i7 so it cannot compete with those servers that cost several thousand euros, that is why I have not focused on the design of the network but more on the algorithmic part.

Goals and dataset

The goals of this post are:

  1. To be able to implement the algorithm to learn about a new colorizing approach.
  2. To have a look to the neural network they used.
  3. To figure out if a simpler network can learn even if the results are quite improvable.
  4. To practice using tensorflow.
  5. To have fun.

The dataset I will use can be downloaded here. It consist of 2687 256×256 images of beaches, forests, streets, etc.

Algorithm

One of the things I enjoyed about the paper [1] is how well structured it is, which made the understanding much easier. I will try to follow their structure in a similar way while being precise and concise.

The very first important thing we need to know is that Lab color space is used. The reason is simple: in color spaces such as RGB we need to learn three different channels whereas in Lab we only need to learn two. The channel L refers to the lightness, having values between 0 (dark) and 100. The channels a and b are the position in the axis between red-green and blue-yellow respectively.

63c89aba0ed994edcfce462b2a4b2b6b

A typical approach would use the L channel as input and the a,b channels as output (for every pixel the prediction consists on two values). However, they argue that using a loss function based on two components does not represent the multimodal and ambiguity problems properly. We are surrounded by objects that can have multiple colors such as apples since they can be red, green or yellow. In addition, Euclidian distance (averaging) will increase desaturated (grayish) values.

In order to solve this they divided the a,b colorspace into Q bins corresponding to probabilities, and consequently, the number of predictions expected per pixel will be Q (one per bin). For instance, in case of the apple, the bins corresponding to red, yellow and green will contain higher probabilities than those bins close to colors like blue.

colorspace

I skipped representing some bins by analyzing first the colors I have in my dataset so that I only take into account those bins with colors used in my dataset. My a,b color space is between -110 and 110 and depending on the size of the squared window used to divide it, the resulting grid and the final used bins will be different. When I’m using the whole dataset and a window size of 10 I use XX bins and it raises to XXX bins when the window size is 5. In contrast, the authors use Q=313 bins.

The problem is now a simple classification problem in which we have an input (brightness values) and the output is a set of probabilities indicating how likely are the values to belong to a certain bin. A classic way to solve this is to use cross entropy but in addition to it, the authors noticed that the model is biased towards low a,b values which typically correspond to backgrounds such as walls, sky, etc. In order to solve this, they rebalanced the colors by adding weights that will multiply the cross entropy calculated.

form1

This formula can be read as: perform cross-entropy on a pixel comparing the original and predicted distribution of the probabilities, and then multiply it by a weight corresponding to that color such that certain colors are highlighted to the network.

form2

To calculate these weights they introduced a new hyper-parameter \lambda (they use \lambda=0.5). Q was the number of bins used and \tilde{p} is a distribution calculated from our dataset. I tried to use the full dataset consisting of different types of images and another consisting only of images of forests so I had to recalculated it. You can count the number of pixels that are located within each bin and later normalize it. They also apply a Gaussian kernel but I actually skipped that part. The image below depicts which colors are used the most in the dataset.

colorspacecombined

In the image above, the left side shows that the most used colors are located in the center (desaturated) and both cases differ depending on the dataset: when only forest images were used, less pink-red colors were used.

This is basically all, in regards to the training part. The other important issue arises when predicting values. Imagine for a second that for a single pixel the network spits a probability distribution such that all values are zero except for one value that is one (the so called one-hot vector). It is more than clear that the corresponding bin will represent the color of that pixel. Nonetheless, we will probably have several probabilities and mapping them back to a,b values is the next task. They introduced a new variable called T standing for temperature (coming from simulated annealing). To put it simple, when T tends to zero, values are more intense since the prediction emphasizes the color with the highest probability. On the other hand, when T tends to one, the colors are more distributed but also more desaturated because the color is result of a more spread average of the obtained probabilities.

form3

The formula above describes how they calculated the prediction given the temperature T and the distribution z.

Neural Network

As I have mentioned before, due to my limited resources, utilizing the network they used was immediately discarded. Thus I designed a very simple neural network that consists of 8 convolutional layers. The input consist of a window of 32×32 pixels and after the convolutions the final prediction has a size of 16×16. In every convolution the size is reduced by 2 because of the 3×3 kernel thus reducing one pixel on each side of the image. The number of filters used depends on the number of bins Q such that they are growing using a step of Q/8. For instance, the number of filters in the first convolutional layer is Q/8, in the second layer is 2*Q/8, and so on until the last layer with a number of Q filters. Therefore, the size of my final output will be 16x16xQ.

Results

At first I was using only the forest dataset to speed up the training process and regardless of some hyper-parameters tunning the results were a bit random: sometimes they were quite okay and sometimes they were bad, but in all cases the network was learning and showing a behavior that makes sense, so I can say that I am satisfied with the results.

results_color

References

1. Zhang R., Isola P. & Efros A.A. 2016. Colorful Image Colorization.

Informal review on Randomized Leaky ReLU (RReLU) in Tensorflow

This very informal review of the activation function RReLU compares the performance of the same network (with and without batch normalization) using different activation functions: ReLU, LReLU, PReLU, ELU and an less famous RReLU. The difference between them lies on their behavior from [- \infty,0]. The goal of this entry is not to explain in detail these activation functions, but to provide a short description.

When a negative value arises, ReLU deactivates the neuron by setting a 0 value whereas LReLU, PReLU and RReLU allow a small negative value. In contrast, ELU has a smooth curve around the zero to make it derivable resulting in a more natural gradient and instead of deactivating the neuron negative values are mapped into a negative one. The authors claim that this pushes the mean unit closer to zero, like batch normalization [1].

elu

LReLU, PReLU and RReLU provide with negative values in the negative part of the respective functions. LReLU is using a small tilted slope whereas PReLU learns the steepness of this slope. On the other hand, RReLU, the function we will study here, sets this slope to be a random value between an upper and lower bound during the training and an average of these bounds during the testing. The authors of the original paper get their inspiration from Kaggle competition and even use the same values [2]. These are random values between 3 and 8 during the training and a fixed value 5.5 during testing.

Notice that in [2] and consequently in the following tests, the variable \alpha_i that uses LReLU is not used as \alpha_i x_i but as \frac{x_i}{\alpha_i}. This detail is important and for some reasons [2] change the notation from the original LReLU paper.

Results

As in the paper where RReLU is introduced, I used the same activation function configurations plus ELU (default configuration). I run a very simple neural network using MNIST dataset with and without batch normalization and as we can see in the figure below RReLU does not only perform among the words but the simple ReLU performs the best when normalization is used and almost the best when no normalization is added.

fixed_norm

fixed_nonorm

Notes on Tensorflow

This activation function requires to constantly use new random values that need to be initalized constantly while the network is training. As we can see in the corresponding tutorial video and the source code the initializer needs to be called on each iteration during the training by:

1
sess.run(r1.initializer)

The code is provided in the Source code section.

References

1. Clevert D.A., Unterthiner T. and Hochreiter S. 2016. Fast and accurate Deep Network Learning by Exponential Linear Units (ELUs). ICLR 2016.
2. Xu B., Wang N., Chen T. and Li M. 2015. Empirical Evaluation of Rectified Activations in Convolutional Network.

Debugging a Keras Neural Network

Learning outcomes:

  • How to get the weights and bias values of the layers.
  • How to get the values between the hidden layers (before and after the activation function)

The goal of this post is to learn how to debug a neural network in Keras. This is extremely important due to a variety of reasons.

  1. Knowing how to debug increases the understanding of the underlying structure of the network and its theoretical background.
  2. Learning what’s going on at each level of the network translates into a better understanding of the outcome.
  3. Knowing about each layer’s outcome can be valuable for research purposes.
  4. Meticulous analyses and splits of the network allow us to easily replace and experiment with some parts of it.

Obtaining general information

Obtaining general information can give us an overview of the model to check whether its components are the ones we initially planned to add. We can simply print the layers of the model or retrieve a more human-friendly summary. Note that the layer of the neural network (input, hidden, output) are not the same as the layers of the Keras model. Our model’s layers are more abstract operations such that transformations, convolutions, activations, etc.

1
print(model.layers)

Output:

1
2
3
4
5
6
[<keras.layers.convolutional.Conv2D at 0x7faf0c4c9c90>,
 <keras.layers.convolutional.Conv2D at 0x7faf0c4de050>,
 <keras.layers.pooling.MaxPooling2D at 0x7faf0c46bc10>,
 <keras.layers.core.Flatten at 0x7faf0c4de450>,
 <keras.layers.core.Dense at 0x7faf0c46b690>,
 <keras.layers.core.Dense at 0x7faf0e3cf710>]
1
print(model.summary())

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
_________________________________________________________________
Layer (type)                 Output Shape              Param #  
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 64)        18496    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 64)        0        
_________________________________________________________________
flatten_1 (Flatten)          (None, 9216)              0        
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1179776  
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
=================================================================
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________

We can also retrieve each layer’s input and output size.

1
2
for layer in model.layers:
    print("Input shape: "+str(layer.input_shape)+". Output shape: "+str(layer.output_shape))

Output:

1
2
3
4
5
6
Input shape: (None, 28, 28, 1). Output shape: (None, 26, 26, 32)
Input shape: (None, 26, 26, 32). Output shape: (None, 24, 24, 64)
Input shape: (None, 24, 24, 64). Output shape: (None, 12, 12, 64)
Input shape: (None, 12, 12, 64). Output shape: (None, 9216)
Input shape: (None, 9216). Output shape: (None, 128)
Input shape: (None, 128). Output shape: (None, 10)

Obtaining the output of a specific layer after its activation function

This model is a modified example from the original Keras repository.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape,
         kernel_initializer=keras.initializers.Ones()))

model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(Dense(num_classes, activation='softmax', kernel_initializer=keras.initializers.Ones()))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
        metrics=['accuracy'])

This model consist of 6 layers which, as we can see in the code, include special information in the parameters. It’s important to note that the activation function used in the layers is specified within the layer because alternatively we could just add another layer after the first convolution specifying the activation function.

We can imagine our model as a tunnel in which each layer is a different part of the tunnel. In order to obtain the output of a specific layer we need to parcellate a subtunnel. As we are interested in the output of the first convolutional layer after the activation function, our subtunnel will be bounded from the input of the first layer to the output of the first layer (which includes the activation funcion because it was specified in the code). We will use the function “function” to create this subtunnel specifying its beginning and end.

1
2
from keras import backend as K
fun = K.function([model.layers[0].input],[model.layers[0].output])

After that we simply have to accommodate the input and pass it to that function.

1
2
x_inp = np.reshape(x,(1,28,28,1))
layer_output = fun([x_inp])[0]

In the Source code section, the script called debugging1.py shows how subtunnels were created from the beginning to each layer of the network. In addition, it shows an alternative way to obtain the same results providing a good understand of what’s going on in the network and both outcomes are compared to check that they are the same.

Obtaining the output of a specific layer before its activation function

The only difference with regards to the previous section is that this time the model needs to be modified to have its activation functions separated from the layers, as we can see below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 input_shape=input_shape,
         kernel_initializer=keras.initializers.Ones()))
model.add(Activation("sigmoid"))

model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, kernel_initializer=keras.initializers.Ones()))
model.add(Activation("sigmoid"))
model.add(Dense(num_classes, kernel_initializer=keras.initializers.Ones()))
model.add(Activation("softmax"))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
        metrics=['accuracy'])

Obtaining the output values is done in a similar way to the previous section. Here we show that obtaining the values before and after the activation it is a matter of changing the output layer.

1
2
3
4
5
6
7
8
# With and without the activation
fun_without = K.function([model.layers[0].input],[model.layers[0].output])
fun_with = K.function([model.layers[0].input],[model.layers[1].output])
# Input
x_inp = np.reshape(x,(1,28,28,1))
# Output
layer_output_without = fun_without([x_inp])[0]
layer_output_with = fun_with([x_inp])[0]

In the Source code section, the script called debugging2.py shows this, and as in debugging1.py it also recreates the solution in an alternative way and compare both results.

What if during the training and testing behaviors are different?

Extracted from the Keras website:

Note that if your model has a different behavior in training and testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you will need to pass the learning phase flag to your function:

1
2
3
4
5
6
7
8
get_3rd_layer_output = K.function([model.layers[0].input, K.learning_phase()],
                                  [model.layers[3].output])

# output in test mode = 0
layer_output = get_3rd_layer_output([x, 0])[0]

# output in train mode = 1
layer_output = get_3rd_layer_output([x, 1])[0]

Note how the now the created functor receives both the input and whether it’s training or testing.

Homography estimation explanation and python implementation

Homographies are transformations of images from one planar surface to another (image registration). Homographies are used for tasks such as camera calibrations, 3D reconstruction, image rectifications. There are multiple methods to calculate an homography and this post explains one of the simplest.

Given a point in a 3D space x=(x_1,y_1,1) and a matrix H, the resulting multiplication will return the new location of that point x' = (x_2,y_2,1) such that:

x' = Hx

Due to the dimensions of x and x' we know that H will be a 3×3 matrix but even if there are 9 elements in the matrix, we will have just 8 degrees of freedom. In this Powerpoint presentation [1] we can intuitively see where does it come from.

So we have that:

  \begin{bmatrix}   u \\   v \\   1  \end{bmatrix} =  \begin{bmatrix}   h_1 & h_2 & h_3 \\   h_4 & h_5 & h_6 \\   h_7 & h_8 & h_9  \end{bmatrix}  \begin{bmatrix}   x \\   y \\   1  \end{bmatrix}

Where u and v are the new coordinates. Therefore, we have:

 \\ u = x h_1 + y h_2 + h_3 \\ v = x h_4 + y h_5 + h_6 \\ 1 = x h_7 + y h_8 + h_9 \\

So for each point we have:

 \\ x h_1 + y h_2 + h_3 - u (x h_7 + y h_8 + h_9) = 0 \\ x h_4 + y h_5 + h_6 - v (x h_7 + y h_8 + h_9) = 0 \\   A_i =   \begin{bmatrix}   x & y & 1 & 0 & 0 & 0 & -ux & -vy & -u \\   0 & 0 & 0 & x & y & 1 & -ux & -vy & -u  \end{bmatrix}

Since we have 8 degrees of freedom, we need at least 4 points to obtain H (each point contributes with two new variables to the formula, x and y). We just need to stack A_1, A_2, A_3, A_4 to have a 8×9 matrix that we will call A. We are interested in solving the following equation avoiding the trivial solution h=0

 Ah=0;  \begin{bmatrix}   x_1 & y_1 & 1 & 0 & 0 & 0 & -u_1 x_1 & -v_1 y_1 & -u_1 \\   0 & 0 & 0 & x_1 & y_1 & 1 & -u_1 x_1 & -v_1 y_1 & -u_1 \\   & & & & \cdots  \end{bmatrix}  \begin{bmatrix}   h1 \\ h2 \\ h3 \\ \vdots  \end{bmatrix} = 0

We will solve this as a least squares problem using singular value decomposition (SVD)

Least squares and SVD

This method (explained very clearly in [2]) is used when we want to approximate a function given different observations. For instance, we have that:

 \\ c + d x_1 = y_1 \\ c + d x_2 = y_2 \\ c + d x_3 = y_3

If there were no errors those equations would be true, but since our measurements might have noise we want to minimize that errors by minimizing:

 (c + d x_1 - y_1)^2 + (c + d x_2 - y_2)^2 + (c + d x_3 - y_3)^2

In general, for systems like Ax=b we want to minimize || Ax-b ||^2

We will use SVD in our matrix A:

 [U,S,V] = SVD(A)

V are the eigenvectors of A^T A. The solution is therefore the last eigenvector because its eigenvalue (diagonal matrix D) will be zero or close to zero in case of noise. More intuitively, imagine that the largest eigenvectors will depict the largest variance across the data, and we are interested in minimizing, so the eigenvector should have a small eigenvalue.

When performing an homography, the resulting image will probably have different dimensions from the original one since it might be stretched, rotated, and so on. This will result in having many “empty pixels” that must be filled by performing an interpolation.

One of the nicest properties of the homography is that H has an inverse, which means that we can map all points back to the origin by multiplying them to the inverse of H. In order to fill an empty point we will multiply their coordinates by H^{-1} to get the original coordinates, which will be floating point numbers. Those “original coordinates” must be interpolated (for instance, you can round them) to get the closest pixel (nearest neighbor) and put it in the empty pixel.

Example

In the following example the label of this small notebook will be placed horizontally. For this, the location of the 4 pixels corresponding to the four corners is used, and a new location drawing a rectangle is calculated as well. The red dots correspond to the points we want to transform and the green dots their target location.

or

This first approximation is obtained by calculating the new location of each pixel. However, it will leave plenty of empty pixels that can be interpolated after the inverse matrix of the homography transformation matrix is calculated.

_tmp_final

After the interpolation, the final result will not contain any empty pixel.

final

The code is provided in the Source code section.

References

1. https://courses.cs.washington.edu/courses/csep576/11sp/pdf/Transformations.pdf (Accessed on 8-8-2017)
2. http://www.sci.utah.edu/~gerig/CS6640-F2012/Materials/pseudoinverse-cis61009sl10.pdf (Accessed on 8-8-2017)

Interesting Papers

List of some interesting papers I’ve been reading (sorted alphabetically by title):

V. Vezhnevets, V. Sazonov and A. Andreeva. 2003. A Survey on Pixel-Based Skin Color Detection Techniques. IN PROC. GRAPHICON-2003 85-92.
Matthew A. Turk and Alex P. Pentland. 1991. Face recognition using Eigenfaces. CVPR 586-591.
J. Shi and J. Malik. 2000. Normalized cuts and image segmentation. T-PAMI, 22(8): 888-905.
R. Dahl, M. Norouzi and J. Shlens. 2017. Pixel Recursive Super Resolution.

[Explanation] Face Recognition using Eigenfaces

This post comes from an assignment for my “Cognitive Science I” course. Enjoy it.

Brief Introduction

This paper is one of the most relevant papers regarding to face recognition. Nowadays it is difficult to find a real life implementation of this old algorithm, but other research has been built upon this. In addition, the simplistic and effectiveness of this algorithm makes it very beautiful.

Algorithm

To train the model:
1.-Flat the black and white images of the training set (from matrices to vectors)
2.-Calculate the mean.
3.-Normalize the training set: for each image, subtract the mean.
4.-Calculate the covariance: multiply all images by themselves.
5.-Extract eigenvectors from the covariance.
6.-Calculate eigenfaces: eigenvectors x normalized pictures.
7.-Choose the most significant eigenfaces.
8.-Calculate weights: chosen eigenfaces x normalized pictures.

To detect a face:
9.-Vectorize and normalize this picture: subtract the calculated mean from the picture.
10.-Calculate the weights: multiply the eigenfaces x normalized picture.
11.-Interpret the distance of this weight vector in the face space: if it is far, it is not a face (establishment of a threshold).

Explanation of the algorithm

Training set used:
trainingset

1.-Flatting the image
This algorithm works with vectors because we have to calculate later the covariance. Because of this, the image needs to be in a vector form.

1 2 3
4 5 6
7 8 9
1 4 7 2 5 8 3 6 9

2.-Calculate the mean
The mean is just the sum of all of the pictures divided by the number of pictures. As a result, we will have an “average” face.
median

3.-Normalize the training set
To normalize the training set, we just simply need to subtract for each picture in the training set the mean that was calculated in the previous step.

The reason why this is necessary is because we want to create a system that is able to represent any face. Therefore, we calculated the elements that all faces have in common (the mean). If we extract this average from the pictures, the features that distinguish each picture from the rest of the set are visible.
1 copy

2 copy

4.-Calculate the covariance
The covariance represents how two variables change together. After the previous step, we have a set of images that have different features, so now we want to see how these features for each individual picture change in relation to the rest of the pictures.

For this purpose, we put all the flat normalized pictures together in a vector. My training set consists of 16 pictures whose dimensions are 235×235 pixels. Therefore, the resulting matrix will be 55225×16. The covariance is the multiplication of this matrix by itself, and if we transpose it correctly, the resulting matrix will be 16×16:

16x55225 * 55225x16 = 16×16

5.-Extract eigenvectors
From the covariance we can extract the eigenvectors. Fortunately, there is Matlab function that helps us in this step (you can see it on the code). There is plenty of information in the internet about eigenvectors (2) (3) but the general idea is that eigenvectors are the vectors of the covariance that describe the direction of the data. The first eigenvector will describe more information than the second and so on. For this reason, later we have to pick the first eigenvectors generated (avoiding noise).
5

6.-Calculate eigenfaces
Each eigenvector is multiplied by the whole normalized training set matrix (the 55225×16 matrix) and as a result, we will have the same amount of eigenfaces as images in our training set.
eigenfaces2

7.-Choose the most significant eigenfaces.
The first eigenfaces represent more information than the last eigenfaces. Actually, the last eigenfaces only add noise to the model, so it is necessary to avoid them. Therefore, only the most significant eigenfaces are chosen. For this, there are many heuristic algorithms but it can also be done by looking at the pictures. In my code I only used 16 different pictures, and since the training set is tiny, all of the eigenfaces represent important features.

Some simple heuristic algorithms are shown in the code but they are not used. I preferred to manually select the amount of eigenfaces to see the difference in the algorithm’s performance.

Among these heuristics are:
1) To select those eigenvectors whose eigenvalues are above 1.
2) To choose all eigenvectors until the cumulative sum of the eigenvalues is around 95%

4

8.-Calculate weights
Each normalized face in the training set multiplies each eigenface. Consequently, there will be N set of weights with M elements (N = amount of pictures in the training set, M = number of eigenfaces).

After this procedure, we can theoretically represent each face as a linear combination of the chosen eigenfaces. This means that each picture in the training set can be recalculated by a sum of each eigenface multiplied by the corresponding weight plus the mean.

3

Recognition part: 9.-Vectorize and normalize the picture
Reshape the test picture into a vector and subtract the mean calculated in 2) from it.

Recognition part: 10.-Calculate the weights
The same as 8) but with the test picture.

Recognition part: 11.-Interpret the distance
Now we have all the weights from our training set and the weight of the picture that we want to classify. The final step is to determine whether the picture is a face or not, given the distance. This can be a bit confusing: the most obvious approach might be calculate the mean of the distances and if the distance is over a predetermined threshold, the picture will be categorized as a face. Nonetheless, this might lead to errors when using, for example, one of the faces used in the training set.

Since the model was trained using the same image, for the system should be obvious to categorize this picture as a face. Unfortunately, that is not the case. The reason is because the distance of the picture used in the training regarding to the specific eigenface that describes its features might be 0 (because it is exactly the same picture) but the distances between this picture and the rest of the images in the training set might be greater, and if we take the mean of the distances, the overall result could be over the threshold that we determined.

6

The code is provided in the Source code section.
You can also access to the slides of the presentation.

References

1. Matthew A. Turk & Alex P. Pentland. 1991. “Face recognition using Eigenfaces”.
2. “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction”. https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/ (Accessed 5-10-2015)
3. Eigenvectors and Eigenvalues Explained Visually. http://setosa.io/ev/eigenvectors-and-eigenvalues/ (Accessed 5-10-2015)

[Hough Transform] Ellipse detection and space reduction

This is the last entry regarding Hough Transform. I previously wrote about Line Detection and Circle Detection including some Source Code, but in this case I will just write about it. The reason why I did not write any code is because it can be found in [1] and because it is very similar to the Circle Detector.

Brief Introduction

Ellipse detection is another useful tool that may have various applications in the field of recognition. Let us not forget that sometimes due to the movement when a picture is taken, in some images circles cannot be represented correctly, so ellipses appear instead. Ellipses are also a very simple shape that can be interesting to recognize since we live surrounded by objects with that shape. As an example, self-driving cars are intended to recognize traffic signs. In particular, rounded traffic signs may look elliptical when they are seen from a specific point of view.

trafficsign

Simple way (5 dimensions)

This approach is very similar to the Circle Detector. Circumferences have 3 characteristics needed to be defined: radius and center coordinates. Likewise, it can be said that in case of ellipses we need 5 characteristics: the center (2), the size along both axes (2) and its rotation (1).

characteristics

Implementing a 5 dimensional matrix will lead us to fall over the curse of dimensionality. This approach can be reasonable when the algorithm faces a known environment and therefore some characteristics can be drastically reduced. For instance, if a fixed camera aims to capture the movement of the moon, the size of its radius will not change very much, and as the movement can also be predicted, one can simply modify a few parameters to adapt the algorithm and focus on a certain region.

The implementation, as stated above, is similar to the circle detector: using the trigonometric characteristics of the ellipse one can establish relationships between them and iterate over each parameter to try all combinations along the whole image. There is a Matlab code written in [1] (page 209).

Space Reduction

Space reduction is applied in a very similar way to the Circle Detector. Nonetheless, as ellipses are figures a bit more complex than circumferences, it can be more problematic to achieve the solution. Let us recall that for circumferences, it is only necessary to take the center point in the chord between two prospective points that might belong to the circumference, and draw a perpendicular line to the chord passing through the center point. This is possible due to the orthogonality between these two lines, but this is not the case when analyzing ellipses. The equivalent to that perpendicular line in the circumference must be found in other way, for instance, using the tangential lines of those two chosen points, as we can see below.

902 901
Circle detection Ellipse detection

The algorithm shown in the book [1] is similar to the Circle detector: it iterates over the whole picture trying to find a black pixel. When it is detected, it looks in the neighborhood for another one to draw a chord and therefore, the pixel in the center of the chord. The proposed way to find the coordinates of the pixel out of the ellipse is by the intersection of the tangent of those two chosen points. The author proposed that these tangents can be obtained before the process of Non-Maximum Suppression. In this case, during the use of Sobel filter the angles to generate those tangents can be obtained when Canny Edge detector is used.

Finally, we just need to obtain the maxima of the accumulator, and draw the consequent ellipse.

References

1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

[Hough Transform] Circle detection and space reduction

Brief Introduction

Hough Transform was already used for Line detection and it showed how powerful it can be. This time, the main goal will be detecting circles. Detecting this basic shape may be interesting in the field of recognition since many objects subject to be classified have a circular shape such as the iris of the eyes, coins or even cells under a microscope.

eye

Simple way (3 dimensional matrix)

This first method is easier to understand but very inefficient compared to the next (after space reduction is applied). A circle needs three parameters: x,y values for the center location and the radius of the circumference. Thus the accumulator matrix will have 3 dimensions, one for each parameter, covering all possibilities.

Given a certain radius, when a pixel is detected, the algorithm will increment the accumulator elements corresponding to the circumference that can be drawn using that pixel and radius as characteristics of the circle.

The first two parameters of the 3-D accumulator are x,y values corresponding to the coordinates of the whole picture. The third parameter is the radius. In the following picture, we are using a fixed radius to make everything easier to understand, and the accumulator is incremented in those coordinates in which a red pixel is located.

circlacc

As you can see, the red pixels almost do not coincide in any point when iterating over the same coordinate. During the algorithm execution, many circles will be drawn and there will be a peak in the center of the real circle we want to detect, as depicted below. They all meet in the center of the circumference (green points).

cirrcleexpl

When the algorithm finishes the iteration over all pixels and radius, we only need to find where is that maxima located to get the characteristics of the circumference. Finally, we can draw it.

circleploted

The most problematic characteristic of this algorithm is that it needs to iterate over all radius, so it will lose a considerable amount of time on this task. For this reason, in the algorithm I developed I establish a min and max radius, to avoid checking radius extremely short and radius too large. This is how it looks depending on the described situation.

circleplot4 circleplot2 circleplot3
a) b) c)

a) When the circumference has a larger radius than expected.
b) When it finds more figures.
c) When it only finds an ellipse.

The algorithm is this:
Iterate over columns (x)
Iterate over rows (y)
If an edge is detected
Iterate over all radius (r)
For angles between 0 and 360 (m)
Calculate pixel for the generated circumference (y,x,r)
If that pixel is not out of bounds, increase accumulator

Space Reduction

Space reduction in this case consist in removing the problematic radius parameter.

When iterating over the picture and a pixel is detected, it will try to look for more pixels in a enclosed neighborhood. When a pixel in the neighborhood is detected, we will have an arc of the circumference. Given these two pixels, we will have a slope and a middle point (blue). We need to find the perpendicular line passing through that middle point (red) which represents the increment in the accumulator.

circc

As the algorithm iterates, it will increase more and more the accumulator, and a peak in the center of the real circumference will be generated.

circleano plot
After some iterations Accumulator plotted

There is another method to obtain the center explained in [1], but the algorithm written in the book is an implementation of the already explained method. However, the author of the books does not give a hint about getting the radius once we got the center.

I used a 1 dimensional array (or vector) as an accumulator of the distances of each pixel in the image to the center. The peak will be generated when the distances of all the pixels belonging to the circumference are summed in the accumulator. This results in the radius of that circumference. It may be improved by starting in the neighborhood of the center and stop counting at some point.

The execution time differs from the improved algorithm and from the non-improved algorithm. For the same picture, the non-improved algorithm takes around 0.39 seconds whereas the improved version takes around 0.019 seconds.

The code is provided in the Source code section.

References

1. M. Nixon and A. Aguado. 2008. “First order edge detection operators”, Feature Extraction & Image Processing.

[High and low pass filters] The Einstein-Monroe picture

There is an Einstein-Monroe picture wandering around the Internet that I recently saw. It can be a nice example of how human vision works as well as building a high and low pass filter from scratch in order to extract both images.

einsteinmonroe

This picture includes a low frequency picture of Monroe and a high frequency picture of Einstein blended together. Human vision notices details of its environment when objects are close (high frequency). That means that when we are close enough (a normal distance) to this picture, we should be able to see Einstein’s face, otherwise you should check your sight. If you take 3 steps back and see the same picture, you should be seeing Monroe’s face because your vision is not able anymore to get the small details of the picture. Instead, it will get a general idea of the picture (a bit blurry).

Since this is about high and low frequency pictures, we can build high and low pass filters to extract the frequency of the image we want. Thus each original image is theoretically possible to be extracted.

First of all, we can see how the Fast Fourier transform looks like. In order to achieve it, we have to perform the 2-D Fast Fourier, shift it, and scale it.

fftshifted

The low frequency data of the image is in the center of the previous image. Thus a simply low-pass filter can be built by keeping the center and removing the rest, as this picture shows. If we want to extract the opposite, we can invert it.

low high
Low-pass filter High-pass filter

The radius of the circle needs to be manually adjusted depending on the output. When we try to rebuilt the image using both of those Fourier Transform, we get the original images.

c1 c2
Low frequency image High frequency image

Since in my secondary screen I can barely see Einstein’s face, I tried to increase the contrast to make it easier to see in case you have the same issue.

einstein2

The code is provided in the Source code section.

Interesting Links

1. S. Lehar. “An Intuitive Explanation of Fourier Theory”, http://cns-alumni.bu.edu/~slehar/fourier/fourier.html.