Colorizing black and white images using Deep Learning (Tensorflow)

Recently I was reading up on an interesting paper that explores how colorizing black and white images using Deep Learning. The paper was easy to read and understand, and to fully enjoy it I decided to implement it in a lower scale. I have a laptop with a humble graphic card and an i7 so it cannot compete with those servers that cost several thousand euros, that is why I have not focused on the design of the network but more on the algorithmic part.

Goals and dataset

The goals of this post are:

  1. To be able to implement the algorithm to learn about a new colorizing approach.
  2. To have a look to the neural network they used.
  3. To figure out if a simpler network can learn even if the results are quite improvable.
  4. To practice using tensorflow.
  5. To have fun.

The dataset I will use can be downloaded here. It consist of 2687 256×256 images of beaches, forests, streets, etc.

Algorithm

One of the things I enjoyed about the paper [1] is how well structured it is, which made the understanding much easier. I will try to follow their structure in a similar way while being precise and concise.

The very first important thing we need to know is that Lab color space is used. The reason is simple: in color spaces such as RGB we need to learn three different channels whereas in Lab we only need to learn two. The channel L refers to the lightness, having values between 0 (dark) and 100. The channels a and b are the position in the axis between red-green and blue-yellow respectively.

63c89aba0ed994edcfce462b2a4b2b6b

A typical approach would use the L channel as input and the a,b channels as output (for every pixel the prediction consists on two values). However, they argue that using a loss function based on two components does not represent the multimodal and ambiguity problems properly. We are surrounded by objects that can have multiple colors such as apples since they can be red, green or yellow. In addition, Euclidian distance (averaging) will increase desaturated (grayish) values.

In order to solve this they divided the a,b colorspace into Q bins corresponding to probabilities, and consequently, the number of predictions expected per pixel will be Q (one per bin). For instance, in case of the apple, the bins corresponding to red, yellow and green will contain higher probabilities than those bins close to colors like blue.

colorspace

I skipped representing some bins by analyzing first the colors I have in my dataset so that I only take into account those bins with colors used in my dataset. My a,b color space is between -110 and 110 and depending on the size of the squared window used to divide it, the resulting grid and the final used bins will be different. When I’m using the whole dataset and a window size of 10 I use XX bins and it raises to XXX bins when the window size is 5. In contrast, the authors use Q=313 bins.

The problem is now a simple classification problem in which we have an input (brightness values) and the output is a set of probabilities indicating how likely are the values to belong to a certain bin. A classic way to solve this is to use cross entropy but in addition to it, the authors noticed that the model is biased towards low a,b values which typically correspond to backgrounds such as walls, sky, etc. In order to solve this, they rebalanced the colors by adding weights that will multiply the cross entropy calculated.

form1

This formula can be read as: perform cross-entropy on a pixel comparing the original and predicted distribution of the probabilities, and then multiply it by a weight corresponding to that color such that certain colors are highlighted to the network.

form2

To calculate these weights they introduced a new hyper-parameter \lambda (they use \lambda=0.5). Q was the number of bins used and \tilde{p} is a distribution calculated from our dataset. I tried to use the full dataset consisting of different types of images and another consisting only of images of forests so I had to recalculated it. You can count the number of pixels that are located within each bin and later normalize it. They also apply a Gaussian kernel but I actually skipped that part. The image below depicts which colors are used the most in the dataset.

colorspacecombined

In the image above, the left side shows that the most used colors are located in the center (desaturated) and both cases differ depending on the dataset: when only forest images were used, less pink-red colors were used.

This is basically all, in regards to the training part. The other important issue arises when predicting values. Imagine for a second that for a single pixel the network spits a probability distribution such that all values are zero except for one value that is one (the so called one-hot vector). It is more than clear that the corresponding bin will represent the color of that pixel. Nonetheless, we will probably have several probabilities and mapping them back to a,b values is the next task. They introduced a new variable called T standing for temperature (coming from simulated annealing). To put it simple, when T tends to zero, values are more intense since the prediction emphasizes the color with the highest probability. On the other hand, when T tends to one, the colors are more distributed but also more desaturated because the color is result of a more spread average of the obtained probabilities.

form3

The formula above describes how they calculated the prediction given the temperature T and the distribution z.

Neural Network

As I have mentioned before, due to my limited resources, utilizing the network they used was immediately discarded. Thus I designed a very simple neural network that consists of 8 convolutional layers. The input consist of a window of 32×32 pixels and after the convolutions the final prediction has a size of 16×16. In every convolution the size is reduced by 2 because of the 3×3 kernel thus reducing one pixel on each side of the image. The number of filters used depends on the number of bins Q such that they are growing using a step of Q/8. For instance, the number of filters in the first convolutional layer is Q/8, in the second layer is 2*Q/8, and so on until the last layer with a number of Q filters. Therefore, the size of my final output will be 16x16xQ.

Results

At first I was using only the forest dataset to speed up the training process and regardless of some hyper-parameters tunning the results were a bit random: sometimes they were quite okay and sometimes they were bad, but in all cases the network was learning and showing a behavior that makes sense, so I can say that I am satisfied with the results.

results_color

References

1. Zhang R., Isola P. & Efros A.A. 2016. Colorful Image Colorization.

Informal review on Randomized Leaky ReLU (RReLU) in Tensorflow

This very informal review of the activation function RReLU compares the performance of the same network (with and without batch normalization) using different activation functions: ReLU, LReLU, PReLU, ELU and an less famous RReLU. The difference between them lies on their behavior from [- \infty,0]. The goal of this entry is not to explain in detail these activation functions, but to provide a short description.

When a negative value arises, ReLU deactivates the neuron by setting a 0 value whereas LReLU, PReLU and RReLU allow a small negative value. In contrast, ELU has a smooth curve around the zero to make it derivable resulting in a more natural gradient and instead of deactivating the neuron negative values are mapped into a negative one. The authors claim that this pushes the mean unit closer to zero, like batch normalization [1].

elu

LReLU, PReLU and RReLU provide with negative values in the negative part of the respective functions. LReLU is using a small tilted slope whereas PReLU learns the steepness of this slope. On the other hand, RReLU, the function we will study here, sets this slope to be a random value between an upper and lower bound during the training and an average of these bounds during the testing. The authors of the original paper get their inspiration from Kaggle competition and even use the same values [2]. These are random values between 3 and 8 during the training and a fixed value 5.5 during testing.

Notice that in [2] and consequently in the following tests, the variable \alpha_i that uses LReLU is not used as \alpha_i x_i but as \frac{x_i}{\alpha_i}. This detail is important and for some reasons [2] change the notation from the original LReLU paper.

Results

As in the paper where RReLU is introduced, I used the same activation function configurations plus ELU (default configuration). I run a very simple neural network using MNIST dataset with and without batch normalization and as we can see in the figure below RReLU does not only perform among the words but the simple ReLU performs the best when normalization is used and almost the best when no normalization is added.

fixed_norm

fixed_nonorm

Notes on Tensorflow

This activation function requires to constantly use new random values that need to be initalized constantly while the network is training. As we can see in the corresponding tutorial video and the source code the initializer needs to be called on each iteration during the training by:

1
sess.run(r1.initializer)

The code is provided in the Source code section.

References

1. Clevert D.A., Unterthiner T. and Hochreiter S. 2016. Fast and accurate Deep Network Learning by Exponential Linear Units (ELUs). ICLR 2016.
2. Xu B., Wang N., Chen T. and Li M. 2015. Empirical Evaluation of Rectified Activations in Convolutional Network.

Debugging a Keras Neural Network

Learning outcomes:

  • How to get the weights and bias values of the layers.
  • How to get the values between the hidden layers (before and after the activation function)

The goal of this post is to learn how to debug a neural network in Keras. This is extremely important due to a variety of reasons.

  1. Knowing how to debug increases the understanding of the underlying structure of the network and its theoretical background.
  2. Learning what’s going on at each level of the network translates into a better understanding of the outcome.
  3. Knowing about each layer’s outcome can be valuable for research purposes.
  4. Meticulous analyses and splits of the network allow us to easily replace and experiment with some parts of it.

Obtaining general information

Obtaining general information can give us an overview of the model to check whether its components are the ones we initially planned to add. We can simply print the layers of the model or retrieve a more human-friendly summary. Note that the layer of the neural network (input, hidden, output) are not the same as the layers of the Keras model. Our model’s layers are more abstract operations such that transformations, convolutions, activations, etc.

1
print(model.layers)

Output:

1
2
3
4
5
6
[<keras.layers.convolutional.Conv2D at 0x7faf0c4c9c90>,
 <keras.layers.convolutional.Conv2D at 0x7faf0c4de050>,
 <keras.layers.pooling.MaxPooling2D at 0x7faf0c46bc10>,
 <keras.layers.core.Flatten at 0x7faf0c4de450>,
 <keras.layers.core.Dense at 0x7faf0c46b690>,
 <keras.layers.core.Dense at 0x7faf0e3cf710>]
1
print(model.summary())

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
_________________________________________________________________
Layer (type)                 Output Shape              Param #  
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 64)        18496    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 64)        0        
_________________________________________________________________
flatten_1 (Flatten)          (None, 9216)              0        
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1179776  
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
=================================================================
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________

We can also retrieve each layer’s input and output size.

1
2
for layer in model.layers:
    print("Input shape: "+str(layer.input_shape)+". Output shape: "+str(layer.output_shape))

Output:

1
2
3
4
5
6
Input shape: (None, 28, 28, 1). Output shape: (None, 26, 26, 32)
Input shape: (None, 26, 26, 32). Output shape: (None, 24, 24, 64)
Input shape: (None, 24, 24, 64). Output shape: (None, 12, 12, 64)
Input shape: (None, 12, 12, 64). Output shape: (None, 9216)
Input shape: (None, 9216). Output shape: (None, 128)
Input shape: (None, 128). Output shape: (None, 10)

Obtaining the output of a specific layer after its activation function

This model is a modified example from the original Keras repository.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape,
         kernel_initializer=keras.initializers.Ones()))

model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(Dense(num_classes, activation='softmax', kernel_initializer=keras.initializers.Ones()))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
        metrics=['accuracy'])

This model consist of 6 layers which, as we can see in the code, include special information in the parameters. It’s important to note that the activation function used in the layers is specified within the layer because alternatively we could just add another layer after the first convolution specifying the activation function.

We can imagine our model as a tunnel in which each layer is a different part of the tunnel. In order to obtain the output of a specific layer we need to parcellate a subtunnel. As we are interested in the output of the first convolutional layer after the activation function, our subtunnel will be bounded from the input of the first layer to the output of the first layer (which includes the activation funcion because it was specified in the code). We will use the function “function” to create this subtunnel specifying its beginning and end.

1
2
from keras import backend as K
fun = K.function([model.layers[0].input],[model.layers[0].output])

After that we simply have to accommodate the input and pass it to that function.

1
2
x_inp = np.reshape(x,(1,28,28,1))
layer_output = fun([x_inp])[0]

In the Source code section, the script called debugging1.py shows how subtunnels were created from the beginning to each layer of the network. In addition, it shows an alternative way to obtain the same results providing a good understand of what’s going on in the network and both outcomes are compared to check that they are the same.

Obtaining the output of a specific layer before its activation function

The only difference with regards to the previous section is that this time the model needs to be modified to have its activation functions separated from the layers, as we can see below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 input_shape=input_shape,
         kernel_initializer=keras.initializers.Ones()))
model.add(Activation("sigmoid"))

model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=keras.initializers.Ones()))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, kernel_initializer=keras.initializers.Ones()))
model.add(Activation("sigmoid"))
model.add(Dense(num_classes, kernel_initializer=keras.initializers.Ones()))
model.add(Activation("softmax"))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
        metrics=['accuracy'])

Obtaining the output values is done in a similar way to the previous section. Here we show that obtaining the values before and after the activation it is a matter of changing the output layer.

1
2
3
4
5
6
7
8
# With and without the activation
fun_without = K.function([model.layers[0].input],[model.layers[0].output])
fun_with = K.function([model.layers[0].input],[model.layers[1].output])
# Input
x_inp = np.reshape(x,(1,28,28,1))
# Output
layer_output_without = fun_without([x_inp])[0]
layer_output_with = fun_with([x_inp])[0]

In the Source code section, the script called debugging2.py shows this, and as in debugging1.py it also recreates the solution in an alternative way and compare both results.

What if during the training and testing behaviors are different?

Extracted from the Keras website:

Note that if your model has a different behavior in training and testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you will need to pass the learning phase flag to your function:

1
2
3
4
5
6
7
8
get_3rd_layer_output = K.function([model.layers[0].input, K.learning_phase()],
                                  [model.layers[3].output])

# output in test mode = 0
layer_output = get_3rd_layer_output([x, 0])[0]

# output in train mode = 1
layer_output = get_3rd_layer_output([x, 1])[0]

Note how the now the created functor receives both the input and whether it’s training or testing.

Homography estimation explanation and python implementation

Homographies are transformations of images from one planar surface to another (image registration). Homographies are used for tasks such as camera calibrations, 3D reconstruction, image rectifications. There are multiple methods to calculate an homography and this post explains one of the simplest.

Given a point in a 3D space x=(x_1,y_1,1) and a matrix H, the resulting multiplication will return the new location of that point x' = (x_2,y_2,1) such that:

x' = Hx

Due to the dimensions of x and x' we know that H will be a 3×3 matrix but even if there are 9 elements in the matrix, we will have just 8 degrees of freedom. In this Powerpoint presentation [1] we can intuitively see where does it come from.

So we have that:

  \begin{bmatrix}   u \\   v \\   1  \end{bmatrix} =  \begin{bmatrix}   h_1 & h_2 & h_3 \\   h_4 & h_5 & h_6 \\   h_7 & h_8 & h_9  \end{bmatrix}  \begin{bmatrix}   x \\   y \\   1  \end{bmatrix}

Where u and v are the new coordinates. Therefore, we have:

 \\ u = x h_1 + y h_2 + h_3 \\ v = x h_4 + y h_5 + h_6 \\ 1 = x h_7 + y h_8 + h_9 \\

So for each point we have:

 \\ x h_1 + y h_2 + h_3 - u (x h_7 + y h_8 + h_9) = 0 \\ x h_4 + y h_5 + h_6 - v (x h_7 + y h_8 + h_9) = 0 \\   A_i =   \begin{bmatrix}   x & y & 1 & 0 & 0 & 0 & -ux & -vy & -u \\   0 & 0 & 0 & x & y & 1 & -ux & -vy & -u  \end{bmatrix}

Since we have 8 degrees of freedom, we need at least 4 points to obtain H (each point contributes with two new variables to the formula, x and y). We just need to stack A_1, A_2, A_3, A_4 to have a 8×9 matrix that we will call A. We are interested in solving the following equation avoiding the trivial solution h=0

 Ah=0;  \begin{bmatrix}   x_1 & y_1 & 1 & 0 & 0 & 0 & -u_1 x_1 & -v_1 y_1 & -u_1 \\   0 & 0 & 0 & x_1 & y_1 & 1 & -u_1 x_1 & -v_1 y_1 & -u_1 \\   & & & & \cdots  \end{bmatrix}  \begin{bmatrix}   h1 \\ h2 \\ h3 \\ \vdots  \end{bmatrix} = 0

We will solve this as a least squares problem using singular value decomposition (SVD)

Least squares and SVD

This method (explained very clearly in [2]) is used when we want to approximate a function given different observations. For instance, we have that:

 \\ c + d x_1 = y_1 \\ c + d x_2 = y_2 \\ c + d x_3 = y_3

If there were no errors those equations would be true, but since our measurements might have noise we want to minimize that errors by minimizing:

 (c + d x_1 - y_1)^2 + (c + d x_2 - y_2)^2 + (c + d x_3 - y_3)^2

In general, for systems like Ax=b we want to minimize || Ax-b ||^2

We will use SVD in our matrix A:

 [U,S,V] = SVD(A)

V are the eigenvectors of A^T A. The solution is therefore the last eigenvector because its eigenvalue (diagonal matrix D) will be zero or close to zero in case of noise. More intuitively, imagine that the largest eigenvectors will depict the largest variance across the data, and we are interested in minimizing, so the eigenvector should have a small eigenvalue.

When performing an homography, the resulting image will probably have different dimensions from the original one since it might be stretched, rotated, and so on. This will result in having many “empty pixels” that must be filled by performing an interpolation.

One of the nicest properties of the homography is that H has an inverse, which means that we can map all points back to the origin by multiplying them to the inverse of H. In order to fill an empty point we will multiply their coordinates by H^{-1} to get the original coordinates, which will be floating point numbers. Those “original coordinates” must be interpolated (for instance, you can round them) to get the closest pixel (nearest neighbor) and put it in the empty pixel.

Example

In the following example the label of this small notebook will be placed horizontally. For this, the location of the 4 pixels corresponding to the four corners is used, and a new location drawing a rectangle is calculated as well. The red dots correspond to the points we want to transform and the green dots their target location.

or

This first approximation is obtained by calculating the new location of each pixel. However, it will leave plenty of empty pixels that can be interpolated after the inverse matrix of the homography transformation matrix is calculated.

_tmp_final

After the interpolation, the final result will not contain any empty pixel.

final

The code is provided in the Source code section.

References

1. https://courses.cs.washington.edu/courses/csep576/11sp/pdf/Transformations.pdf (Accessed on 8-8-2017)
2. http://www.sci.utah.edu/~gerig/CS6640-F2012/Materials/pseudoinverse-cis61009sl10.pdf (Accessed on 8-8-2017)