Gradients in Tensorflow

June 9, 2018June 9, 2018 Juan Miguel Valverde Leave a comment

The chain rule in Tensorflow

Manipulating any type of neural network involves dealing with the backpropagation algorithm and thus it is key to understand concepts such as derivatives, chain rule, gradients, etc. It is often important to not only theoretically understand them but also being able to play around with them, and that is the goal of this post. Most of the information presented has been collected from different posts from stackoverflow.

There are three types of differentiation:

Numerical: it uses the definition of the derivative (lim) to approximate the result.
Symbolic: manipulation of mathematical expressions (the one we learn in high school).
Automatic: repeatedly using the chain rule to break down the expression and use simple rules to obtain the result.

As the author of the answer (Salvador Dali) in stackoverflow points out, symbolic and automatic differentiation look similar but they are different.

Tensorflow uses reverse mode automatic differentiation.

As mentioned above, Automatic differentiation uses the chain rule so there are two possible ways to apply it: from inside to outside (forward mode) and vice versa (reverse mode). In the Automatic differentiation Wikipedia page there are a couple of step-by-step examples of forward and reverse mode quite easy to follow. The reverse mode is a bit harder to see probably because of the notation introduced by the Wikipedia but someone made a simple decomposition easier to understand.

Gradients

Gradients of common mathematical operations are included in Tensorflow so they can be directly applied during the reverse mode automatic differentiation process. In fact, if you want to implement a new operation it has to inherit from Decop and its gradient has to be “registered” (RegisterGradient). For example, this is how the derivative of [latex s=2]f(x)=sin(x)[/latex] looks like (python/ops/math_grad.py):

@ops.RegisterGradient("Sin")
def _SinGrad(op, grad):
"""Returns grad * cos(x)."""
x = op.inputs[0]
with ops.control_dependencies([grad]):
x = math_ops.conj(x)
return grad * math_ops.cos(x)

The function tf.gradients is used by any optimizer because they all inherit from the class Optimizer (python/training/optimizer.py). tf.gradients is not commonly used directly but it is implicitly used when calling the function minimize as:

train_step = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(loss)

More specifically, “minimize” function has two tasks:

Calculate gradients
Apply gradients

We can thus apply those operations, or even break down calculate_gradients to use tf.gradients. Here there are three alternatives when minimizing a loss function:

with tf.variable_scope("optimization") as scope:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
# Option 1:
train_step = optimizer.minimize(self.loss)
# Option 2:
grads = optimizer.compute_gradients(self.loss,var_list=tf.trainable_variables())
train_step = optimizer.apply_gradients(grads)
# Option 3:
grads = tf.gradients(loss,tf.trainable_variables())
grads_and_vars = list(zip(grads,tf.trainable_variables()))
train_step = optimizer.apply_gradients(grads_and_vars)

This decomposition of the function minimize can be useful in some cases. For instance, you may want to process or keep track of the gradients to understand the Graph computations or you may want to calculate the gradients with respect to only some variables and not all of them.

Simple tf.gradients example

We have the function [latex s=2]z = f(x,y) = 2x-y[/latex] and we want to calculate both of its partial derivatives [latex s=2]\frac{\partial z}{\partial x} = 2[/latex] and [latex s=2]\frac{\partial z}{\partial y} = -1[/latex].

import tensorflow as tf

x = tf.Variable(1)
y = tf.Variable(2)
z = tf.subtract(2*x, y)
grad = tf.gradients(z, [x, y])

sess = tf.Session()
sess.run(tf.global_variables_initializer())

res = sess.run(grad)
print(res) # [2, -1]

In the previous example we could not see that after the derivatives are calculated, tf.gradients also substitutes each variable by its value and performs the corresponding calculations. Another example: [latex s=2]z = f(x,y) = sin(x)-y^3[/latex]. [latex s=2]\frac{\partial z}{\partial x} = cos(x)[/latex], [latex s=2]\frac{\partial z}{\partial y} = -3y^2[/latex]

import tensorflow as tf

x = tf.Variable(3.)
y = tf.Variable(5.)

z = tf.subtract(tf.sin(x), tf.pow(y,3))
grad = tf.gradients(z, [x, y])
#upd = inp - tf.multiply(grad,0.01)
sess = tf.Session()
sess.run(tf.global_variables_initializer())

res = sess.run(grad)
print(res) # [-0.9899925, -75.0] <-> [cos(3), -3*5*5]

Dropout explained and implementation in Tensorflow

June 3, 2018June 3, 2018 Juan Miguel Valverde Leave a comment

Dropout

Dropout [1] is an incredibly popular method to combat overfitting in neural networks. The idea behind Dropout is to approximate an exponential number of models to combine them and predict the output. In machine learning it has been proven the good performance of combining different models to tackle a problem (i.e. AdaBoost), or combining models trained in different parts of the dataset. However, when it comes to deep learning, this becomes too expensive, and Dropout is a technique to approximate this.

Dropout can be easily implemented by randomly disconnecting some neurons of the network, resulting in what is called a “thinned” network. Thus, if the model has [latex]n[/latex] neurons, there are [latex]2^n[/latex] potential models. Each of them might be trained once or few times, or even not trained at all. Generating one of these random models is done batch-wise so in every batch there will be a new dropout mask (to disconnect the corresponding weights) generated. At train time, each neuron has a probably [latex]p[/latex] of being disconnected. For instance, if [latex]p=0.5[/latex] (recommended configuration, except for the input layer which is recommended to have [latex]p=0.8[/latex]) and we have 200 neurons, the first batch might encounter 90 activated neurons, the second batch might encounter 103 activated neurons, etc.

At test time, all neurons will multiply [latex]p[/latex].

In the image above we can see a simple implementation of a standard feedforward pass: weights multiply inputs, add bias, and pass it to the activation function. The second set of formulas describe how it would look like if we add dropout:

Generate a dropout mask: Bernoulli random variables (i.e. 1.0*(np.random.random((size))>p)
Apply the mask to the inputs disconnecting some neurons.
Use this new layer to multiply weights and add bias
Finally use the activation function.

All the weights are shared across the potential exponential number of networks, and during backpropagation, only the weights of the “thinned network” will be updated.

How is it implemented in Tensorflow?

In Tensorflow it is implemented in a different way that seems to be equivalent. Let’s have a look at the following example. According to the paper:

Let our neurons be: [latex][1,2,3,4,5,6,7,8][/latex] with [latex]p=0.5[/latex].
At train time, half of the neurons would be randomly disconnected, leading to [latex][1,0,0,4,5,0,7,0][/latex]
At test time, we would have multiplied the whole matrix by p, leading to [latex]0.5*[1,2,3,4,5,6,7,8][/latex]

In other words, we downgrade the outcome at testing time. In contrast, in Tensorflow, we do it the other way around. We increase the values at training time by [latex]1/prob[/latex]. Following our example:

Let our neurons be: [latex][1,2,3,4,5,6,7,8][/latex] with [latex]p=0.5[/latex].
At train time, half of the neurons are randomly disconnected, leading to [latex]1/0.5*[1,0,0,4,5,0,7,0] = [2,0,0,8,5,0,14,0][/latex]
At test time, we would use [latex]p=1[/latex], leading to [latex]1/1*[1,2,3,4,5,6,7,8][/latex]

In other words, at testing time we treat it as a normal neural network without dropout, and at training time we upscale the values by [latex]1/prob[/latex].

The reason why the values are upscale is to preserve the total sum (approx.).

[latex]sum([1,2,3,4,5,6,7,8]) = 36[/latex]
[latex]sum(1/0.5 * [1,0,0,4,5,0,7,0]) = 29[/latex]

This makes sense because if our layer produces certain output, we want to keep it approximately the same regardless of any method we are using to combat overfitting.

Dropout in Tensorflow

Adding a dropout layer in Tensorflow is really easy.

...
W = tf.get_variable("W",shape=[512,128],initializer=init)
b = tf.get_variable("b",initializer=tf.zeros([128]))

dropped = tf.nn.dropout(prev_layer,keep_prob=current_keep_prob)

dense = tf.matmul(dropped,W)+b
act = tf.nn.relu(dense)
...

Where current_keep_prob will be [latex]p[/latex] during training time and 1 during inference/testing time.

As I mentioned before, only those weights that were successfully masked (without the ones corresponding to the dropped out neurons) will be updated. If I have 100 neurons and [latex]p=0.5[/latex], half of the weights are expected to be updated. In the gif below we can see the evolution of 3 different plots: the first shows how the weights are being updated, the second shows which weights are being updated and the third is a cumulative sum of the second.

Optimizers comparison with and without Dropout

Due to the nature of AdamOptimizer, it does not follow this rule of updating only the weights belonging to the “thinned” network, so I found interesting to compare the performance of several optimizers in a simple neural network.

Test
Goal: To perform a first step to check the performance between Adam, Adadelta, Adagrad and Gradient Descent across different learning rates (1000 learning rates between 1e-6 to 1e-1).
Dataset: EMNIST (47 classes).
Batch size: 8.
Epochs: 1.
Network: conv2d (3,3,1,32), conv2d (3,3,32,64), max pooling (2,2), reshape (12*12*64), dense (12*12*64,128), dense (128,47).
Weights initialization: he_uniform. Bias initialization: zeros.
Activation layers: ReLU (and softmax at the end of the network).
Cost function: cross entropy.

These graph does not show the actual performance of dropout under each optimizer since I only tested a bunch of learning rates without properly examine and focus where it seems to provide good results. In addition, it is not fair to compare on the same learning rates dropout vs non-dropout. Instead, this shows how a first step looks like .

References

1. NNitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. “Dropout: A simple way to prevent neural networks from overfitting”

Image style transfer using convolutional neural networks – Tensorflow implementation

April 24, 2018April 24, 2018 Juan Miguel Valverde Leave a comment

Recently I recorded a video explaining in a very simple way how style transfer works in a convolutional neural network (VGG16) based on the incredibly well-written paper by Gatys et al [1]. I also implemented it in an extremely concise and simple way (around than 150 lines with comments).

Original Image

Combinations

The code is provided in the Source code section.

References

1. Gatys LA, Ecker AS and Bethge M. 2016. Image Style Transfer Using Convolutional Neural Network.

Difference between L1 and L2 regularization, implementation and visualization in Tensorflow

January 19, 2018January 19, 2018 Juan Miguel Valverde Leave a comment

Regularization is a technique used in Machine Learning to penalize complex models. The reason why regularization is useful is because simple models generalize better and are less prone to overfitting.

Examples of regularization:

K-means: limiting the splits to avoid redundant classes
Random forests: limiting the tree depth, limiting new features (branches)
Neural networks: limiting the model complexity (weights)

In Deep Learning there are two well-known regularization techniques: L1 and L2 regularization. Both add a penalty to the cost based on the model complexity, so instead of calculating the cost by simply using a loss function, there will be an additional element (called “regularization term”) that will be added in order to penalize complex models.

The theory

L1 regularization (LASSO regression) produces sparse matrices. Sparse matrices are zero-matrices in which some elements are ones (the sparsity refers to the ones), but in this context a sparse matrix could be several close-to-zero values and other larger values. From the data science point of view this is interesting because we can reduce the amount of features. If we find a model with neurons whose weights are close to zero it means we don’t need those neurons because the model deactivates them with zeros and we might not need a specific feature/input leading to a simpler model. For instance, if we have 50 coefficients but only 10 are non-zero, the other 40 are irrelevant to make our predictions. This is not only interesting from the efficiency point of view but also from the economic point of view: gathering data and extracting its features might be a very expensive task (in terms of time and money). Reducing this will benefit us.

Due to the absolute value, L1 regularization provides with a non-differentiable term, but despite of that, there are methods to minimize it. As we will see below, L1 regularization is also robust to outliers.

L2 regularization (Ridge regression) on the other hand leads to a balanced minimization of the weights. Since L2 uses squares, it emphasizes the errors, and it can be a problem when there are outliers in the data. Unlike L1, L2 has an analytical solution which makes it computationally efficient.

Both regularizations have a λ parameter which is directly proportional to the penalty: the larger λ the stronger penalty to find complex models and it will be more likely that the model will avoid them. Likewise, if λ is zero, regularization is deactivated.

The graphs above show how the functions used in L1 and L2 regularization look like. The penalty in both cases is zero in the center of the plot, but this also implies that the weights are zero and the model will not work. The values of the weights try to be as low as possible to minimize this function, but inevitably they will leave the center and will head outside. In case of L2 regularization, going towards any direction is okay because, as we can see in the plot, the function increases equally in all directions. Thus, L2 regularization mainly focuses on keeping the weights as low as possible.

In contrast, L1 regularization’s shape is diamond-like and the weights are lower in the corners of the diamond. These corners show where one of the axis/feature is zero thus leading to sparse matrices. Note how the shapes of the functions shows their differentiability: L2 is smooth and differentiable and L1 is sharp and non-differentiable.

In few words, L2 will aim to find small weight values whereas L1 could put all the values in a single feature.

L1 and L2 regularization methods are also combined in what is called elastic net regularization.

The practice

One of my motivations to try this out was an “intuitive explanation” of L1 vs. L2 I found in quora.

From the theoretical point of view it makes sense: L2 emphasizes errors due to the square, and it will try to minimize them all of them equally so the line will get a bit off from the main trend because a big errors influences more than small errors. On the other hand, for L1 errors have the same importance (linearly speaking) so it will minimize a lot of errors getting really close to the main train even if there are outliers.

I created a small dataset of samples that describes a straight line and I later added noise and some outliers. I created a model with more neurons than needed to solve this problem in order to see whether it works and compare the weight evolution between the methods.

Model characteristics:
-Layers: 1 input, 3 hidden, 1 output
-Sizes: 1,10,10,10,1
-Batch size: 1 (noiser)
-Optimizer: SGD (lr=0.01)
-Lambda: 0.3 (for regularization)

I run the model 5 times with each regularization method and these are the results.

When the random outliers are sufficiently far none of them present good results, but overall the results obtained with L2 performance were better than those obtained with L1. Then, I had a look at the weights. Below, I show the weights and the results obtained with an additional run of the model.

As expected, L1 generates several 0-weighted neurons, so the model doesn’t use them. In other experiments, I got that most of the neurons were disconnected and only few of them had non-zero weights. On the other hand, L2 minimizes the values of the weights until most of them have a very low value.

Adjusting the network according to L1

As described before, L1 generates sparse matrices with disconnected neurons. If a neuron is disconnected, we don’t need it, leading to simpler models. I run again the script that uses L1 and I will adjust the model using less neurons according to the neurons it disconnects. Using the same samples and running the model 5 times, I got this total errors: 22.68515524, 41.64545712, 4.77383674, 24.04390211, 7.25596004.

The weights in this first run look like this:

I adjusted the neurons of the model: From [(1,10),(10,10),(10,10),(10,1)] to [(1,10),(10,10),(10,1),(1,1)] and this are the weights (note: the last big square is a single weight):

Performance on 5 runs: 7.61984439, 13.85177842, 11.95983347, 16.95491162, 25.17294774.

Implementation in Tensorflow

Despite the code is provided in the Code page as usual, implementing L1 and L2 takes very few lines: 1) Add regularization to the Weights variables (remember the regularizer returns a value based on the weights), 2) collect all the regularization losses, and 3) add to the loss function to make the cost larger.

with tf.variable_scope("dense1") as scope:
W = tf.get_variable("W",shape=[1,10],initializer=tf.contrib.layers.xavier_initializer(),regularizer=tf.contrib.layers.l2_regularizer(lambdaReg))
...
reg_losses = tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
cost = tf.reduce_sum(tf.abs(tf.subtract(pred, y)))+reg_losses

Conclusion

The performance of the model depends so much on other parameters, especially learning rate and epochs, and of course the number of hidden layers. Using a not-so good model, I compared L1 and L2 performance, and L2 scores were overall better than L1, although L1 has the interesting property of generating sparse matrices.

Hypothetical improvements: This post aimed to show in a very simple and graphic/animated way the effects of L1 and L2. Further research would imply trying more complex models with data that gives stable results. After tunning the parameters to get the best results, one could use cross validation to compare better the performance.

The code is provided in the Source code section.

Activation Functions in Deep Learning (Sigmoid, ReLU, LReLU, PReLU, RReLU, ELU, Softmax)

January 6, 2018May 26, 2018 Juan Miguel Valverde 1 Comment

Sigmoid and its main problem

Sigmoid function has been the activation function par excellence in neural networks, however, it presents a serious disadvantage called vanishing gradient problem. Sigmoid function’s values are within the following range [0,1], and due to its nature, small and large values passed through the sigmoid function will become values close to zero and one respectively. This means that its gradient will be close to zero and learning will be slow.

This can be easily seen in the backpropagation algorithm (for a simple explanation of backpropagation I recommend you to watch this video):

[latex]-(y-\hat{y}) f’ (z) \frac{\partial z}{\partial W}[/latex]

where [latex]y[/latex] is the prediction, [latex]\hat{y}[/latex] the ground truth, [latex]f'()[/latex] derivative of the sigmoid function, [latex]z[/latex] activity of the synapses and [latex]W[/latex] the weights.

The first part [latex]-(y-\hat{y}) f’ (z)[/latex] is called backpropagation error and it simply multiplies the difference between our prediction and the ground truth times the derivative of the sigmoid on the activity values. The second part describes the activity of each synopsis. In other words, when this activity is comparatively larger in a synapse, it has to be updated more severely by the previous backpropagation error. When a neuron is saturated (one of the bounds of the activation function is reached due to small or large values), the backpropagation error will be small as the gradient of the sigmoid function, resulting in small values and slow learning per se. Slow learning is one of the things we really want to avoid in Deep Learning since it already will consist in expensive and tedious computations. The Figure below shows how the derivative of the sigmoid function is very small with small and large values.

Conclusion: if after several layers we end up with a large value, the backpropagated error will be very small due to the close-to-zero gradient of the sigmoid’s derivative function.

ReLU activation function

ReLU (Rectified Linear Unit) activation function became a popular choice in deep learning and even nowadays provides outstanding results. It came to solve the vanishing gradient problem mentioned before. The function is depicted in the Figure below.

The function and its derivative:
latex
f(x) = \left \{ \begin{array}{rcl}
0 & \mbox{for} & x < 0\\ x & \mbox{for} & x \ge 0\end{array} \right. latex latex f'(x) = \left \{ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array} \right. /latex In order to understand why using ReLU, which can be reformulated as [latex]f(x) = max(0,x)[/latex], is a good idea let's divide the explanation in two parts based on its domain: 1) [-∞,0] and 2) (0,∞]. 1) When the synapse activity is zero it makes sense that the derivative of the activation function is zero because there is no need to update as the synapse was not used. Furthermore, if the value is lower than zero, the resulting derivative will be also zero leading to a disconnection of the neuron (no update). This is a good idea since disconnecting some neurons may reduce overfitting (as co-dependence is reduced), however this will hinder the neural network to learn in some cases and, in fact, the following activation functions will change this part. This is also refer as zero-sparsity: a sparse network has neurons with few connections.

2) As long as values are above zero, regardless of how large it is, the gradient of the activation function will be 1, meaning that it can learn anyways. This solves the vanishing gradient problem present in the sigmoid activation function (at least in this part of the function).

Some literature about ReLU [1].

LReLU activation function

Leaky ReLU is a modification of ReLU which replaces the zero part of the domain in [-∞,0] by a low slope, as we can see in the figure and formula below.

The function and its derivative:

latex
f(x) = \left \{ \begin{array}{rcl}
0.01 x & \mbox{for} & x < 0\\ x & \mbox{for} & x \ge 0\end{array} \right. latex latex f'(x) = \left \{ \begin{array}{rcl} 0.01 & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array} \right. latex The motivation for using LReLU instead of ReLU is that constant zero gradients can also result in slow learning, as when a saturated neuron uses a sigmoid activation function. Furthermore, some of them may not even activate. This sacrifice of the zero-sparsity, according to the authors, can provide worse results than when the neurons are completely deactivated (ReLU) [2]. In fact, the authors report the same or insignificantly better results when using PReLU instead of ReLU.

PReLU activation function

Parametric ReLU [3] is a inspired by LReLU wich, as mentioned before, has negligible impact on accuracy compared to ReLU. Based on the same ideas that LReLU, PReLU has the same goals: increase the learning speed by not deactivating some neurons. In contrast with LReLU, PReLU substitutes the value 0.01 by a parameter [latex]a_i[/latex] where [latex]i[/latex] refers to different channels. One could also share the same values for every channel.

The function and its derivative:

latex
f(x) = \left \{ \begin{array}{rcl}
a_i x & \mbox{for} & x < 0\\ x & \mbox{for} & x \ge 0\end{array} \right. latex latex f'(x) = \left \{ \begin{array}{rcl} a_i & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array} \right. latex The following equation shows how these parameters are iteratevely updated using the chain rule as the weights in the neural network (backpropagation). [latex]\mu[/latex] is the momentum and [latex]\epsilon[/latex] is the learning rate. IN the original paper, the initial [latex]a_i[/latex] used is 0.25 [latex]\nabla a_i := \mu \nabla a_i + \epsilon \frac{\partial \varepsilon}{\partial a_i}[/latex]

RReLU activation function

Randomized ReLU was published in a paper [4] that compares its performance with the previous rectified activations. According to the authors, RReLU outperforms the others, and LReLU performs better when [latex]\frac{1}{5.5}[/latex] substitutes 0.01.

The negative slope of RReLU is randomly calculated in each training iteration such that:

[latex]f_{ji}(x) = \left \{ \begin{array}{rcl} \frac{x_{ji}}{a_{ji}} xa & \mbox{for} & x_{ji} \ge 0\\ x_{ji} & \mbox{for} & x_{ji} \ge 0\end{array} \right.[/latex]

where
[latex]a_{ji} \sim U(l,u)[/latex]

The motivation to introduce a random negative slope is to reduce overfitting.

[latex]a_{ji}[/latex] is thus a random number from a uniform distribution bounded by [latex]l[/latex] and [latex]u[/latex] where [latex]i[/latex] refers to the channel and [latex]j[/latex] refers to the example. During the testing phase, [latex]a_{ji}[/latex] is fixed, and an average of all the [latex]a_{ji}[/latex] is taken: [latex]a_{ji} = \frac{l+u}{2}[/latex]. In the paper they use [latex]U(3,8)[/latex] and in the test time [latex]a_{ji} = \frac{11}{2}[/latex].

ELU activation function

Exponential Linear Unit (ELU) is another type of activation function based on ReLU [5]. As other rectified units, it speeds up learning and alleviates the vanishing gradient problem.

Similarly to the previous activation functions, its positive part has a constant gradient of one so it enables learning and does not saturate a neuron on that side of the function. LReLU, PReLU and RReLU do not ensure noise-robust deactivation since their negative part also consists on a slope, unlike the original ReLU or ELU which saturate in their negative part of the domain. As explained before, saturation means that the small derivative of the function decreases the information propagated to the next layer.

The activations that are close to zero have a gradient similar to the natural gradient since the shape of the function is smooth, thus activating faster learning than when the neuron is deactivated (ReLU) or has non-smooth slope (LReLU).

The function and its derivative:

[latex]f(x) = \left \{ \begin{array}{rcl} \alpha (exp(x) – 1) & \mbox{for} & x \le 0\\ x & \mbox{for} & x > 0\end{array} \right.[/latex]
[latex]f'(x) = \left \{ \begin{array}{rcl} f(x) + \alpha & \mbox{for} & x \le 0\\ 1 & \mbox{for} & x > 0\end{array} \right.[/latex]

In a nutshell:

Gradient of 1 in its positive part.
Deactivation on most of its negative domain.
Close-to-natural gradient in values closer to zero.

Softmax activation function

For the sake of completeness, let’s talk about softmax, although it is a different type of activation function.

Softmax it is commonly used as an activation function in the last layer of a neural network to transform the results into probabilities. Since there is a lot out there written about softmax, I want to give an intuitive and non-mathematical reasoning.

Case 1:
Imagine your task is to classify some input and there are 3 possible classes. Out of the neural network you get the following values (which are not probabilities): [3,0.7,0.5].

It seems that it’s very likely that the input will belong to the first class because the first number is clearly larger than the others. But how likely is it? We can use softmax for this, and we would get the following values: [0.846, 0.085, 0.069].

Case 2:
Now we have the values [1.2,1,1.5]. The last class has a larger value but this time is not that certain whether the input will belong to that class but we would probably bet for it, and this is clearly represented by the output of the softmax function: [0.316, 0.258, 0.426].

Case 3::
Now we have 10 classes and the values for each class are 1.2 except for the first class which is 1.5: [1.5,1.2,1.2,1.2,1.2,1.2,1.2,1.2,1.2,1.2]. Common sense says that even if the first class has a larger value, this time the model is very uncertain about its prediction since there are a lot of values close to the largest one. Softmax transforms that vector into the following probabilities: [0.13, 0.097, 0.097, 0.097, 0.097, 0.097, 0.097, 0.097, 0.097, 0.097].

Softmax function:

[latex size=”25″]\sigma (z)_j = \frac{e^{z_j}}{\sum^K_{k=1} e^{z_j}}[/latex]

In python:

z_exp = [math.exp(i) for i in z]
sum_z_exp = sum(z_exp)
return [round(i/sum_z_exp, 3) for i in z_exp]

References

1. Nair V. & Hinton G.E. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines”
2. Maas A., Hannun A.Y & Ng A.Y. 2013. “Rectifier Nonlinearities Improve Neural Network Acoustic Models”
3. He K., Zhang X., Ren S. & Sun J. 2015. “Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”
4. Xu B., Wang N., Chen T. & Li M. 2015. “Empirical Evaluation of Rectified Activations in Convolutional Network”
5. Clevert D.A., Unterthiner T. & Hochreiter S. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)