Gradients in Tensorflow

The chain rule in Tensorflow

Manipulating any type of neural network involves dealing with the backpropagation algorithm and thus it is key to understand concepts such as derivatives, chain rule, gradients, etc. It is often important to not only theoretically understand them but also being able to play around with them, and that is the goal of this post. Most of the information presented has been collected from different posts from stackoverflow.

There are three types of differentiation:

  • Numerical: it uses the definition of the derivative (lim) to approximate the result.
  • Symbolic: manipulation of mathematical expressions (the one we learn in high school).
  • Automatic: repeatedly using the chain rule to break down the expression and use simple rules to obtain the result.

As the author of the answer (Salvador Dali) in stackoverflow points out, symbolic and automatic differentiation look similar but they are different.

Tensorflow uses reverse mode automatic differentiation.

As mentioned above, Automatic differentiation uses the chain rule so there are two possible ways to apply it: from inside to outside (forward mode) and vice versa (reverse mode). In the Automatic differentiation Wikipedia page there are a couple of step-by-step examples of forward and reverse mode quite easy to follow. The reverse mode is a bit harder to see probably because of the notation introduced by the Wikipedia but someone made a simple decomposition easier to understand.

Gradients

Gradients of common mathematical operations are included in Tensorflow so they can be directly applied during the reverse mode automatic differentiation process. In fact, if you want to implement a new operation it has to inherit from Decop and its gradient has to be “registered” (RegisterGradient). For example, this is how the derivative of [latex s=2]f(x)=sin(x)[/latex] looks like (python/ops/math_grad.py):

@ops.RegisterGradient("Sin")
def _SinGrad(op, grad):
  """Returns grad * cos(x)."""
  x = op.inputs[0]
  with ops.control_dependencies([grad]):
    x = math_ops.conj(x)
    return grad * math_ops.cos(x)

The function tf.gradients is used by any optimizer because they all inherit from the class Optimizer (python/training/optimizer.py). tf.gradients is not commonly used directly but it is implicitly used when calling the function minimize as:

train_step = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(loss)

More specifically, “minimize” function has two tasks:

  1. Calculate gradients
  2. Apply gradients

We can thus apply those operations, or even break down calculate_gradients to use tf.gradients. Here there are three alternatives when minimizing a loss function:

with tf.variable_scope("optimization") as scope:
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  # Option 1:
  train_step = optimizer.minimize(self.loss)
  # Option 2:
  grads = optimizer.compute_gradients(self.loss,var_list=tf.trainable_variables())
  train_step = optimizer.apply_gradients(grads)
  # Option 3:
  grads = tf.gradients(loss,tf.trainable_variables())
  grads_and_vars = list(zip(grads,tf.trainable_variables()))
  train_step = optimizer.apply_gradients(grads_and_vars)

This decomposition of the function minimize can be useful in some cases. For instance, you may want to process or keep track of the gradients to understand the Graph computations or you may want to calculate the gradients with respect to only some variables and not all of them.

Simple tf.gradients example

We have the function [latex s=2]z = f(x,y) = 2x-y[/latex] and we want to calculate both of its partial derivatives [latex s=2]\frac{\partial z}{\partial x} = 2[/latex] and [latex s=2]\frac{\partial z}{\partial y} = -1[/latex].

import tensorflow as tf

x = tf.Variable(1)
y = tf.Variable(2)
z = tf.subtract(2*x, y)
grad = tf.gradients(z, [x, y])

sess = tf.Session()
sess.run(tf.global_variables_initializer())

res = sess.run(grad)
print(res) # [2, -1]

In the previous example we could not see that after the derivatives are calculated, tf.gradients also substitutes each variable by its value and performs the corresponding calculations. Another example: [latex s=2]z = f(x,y) = sin(x)-y^3[/latex]. [latex s=2]\frac{\partial z}{\partial x} = cos(x)[/latex], [latex s=2]\frac{\partial z}{\partial y} = -3y^2[/latex]

import tensorflow as tf

x = tf.Variable(3.)
y = tf.Variable(5.)

z = tf.subtract(tf.sin(x), tf.pow(y,3))
grad = tf.gradients(z, [x, y])
#upd = inp - tf.multiply(grad,0.01)
sess = tf.Session()
sess.run(tf.global_variables_initializer())

res = sess.run(grad)
print(res) # [-0.9899925, -75.0] <-> [cos(3), -3*5*5]

Juan Miguel Valverde

"The only way to proof that you understand something is by programming it"

Leave a Reply