Considering the example code.
I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
This is an example that could be used but where do I introduce this ? In the def of RNN
lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
_X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)
But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?
Do I have to define my own Optimizer for this or is there a simpler option?
Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize()
method.
In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize()
method with something like the following:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)
Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
Clipping each gradient matrix individually changes their relative scale but is also possible:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
None if gradient is None else tf.clip_by_norm(gradient, 5.0)
for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))
In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:
optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))
clip_by_global_norm()
! This is also described as the correct way to perform gradient clipping
in tensorflow docs: tensorflow.org/versions/r1.2/api_docs/python/tf/…
tf.global_norm(gradients)
to see it's usual range and then clip a bit above that to prevent outliers from messing up the training.
opt.minimize()
after or would you call something different like opt.run()
as is suggested in some of the comments on other answers?
optimizer.minimize(loss)
is just a shorthand for computing and applying the gradients. You can run the example in my answer with sess.run(optimize)
.
tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
within an experiment function, then your optimize
would replace my train_op
correct? Right now my train_op = optimizer.minimize(loss, global_step=global_step))
so I'm trying to make sure I adjust accordingly...
It's easy for tf.keras!
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
This optimizer will clip all gradients to values between [-1.0, 1.0]
.
See the docs.
optimizer.apply_gradients
we need to clip the gradient before calling this method. In that case, we need gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
followed by .apply_graidents
.
clipnorm
and apparently global_clipnorm
: optimizer = tf.keras.optimizers.Adam(global_clipnorm=5.0)
This is actually properly explained in the documentation.:
Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps: Compute the gradients with compute_gradients(). Process the gradients as you wish. Apply the processed gradients with apply_gradients().
And in the example they provide they use these 3 steps:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
Here MyCapper
is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()
) is here.
opt.minimize()
after or would you call something different like opt.run()
as is suggested in some of the comments on other answers?
opt.apply_gradients(...)
to a variable like train_step
for example (just like you would for opt.minimize()
. And the in your main loop you call it like usual to train sess.run([train_step, ...], feed_dict)
tf.clip_by_global_norm(list_of_tensors)
).
For those who would like to understand the idea of gradient clipping (by norm):
Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5
.
Let the gradient be g and the max_norm_threshold be j.
Now, if ||g|| > j , we do:
g = ( j * g ) / ||g||
This is the implementation done in tf.clip_by_norm
IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm
:
original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)
This way you only have to define this once, and not run it after every gradients calculation.
Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm
Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .
clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars
where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.
After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)
Success story sharing
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
and then an iteration of optimizer is done asoptimizer.run()
but usingoptimizer.run()
does not seem to work in this case?optimizer.apply_gradients(capped_gvs)
that needs to be assigned to somethingx = optimizer.apply_gradients(capped_gvs)
then within your session you can train asx.run(...)
UserWarning: Converting sparse IndexedSlices to a dense Tensor with 148331760 elements. This may consume a large amount of memory.
So somehow my sparse gradients are converted to dense. Any idea how to overcome this problem?tf.clip_by_global_norm
, as suggested by @danijar