ChatGPT解决这个技术问题 Extra ChatGPT

How to apply gradient clipping in TensorFlow?

Considering the example code.

I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

This is an example that could be used but where do I introduce this ? In the def of RNN

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?

Do I have to define my own Optimizer for this or is there a simpler option?


S
Styrke

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.

In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

Styrke, thanks for the post. Do you know what the next steps are to actually run a iteration of the optimizer? Typically, an optimizer is instantiated as optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) and then an iteration of optimizer is done as optimizer.run() but using optimizer.run() does not seem to work in this case?
Ok got it optimizer.apply_gradients(capped_gvs) that needs to be assigned to something x = optimizer.apply_gradients(capped_gvs) then within your session you can train as x.run(...)
Shout-out to @remi-cuingnet for the nice edit suggestion. (Which unfortunately was rejected by hasty reviewers)
This gives me UserWarning: Converting sparse IndexedSlices to a dense Tensor with 148331760 elements. This may consume a large amount of memory. So somehow my sparse gradients are converted to dense. Any idea how to overcome this problem?
Actually the right way to clip gradients (according to tensorflow docs, computer scientists, and logic) is with tf.clip_by_global_norm, as suggested by @danijar
d
danijar

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

Good example with clip_by_global_norm()! This is also described as the correct way to perform gradient clipping in tensorflow docs: tensorflow.org/versions/r1.2/api_docs/python/tf/…
@Escachator It's empirical and will depend on your model and possibly the task. What I do is to visualize the gradient norm tf.global_norm(gradients) to see it's usual range and then clip a bit above that to prevent outliers from messing up the training.
would you still call opt.minimize() after or would you call something different like opt.run() as is suggested in some of the comments on other answers?
@reese0106 No, optimizer.minimize(loss) is just a shorthand for computing and applying the gradients. You can run the example in my answer with sess.run(optimize).
So if I were using tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op) within an experiment function, then your optimize would replace my train_op correct? Right now my train_op = optimizer.minimize(loss, global_step=global_step)) so I'm trying to make sure I adjust accordingly...
N
Nicolas Gervais

It's easy for tf.keras!

optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)

This optimizer will clip all gradients to values between [-1.0, 1.0].

See the docs.


Also, if we use custom training and use optimizer.apply_gradients we need to clip the gradient before calling this method. In that case, we need gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients] followed by .apply_graidents.
It also supports clipnorm and apparently global_clipnorm: optimizer = tf.keras.optimizers.Adam(global_clipnorm=5.0)
V
Vishnuvardhan Janapati

This is actually properly explained in the documentation.:

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps: Compute the gradients with compute_gradients(). Process the gradients as you wish. Apply the processed gradients with apply_gradients().

And in the example they provide they use these 3 steps:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.


would you still call opt.minimize() after or would you call something different like opt.run() as is suggested in some of the comments on other answers?
@reese0106 No, you need to assign the opt.apply_gradients(...) to a variable like train_step for example (just like you would for opt.minimize(). And the in your main loop you call it like usual to train sess.run([train_step, ...], feed_dict)
Keep in mind that the gradient is defined as the vector of derivatives of the loss wrt to all parameters in the model. TensorFlow represents it as a Python list that contains a tuple for each variable and its gradient. This means to clip the gradient norm, you cannot clip each tensor individually, you need to consider the list at once (e.g. using tf.clip_by_global_norm(list_of_tensors)).
404 on the link
k
kmario23

For those who would like to understand the idea of gradient clipping (by norm):

Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.

Let the gradient be g and the max_norm_threshold be j.

Now, if ||g|| > j , we do:

g = ( j * g ) / ||g||

This is the implementation done in tf.clip_by_norm


if I need to select the threshold by hand, are there any common method to do this?
This is sort of a black magic suggested in some papers. Otherwise, you've to do lot of experiments and find out which one works better.
L
LouYu

IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

This way you only have to define this once, and not run it after every gradients calculation.

Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm


Not supported with mixed precision
Only for tensorflow 1.x
R
Raj

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.

After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)