ChatGPT解决这个技术问题 Extra ChatGPT

How to do gradient clipping in pytorch?

What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

@pierrom Thanks. I found that thread myself. Thought that asking here would save everyone who comes after me and googles for a quick answer the hassle of reading through all the discussion (which I haven't finished yet myself), and just getting a quick answer, stackoverflow style. Going to forums to find answers reminds me of 1990. If no one else posts the answer before me, then I will once I find it.

M
Mateen Ulhaq

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()

Why is this more complete? I see the more votes, but don't really understand why this is better. Can you explain please?
This simply follows a popular pattern, where one can insert torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) between the loss.backward() and optimizer.step()
what is args.clip?
does it matter if you call opt.zero_grad() before the forward pass or not? My guess is that the sooner it's zeroed out perhaps the sooner MEM freeing happens?
@FarhangAmaji the max_norm (clipping threshold) value from the args (perhaps from argparse module)
I
Ivan

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

It is worth mentioning here that these two approaches are NOT equivalent. The latter approach with registering a hook is definitely what most people want. The difference between these two approaches is that the latter approach clips gradients DURING backpropagation and the first approach clips gradients AFTER the entire backpropagation has taken place.
N
Nikita

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.


h
hkchengrex

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping:

optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping


C
Charles Xu

Well, I met with same err. I tried to use the clip norm but it doesn't work.

I don't want to change the network or add regularizers. So I change the optimizer to Adam, and it works.

Then I use the pretrained model from Adam to initate the training and use SGD + momentum for fine tuning. It is now working.


what do you mean by "doesn't work"?
Still gives a 'nan'