Common causes of nans during training of neural networks

machine-learning neural-network deep-learning caffe gradient-descent

I've noticed that a frequent occurrence during training is NANs being introduced.

Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.

Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?

The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?

desertnaut

I came across this phenomenon several times. Here are my observations:

Gradient blow up

Reason: large gradients throw the learning process off-track.

What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.

What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.

Bad learning rate policy and params

Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.

What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.

Faulty Loss function

Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.

What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.

Faulty input

Reason: you have an input with nan in it!

What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.

stride larger than kernel size in "Pooling" layer

For some reason, choosing stride > kernel_size for pooling may results with nans. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with nans in y.

Instabilities in "BatchNorm"

It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.

Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.

Thanks, how does one interpret those numbers? What are these numbers? pastebin.com/DLYgXK5v why is there only one number per layer output!? how should those numbers look like so that someone knows there is a problem or there is not any!?

@Hossein this is exactly what this post is all about.

Thanks for this answer. I am getting NAN loss for an image segmentation application trained with DICE loss (even after adding a small epsilon/ smoothness constant). My dataset contains some images whose corresponding ground-truth that do not contain any foreground label and when I removed these images from training, the loss was stabilized. I am not sure why is that?

@samrairshad have you tried increasing the epsilon in the DICE loss?

Yes I did. I opened the post at stack-overflow and pasted the loss evolution for some epochs. Here's the reference: stackoverflow.com/questions/62259112/…

desertnaut

In my case, not setting the bias in the convolution/deconvolution layers was the cause.

Solution: add the following to the convolution layer parameters.

bias_filler {
      type: "constant"
      value: 0
    }

how would that look in matconvnet? I've something like 'biases'.init_bias*ones(1,4,single)

Przemek D

This answer is not about a cause for nans, but rather proposes a way to help debug it. You can have this python layer:

class checkFiniteLayer(caffe.Layer):
  def setup(self, bottom, top):
    self.prefix = self.param_str
  def reshape(self, bottom, top):
    pass
  def forward(self, bottom, top):
    for i in xrange(len(bottom)):
      isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
      if isbad>0:
        raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
                        (self.prefix,i,100*float(isbad)/bottom[i].count))
  def backward(self, top, propagate_down, bottom):
    for i in xrange(len(top)):
      if not propagate_down[i]:
        continue
      isf = np.sum(1-np.isfinite(top[i].diff[...]))
        if isf>0:
          raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
                          (self.prefix,i,100*float(isf)/top[i].count))

Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:

layer {
  type: "Python"
  name: "check_loss"
  bottom: "fc2"
  top: "fc2"  # "in-place" layer
  python_param {
    module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
    layer: "checkFiniteLayer"
    param_str: "prefix-check_loss" # string for printouts
  }
}

Mohammad Rasoul tanhatalab

learning_rate is high and should be decreased The accuracy in the RNN code was nan, with select the low value for learning rate it fixes

LKB

I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)

can you be a little more specific? can you provide details on the configuration that had nans and the fixed configuration? what type of layers? what parameters?

@shai I had used several InnerProduct (lr_mult 1, decay_mult 1, lr_mult 2, decay_mult 0, xavier, std: 0.01) layers each followed by ReLU (except the last one). I was working with MNIST, and if I remember correctly, the architecture was 784 -> 1000 -> 500 -> 250 -> 100 -> 30 (and a symmetric decoder phase); removing the 30 layer alongwith its ReLU made the NaN's disappear.

Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now

相似问题

Extremely small or NaN values appear in training neural network

Common causes of nans during training of neural networks

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US