ChatGPT解决这个技术问题 Extra ChatGPT

Batch Normalization in Convolutional Neural Network

I am newbie in convolutional neural networks and just have idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalisation in CNN.

I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end they mentioned that a slight modification is required when applied to CNN:

For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.

I am total confused when they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.

I could not understand the below sentence at all "In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"

I would be glad if someone cold elaborate and explain me in much simpler terms


M
Maxim

Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.

Usual batchnorm

Now, here's how the batchnorm is applied in a usual way (in pseudo-code):

# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
  out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)

Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.

Batchnorm in conv layer

This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.

Here's how the code looks like in this case (again pseudo-code):

# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
  out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)

In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").


Great answer, but I think you mean that we should take the mean and variance of B*H*W values, not B*H*C values. Refer to the first paragraph after Batchnorm in conv layer. Either way, +1.
Could we not just write: out[:,:,:,:] = norm(t[:,:,:,:], mean, stddev) without the loop? The mean and variance are computed over the whole batch and then it is applied to each element in the batch seperately rather than at once? @maxim
Regarding the BN for conv layers, one can get more information here - arxiv.org/pdf/1502.03167.pdf in the subsection 3.2. The jist is that we want to preserve the convolutional properties (for example spatial translation invariance of feature) and hence the mean is calculated over the axes of BxHxW
Basically, it computes H*W*C means : Or it calculates just B means in the first case? For a small example: if we consider 3x2x3 i/p the mean over dim=(0) is 2x3. Same here BxHxWxC the mean would be of shape HxWxC and it would be subtracted from each of the input of that batch. Please clarify.
Am I right, that usual batchnorm cannot be applied to fully-convolutional network? In each batch we could have different shapes, which would require arbitrary number of gamma and beta, which is impossible. Is that correct?
M
Maverick Meerkat

Some clarification on Maxim's answer.

I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.

This is why you want C means and stds, because you want a mean and std per channel/feature.

After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).

I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.

Hope this helps anyone who was confused as I was.


This is great, thanks! So, to correctly apply batch normalization after for a convolutional channel is to specify the axis corresponding to the output features of the convolution, not channels? But then the same mean and stdev would be used for normalization across different channels of the same output feature. It’s not 100% whether this is what was described in the original paper for batchnorm (in the section about applying it to conv nets).
I just read the docs on Keras BatchNormalisation. The batch norm paper recommends normalising using statistics (mean and stdev) for all locations of the same output feature within the output of the convolution. If we set axis to correspond to output channels, it should do the right thing. (I think.)
G
Guillaume Chevalier

I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.

About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.

About so that different elements of the same feature map, at different locations, are normalized in the same way: some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.

About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.


Just wanted to clear my idea with an example. So basically if we have 10 feature maps of size 5x5 and mini batch size of 20 so do we try to normalise every feature map individually? So the new mini batch size is = 20 * 25.(25 because the feature map is of size 5x5). I am confused if individual feature map is normalised with its own mean and variance or the mean and variance is the same for all the 10 feature maps. If the latter is the case what will be the new updated mini batch size?
After thinking about it for a while, I would like to just state that I think you are correct. And this is how the batchnorm paper says it should be done after conv layers. So all good.
M
Milo Sun

Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer). then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer. thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand) This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"


关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now