BatchNorm
1. Preface
As the Neural Network grows deeper, there are a few questions.
- the loss comes at the end
- the top layers train faster
- the bottom layers train slowly
- the data is at the bottom
- as the bottom layers update, all other layers need to change
- the top layers retrain many times
- slower convergence
2. BatchNorm
Fix mini-batch’s mean and variance:
$$\mu_B=\frac{1}{|B|}\sum_{i\in B}x_i$$
and
$$\sigma_{B}^{2}=\frac{1}{|B|}\sum_{i\in B}(x_i-\mu_B)^2+\epsilon$$
additional adjustments (trainable parameters):
$$x_{i+1}=\gamma\frac{x_i-\mu_B}{\sigma_B}+\beta$$
- after ‘Conv, Dense’ before ‘activation function’
- before ‘Conv, Dense’
- 对于全连接层,作用于特征维
- 对于卷积层,作用于通道维
3. How it works
- Each batch has different ‘mean’ and ‘variance’, so we add noise to the batch and control the complexity of the model.
- It’s not necessary to use it with Dropout
- learn proper offset and zoom to fix the mean and variance of the mini-batch data.
- speed up training, but doesn’t increase accuracy. (greater lr is available)
4. Q&A
- shallow layer using BN may not have good results.
- Linear Layer is not likely to learn BN, which can stablize the data..