Normalizing input in neural network

Normalizing Input in Neural Network
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Normalization typically refers to the process of scaling input features so that they have a mean of 0 and a standard deviation of 1. This can help the neural network converge faster during training and can mitigate problems associated with large or small input values. Normalizing the input in a neural network can be beneficial for several reasons, including avoiding saturation issues when the input values (z) are too large or too small.

When input values are too large, they can lead to saturation of activation functions, such as the sigmoid or tanh functions, causing gradients to become very small during backpropagation. This is known as the vanishing gradient problem and can make training slow and difficult. Normalizing the input helps keep the values within a reasonable range, reducing the likelihood of saturation. Similarly, when input values are too small, the network may have difficulty learning meaningful representations, and normalization can help address this issue by bringing the values to a more suitable scale for learning.

Batch normalization is a specific technique used in neural networks to normalize the input of each layer during training. It has been shown to improve training stability and speed by addressing issues related to input scale.

The normalization can be done with the formula below,

x = x/σ --------------------------------------------------- [3727a]

where,

σ is standard deviation.

The normalization is scaling each value by dividing it by the standard deviation () calculated from the dataset. Normalization is a common preprocessing step in machine learning, and it's often done to bring all features to a similar scale. This can be important for certain machine learning algorithms that are sensitive to the scale of input features.

Figure 3717a shows the input features is scaled by normalization so that they have a mean of 0 and a standard deviation of 1.

Distribution
Mean
	(a)	(b)

Figure 3717a. Illustration of normalization in neural networks (Code): (a) Original data distribution, and (b) Normalized data distribution.

Figure 3717b shows the normalization effect on Mean squared error (MSE).

Illustration of normalization in neural networks

(a)

Illustration of normalization in neural networks

(b)

Illustration of normalization in neural networks

(c)

Figure 3717b. normalization effect on loss function (Code): (a) Data distribution, (b) Standard deviation, and (c) Mean squared error (MSE).

Figure 3717c shows the the benefits of normalization in the gradient descent:

Before Normalization:
- The gradient descent tends to follow the direction of the steepest slope in the parameter space.
- If the scales of the features (parameters) are significantly different, the optimization path may zigzag as it adjusts each parameter independently.
- This zigzagging is due to the fact that the step size along one dimension may be too large or too small compared to another dimension.
After Normalization:
- Normalization scales the features to have similar ranges, making the optimization landscape more isotropic.
- The steeper slope in the loss contour is pointing toward the middle, implying that the optimization path is more direct.
- With normalized features, the optimization algorithm can more effectively converge toward the minimum because the steps taken in each dimension are more balanced.

. Illustration of normalization in neural networks

Figure 3717c. Benefits of normalization in the gradient descent (Code).

Figure 3717d shows the data distribution, KDE (kernel density estimation) and variance before and after normalization.

Before normalization	After normalization

(a)

(b)

(c)

Figure 3717d. (a) Data distribution, (b) KDE (kernel density estimation), and (c) Variance before and after normalization ( Code).

============================================

=================================================================================