Weight initialization

Weight Initialization
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Figure 3713a shows a single neuron with its connections. The green ball represents the activation function, a = σ. Here, z is given by,

z = w₁x₁+ w₂x₂ + ... + w_nx_n ------------------------------------------ [3713a]

Upload Files to Webpages

Figure 3724. A single neuron with its connections (Code).

Equation 3713a is a linear combination, where is a weighted sum of variables () with corresponding weights (w₁, w₂, ..., w_n). As the number of variables (n) increases, the weights () need to be smaller to prevent the output () from becoming too large, in order to prevent from vanishing and exploding gradients. The appropriate w_i should be,

appropriate wi ---------------------------------------------- [3713b]

Therefore, weight initialization (choosing appropriate initial values for weights) is crucial in mitigating the issues of vanishing and exploding gradients. The idea is that if weights are too large, gradients can explode; if they are too small, gradients can vanish.

A very common weight initialization in a neural network is given by (code),

appropriate wi -------------------------- [3713c]

where,

represents the weights in the -th layer.

generates random values from a normal distribution with mean 0 and standard deviation 1. The shape of the array is determined by the 'shape' parameter.

appropriate wi scales the randomly initialized values. This scaling factor is designed to take into account the number of input units in the previous layer () and is a modification of the original He initialization, which used . The factor of 4 instead of 2 is based on the use of the rectified linear unit (ReLU) activation function.

The initialization in Equation 3713c works for sigmoid activations.

He initialization is another initialization, which was designed with the rectified linear unit (ReLU) activation function, given by,

appropriate wi -------------------------- [3713d]

Xavier/Glorot initialization, which can be used for Tanh, is given by,

appropriate wi -------------------------- [3713e]

Equation 3713c shows random initialization. Random initialization of weights in a neural network is important for several reasons:

Breaking Symmetry: If all the weights are initialized to the same value, each neuron in a layer will compute the same output during forward propagation and receive the same gradients during backpropagation. This symmetry makes it difficult for the network to learn different features, and the neurons may learn the same representation. By randomly initializing the weights, you break this symmetry, allowing the neurons to learn different features.
Avoiding Dead Neurons: If all neurons start with the same weights, they will all update themselves in the same way during training. This can lead to a situation where some neurons become "dead" and always output the same values. Random initialization helps prevent this scenario by introducing diversity in the weights from the beginning.
Facilitating Convergence: Random initialization helps the network converge faster during training. If all weights are initialized to zero, the gradients for each parameter will be the same, and the network may take longer to learn. By starting with random weights, the network can begin to learn more quickly and efficiently.
Handling Different Inputs: Random initialization allows the network to handle a variety of inputs. If all weights are the same, the network may struggle to adapt to different patterns in the input data.
Breaking Symmetry in Deep Networks: In deep networks, breaking symmetry is even more crucial. Deep networks have many layers, and if all the weights are initialized the same way, the symmetry problem is exacerbated, making it extremely challenging for the network to learn meaningful representations at different layers.

In programming, assuming we have a defined gradient_descent function, to plot a graph with weights (w) or parameters (θ) versus cost function, we often need to store each w or θ in a list. On the other hand, for the ML modeling, we only need to return the last w value. To do this, this code is implementing a gradient descent algorithm and plots the cost function over the iterations, and stores each value of w/θ in a list. However, only the last value is stored with reinitialization or overwriting inside a loop so that only the final value is preserved to use in the ML process.

============================================

=================================================================================