Gradient Descent amd its algorithm (for updating θ)

Gradient Descent and its Algorithm (for Updating θ)
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Gradient descent is not an algorithm for learning a specific model but rather an optimization technique used to train a wide range of machine learning models, including neural networks, support vector machines, and linear regression. It's essential for updating model parameters to minimize a loss function, making it a crucial component of many learning algorithms. When you have multiple training samples (also known as a dataset with multiple data points), the equations for the hypothesis and the cost function change to accommodate the entire dataset. This is often referred to as "batch" gradient descent, where you update the model parameters using the average of the gradients computed across all training samples.

Hypothesis (for multiple training samples):

The hypothesis for linear regression with multiple training samples is represented as a matrix multiplication. Let be the number of training samples, be the number of features, be the feature matrix, and be the target values. The hypothesis can be expressed as:

------------------------------ [3974a]

where,

is an matrix, where each row represents a training sample with features, and the first column is filled with ones (for the bias term).
is a column vector, representing the model parameters, including the bias term.

Cost Function (for multiple training samples):

The cost function in linear regression is typically represented using the mean squared error (MSE) for multiple training samples. The cost function is defined as:

------------------------------ [3974b]

where,

is the number of training samples.
) is the hypothesis's prediction for the -th training sample.
⁽ⁱ⁾ is the actual target value for the -th training sample.

Equation 3974b gives you to train the entire set m, so sometimes this type of gradient descent has another name, which is batch gradient descent. The key difference between SGD and batch Gradient Descent is that SGD updates the parameters more frequently and with noisier estimates of the gradient, which can lead to faster convergence and better generalization, especially when dealing with large datasets. However, it can also introduce more variance in the optimization process due to the random sampling of training examples.

Gradient Descent (for updating ):

To train the linear regression model, you typically use gradient descent to minimize the cost function. The update rule for in each iteration of gradient descent is as follows:

Workflow of supervised learning ------------------------------ [3974c]

where,

is the learning rate, which controls the step size of each update.
is the number of training samples.
represents the index of a feature (including the bias term), so ranges from 0 to .

It also can be given by,

Workflow of supervised learning ------------------------------ [3974d]

where,

The derivative function defines the direction of the steepest descent.
defines the function.

Then, we have:

Workflow of supervised learning ------------------------------ [3974e]

Workflow of supervised learning ------- [3974f]

Workflow of supervised learning ----------------------------------------------------------- [3974g]

In each iteration, each parameter is updated simultaneously using the gradients calculated over the entire training dataset. This process is repeated until the cost function converges to a minimum.

This batch gradient descent process allows you to find the optimal parameters that minimize the cost function, making your linear regression model fit the training data as closely as possible.

By combining Equations 3974d and 3974g, then we can have,

Workflow of supervised learning ----------------------- [3974h]

With multiple training examples, we then have,

Workflow of supervised learning ----------------------- [3974i]

where,

i - The i^th training example.

This descent process described in Equation 3974i will repeat until convergence. Similar to many other optimization algorithms, the gradient descent involves iteratively updating model parameters to minimize a cost function. The process continues until a convergence criterion is met. Convergence here means that the algorithm has reached a point where further iterations do not significantly improve the cost function or model performance. This is a common termination condition in many machine learning algorithms.

When we define the cost function J(θ) for linear regression as the mean squared error (MSE), it turns out to be a quadratic function of the parameters θ. This can be proved as shown below.

In linear regression, the hypothesis function h_θ(x) is defined as the linear combination of features and parameters:

Workflow of supervised learning -------------------- [3974j]

Where:

θ₀, θ₁, θ₂, ..., θ_n are the parameters (weights) you're trying to learn.
x₁, x₂, ..., x_n are the input features.
θ₀ is often referred to as the bias term.

The mean squared error (MSE) cost function J(θ) is defined as:

Workflow of supervised learning -------------------- [3974k]

Where:

J(θ) is the cost function.
m is the number of training examples.
h_θ(x⁽ⁱ⁾) is the predicted value for the ith training example.
y⁽ⁱ⁾ is the actual output (target) for the ith training example.

Now, let's expand the MSE cost function:

Workflow of supervised learning - [3974l]

each term inside the summation is a linear combination of the parameters θ₀, θ₁, θ₂, ..., θ_n and the input features x₁, x₂, ..., x_n. When you expand and simplify the squared terms within the summation, you get a polynomial of degree 2 (quadratic) with respect to the parameters θ₀, θ₁, θ₂, ..., θ_n. The presence of squared terms makes it a quadratic function. The goal of linear regression is to find the values of θ₀, θ₁, θ₂, ..., θ_n that minimize this quadratic cost function J(θ), and that's typically done using optimization techniques like gradient descent. Therefore, for linear regression with the mean squared error cost function, J(θ) is indeed a quadratic function of the parameters θ₀, θ₁, θ₂, ..., θ_n. Normally, the cost function J(θ) looks like a big bowl similar to Figure 3906a. This quadratic function will be ellipsis as shown in Figure 3906b. The red dots and lines shows that you run the gradient descent algorithm to find the global minimum, meaning converging to the global minimum.

Cost function J(θ)

Figure 3906a. Normal cost function J(θ).

Cost function J(θ)

Figure 3906b. Ellipsis and its contour of the quadratic function J(θ).

The choice of the learning rate (α) in the Gradient Descent algorithm is crucial, and setting it too large or too small can lead to issues in the optimization process:

Learning Rate Too Large (α is large):
- If the learning rate is too large, Gradient Descent may overshoot the minimum of the cost function.
- It can lead to divergence, where the algorithm fails to converge to the minimum and instead keeps oscillating or diverging away from the minimum.
- Large learning rates may result in unstable and unpredictable behavior.
Learning Rate Too Small (α is small):
- If the learning rate is too small, Gradient Descent may converge very slowly.
- It will require a large number of iterations to reach the minimum, especially for functions with flat or shallow regions.
- Extremely small learning rates can get stuck in local minima or plateaus, preventing convergence to the global minimum.

Choosing an appropriate learning rate is often an empirical process and depends on the specific problem and the characteristics of the cost function:

A common approach is to start with a moderate learning rate and adjust it during training if necessary.
Techniques like learning rate schedules or adaptive learning rates (e.g., Adam, RMSprop) can automatically adjust the learning rate during training to balance speed and stability.
Cross-validation or grid search can also be used to find an optimal learning rate for a given problem.

The choice of learning rate is one of the hyperparameters that can significantly impact the success of the Gradient Descent algorithm, and finding the right balance is essential for efficient and stable optimization. In practice, you will try a couple of values, and then find the most efficient α to find the global minimum.

The script below shows the steps of example gradient descent and plots the hypothesis lines after each step with linear regression model, depending on learning rate. Code:
          Upload Files to Webpages
       Output:

This script shows how the gradient descent algorithm iteratively adjusts the parameters of a linear regression model to find the best-fitting line for the given data.

Steps of gradient descent works within linear regression:

Initialize the coefficients (weights) with some initial values.
Compute the error (cost) between the predicted values and the actual target values using the current coefficients.
Calculate the gradient of the cost function with respect to each coefficient. The gradient represents the direction and magnitude of the steepest increase in the cost function.
Update the coefficients by moving them in the opposite direction of the gradient. This is done to reduce the error. The size of the update is controlled by a learning rate hyperparameter.
Repeat steps 2-4 iteratively until convergence criteria are met. Convergence criteria are conditions that determine when to stop the training process, such as reaching a maximum number of iterations or when the change in the error becomes very small.

============================================

Gradient Descent algorithm: in the script below, we are chosing the initial point (0.8, -0.8) as starting θ₀ and θ₁ to find the minimum of f(θ). However, depending on different initial point, you can get different local lowest point. Code:
          Upload Files to Webpages
       Output:

Figure 3906c shows the the benefits of normalization in the gradient descent:

Before Normalization:
- The gradient descent tends to follow the direction of the steepest slope in the parameter space.
- If the scales of the features (parameters) are significantly different, the optimization path may zigzag as it adjusts each parameter independently.
- This zigzagging is due to the fact that the step size along one dimension may be too large or too small compared to another dimension.
After Normalization:
- Normalization scales the features to have similar ranges, making the optimization landscape more isotropic.
- The steeper slope in the loss contour is pointing toward the middle, implying that the optimization path is more direct.
- With normalized features, the optimization algorithm can more effectively converge toward the minimum because the steps taken in each dimension are more balanced.

. Illustration of normalization in neural networks

Figure 3906c. Benefits of normalization in the gradient descent (Code).

While gradient descent is a powerful optimization algorithm that can converge quickly, the guarantee of reaching the exact global optimum depends on various factors::

Convergence to the Global Optimum:
- Gradient descent is an iterative optimization algorithm commonly used for finding the minimum of a function. In linear regression, the goal is to minimize the cost function (mean squared error, for example).
- Gradient descent can converge to the global optimum, especially in convex optimization problems. However, whether it converges to the global optimum depends on factors such as the choice of the learning rate, the convexity of the cost function, and the specific conditions of the optimization problem.
Convergence Rate:
- While gradient descent can converge relatively quickly, the rate of convergence depends on the characteristics of the optimization problem.
- The choice of the learning rate is crucial. A learning rate that is too large can cause divergence, and a learning rate that is too small can lead to slow convergence.
Exact Global Optimum:
- In practice, due to factors like the finite precision of numerical computations and the potential presence of numerical instability, gradient descent may not reach the exact global optimum.
- In some cases, the algorithm may terminate close to the optimum, but not exactly on it.
Convexity Assumption:
- In convex problems, e.g. linear regression problems, there is only one minimum, and gradient descent is likely to converge to that minimum.

In traditional gradient descent, which involves computing the gradient of the cost function with respect to the parameters using the entire dataset, going through all data points can be computationally expensive, especially for large datasets, and is the most expensive one in the traditional gradient descent. This is because each iteration requires processing the entire dataset to calculate the gradient.

============================================

=================================================================================