=================================================================================
Stochastic Gradient Descent (SGD) is a widely used optimization algorithm in machine learning and deep learning for training models, especially in scenarios where you have large datasets. It is a variant of the gradient descent optimization algorithm. The key idea behind SGD is to update the model's parameters (weights and biases) incrementally and stochastically based on the gradient of the loss function with respect to a single randomly chosen training example at each iteration. Here's how it works:

Initialization: Start with an initial set of model parameters (weights (coefficients) and biases with some initial values).

Data Shuffling: Shuffle the training dataset randomly. This step is typically done to ensure that the updates are not biased by the order of the data.

Iteration: For each iteration, randomly select one training example from the shuffled dataset.
 For each training example in your dataset, compute the gradient of the loss with respect to the model parameters.
 Gradient Computation: Compute the gradient of the loss function with respect to the model parameters using the selected training example. This gradient represents the direction and magnitude of the steepest increase in the loss function for that particular example.
 Parameter Update: Update the model parameters in the opposite direction of the gradient to minimize the loss. The size of the step is controlled by a parameter called the learning rate. Update the model parameters using the computed gradient. The update is performed in the opposite direction of the gradient to minimize the loss. The update rule is as follows:
new_parameter = old_parameter  learning_rate * gradient
The learning rate (often denoted as η or alpha) is a hyperparameter that controls the step size in the parameter space. It determines how big of a step is taken in each iteration.

Repeat: Continue this process for a specified number of iterations or until convergence criteria are met (e.g., the loss no longer decreases significantly, that is the change in the loss becomes small enough).
The "stochastic" aspect of SGD comes from the fact that it uses a single random data point (or a small batch of data points in minibatch SGD) to update the parameters in each iteration. This randomness introduces noise into the optimization process, which can help the algorithm escape local minima and explore the parameter space more effectively, especially in highdimensional spaces.
SGD is computationally efficient and works well in practice, especially for large datasets and highdimensional models like neural networks. However, it can exhibit noisy convergence due to its randomness, so it often requires careful tuning of hyperparameters like the learning rate. Variants of SGD, such as Minibatch SGD and Momentum SGD, are also commonly used to balance the tradeoff between the stochastic nature of SGD and the stability of convergence.
Data from a tf.data.Dataset can be taken to refactor linear regression, and then implement stochastic gradient descent with it. In this case, the dataset will be synthetic and read by the tf.data API directly from memory, and then tf.data API is used to load a dataset when the dataset resides on disk. The steps for this application are:
iv.a)
Use tf.data to read data from memory.
iv.b)
Use tf.data in a training loop.
iv.c)
Use tf.data to read data from disk.
iv.d)
Write production input pipelines with features engineering (batching, shuffling, etc.)
Machine learning algorithms often involve stochastic processes, such as the optimization of model parameters using SGD. Concentration inequalities are used to analyze the convergence properties of these algorithms. They provide bounds on how quickly the algorithms approach the optimal solution.
Some key equations used in Stochastic Gradient Descent are:
 bjective Function: SGD aims to minimize an objective function, typically represented as a loss or cost function, denoted by J(θ), where θ represents the model's parameters.
 nitialization: Initialize the model's parameters θ with some initial values. Often, these are initialized randomly.
 terative Update Rule: In each iteration, SGD updates the model parameters based on the gradient of the objective function with respect to the current parameters. The update rule for the parameters θ is given as:
 [4088a]
 θ_{t} represents the model parameters at iteration t.
 η_{t} is the learning rate at iteration t, which controls the step size of the update.
 ∇J(θ_{t}) is the gradient of the objective function with respect to θ_{t}. In the stochastic version, this gradient is estimated using a single training example or a small batch of examples.
 Learning Rate: The learning rate (η_{t}) is a hyperparameter that determines the step size in each iteration. It is often chosen based on heuristics or tuned using techniques like learning rate schedules or adaptive learning rate methods.
In the SGD process,
the step of updating θ_{j} for every j is repeated for i = 1 to m:
 [4088b]
For SVM, we can transform (θ_{0}, θ_{1}, ..., θ_{n}) into (b, w_{1}, ..., w_{n}):
 [4088bb]
Here,θ_{0} to represent the intercept and θ_{1}, ..., θ_{n} to represent the coefficients of features.
 MiniBatch SGD: In practice, rather than using a single training example, SGD typically operates on a minibatch of training examples. The update rule becomes:
 [4088c]
 B is the minibatch of training examples.
 ∇J(θ_{t}, x_{i}, y_{i}) is the gradient of the objective function for a specific training example (x_{i}, y_{i}).
 Termination: SGD iterates over the training data for a fixed number of iterations (epochs) or until a convergence criterion is met, such as reaching a certain level of loss or gradient norm.
The key difference between SGD and batch Gradient Descent is that SGD updates the parameters more frequently and with noisier estimates of the gradient, which can lead to faster convergence and better generalization, especially when dealing with large datasets. However, it can also introduce more variance in the optimization process due to the random sampling of training examples.
Contour of SGD: Code:
Output:
The step line in the simulated image above does not appear noisy because it uses a fixed learning rate (learning_rate = 0.1) and a deterministic gradient for the simple quadratic objective function:

Fixed Learning Rate: The learning rate (η) in the SGD update rule is constant (learning_rate = 0.1). This means that in each iteration, the model parameters are updated by a fixed amount determined by the learning rate. Since the learning rate is constant, the step size remains the same throughout the optimization process.

Deterministic Gradient: In this script, the objective function (objective_function) is a simple quadratic function, and the gradient (gradient) is computed deterministically. In other words, for each point in the parameter space, the gradient is fixed and does not change between iterations. This results in a smooth trajectory for the optimization process.

Small Number of Iterations: The script runs a relatively small number of iterations (num_iterations = 20). With a fixed learning rate and a deterministic gradient, the optimization process converges quickly to the minimum of the objective function, and there is little opportunity for noise or randomness to affect the trajectory significantly.
In reality, the behavior of realworld stochastic gradient descent, this step line is noisy () because:
 The learning rate is reduced over time.
 The objective function is more complex and nonconvex so that exhibits more varied behavior during optimization.
Figure 4088. Contour plot of SGD in real word.
Table 4088. Applications and related concepts of SGD.
Applications 
Page 
Stochastic gradient descent (SGD): logistic regression can be addressed by applying SGD 
Introduction 
============================================
