Mean squared error (MSE) (L<sub>2</sub> loss function, Euclidean loss) and root mean squared error (RMSE)

Mean Squared Error (MSE) (L₂ loss function, Euclidean loss)
and Root Mean Squared Error (RMSE)
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

The L₂ loss function, also known as the Euclidean loss or mean squared error (MSE), is a commonly used loss function in machine learning. It measures the squared differences between predicted and actual values. In calculation of the MSE, we typically have two datasets: one with actual observed values and another with predicted or estimated values. The formula calculates the squared difference between each pair of corresponding values, averages these squared differences over all data points (from i=1 to n), and returns the MSE as a measure of the average squared error between the actual and predicted values. Minimizing the L2 loss helps in finding model parameters that make the predicted values close to the actual values. It is particularly common in regression problems.

Mean squared error (MSE) is defined as the average of the squared differences between the estimated values and the true values, given by,
          ------------------------------------------- [4068aa]
where,
          -- An individual observed or actual value from one dataset.
          -- An individual predicted or estimated value from another dataset.
         n -- The number of data points.

The formula for L2 loss in Equation 4068a can be re-written as:

------------------------------------------- [4068ab]

where,

n is the number of data points.

y_i is the actual output for the ith data point.

h_θ(x_i) is the predicted output for the ith data point.

The 1/2 factor is included to simplify the derivative of the loss during optimization.

In linear regression, a goal is to minimize the least squares (OLS) or mean squared error (MSE) term, which measures the error between the predicted values and the actual values ⁽ⁱ⁾, below,

hypothesis fuction --------------------------------- [3910ia]

Term 3910ia with L2 regularization (Ridge Regression) becomes the term below,

hypothesis fuction --------------------------------- [3910ib]

Higher-order polynomial models, such as fifth-order polynomials, are capable of fitting training data very closely, which can result in a very low training set error as shown in Figure 4068a. However, they are prone to overfitting. Overfit models memorize the training data and may not generalize well to unseen data. The low training error may not reflect the model's performance on new, unseen data, and it could lead to poor generalization.

Polynomial regression with different orders

(a)

Polynomial regression with different orders

(b)

Figure 4068a. Polynomial regression with different orders: (a) Polynomial regressions, and (b) Mean squared error (Code).

Figure 4068b shows a comparison between variances with and without regularization. The variance represents the mean squared error (MSE) between the model predictions and the actual data points. Both cases used random distributed data as dataset. The variances for both cases without and with regularization are 0.71 and 0.67, respectively. The variance without regularization (alpha=0) is slightly higher than the variance with regularization (alpha=1). However, in some cases, the difference is quite small so that the effect of regularization may vary depending on the dataset and the specific parameters.

Variance in ML

(a)

Variance in ML

(b)

Figure 4068b. Comparison between variances: (a) without, and (b) with regularization. (code)

Figure 4068c shows the normalization effect on Mean squared error (MSE).

Illustration of normalization in neural networks

(a)

Illustration of normalization in neural networks

(b)

Figure 4068c. Normalization effect on loss function (Code): (a) Data distribution, and (b) Mean squared error (MSE).

The general trend of the test error decreases as the square root of the training set size () as shown in Figure 4068d. The idea that the test error decreases as until reaching some irreducible error, often referred to as the "Bayes error" or "irreducible error," is a common observation. However, it's important to note that learning algorithms may not always drive the test error to zero, even with an infinite amount of data.

General trend of the test error decreasing as the square root of the training set size

Figure 4068d. General trend of the test error decreasing as the square root of the training set size. (Code).

On the other hand, Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

------------------------------------------- [4068ab]

============================================

Linear Regression: In linear regression, the true loss is typically defined as the mean squared error (MSE):

Uniform convergence -------------------------------------- [4068b]

Here, consists of the parameters and θ₁, and is the MSE.

Here's a Python program to plot the Mean Squared Error (MSE) as a function of a parameter for a simple linear regression example. Code:
          properties of variance
       Output:

In this script:

We generate some example data for a simple linear regression problem.
We define a range of parameter values (in this case, it represents the slope of the linear regression model).
We calculate the MSE for each value by making predictions using the linear regression model and comparing them to the actual data.
Finally, we create a plot that shows how the MSE changes as varies. This plot helps visualize how the choice of affects the goodness of fit in the linear regression model.

============================================

To plot the loss (MSE) versus epoch for a machine learning training process, we typically need training data and a model to train. Here's a Python script using the popular deep learning library TensorFlow and its Keras API to demonstrate how to create such a plot for a simple linear regression model. Code:
          properties of variance
       Output:

In this script:

We generate example data for a simple linear regression problem.
We define a simple linear regression model using TensorFlow/Keras with one input and one output unit (for linear regression).
We compile the model using Mean Squared Error (MSE) as the loss function.
We train the model on the data for a specified number of epochs (training iterations) and collect the loss values for each epoch.
Finally, we create a plot that shows how the loss (MSE) changes with each epoch during training.

=====================

To visualize both training and validation losses during the training process, we can use TensorFlow/Keras's built-in support for validation data. Code:
          properties of variance
       Output:

In this updated script:

Finally, we create a plot that shows how both training and validation loss (MSE) change with each epoch during training.
We split the example data into training and validation sets to monitor both training and validation losses. We split the example data into training and validation sets using the specified split_ratio.
During training, we use the validation_data parameter to specify the validation data, allowing TensorFlow/Keras to calculate and track both training and validation losses.
We extract both training and validation loss values from the training history. We collect both training and validation loss values for each epoch.
Finally, we create a plot that shows the training and validation loss curves for each epoch.

This plot helps us visualize how the model performs on both the training and validation datasets throughout the training process, which is crucial for monitoring model generalization and potential overfitting.

In the context of monitoring training and validation loss curves, a better performance is typically indicated by the following curve shapes:

Decreasing Training Loss: The training loss curve should decrease steadily as the number of epochs increases. This indicates that the model is learning from the training data and improving its performance.
Converging Validation Loss: The validation loss curve should also initially decrease, indicating that the model is learning from the training data. However, the key point to look for is convergence. The validation loss should eventually stabilize and even start increasing slightly, forming a U-shape or leveling off. This indicates that the model has learned to generalize well to unseen data.

Here's a breakdown of the curve shapes:

Underfitting (High Bias): If both the training and validation loss curves remain high and do not decrease significantly, it suggests that the model is too simple to capture the underlying patterns in the data. This is a sign of underfitting.
Good Fit: A good model fit is characterized by a decreasing training loss and a validation loss that decreases initially and then levels off or starts to increase slightly. This suggests that the model is learning well from the training data and generalizing to new data.
Overfitting (High Variance): Overfitting is indicated by a decreasing training loss but an increasing validation loss after some point. The model is fitting the training data too closely and failing to generalize, leading to poor performance on unseen data.

Therefore, in terms of curve shapes, a better performance is often associated with a training loss that decreases and a validation loss that converges and stabilizes. The point where the validation loss starts to increase slightly (after initially decreasing) is often the sweet spot for model selection, as it indicates the model's ability to generalize without overfitting to the training data.

The dependence of training loss and validation loss on the number of epochs (Epoch) can be expressed using equations that describe how these losses change over the course of training. Typically, these equations are defined as follows:

Training Loss (L_train):

The training loss measures how well the model fits the training data. It generally decreases during training as the model learns from the data. A common form of the equation is:

L_train(Epoch)=f_train(Epoch) ----------------------------- 4068c]
L_train(Epoch) represents the training loss at a specific epoch.
is a function that describes the training loss at each epoch. The specific form of train(Epoch) depends on the model and optimization algorithm used.

Validation Loss (L_val):

The validation loss measures how well the model generalizes to unseen data. It may initially decrease but should eventually stabilize or start increasing if overfitting occurs. The equation for validation loss is similar to that of the training loss:

----------------------------- 4068d]
L_val(Epoch) represents the validation loss at a specific epoch.
is a function that describes the validation loss at each epoch. It is influenced by the model's generalization capability.

The specific form of f_train(Epoch) and f_val(Epoch) depends on factors like the loss function, optimization algorithm, and the data being used. For example, in the case of Mean Squared Error (MSE) loss with stochastic gradient descent (SGD) optimization, f_train(Epoch) and f_val(Epoch) would involve the computation of the loss over the training and validation datasets at each epoch.

These equations provide a general framework for understanding how training and validation losses change with the number of training epochs. The goal during training is to minimize the training loss while ensuring that the validation loss remains low without increasing significantly, indicating good generalization to unseen data.

In well-performed training, it's not necessary for the training and validation losses to overlap with each other. [1,2] In fact, it's common for the training loss to be lower than the validation loss. Here's why:

Training Loss (L_train): The training loss measures how well the model fits the training data. During training, the model is optimized to minimize this loss. It is expected to decrease continuously as the model learns from the training data.
Validation Loss (L_val): The validation loss measures how well the model generalizes to unseen data (i.e., the validation dataset). It is a crucial metric for assessing a model's performance on data it has not seen during training.

In a well-performing model:

The training loss tends to decrease continuously as the model learns and fits the training data. This is because the model is specifically optimized to minimize this loss.
The validation loss may decrease initially, indicating that the model is learning and improving its generalization. However, it's normal for the validation loss to stabilize or start increasing slightly after a certain point. This behavior suggests that the model is not overfitting the training data and is generalizing well to new data.

So, it's common for the validation loss to be slightly higher than the training loss, especially as training progresses. The key is that the gap between the two losses should not widen significantly, and the validation loss should not exhibit a sharp increase. A widening gap or a sharp increase in the validation loss would indicate potential overfitting, which is not desirable.

Therefore, while it's not necessary for the training and validation losses to overlap, it's essential for the validation loss to remain reasonable and not show signs of deteriorating performance as training progresses. The primary goal is to have a well-generalizing model with a low validation loss.

============================================

Generalization Error: In this example, we'll generate synthetic data and fit a polynomial regression model to it. We'll then calculate and visualize the training error and test error (generalization error) as the polynomial degree increases, demonstrating the trade-off between underfitting and overfitting. Code:
          Upload Files to Webpages
       Output:

This script generates a plot showing how the training error and test error change as the polynomial degree increases. It illustrates the concept of generalization error by demonstrating the trade-off between underfitting (high training and test error) and overfitting (low training error but high test error).

To find the right balance between underfitting and overfitting, we typically use techniques like cross-validation and validation datasets to assess model performance. These techniques help us select a model that generalizes well to unseen data and doesn't underfit or overfit.

In addition to this simplified mathematical description, we can also use more complex metrics like learning curves, bias-variance trade-off analysis, or measures like the mean squared error (MSE) to assess the level of underfitting or overfitting in the models.

============================================

Correlations/similarity/dissimilarity between csv data using mean squared error (MSE): Reads data from multiple folders, calculates the mean squared error as a measure of dissimilarity between the data in "FolderOne" and each other folder, finds the best match for each file in the other folders, and computes overall correlations for each folder based on the MSE values. It then prints the results, providing insights into the similarity/dissimilarity between the data in "FolderOne" and the other folders. Note that the script below is not a machine learning script. Code:
          Upload Files to Webpages

       Input:







       Output:

============================================

[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Understanding Deep Learning Requires Rethinking Generalization, (2017).
[2] Lutz Prechelt, Early Stopping - But When?, Neural Networks, (1998):

=================================================================================