Bias and Variance, and Bias-Variance Trade-off in ML
- Python and Machine Learning for Integrated Circuits -
- An Online Book -
Python and Machine Learning for Integrated Circuits                                                           http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Bias and variance are two fundamental concepts in machine learning that relate to a model's ability to make accurate predictions and generalize from the training data to unseen data. They are often discussed in the bias-variance trade-off:

i) Bias: Bias refers to the error introduced by approximating a real-world problem with a simplified model. It represents the model's assumptions and how closely it aligns with the true relationship between features and the target variable. High bias can lead to underfitting, where the model is too simplistic and cannot capture the underlying patterns in the data. When the bias is high, the model is too simple and doesn't fit the data well. Mathematically, bias can be measured as the difference between the expected (average) prediction of your model and the true values you are trying to predict. In a regression, where you're trying to predict a continuous target variable y with a model h(x), the bias can be represented as:

Bias = E[h(x) - y] --------------------------------------------- [3806a]

Here, E[...] represents the expected value over all possible training sets. Essentially, it quantifies how far off, on average, your predictions are from the true values.

ii) Variance: Variance, on the other hand, represents the model's sensitivity to small fluctuations in the training data. A model with high variance is overly complex and can fit the training data very closely, but it may not generalize well to new, unseen data. High variance can lead to overfitting. When the variance is high, the model is too complex and fits the noise in the data, not the underlying patterns. In a regression, variance can be represented as:

Variance = E[(h(x) - E[h(x)])2] ------------------------------------------- [3806b]

It calculates how much the predictions for a given data point x differ from the expected prediction over all possible training sets. The tradeoff implies that as you increase the complexity of your model (e.g., using a more flexible algorithm or increasing the model capacity), you may reduce bias (training error) but increase variance (sensitivity to variations in the training data).

Cross-validation and learning curves to assess bias and variance can be used for indirectly assessing variance and bias. Not many techniques can be used to directly measure bias and variance; however, for a direct evaluation, we can use techniques like bias-variance decomposition. Bias-variance decomposition breaks down the mean squared error of the model into three components: bias, variance, and irreducible error. This decomposition provides a direct assessment of how much of the error is due to bias and how much is due to variance.

The challenge arises in finding the right balance between bias and variance:

1. Trade-off: You need to strike a balance between bias and variance. Reducing one often increases the other. Achieving the right trade-off depends on the specific problem and dataset, and there's no one-size-fits-all solution.

2. Data-dependent: The optimal bias-variance trade-off varies with the data. What works for one dataset may not work for another. This means you need to adapt your model and its complexity to each problem.

3. Real-world data is messy: Real-world data is often noisy and may contain outliers and unmodeled factors. This makes it challenging to build models that generalize well.

4. Dimensionality: High-dimensional data often introduces additional challenges in managing bias and variance, as complex models can overfit even more easily.

5. Cross-validation and hyperparameter tuning: Determining the right level of model complexity (e.g., selecting the right hyperparameters) often requires extensive experimentation and validation.

6. Experience: Mastering the bias-variance trade-off comes with experience and a deep understanding of the problem domain, data, and the specific machine learning algorithms being used.

The bias-variance trade-off is a crucial concept in machine learning. If you increase the complexity of a model (e.g., by adding more features or using a more complex algorithm), you reduce bias but increase variance, and vice versa. Finding the right balance is essential for creating models that generalize well to unseen data.

The goal in machine learning is to minimize the combined error due to bias and variance, given by,

Total Error = Bias2 + Variance + Irreducible Error ------------------------------------- [3806c]

Irreducible error is the error that you can't reduce because of the inherent noise in the data. The trade-off is that as you reduce bias, variance increases, and vice versa. The challenge is to find a model that strikes the right balance to achieve good generalization on unseen data. Figure 3806a shows bias-variance trade-off, and underfitting and overfitting in machine learning.

(a)

(b)

 Figure 3806a. (a) Bias-variance trade-off, and (b) underfitting and overfitting in machine learning (Code).

To find the right balance between underfitting and overfitting, you typically use techniques like cross-validation and validation datasets to assess model performance. These techniques help you select a model that generalizes well to unseen data and doesn't underfit or overfit.

In addition to this simplified mathematical description, you can also use more complex metrics like learning curves, bias-variance trade-off analysis, or measures like the mean squared error (MSE) to assess the level of underfitting or overfitting in your models. Bias and variance are two key components of the MSE of an estimator. The MSE is a measure of how well an estimator performs.

MSE is defined as the average of the squared differences between the estimated values and the true values. Figure 3806b shows the parameters' distributions, which we got, after running four algorithms. The red disk at the center is the input data (true data). The x-axis and y-axis are two features θ1 and θ2. Each dot in the images represents the sample of size M. In other words, the dots are basically the samples from the sampling distribution. The number of points is the number of experiments. Figures 3806b (a), (c) and (d) have low bias, while Figures 3806b (b) has high bias. Figures 3806b (a), (b) and (d) have high variance, while Figures 3806b (c) has low variance. That is, if the distribution of (θ1, θ2) is centered around true parameter (the red disk), then it has lower bias.Variance is measuring how dispersed the sampling distribution is.

 (a) (b) (c) (d)
 Figure 3806b. Variance and bias in ML: (a) Obtained from algorithm A, (b) Obtained from algorithm B (code), (c) Obtained from algorithm C (code), and (d) Obtained from algorithm D.

The output after the first training process in Figure 3806b indicates the statistical efficiency with the individual learning algorithm, which refers to how well a model utilizes the available data to make accurate predictions.

Figure 3806c shows the variance become smaller and smaller from (a) to (d) because of more training steps in a machine learning process.

 (a) (b) (c) (d)
 Figure 3806c. Changes of variance during the training process (a), (b), (c) and (d): E[θ^] = θ* for all M. (code)

In the training process, the goal is to make the expected value of the estimated parameters (denoted as E[θ^]) equal to the true parameters (θ*) across all data (denoted as M), namely,

E[θ^] = θ* for all data M -------------------------------------------------------- [3806d]

The relationship between sample size and bias/variance is given by:

• Increasing the sample size often leads to a reduction in variance. When you have more data points, your estimate becomes more stable, and it's less likely to be influenced by random fluctuations or outliers in the data. This leads to a smaller variance. If there are infinite samples, then the variance becomes zero.

• Increasing the sample size can also affect bias, but the relationship is not as straightforward. In some cases, a larger sample size may reduce bias by providing a more representative sample of the population. However, in complex modeling situations, especially with overfitting, increasing the sample size might not necessarily reduce bias and could even increase it if the model complexity is not adjusted appropriately.

Variance is also "called" the wild horse of machine learning. The ways of reducing variance are:

i) Increase your training data. More examples give your model a broader perspective and help it generalize better. Figure 3806cb shows the effect of training data size on error. The training error and validation error can be used as proxies for bias and variance so that they can represent variance in machine learning.

Figure 3806cb. Effect of training data size on error. (code)

ii) Regularization techniques. Figure 3806d shows a comparison between variances with and without regularization. The variance represents the mean squared error (MSE) between the model predictions and the actual data points. Both cases used random distributed data as dataset. The variances for both cases without and with regularization are 0.71 and 0.67, respectively. The variance without regularization (alpha=0) is slightly higher than the variance with regularization (alpha=1). However, in some cases, the difference is quite small so that the effect of regularization may vary depending on the dataset and the specific parameters.

Figure 3806d. Comparison between variances without and with regularization. (code)

Figure 3806e shows the comparison between bias without and with regularization.

Figure 3806e. Comparison between bias without and with regularization. (code)

Regularization tends to reduce overfitting, which means it helps in reducing variance rather than bias. While regularization might slightly increase the bias in some cases due to the penalty on complex models, the primary purpose of regularization is to control variance and improve the model's generalization to new data.

iii) Simplify the model, like pruning unnecessary features or reducing its complexity.

iv) Ensemble methods, like bagging and boosting, are like the Avengers of machine learning—they bring together different models to reduce variance.

In Leave-One-Out Cross-Validation (LOOCV), the variance measures how much the individual estimations differ from the mean estimation. A higher variance indicates greater variability in the model's performance across the different data points. A lower variance suggests that the model's performance is more consistent when evaluated on different subsets of the data. Measuring variance can be helpful in assessing the robustness of your model and identifying whether it is sensitive to specific data points. If the variance is too high, it may suggest that the model's performance is unstable, and further investigation may be needed to understand the sources of variability and potential improvements.

Table 3806 lists the factors which can affect bias and variance in machine learning.

Table 3806. Factors which can affect bias and variance.

Bias Variance
Model Complexity High bias occurs when a model is too simple and unable to capture the underlying patterns in the data. This leads to systematic errors or oversimplification. High variance happens when a model is too complex and captures noise in the training data, leading to poor generalization to new data.
Dataset Size Small datasets may not provide enough information for the model to learn complex patterns, resulting in high bias. With small datasets, models may fit the noise in the data, leading to high variance.
Feature Selection If important features are excluded, the model may have high bias as it cannot capture the underlying patterns. Including irrelevant features may increase the complexity of the model, leading to high variance.
Regularization Regularization methods, such as L1 or L2 regularization, can be used to reduce model complexity and prevent overfitting, which helps control bias. Proper regularization can also prevent the model from fitting the training data too closely, reducing variance.
Noise in the Data Noisy data can introduce errors and mislead the learning algorithm, resulting in high bias. Noise in the training data can be fitted by a complex model, leading to high variance.
Model Selection Choosing a model that is too simple for the complexity of the data can result in high bias. Selecting a model that is too complex for the given data can lead to high variance.
Training Duration Insufficient training may result in the model not capturing the underlying patterns, leading to high bias. Overtraining on the training data can result in the model fitting noise, increasing variance.
Cross-Validation Cross-validation helps in assessing how well the model generalizes to new data and can help reduce bias. Cross-validation can also provide insights into the model's stability and variance.
Ensemble Methods Ensemble methods, such as bagging, can help reduce bias by combining predictions from multiple models. Ensemble methods can also reduce variance by smoothing out individual model predictions.

Furthermore, the relationship between the degree of variance and the size of the hypothesis class is:
i) Small Hypothesis Class (Low Complexity): Models are simpler, with fewer parameters. This often leads to high bias and low variance. The model may not be able to capture the complexity of the underlying data.
ii) Large Hypothesis Class (High Complexity): Models are more complex, with more parameters. This can lead to low bias but high variance. The model may fit the training data very well but might not generalize well to new, unseen data.

Figure 3806f shows linear regression learning curves, which involve training the model on different subsets of the training data and show the training and test errors for each subset size.

Figure 3806f. Linear regression learning curves. (code)

When the training sample size is small, the training error is also small. For instance, if we only have one example for training, then any algorithm can fit the example perfectly. When the sample size is large, then it is harder to fit the training data perfectly. Howver, when the training sample size is small, the test error still decreases as the increase of sample size, which suggests larger sample set will help. The phenomenon of larger gaps between training and test errors at small sample sizes is often attributed to high model variance. This is a common occurrence when the size of the training set is relatively small, and the model has a higher tendency to overfit the training data. On the other hand, the large deviation of the training error from the desired performance is a sign of large bias. The other sign of large bias can be given when the gap between the training error and test error is very small, and the training error will never come back down to reach the desired performance line no matter how many samples we have. Note that a common “standard” of desired performance (allowed maximum error) is what humans can achieve. In general, the test error should always be higher than the training error no matter how many samples we have.

============================================

=================================================================================