Bias and Variance, and BiasVariance Tradeoff in ML  Python and Machine Learning for Integrated Circuits   An Online Book  

Python and Machine Learning for Integrated Circuits http://www.globalsino.com/ICs/  


Chapter/Index: Introduction  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  Appendix  
================================================================================= Bias and variance are two fundamental concepts in machine learning that relate to a model's ability to make accurate predictions and generalize from the training data to unseen data. They are often discussed in the biasvariance tradeoff: i) Bias: Bias refers to the error introduced by approximating a realworld problem with a simplified model. It represents the model's assumptions and how closely it aligns with the true relationship between features and the target variable. High bias can lead to underfitting, where the model is too simplistic and cannot capture the underlying patterns in the data. When the bias is high, the model is too simple and doesn't fit the data well. Mathematically, bias can be measured as the difference between the expected (average) prediction of your model and the true values you are trying to predict. In a regression, where you're trying to predict a continuous target variable y with a model h(x), the bias can be represented as: Bias = E[h(x)  y]  [3806a] Here, E[...] represents the expected value over all possible training sets. Essentially, it quantifies how far off, on average, your predictions are from the true values. ii) Variance: Variance, on the other hand, represents the model's sensitivity to small fluctuations in the training data. A model with high variance is overly complex and can fit the training data very closely, but it may not generalize well to new, unseen data. High variance can lead to overfitting. When the variance is high, the model is too complex and fits the noise in the data, not the underlying patterns. In a regression, variance can be represented as: Variance = E[(h(x)  E[h(x)])^{2}]  [3806b] It calculates how much the predictions for a given data point x differ from the expected prediction over all possible training sets. The tradeoff implies that as you increase the complexity of your model (e.g., using a more flexible algorithm or increasing the model capacity), you may reduce bias (training error) but increase variance (sensitivity to variations in the training data). Crossvalidation and learning curves to assess bias and variance can be used for indirectly assessing variance and bias. Not many techniques can be used to directly measure bias and variance; however, for a direct evaluation, we can use techniques like biasvariance decomposition. Biasvariance decomposition breaks down the mean squared error of the model into three components: bias, variance, and irreducible error. This decomposition provides a direct assessment of how much of the error is due to bias and how much is due to variance. The challenge arises in finding the right balance between bias and variance:
The biasvariance tradeoff is a crucial concept in machine learning. If you increase the complexity of a model (e.g., by adding more features or using a more complex algorithm), you reduce bias but increase variance, and vice versa. Finding the right balance is essential for creating models that generalize well to unseen data. The goal in machine learning is to minimize the combined error due to bias and variance, given by, Total Error = Bias^{2} + Variance + Irreducible Error  [3806c] Irreducible error is the error that you can't reduce because of the inherent noise in the data. The tradeoff is that as you reduce bias, variance increases, and vice versa. The challenge is to find a model that strikes the right balance to achieve good generalization on unseen data. Figure 3806a shows biasvariance tradeoff, and underfitting and overfitting in machine learning. (a) (b)
To find the right balance between underfitting and overfitting, you typically use techniques like crossvalidation and validation datasets to assess model performance. These techniques help you select a model that generalizes well to unseen data and doesn't underfit or overfit. In addition to this simplified mathematical description, you can also use more complex metrics like learning curves, biasvariance tradeoff analysis, or measures like the mean squared error (MSE) to assess the level of underfitting or overfitting in your models. Bias and variance are two key components of the MSE of an estimator. The MSE is a measure of how well an estimator performs. MSE is defined as the average of the squared differences between the estimated values and the true values. Figure 3806b shows the parameters' distributions, which we got, after running four algorithms. The red disk at the center is the input data (true data). The xaxis and yaxis are two features θ_{1} and θ_{2}. Each dot in the images represents the sample of size M. In other words, the dots are basically the samples from the sampling distribution. The number of points is the number of experiments. Figures 3806b (a), (c) and (d) have low bias, while Figures 3806b (b) has high bias. Figures 3806b (a), (b) and (d) have high variance, while Figures 3806b (c) has low variance. That is, if the distribution of (θ_{1}, θ_{2}) is centered around true parameter (the red disk), then it has lower bias.Variance is measuring how dispersed the sampling distribution is.
The output after the first training process in Figure 3806b indicates the statistical efficiency with the individual learning algorithm, which refers to how well a model utilizes the available data to make accurate predictions. Figure 3806c shows the variance become smaller and smaller from (a) to (d) because of more training steps in a machine learning process.
In the training process, the goal is to make the expected value of the estimated parameters (denoted as E[θ^]) equal to the true parameters (θ*) across all data (denoted as M), namely, E[θ^{^}] = θ* for all data M  [3806d] The relationship between sample size and bias/variance is given by:
Variance is also "called" the wild horse of machine learning. The ways of reducing variance are: i) Increase your training data. More examples give your model a broader perspective and help it generalize better. Figure 3806cb shows the effect of training data size on error. The training error and validation error can be used as proxies for bias and variance so that they can represent variance in machine learning. Figure 3806cb. Effect of training data size on error. (code) ii) Regularization techniques. Figure 3806d shows a comparison between variances with and without regularization. The variance represents the mean squared error (MSE) between the model predictions and the actual data points. Both cases used random distributed data as dataset. The variances for both cases without and with regularization are 0.71 and 0.67, respectively. The variance without regularization (alpha=0) is slightly higher than the variance with regularization (alpha=1). However, in some cases, the difference is quite small so that the effect of regularization may vary depending on the dataset and the specific parameters. Figure 3806d. Comparison between variances without and with regularization. (code) Figure 3806e shows the comparison between bias without and with regularization. Figure 3806e. Comparison between bias without and with regularization. (code) Regularization tends to reduce overfitting, which means it helps in reducing variance rather than bias. While regularization might slightly increase the bias in some cases due to the penalty on complex models, the primary purpose of regularization is to control variance and improve the model's generalization to new data. iii) Simplify the model, like pruning unnecessary features or reducing its complexity. iv) Ensemble methods, like bagging and boosting, are like the Avengers of machine learning—they bring together different models to reduce variance. In LeaveOneOut CrossValidation (LOOCV), the variance measures how much the individual estimations differ from the mean estimation. A higher variance indicates greater variability in the model's performance across the different data points. A lower variance suggests that the model's performance is more consistent when evaluated on different subsets of the data. Measuring variance can be helpful in assessing the robustness of your model and identifying whether it is sensitive to specific data points. If the variance is too high, it may suggest that the model's performance is unstable, and further investigation may be needed to understand the sources of variability and potential improvements. Table 3806 lists the factors which can affect bias and variance in machine learning. Table 3806. Factors which can affect bias and variance.
Furthermore, the relationship between the degree of variance and the size of the hypothesis class is: Figure 3806f shows linear regression learning curves, which involve training the model on different subsets of the training data and show the training and test errors for each subset size. Figure 3806f. Linear regression learning curves. (code) When the training sample size is small, the training error is also small. For instance, if we only have one example for training, then any algorithm can fit the example perfectly. When the sample size is large, then it is harder to fit the training data perfectly. Howver, when the training sample size is small, the test error still decreases as the increase of sample size, which suggests larger sample set will help. The phenomenon of larger gaps between training and test errors at small sample sizes is often attributed to high model variance. This is a common occurrence when the size of the training set is relatively small, and the model has a higher tendency to overfit the training data. On the other hand, the large deviation of the training error from the desired performance is a sign of large bias. The other sign of large bias can be given when the gap between the training error and test error is very small, and the training error will never come back down to reach the desired performance line no matter how many samples we have. Note that a common “standard” of desired performance (allowed maximum error) is what humans can achieve. In general, the test error should always be higher than the training error no matter how many samples we have. ============================================


=================================================================================  

