Bagging (Bootstrap Aggregating) in ML

Bagging (Bootstrap Aggregating) in ML
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Bagging, which stands for Bootstrap Aggregating, is a machine learning ensemble technique designed to improve the stability and accuracy of machine learning algorithms. The term "Bagging" in machine learning again comes from "Bootstrap AGGregatING," a blend of the words "bootstrap" and "aggregating." The basic idea behind bagging is to train multiple instances of the same learning algorithm on different subsets of the training data and then combine their predictions.

In machine learning, particularly in ensemble techniques like bagging, "aggregating" refers to the process of combining the predictions from multiple models to form a single prediction. The idea is to take the individual outputs of several models—each trained on a different subset of the training data—and combine them in some way to achieve a more accurate or robust overall prediction than any single model could provide.

A few common methods of aggregation used in bagging are:

Majority Voting (for classification): Each model makes a prediction (a class label), and the class label that gets the most votes is chosen as the final prediction. This method is simple and often very effective for categorical outcomes.
Average (for regression): Each model predicts a numerical value, and the final prediction is the average of these values. This method is straightforward and helps to reduce variance in the predictions.
Weighted Average: Similar to averaging, but each model's prediction is weighted by its accuracy or some other measure of its performance. This can help to emphasize the influence of more reliable models in the ensemble.
Sum: In some cases, the predictions of all models are summed to get the final prediction. This is less common but can be appropriate depending on the specific application and how the individual model outputs are scaled.

The process of aggregating helps to mitigate errors that might arise from any single model, especially if the models are overfitted to their respective training subsets. By combining multiple perspectives, the ensemble method often achieves better generalization to new, unseen data.

Here's how bagging works:

Bootstrap Sampling: Randomly select subsets of the training dataset with replacement and given a training dataset with 'n' samples, bagging creates multiple random subsets (called bootstrap samples) of the training data by sampling with replacement (Figure 3740a). This means that some instances may appear multiple times in a subset, while others may not appear at all.
Model Training: Train a base model (e.g., a decision tree) on each of the bootstrapped subsets. A base learning algorithm (e.g., decision tree, neural network) is trained independently on each bootstrap sample. This results in multiple base models.
Prediction Aggregation: When making predictions on new data, each base model predicts the outcome independently and the outputs of all individual models are combined. For regression tasks, the final prediction is often the average of the individual model predictions, while for classification tasks, it might be a majority vote.

Let be the prediction of the -th base model on input . For regression tasks, the final prediction is often the average of the individual model predictions:

Gm(x) be the prediction of the m-th base model on input x ---------------------------------------- [3740a]

where,

is the total number of base models.

For classification tasks, the final prediction can be a majority vote:

Gm(x) be the prediction of the m-th base model on input x ------------------------ [3740b]

where,

is the indicator function (equal to 1 if the condition inside is true, and 0 otherwise).

ranges over the possible class labels.

Equation 3740b expresses the idea that, in regression, we aggregate predictions by averaging them, and in classification, we aggregate predictions by selecting the class with the majority of votes.

Figure 3740a illustrate "sampling with replacement in bagging. The dataset and histogram show how often each element is sampled. Each bar in the figure represents the frequency of each element in the sampled dataset. Since the sampling is done with replacement, some elements may appear more than once.

Sampling with replacement in bagging

(a)

Sampling with replacement in bagging

(b)

Sampling with replacement in bagging

(c)

Figure 3740a. Sampling with replacement in bagging (Code): (a) 20 samples in different colors, (b) Sample 1, and (c) Sample 2.

The main advantages of bagging include:

Reduction of Overfitting and Variance: By training on different subsets of the data, each model in the ensemble focuses on different patterns in the data, reducing the risk of overfitting. Since each model is trained on a slightly different subset of the data, they capture different aspects of the underlying patterns in the dataset. When combined, these models can provide a more robust and accurate prediction compared to a single model.
Improved Stability: Bagging can make the model more robust by reducing the impact of outliers or noise in the training data.
Increased Accuracy: The combination of predictions from multiple models often leads to a more accurate and robust final model compared to any individual model.

A well-known algorithm that uses bagging is the Random Forest algorithm, which builds an ensemble of decision trees using bagging. Random Forests introduce additional randomness by considering a random subset of features at each split, making them even more robust.

While bagging (Bootstrap Aggregating) offers several advantages, it also has some potential disadvantages or limitations:

Increased Complexity and Computational Cost:
- Bagging involves training multiple models, which can significantly increase the computational cost, especially if the base learner is computationally expensive.
- The need to train and maintain multiple models can make bagging less practical for large datasets and real-time applications.
Loss of Interpretability:
- The ensemble of models created through bagging may be more challenging to interpret compared to a single, simpler model. Understanding the contributions of individual models to the ensemble prediction can be complex.
No Improvement for Underlying Model Quality:
- Bagging cannot overcome the limitations of the underlying base learning algorithm. If the base model is inherently weak or poorly suited to the task, bagging may not yield significant improvements.
Possible Overfitting with Noisy Data:
- While bagging can reduce overfitting in general, it may not perform as well if the dataset is very noisy or if there are outliers. The inclusion of noisy samples in the bootstrap samples can propagate errors.
Limited Improvement for Stable Models:
- Bagging tends to benefit from models that are sensitive to different aspects of the data. If the base model is already stable and performs well on the entire dataset, bagging might not provide substantial improvements.
Dependency on Diversity:
- The effectiveness of bagging relies on the diversity of the base models. If the base models are too similar or highly correlated, the ensemble may not provide significant performance gains.
Potential for Model Redundancy:
- In some cases, bagging may end up training similar models multiple times, leading to redundancy in the ensemble. This redundancy may not contribute much to the overall predictive power.
Limited Improvement for Linear Models:
- Bagging is often more beneficial for models with high variance, such as decision trees. For linear models with lower variance, the gains from bagging may be less pronounced.

In bagging, the goal is to create multiple bootstrap samples (Z) from the original training set (S) through the process of resampling with replacement. The assumption is that each bootstrap sample is obtained by drawing instances from the original training set, and since the sampling is done with replacement, some instances may appear more than once in a given bootstrap sample, while others may not appear at all.

Assuming that the true population (P) is equal to the training set (S), meaning that the training set is a representative sample of the entire population, and we create a bootstrap sample (Z) from S, then:

Similarity to the Original Training Set:
- The bootstrap sample Z is likely to resemble the original training set S, given that they are drawn from the same population (P = S).
Variability in Bootstrap Samples:
- Each bootstrap sample Z will be slightly different from the original training set due to the sampling with replacement. This variability is essential for bagging's purpose of creating diverse subsets for training different models.
Bootstrap Sample Size:
- If we create a bootstrap sample Z from S, the size of Z is likely to be the same as the size of S, as we draw instances with replacement until we have a sample of the same size.
Repetition of Instances:
- Some instances from the original training set may be repeated in the bootstrap sample, while others may be omitted. The exact composition of Z will vary across different bootstrap samples.

For the variance of the sample mean () when dealing with a scenario where the variables are correlated, we have (see i.i.d.),

formula for the variance --------------------------------- [3740c]

Equation 3740c describes the variance of an aggregated estimate when combining independently estimated values with weights and . In the context of bagging, this equation can be related to the bias and variance components in the following way:

Let represent the prediction of the -th base model in bagging.
The aggregated prediction f^_bagged(X) in bagging is often defined as the average of the individual predictions:
---------------------------- [3740d]

where,

is the number of base models.

p = 1/M represents the weight assigned to each base model.

The relation between Equation 3740c, and bias and variance are: :

Variance (for a single prediction point):
- In bagging, the variance term in the equation can be associated with the variability between the predictions of different base models.
- represents the weighted sum of the variances of individual base models, where is the weight assigned to each model.
- /n represents the average variance within a single model, where is the size of the bootstrap samples used to train each model.
Bias (for a single prediction point):
- Bias is often implicitly reduced because the base models are trained on different subsets of the data, capturing different aspects of the underlying relationship. The averaging process tends to smooth out individual model biases.
When n is increased, then the second term (the average variance) in Equation 3740c becomes smaller.
Bootstrapping itself does not directly cause overfitting. Bootstrapping is a resampling technique where random samples (with replacement) are drawn from the dataset to create multiple datasets of the same size. Each dataset is used to train a model, and the final model is an aggregation (such as averaging) of the models trained on each bootstrap sample. This process is commonly used in ensemble methods like bagging. The idea behind bootstrapping is to improve the stability and reduce the variance of a model by creating diverse training datasets. It helps in making the model less sensitive to the specific quirks or outliers in the original dataset.
Bootstrapping could lead to a slight increase in bias because of random subsampling:
1. Underrepresentation of Rare Events:
  - In the process of bootstrapping, some observations from the original dataset may be sampled multiple times, while others may not be sampled at all. This can lead to an underrepresentation of rare events or outliers in the bootstrapped samples.
  - If these rare events have a significant impact on the estimation, the omission or underrepresentation of such events in some bootstrap samples can introduce bias.
2. Distortion of the True Distribution:
  - Bootstrapping assumes that the original dataset is a good representation of the underlying population or distribution. However, if the original dataset is small or not fully representative, bootstrapping may inadvertently perpetuate biases present in the original dataset.
  - This can be particularly relevant if the original dataset is skewed or does not capture the full range of variability in the population.
3. Impact on Estimators with High Sensitivity:
  - Some estimators may be more sensitive to the specific composition of the training data. If the estimator is highly sensitive to the presence or absence of certain observations, variations in the bootstrapped samples can lead to fluctuations in the estimate, potentially introducing bias.
4. Aggregation Method in Ensemble Models:
  - In ensemble models like bagging, where multiple models are trained on different bootstrapped samples and then aggregated, the aggregation method can influence bias. If the aggregation method is sensitive to certain types of errors, it may introduce bias in the final ensemble prediction.
5. Sample Size and Bootstrap Iterations:
  - The impact of bootstrapping on bias can also be influenced by the size of the original dataset and the number of bootstrap iterations. Small datasets and a high number of bootstrap iterations may lead to a higher chance of introducing bias.

============================================

=================================================================================