Link Search Menu Expand Document

Bagging:

Bagging and Boosting are two main types of ensemble learning method. Ensemble learning is combining the insights obtained from multiple models to facilitate accurate and improved decisions. ( In simple words, combining multiple ML base models to achieve better accuracy)

Now what is Bagging.

Bagging is an Ensemble technique, and it has three steps

  • Bootstrapping
  • Parallel training(Base Model Training)
  • Aggregation

Bootstrapping:

Bootstrapping is a sampling method, where a sample is chosen out of a set, using the replacement. This resampling method generates different subsets of the training dataset by selecting data points at random and with replacement. This means that each time you select a data point from the training dataset, you are able to select the same instance multiple times. As a result, a value/instance may get repeated in another sample.

Base Model Training:

For each bootstrap sample, a base model (usually a decision tree, but other models can be used as well) is trained independently on that sample. Since each subset is slightly different due to the randomness introduced by bootstrap sampling, each base model will learn different patterns from the data. Some base models may overfit, while others may underfit, but overall they capture different aspects of the underlying data distribution.

Aggregation

After training all the base models on their respective bootstrap samples, bagging combines their predictions to make a final prediction. For regression problems, the predictions from each base model are typically averaged. For classification problems, the predictions can be combined by taking a majority vote (for binary classification) or by using a weighted vote (for multi-class classification).

The key idea behind bagging is that by combining multiple base models, it reduces the variance of the ensemble model.

Out-of-Bag (OOB) Evaluation:

One of the advantages of bagging is that it allows for an automatic estimation of model performance without the need for a separate validation set. Each base model is trained on a different subset of the data, and some data points are left out (out-of-bag) during each bootstrap sample creation. These out-of-bag samples can be used to evaluate the performance of individual base models.

See the below diagram explains bagging