From Trees to Forests: Exploring the Power of Random Forest in Machine Learning
Hello, machine learning enthusiasts! Today, I'm excited to share the second installment in my series on random forests. Before delving into this, I highly recommend reading my first article, which focuses on decision trees, particularly classification trees. This foundational knowledge is crucial before attempting to build your first random forest (RF) algorithm. For those well-versed in decision trees, let's explore RF today.
What is Random Forest?
RF is a robust machine learning algorithm designed to mitigate issues commonly associated with decision trees, such as overfitting and the imbalance between bias and variance. One of the reasons RF is among my favorite algorithms is its minimal preprocessing requirements. The scale of features is largely irrelevant because the algorithm processes each feature orthogonally, seeking optimal split points. For instance, if one feature measures height in centimeters and another weight in pounds, RF can still efficiently determine the best split for each. This differs from algorithms like K-Nearest Neighbors (KNN), where preprocessing or standardizing feature scales is essential to avoid skewed groupings due to disproportionate feature scales, potentially leading to poor model performance or interpretational errors. Although RF generally requires more computational time than KNN, it excels with features of varying scales and high-dimensional datasets, avoiding the pitfalls of dimensionality that KNN faces. Please refer to my post on the curse of dimensionality for more details.
Types of RF
Before elaborating on the concept of RF, let's discuss the types of RF available:
Bagging (Boostrapping Agrreggation)
Fundamentally, RF combines decision trees with a bootstrapping method. Simply put, each tree in a forest is built using bootstrapping to generate a new sample from the entire original dataset.This new sample maintains the same probability distribution as the original dataset. For example, if you are analyzing a Netflix dataset with high variance in viewer preferences, your subsamples will mirror this distribution.
A common misconception about RF is that each tree utilizes a smaller dataset than the original. In reality, bootstrapping employs a sampling-with-replacement technique, ensuring each new dataset has a similar sample size to the original dataset. It is possible for a single sample or case to appear multiple times in a subsdataset since each resampling could select and replace cases repeatedly.
Each subsample in the boostrapping process is utilized to construct a tree. When these trees are combined, or ensembled, they produce a more accurate average prediction for regression tasks or the most frequently predicted class in classification tasks than a single tree. In other words, the ensemble process reduce errors better than a decision tree.
It's important to recognize that errors in ML algorithms can stem from three sources: variance (indicative of overfitting), bias (indicative of underfitting), and noise (unpredicted variance of the target), as shown in the equation below
The goal of bagging is to reduce the variance term to make hD(X)(predicted values) as close as possible to h(X) (observed values). With an aim to reduce variance without increasing bias, bagging involves sampling m datasets with replacement from the initial data pool, D. This process generates datasets D1, D2,..., Dm. Each Di then is trained with a classifier or a regressor hi(). The final classifer or regressor across all trees is calculated as:
Note that a larger m results in a better ensemble and prediction accuracy. However, if m is too large, you may eventually end up modeling noise and slowing down the computation.
Recommended by LinkedIn
Feature Selection
In addition to the bootstrapping process, which involves randomly selecting samples (i.e., rows in the dataset) to reduce variance, RF also randomly select features at each node of each tree in the forest to further reduce variance without increasing bias. This raises a key question: How many features should be used for each split in an RF?
The number of features to consider at each split in an RF is a tunable parameter known as max_features. A common rule of thumb is to use the square root of the total number of features for classification tasks, and about one-third of the total features for regression tasks. For example, if your dataset comprises 25 features, you might set approximately 5 features (the square root of 25) for a classification task or about 8 features (one-third of 25) for a regression task.
During the construction of each tree, at each node, a subset of features is randomly selected based on the max_features parameter. The best split on these features is then determined based on how well it can separate the classes in classification or reduce variance in regression. Once a tree begins to form, its structure proceeds to branch out based on the best splits at each node and does not backtrack to alter previous decisions.
Each tree in the forest might end up using different features at its root, contributing to the diversity of the models within the forest. It's important to note that the random selection of features in RF distinguishes it from the approach in a single decision tree, where potentially all features could be used at any node if the tree depth is not limited.
So far, I have introduced parameters that you can fine-tune in RF, including max_features and the number of trees (n_estimators). Modern computational capabilities allow each tree to be built in parallel, which significantly saves on computational time without affecting the performance of the model. This aspect of RF is particularly beneficial for feature selection, as the algorithm can effectively identify which predictors are most impactful by evaluating the reduction of impurities across the forest.
Out-of-Bag Error Estimation
When building each tree in RF, the algorithm randomly selects a subset of the data (with replacement) for training, which means some data points may be selected multiple times, while others may not be selected at all. The data points not used to train a particular tree are known as the "out-of-bag" (OOB) data for that tree.
To further explain, with the bootstrapping method, each subdataset automatically excludes some data points. For example, from bootstrap 1, the first few data points could be participant IDs 1, 2, 3, and 4, and then for bootstrap 2, it could be IDs 1, 2, 2, and 4, as we sampled with replacement. In this case, participant 3 got automatically excluded from bootstrap 2 while participant ID 2 got selected repeatedly. Approximately, each sub-dataset has 60% overlap of the samples with the original dataset, and the other 40% are repeated cases. Note that for each predictor (classifier and regressor), the repeated 40% are not the same.
The main advantage of OOB error estimation is that it provides a way to estimate the model's performance without needing a separate validation or test set. In other words, you may not have to run the cross-validation (CV) technique, which could take a lot of time if you have a big forest, to evaluate the model's performance.
The OOB error is very useful as it is automatically available as a by-product of the training process of the random forest, requiring no additional computational cost. This makes the OOB error a convenient and efficient tool for model evaluation and tuning, especially when dealing with large datasets where cross-validation can be computationally expensive.
Hyperparameters
I previously mentioned the number of trees (n_estimators) and the maximum number of features (max_features) as parameters that can be fine-tuned. Below is the list of other parameters. For all parameters, the 'GridSearchCV' function, utilizing cross-validation, could help identify the best values along with other factors, such as computational complexity, time, and project objectives. For data with high dimensions, using GridSearch may not be time- and computing-efficient. Other strategies, such as random search or Bayesian optimization, are recommended.
Want to learn more about applying RF to a real-world use case? Read my full article with a real-world example here.
Decision Scientist | 10 years IE XP | Data Science Master's & Industrial Engineering Bachelor's
8moGreat article with RF fundamentals! I'll start following. Have you heard of Optimal Decision Trees (ODT) or Optimal Classification Trees (OCT)? They're an interest for me right now.