From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

Hello, machine learning enthusiasts! Today, I'm excited to share the second installment in my series on random forests. Before delving into this, I highly recommend reading my first article, which focuses on decision trees, particularly classification trees. This foundational knowledge is crucial before attempting to build your first random forest (RF) algorithm. For those well-versed in decision trees, let's explore RF today.

What is Random Forest?

RF is a robust machine learning algorithm designed to mitigate issues commonly associated with decision trees, such as overfitting and the imbalance between bias and variance. One of the reasons RF is among my favorite algorithms is its minimal preprocessing requirements. The scale of features is largely irrelevant because the algorithm processes each feature orthogonally, seeking optimal split points. For instance, if one feature measures height in centimeters and another weight in pounds, RF can still efficiently determine the best split for each. This differs from algorithms like K-Nearest Neighbors (KNN), where preprocessing or standardizing feature scales is essential to avoid skewed groupings due to disproportionate feature scales, potentially leading to poor model performance or interpretational errors. Although RF generally requires more computational time than KNN, it excels with features of varying scales and high-dimensional datasets, avoiding the pitfalls of dimensionality that KNN faces. Please refer to my post on the curse of dimensionality for more details.

Types of RF

Before elaborating on the concept of RF, let's discuss the types of RF available:

  1. Random Forest Classifier In classification tasks, each decision tree aims to classify samples into categories and reduce node impurity using measures like the Gini index or entropy. These metrics help the algorithm determine how to best split the data at each node to increase the homogeneity of the resulting child nodes relative to the target class. By minimizing impurity or maximizing information gain at each split, the cost of misclassification within each tree is effectively reduced.
  2. Random Forest Regressor In regression tasks, the objective is to predict a continuous value and minimize the variance within each node after the split. Here, impurity is represented by the variance in the node's values. A preferred split is one that results in child nodes with lower variance than the parent node, thus grouping together similar scores.

Bagging (Boostrapping Agrreggation)

Fundamentally, RF combines decision trees with a bootstrapping method. Simply put, each tree in a forest is built using bootstrapping to generate a new sample from the entire original dataset.This new sample maintains the same probability distribution as the original dataset. For example, if you are analyzing a Netflix dataset with high variance in viewer preferences, your subsamples will mirror this distribution.

A common misconception about RF is that each tree utilizes a smaller dataset than the original. In reality, bootstrapping employs a sampling-with-replacement technique, ensuring each new dataset has a similar sample size to the original dataset. It is possible for a single sample or case to appear multiple times in a subsdataset since each resampling could select and replace cases repeatedly.
Article content


Each subsample in the boostrapping process is utilized to construct a tree. When these trees are combined, or ensembled, they produce a more accurate average prediction for regression tasks or the most frequently predicted class in classification tasks than a single tree. In other words, the ensemble process reduce errors better than a decision tree.

It's important to recognize that errors in ML algorithms can stem from three sources: variance (indicative of overfitting), bias (indicative of underfitting), and noise (unpredicted variance of the target), as shown in the equation below

Article content

The goal of bagging is to reduce the variance term to make hD(X)(predicted values) as close as possible to h(X) (observed values). With an aim to reduce variance without increasing bias, bagging involves sampling m datasets with replacement from the initial data pool, D. This process generates datasets D1, D2,..., Dm. Each Di then is trained with a classifier or a regressor hi(). The final classifer or regressor across all trees is calculated as:

Article content

Note that a larger m results in a better ensemble and prediction accuracy. However, if m is too large, you may eventually end up modeling noise and slowing down the computation.

Feature Selection

In addition to the bootstrapping process, which involves randomly selecting samples (i.e., rows in the dataset) to reduce variance, RF also randomly select features at each node of each tree in the forest to further reduce variance without increasing bias. This raises a key question: How many features should be used for each split in an RF?

The number of features to consider at each split in an RF is a tunable parameter known as max_features. A common rule of thumb is to use the square root of the total number of features for classification tasks, and about one-third of the total features for regression tasks. For example, if your dataset comprises 25 features, you might set approximately 5 features (the square root of 25) for a classification task or about 8 features (one-third of 25) for a regression task.

During the construction of each tree, at each node, a subset of features is randomly selected based on the max_features parameter. The best split on these features is then determined based on how well it can separate the classes in classification or reduce variance in regression. Once a tree begins to form, its structure proceeds to branch out based on the best splits at each node and does not backtrack to alter previous decisions.

Each tree in the forest might end up using different features at its root, contributing to the diversity of the models within the forest. It's important to note that the random selection of features in RF distinguishes it from the approach in a single decision tree, where potentially all features could be used at any node if the tree depth is not limited.

So far, I have introduced parameters that you can fine-tune in RF, including max_features and the number of trees (n_estimators). Modern computational capabilities allow each tree to be built in parallel, which significantly saves on computational time without affecting the performance of the model. This aspect of RF is particularly beneficial for feature selection, as the algorithm can effectively identify which predictors are most impactful by evaluating the reduction of impurities across the forest.

Out-of-Bag Error Estimation

When building each tree in RF, the algorithm randomly selects a subset of the data (with replacement) for training, which means some data points may be selected multiple times, while others may not be selected at all. The data points not used to train a particular tree are known as the "out-of-bag" (OOB) data for that tree.

To further explain, with the bootstrapping method, each subdataset automatically excludes some data points. For example, from bootstrap 1, the first few data points could be participant IDs 1, 2, 3, and 4, and then for bootstrap 2, it could be IDs 1, 2, 2, and 4, as we sampled with replacement. In this case, participant 3 got automatically excluded from bootstrap 2 while participant ID 2 got selected repeatedly. Approximately, each sub-dataset has 60% overlap of the samples with the original dataset, and the other 40% are repeated cases. Note that for each predictor (classifier and regressor), the repeated 40% are not the same.

The main advantage of OOB error estimation is that it provides a way to estimate the model's performance without needing a separate validation or test set. In other words, you may not have to run the cross-validation (CV) technique, which could take a lot of time if you have a big forest, to evaluate the model's performance.

Article content


The OOB error is very useful as it is automatically available as a by-product of the training process of the random forest, requiring no additional computational cost. This makes the OOB error a convenient and efficient tool for model evaluation and tuning, especially when dealing with large datasets where cross-validation can be computationally expensive.

Hyperparameters

I previously mentioned the number of trees (n_estimators) and the maximum number of features (max_features) as parameters that can be fine-tuned. Below is the list of other parameters. For all parameters, the 'GridSearchCV' function, utilizing cross-validation, could help identify the best values along with other factors, such as computational complexity, time, and project objectives. For data with high dimensions, using GridSearch may not be time- and computing-efficient. Other strategies, such as random search or Bayesian optimization, are recommended.

  • Maximum Depth of Trees (max_depth): There is no exact rule of thumb regarding how deep a tree should be, although limiting trees to not be too deep helps prevent overfitting. However, setting the parameter too low could lead to underfitting.
  • Minimum Samples Split (min_samples_split): Typically, higher values prevent the model from learning overly specific patterns, which can lower the risk of overfitting. However, a value that is too high could lead to underfitting.
  • Minimum Samples Leaf (min_samples_leaf): The minimum number of samples required to be at a leaf node. Setting a too low number can lead to overfitting. However, setting the number too high could lead to underfitting. Justifying the right number also depends on your total sample size and whether you have a balanced or imbalanced design, especially for a classification forest. If you have an imbalanced design (i.e., the number of one target class is way higher than the other for a binary classification forest), considering the class with the smaller sample size to decide the minimum sample of the tree could help prevent underfitting issues from the tree not being able to grow due to a smaller sample size of the class compared to the predefined minimum sample split.
  • Class Weight (class_weight): Relevant to the point above, setting class weight is useful if you are building a classification forest with imbalanced classes of the target output. The parameter associates classes with weights and prioritizes the minority class during training to compensate for an imbalanced dataset. There are several other methods to deal with the class imbalance issue, which I will discuss more in my upcoming post.
  • Bootstrap (bootstrap): By default, RF always utilizes bootstrap. However, if the function is set to be 'false', the whole dataset is used to build each tree. I do not recommend this as bootstrapping can generate heterogeneous samples that still share the same distribution with the original data pool to prevent the issue of overfitting commonly found in decision trees where the whole dataset is used.
  • Criterion (criterion): The function to measure the quality of a split. For classification forests, 'Gini' for Gini impurity and 'entropy' for information gain are commonly used. For regression forests, Mean Squared Error ('MSE') or Mean Absolute Error ('MAE') are used to calculate the distance between the actual and predicted values at a node. The split that minimizes this error is chosen.
  • Max Leaf Nodes (max_leaf_nodes): The maximum number of leaf nodes a tree can have. Note that if this parameter is defined, the trees might not reach the max_depth specified as the algorithm considers growing each tree according to its max_leaf_node first.
  • Random State (random_state): Finally, DO NOT forget to set your random state (or set.seed() if you are an R user). This is important for model reproducibility!

Want to learn more about applying RF to a real-world use case? Read my full article with a real-world example here.

John Woodward

Decision Scientist | 10 years IE XP | Data Science Master's & Industrial Engineering Bachelor's

8mo

Great article with RF fundamentals! I'll start following. Have you heard of Optimal Decision Trees (ODT) or Optimal Classification Trees (OCT)? They're an interest for me right now.

Like
Reply

To view or add a comment, sign in

More articles by Kay Chansiri, Ph.D.

Insights from the community

Others also viewed

Explore topics