The world of Ensemble models using Decision Trees as weak learners
Decision tree has been amongst the most commonly known model when it comes to Machine Learning or Data Science. But there are a few problems with Decision trees that are (hope you know how a decision tree works!!) worth penning down -
- Which feature should be splitted for branching?
- What is the optimum threshold for splitting?
- What is the ideal depth of the tree model?
and many others to follow up.Here comes ensemble modelling in picture.Ensemble models are nothing but aggregation of a number of Weak Learners, a model performing just better than random guessing (most of the time , decision trees are used & hence the article title) i.e using 10s or 100s of different decision trees, getting their results and using these different results to get onto a final result.The various ensemble models I will be briefing(only intuitive) are as follows-
1)RandomForestClassifier-
Random Forest has been amongst the most used ensembling model that follows up the concept of Bagging.Here, we would be considering a number of trees, take 1000s of Decision trees, all independent of each other, might be using entire/part of training dataset(the distribution would be random) and producing different predictions.And using these results, and average result is taken and considered as the final prediction by the model.It ensures the model don't overfit.
Example - let us have 100 Decision trees out of which 60 predict 1 and 40 predict 0(considering binary classification). As 1 predictors are more, hence result is 1.
2)Gradient Boosting Machine-
As RFC involves bagging, GBM uses Boosting.Here also we would be taking up 10s of Decision Trees but they won't be independent.These trees would be working in a sequential order .Output of one tree is used by other tree to focus more on the errors and to fit over the residuals.The common problem faced is it overfits very soon, hence keep the number of trees comparatively low to RFC.
Example-Let us have 5 decision trees.The 1st one,let F1 intakes the training data and produces output Y1.Now, the 2nd tree, let H1, would take X as input but Y - Y1(predicted by tree 1, F1) as target. The combined output of F1 & H1 is the final output.If number of trees are more, the same chain continues.
Y2=F1(X):target is Y+H1(X):target is Y - Y1 where
X=input/training data
Y=Target value
F1=a weak learner
H1=Booster for F1, new decision tree model
Y1=output of F1(X)
Y2=Improved results
Now for next boosting round, we use
Y3=Y2 + H2(X):target is Y-Y2
here, all notations remain same except H2 is the new booster and Y3 is improved version of Y2.
now the same step can be repeated further on for better results for the mentioned number of trees for the ensembling purpose.Rest of the models described below uses Boosting Technique for ensembling purpose.
3)eXtreme Gradient Boosting Machine-
It is the most popular model when comes to Kaggle competitions.It is an upgraded version of GBM hence faster and uses less space as it doesn't go for all possible splits, but for some useful splits i.e. if 1000 splits point are possible , it may go for only 100 best points hence savings everywhere whether space or time!!!(using presorted splitting algorithm).It is often taken as Regularized GBM as a term lambda(let it be L for now) is multiplied with the function being used for boosting in the above example(H1).Hence equation becomes L*H1() instead of H1().
4)Light Gradient Boosting Machine-
LGBM has also been amongst the emerging models getting popularity in the data science domain.Though accuracy for both, XGB & LGBM models is quite close, their implementation is slightly different.To find out best splits, amongst all possible splits(100 out 1000 split points concept, hence to reduce the extra work ), LGBM uses Gradient based one sided sampling(GOSS) while XGB uses pre sorted algorithm for splitting purpose.
For explanation related to GOSS and pre sorting splitting, kindly Check here
5)CatBoost-
Not so popular, CatBoost is comparatively slower than LGBM & XGB but has an unbeatable advantage, it can intake categorical data as text form (you need to mention which columns are categorical) and train the model, & hence the name Categorical Boost.The case is, it understands categorical data while other models just accept it when presented as numeric.No preprocessing step for converting text to numeric using OneHotEncoder or LabelEncoding for categories is required because of which it produces better results.
To know how categorical data intake is done, refer here.
Apart from this, there are a lot of ensemble models coming up, showing up good results than traditional models.Each model has its merits and demerits as well.The right model depends on the problem and the dataset available.As according the Free Lunch Theorem, no perfect model exists and hence
Keep Exploring, Keep Learning