About Decision Tree Algorithms...
Decision tree algorithms

About Decision Tree Algorithms...

What is Decision Tree? 


  • A Decision Tree is a Supervised learning technique that can be used for classification and Regression problems, but it is mainly preferred for solving Classification problems.


  • A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility


  • The structure of a tree-like flowchart, where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents an outcome or class label.


  • The whole idea is to create a tree like this for the entire data and process a single outcome at every leaf(or minimize the error in every leaf).

No alt text provided for this image


Why Decision trees?


We have a couple of other algorithms there, so why do we have to choose Decision trees??

well, there might be many reasons but I believe a few which are


  • Decision trees often mimic human-level thinking so it's so simple to understand the data and make some good interpretations.


  • Decision trees make you see the logic for the data to interpret(not like black box algorithms like SVM, etc..)


No alt text provided for this image
Bank Loan Application


For example: if we are classifying a bank loan application for a customer, the decision tree may look like this

Here we can see the logic of how it is making the decision.

It’s simple and clear


Types of Decision Tree:

No alt text provided for this image
types of Decision tree


Regression tree:


  • Decision Tree which has a continuous target variable(ex.: the price of a house).


  • To build a regression tree, we first use recursive binary splitting,i.e. each data sample is taken as a node to split the data.


  • The point with which there is a minimum residual sum of squared errors or mean squared error is taken to be the node of the split.


Each node stops splitting when it reaches a limit meaning further splits will have less than n number of observations.


Classification tree:


  • Decision Tree which has a categorical target variable. (ex.: in titanic data whether a passenger survived or not).


  • It is very similar to regression trees, however Residual sum of squared is not used to split the nodes.


Decision Tree Terminologies:


  • Root Node: The root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or more homogeneous sets.


  • Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.


  • Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions.


  • Branch/Sub Tree: A tree formed by splitting the tree.


  • Pruning: Pruning is the process of removing the unwanted branches from the tree.


  • Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.


How Does a Decision Tree Work?

No alt text provided for this image


  • It works for both categorical and continuous input and output variables.


  • We split the population or sample into two or more homogeneous sets (or sub-populations) based on the most significant splitter/differentiator in input variables.


  • The decision of how the splits are made heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees.


  • Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.


  • However, The problem is the greedy nature of the algorithm. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes.


Decision Tree Working Procedure:


  • In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree.
  • This algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.


For the next node, the algorithm again compares the attribute value with the other sub-nodes and moves further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below steps:


Step-1: Begin the tree with the root node, says S, which contains the complete dataset.


Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).


Step-3: Divide the S into subsets that contain possible values for the best attributes.


Step-4: Generate the decision tree node, which contains the best attribute.


Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.



Attribute Selection Measure (ASM)/Attribute to split the data:

No alt text provided for this image
Decision tree splitting Criteria


Entropy:


  • Entropy is a concept used in decision tree algorithms, particularly in the context of information theory, to measure the impurity or disorder of a set of data. It is used as a criterion to determine the best attribute to split the data at each node of a decision tree.


  • In the context of decision trees, entropy refers to the amount of uncertainty or randomness in the target variable's distribution within a given set of instances. The target variable is the variable we want to predict or classify in a decision tree. For example, in a binary classification problem where the target variable can take on two values (e.g., "yes" or "no"), entropy measures the randomness of the distribution of "yes" and "no" instances.


  • The formula for calculating entropy is:

No alt text provided for this image

Here, P(+) /P(-) = % of +ve class / % of -ve class


  • Entropy ranges from 0 to 1, where 0 represents a perfectly pure or homogeneous set (all instances belong to the same class), and 1 represents maximum impurity or randomness (equal distribution of instances across classes).


  • In the context of decision tree algorithms, entropy is used to evaluate the quality of a split at a given node. When considering a potential split on a specific attribute, the algorithm calculates the entropy of the resulting subsets after the split. The goal is to find the attribute that maximally reduces the entropy and leads to more homogeneous subsets.


NOTE: 


  • Entropy will be 0 for pure split


  • Entropy will be high when impurity is high.


A limitation with Entropy: Computation time taken will be very high because of the log function so we use GINI INDEX/GINI IMPURITY


Gini Index/ Gini Impurity:


  • The Gini index, or Gini impurity, is another measure used in decision tree algorithms to assess the impurity or disorder of a set of instances. It is commonly used as a criterion to determine the best attribute to split the data at each node in a decision tree.


  • The Gini index quantifies the probability of misclassifying a randomly chosen instance if it were randomly labelled according to the distribution of classes in the set. In other words, it measures the degree of impurity in a set of instances by calculating the probability of two randomly selected instances having different class labels.


  • The formula for calculating the Gini index is as follows:


No alt text provided for this image


Where:

p is the proportion of instances in S that belong to class i.


  • The Gini index ranges from 0 to 1, where 0 indicates a perfectly pure or homogeneous set (all instances belong to the same class), and 1 represents maximum impurity or randomness (equal distribution of instances across classes).


  • In the context of decision tree algorithms, the Gini index is used to evaluate the quality of a split at a given node. When considering a potential split on a specific attribute, the algorithm calculates the Gini index of the resulting subsets after the split. The attribute with the lowest Gini index (i.e., the highest reduction in impurity) is chosen as the splitting attribute.


  • It's important to note that both entropy and the Gini index are commonly used as splitting criteria in decision tree algorithms, and the choice between them often depends on the specific problem and the preferences of the practitioner. In practice, both measures tend to yield similar results, but the Gini index is slightly faster to compute since it does not involve logarithmic calculations.


Information Gain:


  • Information gain is a concept used in decision tree algorithms to quantify the amount of information provided by a particular attribute in reducing the uncertainty or disorder within a set of instances. It is used as a criterion to determine the best attribute to split the data at each node in a decision tree.
  • Information gain measures the reduction in entropy (or alternatively, Gini impurity) achieved by partitioning the data based on a specific attribute. It evaluates how much the attribute contributes to improving the homogeneity of the resulting subsets.


  • The formula for calculating information gain is as follows:

No alt text provided for this image

Where:

Information Gain(S, A) is the information gain of attribute A on set S.

Entropy(S) is the entropy of the set S before the split.

|Sv| is the number of instances in subset Sv after the split.

|S| is the total number of instances in set S.

Entropy(Sv) is the entropy of the subset Sv after the split.


  • The information gain is calculated by subtracting the weighted average of the entropies of the resulting subsets from the entropy of the original set. It represents the expected reduction in entropy achieved by considering the attribute for splitting.


  • The attribute with the highest information gain is selected as the splitting attribute at each node in the decision tree. This means that it provides the most significant reduction in uncertainty and leads to more homogeneous subsets.


  • Information gain is particularly useful when dealing with categorical or discrete attributes, as it directly relates to the concept of entropy and the reduction in randomness. However, it can also be used with continuous attributes by discretizing them into intervals or applying techniques such as binning.


  • By using information gain, decision tree algorithms can effectively identify the attributes that provide the most relevant and discriminative information for classification or regression tasks. It helps construct a decision tree that optimally splits the data and makes accurate predictions.


Note - Scaling and Handling Outliers are not required for the Decision tree Algorithm


No alt text provided for this image

Advantages of Decision Trees:


a. Interpretability: Decision trees provide a transparent and intuitive representation of the decision-making process, enabling easy interpretation and understanding.


b. Handling Nonlinear Relationships: Decision trees can capture complex nonlinear relationships between features, allowing them to handle datasets with intricate structures.


c. Handling Mixed Data Types: Decision trees can handle both categorical and numerical features, making them versatile for various types of datasets.


d. Feature Importance: Decision trees can assess the importance of features based on their contribution to the decision-making process, allowing for feature selection and interpretation.


Limitations of Decision Trees:


a. Overfitting: Decision trees have a tendency to overfit the training data, capturing noise and irrelevant patterns. Techniques like pruning and setting stopping criteria can mitigate this issue.


b. Lack of Robustness: Small changes in the training data can lead to significantly different tree structures, making decision trees less robust compared to other algorithms.


c. Difficulty with Continuous Data: Decision trees perform better with categorical or binary features compared to continuous variables. Preprocessing techniques, such as discretization or binning, can be used to handle continuous data.


Practical Applications:


a. Medical Diagnosis: Decision trees can help doctors make accurate diagnoses based on patient symptoms, medical history, and test results.


b. Credit Scoring: Banks and financial institutions use decision trees to assess creditworthiness by analyzing factors such as income, credit history, and employment status.


c. Customer Segmentation: Decision trees can segment customers based on their demographic, behavioural, or purchase data, aiding targeted marketing campaigns.


d. Fraud Detection: Decision trees are valuable in identifying patterns of fraudulent activities by analyzing transactional data and user behaviour.



If you learned something from this blog, make sure you give it a 👏🏼

Will meet you in some other blog, till then Peace ✌🏼.



 

Thank_You_

Jeevitha D S

Principal Instructor-Data Science, Learning Operations at AlmaBetter

1y

Interesting!!

To view or add a comment, sign in

More articles by Dishant Kharkar

  • "Unravelling the Power of XGBoost: Boosting Performance with Extreme Gradient Boosting"

    XGBoost is a powerful machine-learning algorithm that has been dominating the world of data science in recent years…

  • About Boosting and Gradient Boosting Algorithm…

    What is Boosting? Boosting is a machine learning ensemble technique that combines multiple weak or base models to…

  • About Random Forest Algorithms.

    What is Random Forest? Random Forest is a popular machine learning algorithm that belongs to the supervised learning…

  • About Support Vector Machine Algorithm (SVM’s)...

    Introduction: Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms. SVM is used for…

    2 Comments
  • Naïve Bayes classifiers

    What is Naïve Bayes Algorithm/Classifiers? The Naïve Bayes classifier is a supervised machine learning algorithm. which…

    2 Comments
  • K-Means Clustering Algorithm.

    K-Means Clustering is an unsupervised learning algorithm that solves clustering problems in machine learning or data…

    2 Comments
  • What is an Outliers?? How To handle it??

    “ Do not be an ignoramus. STOP treating Outliers like Garbage, START listening to What it tells you.

  • About Logistic Regression

    About Logistic Regression After the basics of Regression, it’s time for the basics of Classification. And, what can be…

  • About Linear Regression

    Every Data Scientist starts with this one. So, here it is.

  • Introduction of Machine Learning.

    What Is Machine Learning? Machine learning is categorised as a subset of Artificial Intelligence (AI). AI Machine…

Insights from the community

Others also viewed

Explore topics