Decision Trees

Introduction

In computer programming, you should be familiar with conditional statements, such as the if-else expression. An if-else statement divides the flow of a computer programme into two parts, with only one part being executed depending on the status of the condition specified. Graphically, an if-else statement can be represented as decision node branching in the two branches. A decision tree is a flowchart-like structure in which an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree recursively in a manner called recursive partitioning.


How do we grow them?

Decision trees can be used for classification as well as regression problems. Consider a classification problem where a decision has to be takes to assign a particular class to the observed data based on its attribute. For example based on the person’s gender, income and age one want to decide whether the person will buy a certain product or not? The data collected from the known customers may help to build the classification tree. The basic idea behind any decision tree algorithm is as follows:
Select the best attribute using Attribute Selection Measures (ASM) to split the records. Make that attribute a decision node and breaks the dataset into smaller subsets. Starts tree building by repeating this process recursively for each child until one of the condition will match:

  • All the tuples belong to the same attribute value.
  • There are no more remaining attributes.
  • There are no more instances.
The decision tree is formed with a single root, branches representing decision rules, and leaf nodes representing the outcome.

Attribute selection measure or Splitting criteria:

There are several attribute selection measure techniques to choose the attribute, for constructing a decision tree. Some of the most popular are

  1. Information Gain: This measures the reduction in entropy that results from a feature split. It is used to determine which feature to split on at each node of the decision tree. A higher information gain indicates a more significant reduction in entropy.
  2. Gini Index: This is a measure of impurity in a dataset, and it is used to determine which feature to split on at each node of the decision tree. A lower Gini index indicates a purer subset of the data.
  3. Chi-Squared: This is a statistical test used to determine the correlation between two categorical variables. It is used to determine the feature to split on at each node of the decision tree. A higher chi-squared value indicates a stronger correlation between the two variables.
  4. Gain Ratio: This is an improvement over information gain, it corrects for bias in information gain towards features with many outcomes.
  5. Reduction in Variance: This is used for continuous target variable, it measures the reduction in variance after a split.

These are some of the most common splitting criteria used in decision tree algorithms, but there may be other criteria as well depending on the specific algorithm or library being used. Let us discuss the information gain criteria in details which is based on the concept of Shannon entropy.


What is entropy:
Entropy is a measure of disorder or randomness in a system. It is often used in thermodynamics, information theory, and statistics to describe the distribution of energy or information in a system. 
The concept of entropy in statistics can be related to the concept of entropy in thermodynamics in terms of microstates and macrostates. A microstate in statistics refers to a specific combination of values for the random variable, while a macrostate refers to the overall distribution of values for the variable. The number of microstates that can give rise to a particular macrostate is related to the entropy of the system. The more microstates that are possible for a given macrostate, the higher the entropy of the system.

Entropy of each node can be estimated by count the probability of each type of items in the nodes. Look at the following figure, the dataset contain 14 observations about objects of two different shapes; triangle and square. Other three attributes of the object are color of the object, Outline (which can be rough or smooth) and whether the object has circle in it or not.  classification problem we frame here is that if it is possible to determine the shape of the object based on its other three attributes. We can use these attribute to split the dataset and build decision tree. In the example below we will be using entropy for splitting the nodes. Before we do that let us first find the entropy of the full dataset. Look at the figure below:



Now we know that the entropy of the complete dataset is 0.94, we can use the any attribute to split this data and create sub nodes to estimate the conditional entropy and find how much is the information gain. Look at the figure below


We find the information gain for all the three attributes and select the one for splitting which give the most gain; in this case attribute ‘color’ is the one. We can now move forward to split the tree further.



We can iterate this process until we find the purest nodes and make them as leaf node.


That’s how we get the final decision tree which is ready to make predictions. For example, to predict the shape of an object using above tree we first ask the color; if color is red then we look for the circle inside it if there is a circle then it’s a triangle other it is a square. Similarly if color is green and we look for outline. If the outline is rough then it is a triangle otherwise it is s square. For blue color object we don’t look further and predict the shape as square.

Decision trees are prone to overfit:
Decision trees are prone to overfitting because they can create complex trees with many branches that perfectly fit the training data, but do not generalize well to new data. This is because decision trees are greedy algorithms and will continue to split a node as long as it improves the criterion for that split, regardless of whether or not the improvement is significant. Additionally, decision trees are sensitive to small fluctuations in the data and can create multiple branches for small variations in the training data, which can lead to overfitting. To avoid overfitting, techniques such as pruning, setting a maximum depth for the tree, and using ensemble methods like random forests can be used.

Avoid over-fitting
Overfitting in decision tree can be avoided using following ways: 
  1. Pruning: This involves removing branches from the tree that do not add much value to the accuracy of the model. 
  2. Setting constraints: By limiting the depth of the tree, or number of leaf nodes, or number of samples in the leaf nodes, you can prevent the tree from becoming too complex and fitting the noise in the data.
  3. Using ensemble approach: Ensemble methods like random forests and gradient boosting can help to reduce overfitting by combining the predictions of multiple decision trees, rather than using a single tree.
Advantages
  • Easy to Understand: Decision tree output is very simple to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them.
  • Its white box type of algorithm. It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network.
Disadvantages
  • Overfitting: Handling the noise in the data. Overfitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in the above section).
  • They can be extremely sensitive to small perturbations in the data: A slight change in the data can lead to a completely different decision tree. This is called the instability of the decision tree.
Conclusion
Decision tree is one of the easiest and popular classification algorithms to understand and interpret. The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. One of the main advantages of decision trees is that they can handle high dimensional data with good accuracy.

Comments

Popular Posts

Followers