Decision Trees
Introduction
In computer programming, you should be familiar with conditional statements, such as the if-else expression. An if-else statement divides the flow of a computer programme into two parts, with only one part being executed depending on the status of the condition specified. Graphically, an if-else statement can be represented as decision node branching in the two branches. A decision tree is a flowchart-like structure in which an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree recursively in a manner called recursive partitioning.
How do we grow them?
- All the tuples belong to the same attribute value.
- There are no more remaining attributes.
- There are no more instances.
There are several attribute selection measure techniques to choose the attribute, for constructing a decision tree. Some of the most popular are
- Information Gain: This measures the reduction in entropy that results from a feature split. It is used to determine which feature to split on at each node of the decision tree. A higher information gain indicates a more significant reduction in entropy.
- Gini Index: This is a measure of impurity in a dataset, and it is used to determine which feature to split on at each node of the decision tree. A lower Gini index indicates a purer subset of the data.
- Chi-Squared: This is a statistical test used to determine the correlation between two categorical variables. It is used to determine the feature to split on at each node of the decision tree. A higher chi-squared value indicates a stronger correlation between the two variables.
- Gain Ratio: This is an improvement over information gain, it corrects for bias in information gain towards features with many outcomes.
- Reduction in Variance: This is used for continuous target variable, it measures the reduction in variance after a split.
These are some of the most common splitting criteria used in decision tree algorithms, but there may be other criteria as well depending on the specific algorithm or library being used. Let us discuss the information gain criteria in details which is based on the concept of Shannon entropy.
What is entropy:
Entropy is a measure of disorder or randomness in a system. It is often used in thermodynamics, information theory, and statistics to describe the distribution of energy or information in a system. The concept of entropy in statistics can be related to the concept of entropy in thermodynamics in terms of microstates and macrostates. A microstate in statistics refers to a specific combination of values for the random variable, while a macrostate refers to the overall distribution of values for the variable. The number of microstates that can give rise to a particular macrostate is related to the entropy of the system. The more microstates that are possible for a given macrostate, the higher the entropy of the system.
Entropy of each node can be estimated by count the probability of each type of items in the nodes. Look at the following figure, the dataset contain 14 observations about objects of two different shapes; triangle and square. Other three attributes of the object are color of the object, Outline (which can be rough or smooth) and whether the object has circle in it or not. classification problem we frame here is that if it is possible to determine the shape of the object based on its other three attributes. We can use these attribute to split the dataset and build decision tree. In the example below we will be using entropy for splitting the nodes. Before we do that let us first find the entropy of the full dataset. Look at the figure below:
- Pruning: This involves removing branches from the tree that do not add much value to the accuracy of the model.
- Setting constraints: By limiting the depth of the tree, or number of leaf nodes, or number of samples in the leaf nodes, you can prevent the tree from becoming too complex and fitting the noise in the data.
- Using ensemble approach: Ensemble methods like random forests and gradient boosting can help to reduce overfitting by combining the predictions of multiple decision trees, rather than using a single tree.
- Easy to Understand: Decision tree output is very simple to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them.
- Its white box type of algorithm. It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network.
- Overfitting: Handling the noise in the data. Overfitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in the above section).
- They can be extremely sensitive to small perturbations in the data: A slight change in the data can lead to a completely different decision tree. This is called the instability of the decision tree.
Comments
Post a Comment