When learning about Decision tree, I learned about Entropy. In my opinion, entropy is important to learn for data science and in decision tree.
In simple words, entropy is the measurement of level of impurity in the data set or it is measurement of uncertainty.
It is defined as
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))
The Entropy function can be visualized as the probability of “a” and “b” becomes 50% and 50% the uncertainty is at its maximum and as one of those variables probability decreases then the uncertainty is low. For example if P(a) = 1 and P(b) = 0 then entropy function will return 0 which is minimum.
In general entropy function is defined as ∑p(xi).log(base2).p(xi) for N outcomes.
To illustrate above equation. Lets consider we have total 16 number of shapes with 9 orange color and 7 blue color. Notes in the figure that stars are in 4 blue and 6 orange color whereas Diamond are 3 orange and 3 blue color.
We want to predict that what color and shape is selected. Now calculate the entropy and separate data by colors that is orange and blue.
E(colors) = – (9/16) * log2(9/16) – (7/16) * log2(7/16) = 0.988699408
Next we split data set and calculate their child entropy by
E(shapes_right) = – (6/9) * log2(6/9) – (3/9) * log2(3/9) = 0.918295834
E(shapes_left) = – (4/7) * log2(4/7) – (3/7) * log2(3/7) = 0.985228136
After calculating the left and right of entropy, by summing we will get entropy after the root entropy.
E(after) = (9/16) * E(shapes_right) + (7/16) * E(shapes_left) = 0.947578716
In this example the information gain, is very small which is
E(colors) – E(after) = 0.041120692
Information gain is the decrease in entropy after the split of entropy.
It is the easiest to understand algorithm. Decision tree a tree structure, in which it splits the data set deep down to leave nodes from the decision nodes. Decision tree is a supervised learning, which can be used in classification(Yes/No types) and regression(numerical data types e.g 123).
As above in the example of entropy, the information gain was very small and entropy were large, since we pick the splitting randomly without know the best result therefore, in order to optimize our model, we can use ID3 (Iterative Dichotomiser 3)
ID3 works as
- Create root node for tree
- If all the example’s leave nodes are positive, return positives
- If all the example’s leave nodes are negative, return negatives
- Calculate the entropy of the current state H(S)
- Calculate each attribute entropy of x denoted by H(S, x)
- Select the attribute which as maximum information gain IG(S,x)
- Put away the attribute for maximum information gain from the set of attributes
- Repeat until all attribute are done
This is what’s known as a “greedy” algorithm because, at each step it choose the best option.
Implementing Decision tree through Sklearn Decision Tree