Building Decision Trees with ID3

Iterative Dichotomiser 3 (ID3) is an algorithm invented by Ross Quinlan and is used to generate decision trees from a dataset.

Entropy

To understand how to build a decision tree, we must have a way of quantifying an amount of information. This is known as entropy.

In a decision tree, the question can have many possible outcomes but, for simplicity, we will use a binary decision tree in this example. Imagine a machine that outputs the letters A, B, C, or D in the proportions shown in Figures 1 and 2. We wish to establish what the next letter is by repeatedly evaluating conditions. Although the first tree has a lesser total depth, each node is at depth 2 and the average number of questions asked (conditions evaluated) is 2. By contrast, in the second tree the average number of conditions evaluated is 1.75 (= 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3). Thus, tree 2 will on average perform better at identifying the next letter.

Figure 1: this tree's total depth is lower than Figure 2, but it will still perform poorer due to each outcome's probabilities
Figure 2: this tree's total depth is higher than Figure 1, but it will perform better, as deeper levels have lower probabilities

Using this method, we can calculate the entropy (a measure of uncertainty) of a system as the sum of the probability of the outcome multiplied by the number of questions required to reach the outcome.

\[H = \sum\limits_{i=1}^{n} p * number\space of \space questions \]

In a well created tree, this is equal to (1):

\[H = \sum\limits_{i=1}^{n} p_{i} * log_{2}(\frac{1}{p_{i}}) \]

Building A Tree

In order to build a tree, we need a particular attribute from the data set to be our target attribute (the player position in the basketball example). The algorithm for creating a tree is recursive as each time you split on a certain attribute, you create a new tree beneath it for each of the possible outcomes.

Pseudocode (2)

ID3 (Examples, Target_Attribute, Attributes)
    Create a root node for the tree
   //bases cases
    If all examples are positive, Return the leaf, with label = +.
    If all examples are negative, Return the leaf, with label = -.
   
   
If number of predicting attributes is empty, then Return the leaf,
    with label = most common value of the target attribute in the examples.
   
//recursive call
Otherwise Begin
        A = The Attribute that best classifies examples.
        Decision Tree attribute for Root = A.
        For each possible value, vi, of A,
            Add a new tree branch below Root, corresponding to the test A = vi.
            Let Examples(vi) be the subset of examples that have the value vi for A
            If Examples(vi) is empty
                Then below this new branch add a leaf node with label = most common target value in the examples
            Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root

Note that it is not possible to always provide a definitive decision based off the training data. On line 7, there are still varying decisions in the training data but no more attributes left to separate them by. As such, the code will select the most common decision.

Finding the attribute to split on using entropy

The attribute to split on is the one that will provide the largest information gain. Information gain, IG, is the difference in entropy in the current data before and after splitting on a particular attribute. Information gain can be informally expressed as the amount of uncertainty that was removed by splitting the set on a particular attribute.

Improvements to the algorithm:

There are a few other algorithms to build trees that provide marginally better performance (such as C4.5 or C5.0) but these are merely refinements of ID3 and use the same underlying principles (3).

Problems with ID3

As with all algorithms, ID3 is not infallible and, with a poor training data set, you will suffer from GIGO (Garbage In, Garbage Out). With a perfect data set, you would get perfect results from a decision tree built using ID3. However, data sets in the real world are never perfect and will contain many outliers, missing variables or even too many variables. Also, if the data set is too small, you will suffer from overfitting.

References

1. Weisstein EW. Wolfram MathWorld. [Online]. [cited 2017 February 25. Available from: http://mathworld.wolfram.com/Entropy.html.

2. Quinlan R. Induction of Decision Trees. Machine Learning. 1986 March: p. 81–106.

3. Peng W. University of New South Wales. [Online]. [cited 2017 02 14. Available from: http://web.arch.usyd.edu.au/~wpeng/DecisionTree2.pdf.