## Building Decision Trees with ID3

**Iterative Dichotomiser 3** (ID3) is an algorithm invented by Ross Quinlan and is used to **generate decision trees** from a dataset.

### Entropy

To understand how to build a decision tree, we must have a way of quantifying an amount of information. This is known as **entropy**.

In a decision tree, the question can have many possible outcomes but, for simplicity, we will use a binary decision tree in this example. Imagine a machine that outputs the letters A, B, C, or D in the proportions shown in **Figures 1 and 2**. We wish to establish what the next letter is by repeatedly evaluating conditions. Although the first tree has a lesser total depth, each node is at depth 2 and the average number of questions asked (conditions evaluated) is 2. By contrast, in the second tree the average number of conditions evaluated is 1.75 (= 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3). Thus, tree 2 will on average perform better at identifying the next letter.

Using this method, we can calculate the entropy (a measure of uncertainty) of a system as the sum of the probability of the outcome multiplied by the number of questions required to reach the outcome.

\[H = \sum\limits_{i=1}^{n} p * number\space of \space questions \]In a well created tree, this is equal to (1):

\[H = \sum\limits_{i=1}^{n} p_{i} * log_{2}(\frac{1}{p_{i}}) \]### Building A Tree

In order to build a tree, we need a particular attribute from the data set to be our target attribute (the player position in the basketball example). The algorithm for creating a tree is recursive as each time you split on a certain attribute, you create a new tree beneath it for each of the possible outcomes.

### Pseudocode (2)

```
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
//bases cases
If all examples are positive, Return the leaf, with label = +.
If all examples are negative, Return the leaf, with label = -.
If number of predicting attributes is empty, then Return the leaf,
with label = most common value of the target attribute in the examples.
//recursive call
Otherwise Begin
A = The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root
```

Note that it is not possible to always provide a definitive decision based off the training data. On line 7, there are still varying decisions in the training data but no more attributes left to separate them by. As such, the code will select the most common decision.

### Finding the attribute to split on using entropy

The attribute to split on is the one that will provide the largest information gain. Information gain, IG, is the difference in entropy in the current data before and after splitting on a particular attribute. Information gain can be informally expressed as the amount of uncertainty that was removed by splitting the set on a particular attribute.

### Improvements to the algorithm:

There are a few other algorithms to build trees that provide marginally better performance (such as C4.5 or C5.0) but these are merely refinements of ID3 and use the same underlying principles (3).

### Problems with ID3

As with all algorithms, ID3 is not infallible and, with a poor training data set, you will suffer from GIGO (Garbage In, Garbage Out). With a perfect data set, you would get perfect results from a decision tree built using ID3. However, data sets in the real world are never perfect and will contain many outliers, missing variables or even too many variables. Also, if the data set is too small, you will suffer from overfitting.