An Introduction to Overfitting (1)
Clearly, when we build a decision tree from a set of training data, the purpose of this is to build a tree that can accurately classify new data that’s given to the system. A common issue with decision trees, is that they tend to ‘overfit’ to their training data. This means that whilst they will perform very well on the training data they were built from, they frequently fail to classify new data, or to ‘generalise’. Figure 1 shows how a regression model from data points could be overfit due to learning noise in the data.
Consider that in this example, the data points are representing a linear relationship (so we would want the regression to reflect just that). The polynomial of degree 0 represents a regression model that underfits the training data, that is, it fails to accurately fit the training data at all, and would most likely be rejected as a model immediately. Usually this isn’t such a big issue because we know straight away if the model has fitted poorly to our training data and we can do something about it.
The polynomial of degree 9 is a more interesting case because clearly, it is picking up on small amounts of variation in the data points which is most likely due to marginal error. This is an issue because if we were to build a learning model like this, we might choose to test whether it’s a good model by seeing how well it has fit the training data. In this case, it has 100% fit on our training data, hooray job done! However, as soon as we consider a data point at about x = 10, whilst we might want the model to return me a prediction of y = ~21, our polynomial degree 9 fit will actually give me a prediction of many orders of magnitudes bigger. So despite what our test against the training set told us, the model is no good for general data.
Testing for Overfitting (2)
So, before we turn to fixing overfitting in our decision trees, we might like to consider how we can test for overfitting in the first place. This is a challenge because in almost all cases, the model we are developing isn’t simply trying to fit to a set of 2 dimensional points, and so it’s much harder to tell when we have overfit our data. And as mentioned before, if we try to test for overfitting by checking how often it correctly classifies our training data (the data it was built from), this offers no promises on how the model generalises to new data.
To do this we will split the data that we have been given to build on into 2 non overlapping sets. The first set of data will be the actual training data, and the rest will be our test data. So we will build the model from the training data, and then we test how well the model generalises by giving it the test data and monitoring how many of the data points it correctly classifies. This will then give us a good idea on how well our model generalises. We will use this method later to test our extensions on decision trees.
Bias and Variance
It’s helpful here to introduce the concepts of bias and variance. Both concepts are measures of what would happen if you were to retrain your model many times on different sets of training data. Consider our degree 0 model in the graph above, no matter how many times we retrain this model on different sets of data, it’s going to model the training data pretty poorly, this would mean it has high bias. However, with every iteration that we retrain the model, each model will come out looking very similar. So in essence the group of models are a closely packed one, but they all fail to hit the mark. A model of high bias but low variance typically correspond to underfitting (3).
Conversely, our degree 9 model fits the training set perfectly, so it has very low bias. However if we were to run the training simulation again on a different set of data, the model would likely come out looking dramatically different, and this corresponds to a high variance. These properties are that of an overfitted model.
Figure 2 illustrates this (4).
Sources of Overfitting in Decision Trees and Impact in Big Data (5)
Decision trees fall victim to overfitting in two different ways, one of which we have already mentioned; that of learning from noise in our data and making assumptions from anomalous or unusual data points. The second is learning to identify specific inputs rather than whatever factors are actually predictive for the desired output.
As we now know, if decision trees are often the victim of overfitting, it means that they suffer from high variance. A common way to fix high variance is by training our model with a larger set of data, this is to ‘drown’ out the noise. So this might sound like we’ll have better luck when it comes to training our model with relation to big data, because there certainly won’t be a shortage of it. But there’s a major issue with big data. It’s true that we might be able to get our hands on a hell of a lot of data, but in reality, this data tends to be messy. Unlike a small dataset which we might manually clean before using it to train a model (by looking through it and removing anomalies), if we are to train a model with big data, we’ll have far too much of it to do this with. The reason for this poor quality data could be down to how it has been collected, for instance we might be collecting it from numerous social networks where there’s been no human intervention in it. Whilst we might be able to perform some methods of pre-processing to better prepare our data for the use (see dimensionality reduction), we still need a robust learning model to cope with anomalies and noise.