## Random Forests

Random forests **reduce variance** and thus **decrease overfitting** in a dataset with **noisy data points**.

Examine the top right circle of **Figure 1**; this is where we currently stand with our decision tree. An overfitting model with high variance. Think of every dot as a different decision tree that has been trained from a different set of training data. Each one misses the mark but if we were able to ask every one of those models for a classification, and then allow them to vote on a result to return back to me, we would end up with a classification that is accurate far more often.

Another way to think about this is ‘asking the audience’ in a game of Who Wants To Be A Millionaire. Each individual person has some knowledge, and they’ll be wrong about some things. However by asking the whole audience, we tend to get a much more accurate answer because the individual people who are wrong about a topic tend to be outweighed by the majority who are correct.

So consider in figure 1 that rather than building a degree 9 polynomial on all the data points, we build many polynomials over random subsets of the data points, and then find the average of all of those polynomials. This fit will give us a much better general fit than the single 9 degree regression that we started with, because a particular outlier will only affect a small proportion of the trees. These examples are the essence of random forests.

### The Algorithm 1

From a set of labelled data with classifiers, we will split the data into 2 non overlapping sets, one for training and one for testing, we’ll call the training set T. For B trees that we want to build, we’ll take B random subsets of T (with replacement, thus allowing overlapping subsets), we’ll call them X_{1}, X_{2}… X_{b}. For each subset X, we will build a separate tree, ƒ_{b} using the ID3 algorithm (partitioning data by highest information gain). We can then provide a classification, G, from an unlabelled piece of data x’ with the following formula (which calculates the average):

It’s worth noting, that this formula works because all of our targets will be mapped to integers, so once we get a numeric value for G, we can round it and return back a target value. For instance if we are trying to classify types of fruit, and ‘orange’ maps to 1, if G returns 1.37, we round this and return ‘orange’.

### Implementing Decision Tree Tools in Scikit and Random Forest Extensions

#### Introduction

All implementations will be developed using SciKit for Python, which is a library that contains many different machine learning tools, as well as extensions for these tools. Our implementations will then be tested and contrasted with a standard decision tree, by comparing the accuracy of classifications given a test dataset (which importantly, will contain no overlaps with the training dataset).

It’s worth noting at this point that we will be dealing with 2 different types of decision trees; classification trees and regression trees (CART). The difference between these 2 is that classification trees are used to return discrete targets, like a type of fruit or size of shoe. Regression trees on the other hand deal with continuous target data. Both trees are built in very similar ways however (2).

#### A Simple Regression Example (Code on Github) (3)

A regression model of 2D data points is a great place to start because we can visually confirm what is occurring. For our simple regression, we’ll generate a bunch of data points that resemble a sine wave (but importantly, with some random error added in). From this set of points, the model will have to fit a regression line to predict a bunch of points that we give it, we’ll split the initial data in half and use one half for the training data, and the other half for the test data. Clearly in reality, this model is not one with much practical significance.

We’ll start by generating a dataset of points, splitting them in half (one half for training and the other for testing), and then adding noise to the training half:

```
# Create a random dataset
import numpy as np
rng = np.random.RandomState(3)
X = np.sort(5 * rng.rand(300, 1), axis=0)
y = np.sin(X).ravel()
trainingX = X[::2]
testingX = X[1::2]
trainingY = y[::2]
testingY = y[1::2]
trainingY[::5] += 3 * (0.5 - rng.rand(30))
```

View Code on Github
By using the RandomState feature of the numpy library, we can guarantee the same random noise is added to the data points every time, so that when we run the model first with one decision tree and then with a random forest, both learn from the same data.

We can now use Scikit.tree to build a DecisonTreeRegressor (4) and fit it to our training data (code on Github). At this point, we might as well take advantage of the fact that we are dealing with 2 dimensional data points by plotting the points, and the predictions that the model gives us for every point, on a graph. This was done using pyplot in the matplotlib library.

It’s important to note here, that we are restricting the depth of the trees that we are building to 4. Otherwise, we would be building a tree fits entirely to the training data. This is a whole area of discussion in itself (pruning) and is brought up in detail elsewhere on this site.

We will now use Scikit.ensemble’s RandomForestRegressor (5) to build a series of 20 decision trees from random subsets of the data, as explained above. We’ll use matplotlib to produce a similar graph:

We can tell that the random forest regression has better fit the data in 2 different ways. First, it curves better to the data in areas like x = ~2.5. Second, it is not as affected by individual outliers (consider the point at x = ~4.6).

We’ll now test both of these models against our test data with the following code:

```
# Test with test data
testResults = rf.predict(testingX)
acc = 0
for i in range(0, len(testResults)):
acc += abs(testResults[i] - testingY[i])
averageError = acc / len(testResults)
print averageError
```

View Code on Github
When we run this test against the results from the single decision tree, we get an average error of 0.14. When we run the same test from our random forest, our error is 0.10, thus our regression tree model better predicts our data in this example.

### Interactive Demo (6)

The interactive feature below demonstrates the use of random forests to predict the classification of 2 dimensional datapoints by learning from a dataset. Clicking (or shift clicking) allows you to plot datapoints from either classification on to the graph, and pressing ‘r’ allows you to retrain the model. The red and green areas demonstrate what classification the model would predict if you asked it to predict for a particular datapoint. There are a few things that we can learn from this.

First try plotting some points (perhaps into an interesting classification pattern, rather than just randomly) and setting the number of trees to one (so we are no longer using a random forest), now press ‘r’ a few times to retrain the model. Notice how different the model looks every time, this is an ideal representation of variance. Now increase the number of trees to perhaps 50 and try again, notice how much the variance has decreased. Each model now looks pretty similar to the last.

Note also that if you plot a few strong outliers in the dataset it is learning from, and then decrease the number of trees back down again, the models become much more swayed by these outliers, clearly demonstrating a model that has highly overfit the data.