## Boosting

Boosting **combines** many **weak classifiers** to produce a single, **strong classifier**.

Boosting is an effective method of producing a very accurate prediction rule by combining relatively inaccurate rules of thumb (4). Obtaining these rules of thumb, also called ‘weak learners’ should be fairly easy - a program that given a set of examples searches for simple prediction rules that could be rough and inaccurate would be sufficient. The two main problems with boosting are how should one extract the most useful rules of thumb from a dataset and once many have been collected, how can they be combined into a single, highly accurate rule?

For the latter, the obvious approach of obtaining a combined rule via a vote of the predictions of the rules of thumb turns out to be a good one. For the first question, a good approach would be to focus the 'weak learning' program on the examples that have the highest error rate with the current rules.

As an example for boosting, we will look at Adaptive Boosting, or AdaBoost for short. It’s a machine learning meta-algorithm (an algorithm that may provide a sufficiently good solution to an optimisation problem, especially with incomplete or imperfect information). It is most commonly combined with other learning algorithms when their output (‘weak learners’) are combined into a weighted sum that represents the final output of the boosted classifier (3). However, AdaBoost is susceptible to ‘noisy’ data and outliers and is prone to overfitting.

Most learning algorithms perform better on a particular problem type than others and therefore have a lot of different parameters and configurations to get the best performance on the existing dataset. AdaBoost with decision trees is considered to be one of the best overall classifiers. A disadvantage, however, is that AdaBoost is very susceptible to ‘noisy’ data (1).

It’s important to note that a boosting algorithm uses other learning algorithms (decision trees, in our case) and therefore can only improve results by manipulating the data, rather than being a better algorithm itself.

### The Algorithm

As we’ve already said, the idea behind a boosting algorithm is to choose training examples for the base learner in a way that would force the learner to infer something new every time it’s called. This is done by selecting training sets on which the rules we have so far would perform very poorly, even worse than their already weak performance. We expect a new classifier which would be different from the others because while the learning algorithm outputs weak learners, it’s still to be expected that it would output classifiers whose predictions are nontrivial.

As we’ve already said, the idea behind a boosting algorithm is to choose training examples for the base learner in a way that would force the learner to infer something new every time it’s called. This is done by selecting training sets on which the rules we have so far would perform very poorly, even worse than their already weak performance. We expect a new classifier which would be different from the others because while the learning algorithm outputs weak learners, it’s still to be expected that it would output classifiers whose predictions are nontrivial.

The weak learner would try to choose some weak hypothesis h with a low weighted error e. We don’t expect this error to be small; we just expect it to be a bit better than random (if it’s a random guess, then e is bounded by 0.5. Therefore each e would be equal to ½ - x for some small positive constant x).

Once the base classifier h has been returned, the algorithm chooses a parameter a, as shown in **Figure 1**. a measures the importance that is assigned to h. It’s important to note that a > 0 if e < ½, and that a is inversely correlated to e. Then the distribution is updated as shown. The rule for updating tends to concentrate the weight on the ‘harder’ examples. The effect of it is a new distribution on which the last base classifier h is sure to do poorly. It can be shown that the error of h with respect to the new distribution is exactly ½, which is essentially random guessing. This way, AdaBoost forces the base learner to learn new things about the data (2).

In the end, AdaBoost combines all the base classifiers into a single final classifier. It’s accomplished by a weighted sum of the base classifiers, where every hypothesis h is weighted by its parameter a.

### An example of AdaBoost (3)

Let’s imagine the following situation: the admission committee for a university has 6 members: A, B, C, D, E and F, who have to decide who’ll succeed at their university. In the start, they just vote yes or no, but after a couple of years, they have enough information to weigh everyone’s judging ability.

Now, let’s say that A has a 60% success in predicting. We’ll weigh her vote up, because her guess is better than random. B also has a high success rate, but B’s answers seem to be correlated to A.

On the other hand, C has a low success rate, but appears to be looking at different criteria than A and B. While C has less predictive power, if we combine it with A, we will get a stronger classifier than we would by combining A and B.

Maybe D is better at guessing who will NOT succeed in university. That’s completely fine, because if D’s decisions are not random, they still provide information.

And this goes on until all the classifiers are weighed. Instead of summing A + B + C + D + E + F equally, we weigh these weak classifiers and produce a stronger overall model.