Ask Sawal

Discussion Forum
Notification Icon1
Write Answer Icon
Add Question Icon

Why xgboost is extreme?

1 Answer(s) Available
Answer # 1 #

Gradient Boosting is a machine learning algorithm that can be used for both classification and regression problems. It creates an ensemble model from numerous weak predictors usually “Regression Trees” which are added in a stage wise fashion with each new tree focusing on the errors of the previous tree. This additive approach with a focus on previous mistakes essentially converts these weak learners into a single strong predictor.

One concept that I found really helpful to get an intuition for gradient boosting is to think of the problem in terms of gradient descent. I am essentially summarizing a fantastic blog post written by Nicholas Hug which will be linked here.

Just to quickly recap how gradient descent is applied to a simple Ordinary Least Squares estimator. If we take the negative of the gradient of the loss function this will essentially guide us to the minimum of the function aka our best model. Most models such as OLS use gradient descent on the models hyperparamteters such as the slope/intercept to find the optimal solution.

This can be expressed mathematically as follows our loss function is defined as the sum of the squared residuals:

In order to minimize the loss function we take the derivative of the loss function with respect to the slope and the intercept and add the negative of this value to the slope weight scaled by some learning rate to achieve a more accurate model this is process is repeated until convergence.

So why do we care ?, and what has this got to do with gradient boosting ?. Well if you notice that the sum of the squared residuals is written in terms of the predictions themselves (y hat). What would happen if we took the derivative with respect to predictions themselves and iteratively updated the predictions themselves ?.

This is genius but there is one vital flaw we can’t make any predictions with this method as we need the true value Y in order to calculate the loss and update our predictions. Don’t worry this is where the sequential regression trees (weak learners) come into place. So instead of updating our predictions with the real value of the gradient, we will train a regression tree to predict these gradients, at each iteration. Thus, allowing us to make predictions on unseen data points.

Unfortunately we are not finished as the negative gradient only gives the direction of the step. Further effort is necessary to determine the step length (pm) that the will be taken in the direction. The most popular way to do this is a line search I will not go through all the math as someone already has here. Briefly we can write the loss function in terms of a tree do some math and we essentially end up with an equation to measure the quality of a tree structure with respect to the minimizing the loss function.

As it is impractical to create every tree structure we can re-arrange this function to minimize the loss at every layer of the tree. This is a much better method than just using the internal mechanism of the regression tree to fit the residuals.

This is the core idea behind gradient boosting and the pseudo code for such an algorithm is as follows:

Gradient Boosting Pseudo Code

1.The algorithm begins by making an initial prediction that minimizes this equation below. If the task is regression then this equation boils down to using the average value of our target column as our initial prediction. Whereas, if we are concerned with classification then our initial prediction will use the log odds of our target variable.

2. Next we need to compute the negative gradient of the loss function with respect to our predictions. A regression tree is then trained to estimate this gradient in order to predict unseen values.

3. Now that we have our regression tree structure we still need to find a single value for each leaf node that will minimizes this summation below. Again this looks complicated and the math is ! but when you take the derivative with respect to gamma and set this equation to zero to find optimal leaf value.It essentially tells us that the average value in a leaf node is the optimal value for regression.

4. This new regression tree is now added to the initial prediction, scaled by some shrinkage factor (learning rate). The negative gradient of this new model is then calculated and the entire process is repeated for a specific number of Boosting Rounds.

All of this may seem quite complicated but in practice it is actually surprisingly manageable. I will not do an example of gradient boosting by hand as there are many resources available online. Such as this video series by Statquest which I highly recommend watching to gain an intuition how this algorithm works in practice.

Although this series of videos eloquently describes the process of gradient boosting. If you were paying attention above you will notice that nearly all of the online resources use a naive approach to boosting. That is they use just a simple regression tree with its own internal split metric to fit the residuals.

So instead of using the defined loss function to construct the tree they simply grow the tree using some standard metric which has no direct link to problem at hand. However this is only a trivial fix as we simply replace the gain metric and sum up the leaf values in a different way but the process itself remains unchanged. On my GitHub repository I implement both a naive approach and on that follows Friedman’s model named “Gradient Boosting Machine”.

xbgyfabc Levi