class: center, middle, inverse, title-slide # MLCA Week 8: ## Gradient Boosting Machines ### Mike Mahoney ### 2021-10-20 --- class: center, middle # Gradient Boosting Machines --- class: middle We're going to move on this week to talk about **gradient boosting machines** (GBMs). Like random forests, GBMs aggregate decision trees in order to reduce model variance, resulting in more accurate predictions. However, unlike random forests, trees in a GBM are not grown all at once to predict the same outcome. Instead, trees are grown in sequence, and each tree predicts the residuals of the trees grown before it. GBMs are computationally intensive and a downright bear to tune, but done right GBMs are often the best predictive model for a given task, and have a long history of winning competitive machine learning challenges. --- But before we get to all that, let's talk about **boosting** -- the "B" in GBM. Boosting is the process of sequentially adding models to a base model in order to reduce errors and refine the predictions of the initial model. Let's walk through an example, using our Ames data. First off, we'll create our training and testing splits as usual: ```r set.seed(123) ames <- AmesHousing::make_ames() row_idx <- sample(seq_len(nrow(ames)), nrow(ames)) training <- ames[row_idx < nrow(ames) * 0.8, ] testing <- ames[row_idx >= nrow(ames) * 0.8, ] ``` --- The first step in creating a boosted model is to fit a "base" model. While we can use any model for this -- it's not uncommon to see boosted linear models, for instance -- most often we'll use decision trees. To keep things simple, we'll go ahead and fit our default decision tree: ```r library(rpart) first_tree <- rpart(Sale_Price ~ ., training) ``` As we've seen before, this decision tree is a weak learner which doesn't do great at predicting sale prices: ```r training$first_prediction <- predict(first_tree, training) sqrt(mean((training$first_prediction - training$Sale_Price)^2)) ``` ``` ## [1] 38055.45 ``` (We'll look at test-set accuracy in a minute.) --- So we now have our base model, and we want to improve its accuracy through boosting. In order to do that, we're going to fit a new model to predict the _errors_ of our base model. First, we need to calculate the residuals from our model: ```r training$residual <- training$Sale_Price - training$first_prediction ``` We'll then go ahead and create a new training data frame, without our original sale price or prediction columns (since we don't want to include those in the model): ```r library(dplyr) new_training <- select(training, -Sale_Price, -first_prediction) ``` And finally, we'll fit another decision tree to try and predict the _residuals_ from our first tree: ```r second_tree <- rpart(residual ~ ., new_training) ``` --- We now have two separate models -- one that predicts sale prices, and one that predicts the error of that prediction. To get a single, combined predicted sale price from those models, we can just add the two predictions together: ```r training$second_prediction <- predict(second_tree, training) training$adjusted_prediction <- training$first_prediction + training$second_prediction ``` Now we can compare the RMSE of this combined model against our original decision tree: ```r sqrt(mean((training$first_prediction - training$Sale_Price)^2)) ``` ``` ## [1] 38055.45 ``` ```r sqrt(mean((training$adjusted_prediction - training$Sale_Price)^2)) ``` ``` ## [1] 28530.86 ``` And it looks like the boosted model does dramatically better than the single tree! --- Of course, this is the training set error. We already know that decision trees can be notorious overfitters, with much lower errors on the test set than the training set. With that in mind, let's look at how our boosted model does on the test set: ```r testing$first_prediction <- predict(first_tree, testing) testing$second_prediction <- predict(second_tree, testing) testing$adjusted_prediction <- testing$first_prediction + testing$second_prediction sqrt(mean((testing$first_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 39131.38 ``` ```r sqrt(mean((testing$adjusted_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 35830.06 ``` Looking at these RMSEs, it does seem that the boosted model improves the single decision tree -- but by much less than when we looked at training RMSE. This looks like classic overfitting -- our model is becoming far too confident when predicting the training set, so it makes extreme predictions which don't work when predicting new data. What can we do about that? --- Well, what if we made our model less confident? For instance, what if instead of believing that our second model perfectly corrects the base prediction, we viewed it as being a weak indicator of how far off the base prediction was? When we think about our second model in that way, it might make sense for us to not add the entire residual predicted by the model. Instead, we might try nudging our initial prediction by, say, half the correction factor: ```r testing$adjusted_prediction <- testing$first_prediction + (testing$second_prediction * 0.5) ``` We can then evaluate this new adjustment method: ```r sqrt(mean((testing$first_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 39131.38 ``` ```r sqrt(mean((testing$adjusted_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 35635.45 ``` Though it's still not as good as we did on the training data, we can see that this weaker nudge actually out-performs our first attempt. --- We can think of that 0.5 multiplier as being a "learning rate" for our model. Conceptually, the learning rate represents how fast we want our model to incorporate corrections into its predictions. For instance, we could go on to add a third tree to this model, using a learning rate of 0.5. To do so, we'd first want to calculate the residuals from our second tree: ```r training$adjusted_prediction <- training$first_prediction + (training$second_prediction * 0.5) training$residual <- training$Sale_Price - training$adjusted_prediction ``` Then we'll recreate a data frame without our predictions, and fit a new tree: ```r new_training <- select(training, -Sale_Price, -first_prediction, -second_prediction, -adjusted_prediction) third_tree <- rpart(residual ~ ., new_training) ``` --- And then we can add that third tree to our model, just like we did with the second one: ```r testing$third_prediction <- predict(third_tree, testing) testing$final_prediction <- testing$adjusted_prediction + (testing$third_prediction * 0.5) sqrt(mean((testing$first_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 39131.38 ``` ```r sqrt(mean((testing$adjusted_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 35635.45 ``` ```r sqrt(mean((testing$final_prediction - testing$Sale_Price)^2)) ``` ``` ## [1] 34034.33 ``` Our third tree improves our prediction accuracy even further! This is how boosting works, at its core. Most of the time we'll be working with many more trees, but the essential process of building trees to predict the residuals of past predictions is the same. --- The other big way that boosting algorithms reduce overfitting is by using weaker learners. For instance, when boosting decision trees, you might wind up fitting a bunch of trees with only one split, sometimes referred to as **decision stumps**. These stumps will have _extremely_ low predictive accuracy, but not 0 accuracy altogether. If you add enough of them together, you might be able to aggregate all those non-0 accuracies together into a solid final prediction. We won't walk through the process of using decision stumps manually, because it takes _many_ trees to see real improvements over the base learner. But it's worth highlighting that usually when we talk about boosting algorithms, we're purposefully using extremely weak versions of our component models to avoid overfitting. --- Of course, when we want to use more trees and smaller learning rates, we usually won't fit all the trees ourselves. Gradient boosting machines provide a way to fit all the trees automatically. We'll use the `gbm` package to fit our GBMs today -- go ahead and install it if you need to: ```r install.packages("gbm") ``` We've also made a mess of our training and testing sets, so let's use our original `ames` data and `row_idx` vector to recreate those: ```r training <- ames[row_idx < nrow(ames) * 0.8, ] testing <- ames[row_idx >= nrow(ames) * 0.8, ] ``` --- Fitting a simple GBM works just like every other type of model we've seen so far -- we'll load the `gbm` package and then fit a model using `gbm`: ```r library(gbm) ``` ``` ## Loaded gbm 2.1.8 ``` ```r first_gbm <- gbm(Sale_Price ~ ., data = training) ``` ``` ## Distribution not specified, assuming gaussian ... ``` Note that the second argument to `gbm` is `distribution`, not `data`, so we need to specify our data explicitly. The `gbm` package will walk through the same steps as we did, fitting sequential trees to our data to try and reduce prediction error. However, `gbm` will fit a lot more trees than we did -- by default, 100, but we'll usually use more than that -- and will also fit those trees on a different random subsample of the training data each time. The theory here is similar to random forests -- we want to increase the variance between our trees, so that the aggregated prediction is better. We'll talk more about this next week. --- class: middle With our model fit, we can then evaluate our test-set RMSE: ```r sqrt(mean((predict(first_gbm, testing) - testing$Sale_Price)^2)) ``` ``` ## Using 100 trees... ``` ``` ## [1] 27374.61 ``` Out-of-the-box, GBM does an _alright_ job predicting sale price. It's better than linear regression, but worse than random forests. --- class: middle But, much like random forests, GBMs have a number of hyperparameters for us to tune. The main ones in `gbm` include: + `shrinkage`: The learning rate for the model. + `n.trees`: How many trees to grow. Unlike random forests, GBMs can wind up overfitting if they're grown with too many trees, so it's important to tune this value with the other hyperparameters. + `interaction.depth`: The maximum depth of each tree (how many splits). Tends to be from 3-8, though older papers often use 1 to make decision stumps. + `n.minobsinnode`: The minimum number of observations per leaf node in each tree (like `min.node.size` in ranger). GBMs are incredibly sensitive to hyperparameter values, and tend to need more careful tuning than random forests to predict effectively. --- We can tune these hyperparameters using grid searches, just like last week. However, because GBMs need to grow their trees sequentially (unlike random forests, which can grow all their trees at the same time), grid searches for GBMs can take quite a bit of time. As such, we tend to tune GBM in multiple stages. We're going to start off by tuning the learning rate for our GBM. Smaller learning rates tend to produce better models, but require more trees to be effective and so take a long time to run. Higher learning rates are faster, but might not be as effective. For this reason, we're going to try a range of learning rates and choose the value that balances accuracy and training time, which we'll measure using `system.time`. One very nice thing about `gbm` is that we can implement k-fold CV just by using the `cv.folds` argument; we don't need to write our own function this time. --- As a result, our grid to choose a learning rate looks like this. Fair warning, this takes about 6 minutes to run: ```r tuning_grid <- expand.grid( learning_rate = c(0.3, 0.1, 0.05, 0.01, 0.005), rmse = NA, trees = NA, time = NA ) for(i in seq_len(nrow(tuning_grid))) { train_time <- system.time({ m <- gbm( formula = Sale_Price ~ ., # Optional -- silences a warning: distribution = "gaussian", data = training, n.trees = 6000, shrinkage = tuning_grid$learning_rate[i], interaction.depth = 3, n.minobsinnode = 10, cv.folds = 5 ) }) tuning_grid$rmse[i] <- sqrt(min(m$cv.error)) tuning_grid$trees[i] <- which.min(m$cv.error) tuning_grid$time[i] <- train_time[["elapsed"]] } ``` --- We can then see which of our learning rates best balances accuracy and time: ```r arrange(tuning_grid, rmse) ``` ``` ## learning_rate rmse trees time ## 1 0.010 22118.63 6000 66.468 ## 2 0.100 22766.41 1566 66.063 ## 3 0.050 22791.24 2527 68.975 ## 4 0.005 25319.20 5074 69.443 ## 5 0.300 28541.96 37 65.540 ``` In this case, we'll go ahead and run with a learning rate of 0.01 for the rest of our tuning. --- With a learning rate chosen, we'll go ahead and tune our tree-specific parameters (namely, interaction depth and observations per node) using a new grid. This one took about 15 minutes to complete: ```r tuning_grid <- expand.grid( learning_rate = 0.01, interaction.depth = c(3, 5, 7), n.minobsinnode = c(5, 10, 15), trees = NA, rmse = NA ) for(i in seq_len(nrow(tuning_grid))) { m <- gbm( formula = Sale_Price ~ ., distribution = "gaussian", data = training, n.trees = 6000, shrinkage = tuning_grid$learning_rate[i], interaction.depth = tuning_grid$interaction.depth[i], n.minobsinnode = tuning_grid$n.minobsinnode[i], cv.folds = 5 ) tuning_grid$trees[i] <- which.min(m$cv.error) tuning_grid$rmse[i] <- sqrt(min(m$cv.error)) } ``` --- And just as before, we'll look to see which of our hyperparameter combinations produced the best models: ```r head(arrange(tuning_grid, rmse)) ``` ``` ## learning_rate interaction.depth n.minobsinnode trees rmse ## 1 0.01 5 5 5999 21841.67 ## 2 0.01 3 5 6000 22090.23 ## 3 0.01 5 10 5999 22111.81 ## 4 0.01 3 10 6000 22217.39 ## 5 0.01 3 15 6000 22466.78 ## 6 0.01 5 15 5999 22749.58 ``` As with random forests, we could continue tuning with narrower and narrower ranges until we were confident we arrived at the absolute best hyperparameters. We'd also experiment with smaller and smaller shrinkages, to see if they improved fit at all. --- However, even the values we've landed at from two rounds of tuning produce a pretty good model: ```r final_gbm <- gbm( Sale_Price ~ ., data = training, n.trees = 6000, shrinkage = 0.01, interaction.depth = 5, n.minobsinnode = 5 ) ``` ``` ## Distribution not specified, assuming gaussian ... ``` ```r sqrt(mean((predict(final_gbm, testing) - testing$Sale_Price)^2)) ``` ``` ## Using 6000 trees... ``` ``` ## [1] 19867.59 ``` This is dramatically better than our fully tuned random forest! --- class: middle GBMs are incredibly effective predictive models. With proper tuning, they're often one of the best tools for any predictive task. However, GBMs aren't without their drawbacks. These models can be incredibly computationally intensive to train and tune, especially compared to random forests. Unlike random forests, they also rarely work well "out of the box" -- GBMs almost always require tuning before they're effective predictors. Lastly, GBMs also do a good job of reducing bias, but are worse at removing error from variance than random forests. Next week we'll talk about a variation of GBMs designed to fix most of these issues. --- class: middle # References --- Titles are links: + [HOML Chapter 12](https://bradleyboehmke.github.io/HOML/gbm.html) A few papers + [Additive logistic regression: a statistical view of boosting](https://projecteuclid.org/journals/annals-of-statistics/volume-28/issue-2/Additive-logistic-regression--a-statistical-view-of-boosting-With/10.1214/aos/1016218223.full) + [Greedy function approximation: A gradient boosting machine.](https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boostingmachine/10.1214/aos/1013203451.full) + [Stochastic gradient boosting](https://www.sciencedirect.com/science/article/pii/S0167947301000652)