class: center, middle, inverse, title-slide # MLCA Week 9: ## Stochastic Gradient Boosting Machines and Stacked Ensembles ### Mike Mahoney ### 2021-10-27 --- class: center, middle # Stochastic GBMs --- class: middle We ended last week's discussion of GBMs by talking about how GBMs can be incredibly effective prediction algorithms, but at the cost of being computationally intensive to fit and less adept at reducing variance error than random forests. This week, we'll focus on extensions to the basic GBM which help address that. --- class: middle Last week, I briefly mentioned that the `gbm` package fits each of its trees on a different random subsample of the training data each time. This algorithm is more properly known as **stochastic gradient boosting machines**, as the data each tree is fit to will change every time you re-run the model (assuming you haven't set a seed). This simple change improves GBM accuracy enough that I'm not actually aware of a GBM implementation that doesn't resample. But this left researchers wondering -- could other forms of randomness, like we use in random forests, improve GBMs even further? --- class: middle Remember that random forests use two main forms of randomness to increase the variance between their decision trees. First off, each tree is fit to a different bootstrap sample; rather than subsampling the training data, random forests fit decision trees to data with the same number of observations as the training data, sampled with replacement. Secondly, each split in each decision tree is selected from a different subsample of predictors. What if we introduced these forms of randomness to GBMs? --- class: middle There's currently a lot of competition between packages for fitting these sorts of stochastic GBMs. There are three main contenders worth mentioning briefly: + `xgboost`, released in 2014, was the first mainstream stochastic GBM implementation to really take off and is still very popular today. + `lightGBM`, released in 2016, runs dramatically faster than `xgboost` while still producing highly accurate models. + `catboost`, released in 2019, is slower and less accurate than `lightGBM` but makes it a bit easier to deal with categorical variables. We're going to use `lightGBM`, because I like it the most. Go ahead and install it now: ```r install.packages( "lightgbm", repos = "https://cran.microsoft.com/snapshot/2021-10-24/") ``` --- Before we actually fit a model, however, we need to prepare our data. We'll use the Ames housing data set today: ```r ames <- AmesHousing::make_ames() ``` Before we split this into testing and training sets, however, we first need to encode our categorical variables. `lightGBM` (as well as `xgboost`) can't handle categorical variables directly, so we'll need to one-hot encode them first. Way back in Week 2 we talked about how to do this with `dplyr` and `tidyr`: ```r library(tidyr) library(dplyr) ames %>% mutate(dummy_value = 1) %>% # Create a dummy value column pivot_wider( names_from = Foundation, names_prefix = "Foundation", values_from = dummy_value, values_fill = 0 ) %>% select(starts_with("Foundation"), Sale_Price, Year_Built, Gr_Liv_Area) |> head(1) ``` ``` ## # A tibble: 1 × 9 ## FoundationCBlock FoundationPConc FoundationWood FoundationBrkTil ## <dbl> <dbl> <dbl> <dbl> ## 1 1 0 0 0 ## # … with 5 more variables: FoundationSlab <dbl>, FoundationStone <dbl>, ## # Sale_Price <int>, Year_Built <int>, Gr_Liv_Area <int> ``` --- class: middle However, the Ames data has 46 categorical variables to encode, so doing them one-at-a-time is going to take a while. Instead, we're going to use a new package named `recipes` to dummy encode all of our variables at once. Let's install it now: ```r install.packages("recipes") ``` --- The `recipes` package is a fantastic tool that helps us pre-process our data before we fit our models. In order to use it to encode our variables, we need to walk through a few steps. First off, we'll load the package: ```r library(recipes) ``` Then we'll create a "recipe" object, which tells the other functions from the package what formula and data we're trying to work with: ```r ames_recipe <- recipe(Sale_Price ~ ., data = ames) ``` We're now able to specify what pre-processing steps we want to take using any number of functions from `recipes`, all using the `step_` prefix. For example, to say that we want to dummy encode all the factors in our data, we can use the `step_dummy` function: ```r ames_recipe <- step_dummy(ames_recipe, where(is.factor)) ``` --- If we wanted, we could keep adding pre-processing steps to our "recipe" object. Since we just want to encode our factors, we'll tell the `recipes` package we're done by using the `prep` function: ```r ames_recipe <- prep(ames_recipe) ``` And finally, we'll actually encode our variables by way of the `bake` function: ```r ames <- bake(ames_recipe, ames) ``` With that all done, we've finally got a data frame with encoded categorical variables: ```r head(ames[36:309]) ``` ``` ## # A tibble: 6 × 274 ## MS_SubClass_One_S… MS_SubClass_One_Sto… MS_SubClass_One_an… MS_SubClass_One_a… ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0 ## 6 0 0 0 0 ## # … with 270 more variables: MS_SubClass_Two_Story_1946_and_Newer <dbl>, ## # MS_SubClass_Two_Story_1945_and_Older <dbl>, ## # MS_SubClass_Two_and_Half_Story_All_Ages <dbl>, ## # MS_SubClass_Split_or_Multilevel <dbl>, MS_SubClass_Split_Foyer <dbl>, ## # MS_SubClass_Duplex_All_Styles_and_Ages <dbl>, ## # MS_SubClass_One_Story_PUD_1946_and_Newer <dbl>, ## # MS_SubClass_One_and_Half_Story_PUD_All_Ages <dbl>, … ``` --- I should point out that the `recipes` package is really designed to work with the pipe operator. While the process of dummy encoding our variables seems like a lot when broken into separate lines, like I just did for explanation, the code is a lot neater when piped together: ```r ames <- AmesHousing::make_ames() ames <- recipe(Sale_Price ~ ., data = ames) |> step_dummy(where(is.factor)) |> prep() |> bake(ames) head(ames[36:309]) ``` ``` ## # A tibble: 6 × 274 ## MS_SubClass_One_S… MS_SubClass_One_Sto… MS_SubClass_One_an… MS_SubClass_One_a… ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0 ## 6 0 0 0 0 ## # … with 270 more variables: MS_SubClass_Two_Story_1946_and_Newer <dbl>, ## # MS_SubClass_Two_Story_1945_and_Older <dbl>, ## # MS_SubClass_Two_and_Half_Story_All_Ages <dbl>, ## # MS_SubClass_Split_or_Multilevel <dbl>, MS_SubClass_Split_Foyer <dbl>, ## # MS_SubClass_Duplex_All_Styles_and_Ages <dbl>, ## # MS_SubClass_One_Story_PUD_1946_and_Newer <dbl>, ## # MS_SubClass_One_and_Half_Story_PUD_All_Ages <dbl>, … ``` --- We'll talk a little bit more about using `recipes` for data processing next week. For now, once our variables are encoded, we'll go ahead and split our data into training and testing sets: ```r set.seed(123) row_idx <- sample(seq_len(nrow(ames)), nrow(ames)) training <- ames[row_idx < nrow(ames) * 0.8, ] testing <- ames[row_idx >= nrow(ames) * 0.8, ] ``` And load lightgbm: ```r library(lightgbm) ``` If you look at `?lightgbm`, you'll notice that it looks a little bit different than `?lm` or the other models we've worked with so far. Specifically, `lightgbm` doesn't have a `formula` argument for us to use to specify our model. --- Instead, we need to provide `lightgbm` with a _matrix_ of our predictors and a _vector_ of our outcome. We can create those objects like this: ```r xtrain <- as.matrix(training[setdiff(names(training), "Sale_Price")]) ytrain <- training[["Sale_Price"]] ``` With our objects created, the syntax to fit a `lightgbm` model is then: ```r first_lgb <- lightgbm( data = xtrain, label = ytrain, verbose = -1L, obj = "regression", ) ``` In addition to providing our predictors and outcome, we also need to specify that we want to do regression via `obj = "regression"`. I also like `lightgbm` to run in quiet mode, so I set the argument `verbose = -1L`, but this is optional. --- In order to predict with this model, we also need to convert our test set into a matrix: ```r xtest <- as.matrix(testing[setdiff(names(testing), "Sale_Price")]) ``` We can then use `predict` as normal to generate predictions: ```r lgb_predictions <- predict(first_lgb, xtest) ``` Then we calculate RMSE as normal: ```r sqrt(mean((lgb_predictions - testing$Sale_Price)^2)) ``` ``` ## [1] 36751.29 ``` And all in all, this seems like a lot of work for a really crappy model. --- But of course, GBMs are very sensitive to their hyperparameter values, and we haven't tuned this model at all yet. We'll tune this `lightgbm` model in a pretty similar way as to the last two weeks. First, we'll tweak the `calc_rmse` function from week 7 so that it will create a matrix to predict on: ```r calc_rmse <- function(model, data) { xtest <- as.matrix(data[setdiff(names(data), "Sale_Price")]) lgb_predictions <- predict(model, xtest) sqrt(mean((lgb_predictions - data$Sale_Price)^2)) } ``` We'll also tweak our `k_fold_cv` function so that it works with `lightgbm`. This function is a bit longer than last time, so it's on the next slide all by itself. --- ```r k_fold_cv <- function(data, k, nrounds = 10L, ...) { per_fold <- floor(nrow(data) / k) fold_order <- sample(seq_len(nrow(data)), size = per_fold * k) fold_rows <- split( fold_order, rep(1:k, each = per_fold) ) vapply( fold_rows, \(fold_idx) { fold_test <- data[fold_idx, ] fold_train <- data[-fold_idx, ] xtrain <- as.matrix(fold_train[setdiff(names(fold_train), "Sale_Price")]) ytrain <- fold_train[["Sale_Price"]] fold_lgb <- lightgbm( data = xtrain, label = ytrain, verbose = -1L, obj = "regression", nrounds = nrounds, params = ... ) calc_rmse(fold_lgb, fold_test) }, numeric(1) ) |> mean() } ``` --- With the prep work out of the way, we can get started tuning hyperparameters. At its core, `lightgbm` is still a GBM, and so has the same basic hyperparameters as our `gbm` model did. For instance, we'll start off by using a learning rate of 0.1 and tuning how many trees our models should use. In `lightgbm`, this is called `nrounds`. ```r tuning_grid <- expand.grid( learning_rate = 0.1, nrounds = c(10, 50, 100, 500, 1000, 1500, 2000, 2500, 3000), rmse = NA ) for (i in seq_len(nrow(tuning_grid))) { tuning_grid$rmse[i] <- k_fold_cv( training, k = 5, learning_rate = tuning_grid$learning_rate[i], nrounds = tuning_grid$nrounds[i] ) } head(arrange(tuning_grid, rmse), 2) ``` ``` ## learning_rate nrounds rmse ## 1 0.1 1000 25680.25 ## 2 0.1 1500 26150.08 ``` --- We can keep going to tune the other parameters we used in `gbm`. The maximum depth of trees here is called `max_depth`, instead of `interaction.depth`, and the minimum number of observations per leaf is called `min_data_in_bin` rather than `n.minobsinnode`, but mechanically these hyperparameters do the same thing here as they did back in `gbm`. Among the normal values we're going to try for `max_depth`, we're also going to try the special value "-1". When `max_depth` is set to -1, each decision tree is allowed to grow as deep as it wants, with no hard limit set by the algorithm. --- Our grid to tune these variables looks like this: ```r tuning_grid <- expand.grid( learning_rate = 0.1, nrounds = 1000, max_depth = c(-1, 2, 8, 32, 63), min_data_in_bin = c(3, 8, 13, 18), rmse = NA ) for (i in seq_len(nrow(tuning_grid))) { tuning_grid$rmse[i] <- k_fold_cv( training, k = 5, learning_rate = tuning_grid$learning_rate[i], nrounds = tuning_grid$nrounds[i], max_depth = tuning_grid$max_depth[i], min_data_in_bin = tuning_grid$min_data_in_bin[i] ) } head(arrange(tuning_grid, rmse), 2) ``` ``` ## learning_rate nrounds max_depth min_data_in_bin rmse ## 1 0.1 1000 2 18 24328.38 ## 2 0.1 1000 2 13 24338.81 ``` --- With the basic GBM hyperparameters sorted out, we can now start tuning the stochastic hyperparameters that make lightgbm so interesting. Specifically, there are three new hyperparameters worth playing with: + `bagging_fraction` represents what percent of observations should be sampled for each tree, just like `sampling.fraction` back in `ranger`. + `bagging_freq` represents how often bootstrapping should be resampled. If you set it to 1, then each tree is fit to a different bootstrap sample; if you set it to 5, then the model will fit five trees to the same sample before resampling. If you set it to 0, the model won't resample ever. + `feature_fraction` represents the percentage of variables that should be available to each tree. For instance, if you set it to 0.8, each tree would be fit using a random 80% of predictors. We'll tune these parameters on the next slide. --- ```r tuning_grid <- expand.grid( learning_rate = 0.1, nrounds = 1000, max_depth = 2, min_data_in_bin = 18, bagging_freq = c(0, 1, 5, 10), bagging_fraction = seq(0.3, 1.0, 0.1), feature_fraction = seq(0.3, 1.0, 0.1), rmse = NA ) for (i in seq_len(nrow(tuning_grid))) { tuning_grid$rmse[i] <- k_fold_cv( training, k = 5, learning_rate = tuning_grid$learning_rate[i], nrounds = tuning_grid$nrounds[i], max_depth = tuning_grid$max_depth[i], min_data_in_bin = tuning_grid$min_data_in_bin[i], bagging_freq = tuning_grid$bagging_freq[i], bagging_fraction = tuning_grid$bagging_fraction[i], feature_fraction = tuning_grid$feature_fraction[i] ) } head(arrange(tuning_grid, rmse), 2) |> select(bagging_freq, bagging_fraction, feature_fraction, rmse) ``` ``` ## bagging_freq bagging_fraction feature_fraction rmse ## 1 5 0.9 0.6 23053.00 ## 2 0 0.4 0.8 23241.62 ``` --- class: middle At this point, we've tuned our model using the most important hyperparameters. These hyperparameters are the ones that are most likely to impact our model accuracy; 90%+ of the possible improvements to your model come from these seven hyperparameters. But they aren't the only hyperparameters available. In fact, there are 86 hyperparameters in total, many of which might give some slight improvement to your model. Most of the time you won't need to worry about the other hyperparameters available. But if you're using `lightgbm` or `xgboost` professionally and need the best performance at all costs, I highly recommend [this website (link) on available parameters](https://sites.google.com/view/lauraepp/parameters). The [`lightgbm` docs also have some pointers on hyperparameters which might help improve speed, accuracy, or overfitting (link).](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html) --- With all that said and our tuning complete, let's go ahead and evaluate our final test set accuracy: ```r xtrain <- as.matrix(training[setdiff(names(training), "Sale_Price")]) ytrain <- training[["Sale_Price"]] final_lgb <- lightgbm( data = xtrain, label = ytrain, verbose = -1L, obj = "regression", nrounds = 1000, params = list( learning_rate = 0.1, max_depth = 2, min_data_in_bin = 18, bagging_freq = 5, bagging_fraction = 0.9, feature_fraction = 0.6 ) ) calc_rmse(final_lgb, testing) ``` ``` ## [1] 19730.79 ``` This is roughly comparable to the base GBM, but ran a lot faster -- and with further tuning, we could likely beat GBM altogether. --- So we now have three pretty alright models -- from `ranger`, `gbm`, and `lightgbm` -- all doing a good job predicting sale price for the Ames data. What do we do with them now? A lot of times in research, we'd go on to rank our models (by RMSE, MAE, or god forbid, p-values) and then only talk about our "best" model, with all our results and conclusions drawn from the top-ranked model. This is weird. We went through all the trouble of fitting multiple models because we thought they were all plausible options for the best tool for the job; suddenly having 100% confidence in a single model being the best based on its ranking against the test set places a _lot_ of faith in your test set being exactly representative of real-world data. What if there was a way to benefit from the work we put into training _all_ our models? --- Turns out, there is. We can combine the predictions of our models to better represent our confidence in _each_ model's predictive accuracy, and as a side effect usually generate better predictions overall. This method is called **stacking** (or sometimes stacked generalization, or sometimes "super learners"). Just like GBMs and random forests, stacking aggregates multiple models in order to improve predictive accuracy. However, unlike those methods, stacking tries to combine _strong learners_ (models which are pretty accurate on their own) rather than weak learner decision trees. --- We'll walk through a few examples of stacking. First things first, we need to recreate our `ranger` and `gbm` models: ```r library(ranger) final_ranger <- ranger( Sale_Price ~ ., training, num.trees = 800, mtry = 20, min.node.size = 5, replace = FALSE, sample.fraction = 1 ) ``` ```r library(gbm) final_gbm <- gbm( Sale_Price ~ ., data = training, n.trees = 6000, shrinkage = 0.01, interaction.depth = 5, n.minobsinnode = 5 ) ``` ``` ## Distribution not specified, assuming gaussian ... ``` --- In order to combine the predictions from these models, we'll first need to go ahead and make some predictions. I'll create a new testing data frame, just to keep our original test set clean: ```r stacked_test <- testing ``` And now I'll add columns with predictions to the data frame: ```r xtest <- as.matrix(stacked_test[setdiff(names(stacked_test), "Sale_Price")]) stacked_test$lgb <- predict(final_lgb, xtest) stacked_test$rf <- predictions(predict(final_ranger, stacked_test)) stacked_test$gbm <- predict(final_gbm, stacked_test) ``` ``` ## Using 6000 trees... ``` --- So now that we have our predictions, how can we combine them into a single stacked ensemble? We have a few options. The easiest is to simply average the predictions together: ```r stacked_test$avg <- (stacked_test$lgb + stacked_test$rf + stacked_test$gbm) / 3 ``` Conceptually, doing this is a way of saying "all of these models are equally likely to be the best one". Pragmatically, even this simple ensemble beats all our models on the test data: ```r stacked_test |> summarise( across(c(lgb, rf, gbm, avg), \(x) sqrt(mean((x - Sale_Price)^2))) ) ``` ``` ## # A tibble: 1 × 4 ## lgb rf gbm avg ## <dbl> <dbl> <dbl> <dbl> ## 1 19731. 22583. 19277. 18990. ``` --- But realistically, we don't think that all our models are equally likely to be the best. We already know that, for instance, this random forest is a bit worse at prediction than the two GBMs. Instead of the simple average, it often makes sense to use a weighted average, weighting each model's contribution to the final prediction based on its accuracy against a validation set or from k-fold CV. Let's try that now. We'll first have to calculate RMSE for each of our models from 5-fold CV. This takes a lot of code, so the next few slides will walk through each of the pieces. First off, we'll write a function to fit our ranger models and return RMSEs: ```r fit_ranger <- function(training, testing) { cv_model <- ranger( Sale_Price ~ ., training, num.trees = 800, mtry = 20, min.node.size = 5, replace = FALSE, sample.fraction = 1 ) preds <- predictions(predict(cv_model, testing)) sqrt(mean((preds - testing$Sale_Price)^2)) } ``` --- We'll do the same for GBM: ```r fit_gbm <- function(training, testing) { cv_model <- gbm( Sale_Price ~ ., data = training, n.trees = 6000, shrinkage = 0.01, interaction.depth = 5, n.minobsinnode = 5 ) preds <- predict(cv_model, testing) sqrt(mean((preds - testing$Sale_Price)^2)) } ``` --- And lastly for `lightgbm`: ```r fit_lgb <- function(training, testing) { xtrain <- as.matrix(training[setdiff(names(training), "Sale_Price")]) ytrain <- training[["Sale_Price"]] xtest <- as.matrix(testing[setdiff(names(testing), "Sale_Price")]) cv_model <- lightgbm( data = xtrain, label = ytrain, verbose = -1L, obj = "regression", nrounds = 1000, params = list( learning_rate = 0.1, max_depth = 2, min_data_in_bin = 18, bagging_freq = 5, bagging_fraction = 0.9, feature_fraction = 0.6 ) ) preds <- predict(cv_model, xtest) sqrt(mean((preds - testing$Sale_Price)^2)) } ``` --- We can then use those functions to estimate model accuracy against each of the five folds: ```r per_fold <- floor(nrow(training) / 5) fold_order <- sample(seq_len(nrow(training)), size = per_fold * 5) fold_rows <- split( fold_order, rep(1:5, each = per_fold) ) model_rmse <- data.frame( rf = rep(NA, 5), gbm = rep(NA, 5), lgb = rep(NA, 5) ) for (i in seq_along(fold_rows)) { fold_test <- training[fold_rows[[i]], ] fold_train <- training[-fold_rows[[i]], ] model_rmse$rf[i] <- fit_ranger(fold_test, fold_train) model_rmse$gbm[i] <- suppressWarnings(fit_gbm(fold_test, fold_train)) model_rmse$lgb[i] <- fit_lgb(fold_test, fold_train) } ``` ``` ## Distribution not specified, assuming gaussian ... ## Distribution not specified, assuming gaussian ... ## Distribution not specified, assuming gaussian ... ## Distribution not specified, assuming gaussian ... ## Distribution not specified, assuming gaussian ... ``` --- We now have RMSE estimates for each iteration of each of our models: ```r model_rmse ``` ``` ## rf gbm lgb ## 1 34385.77 28824.92 29900.48 ## 2 30172.44 28571.99 29284.42 ## 3 33759.57 30432.66 31634.38 ## 4 28883.56 26352.62 29794.65 ## 5 30152.59 27581.81 28876.70 ``` Let's get the mean RMSE for each of those: ```r (model_rmse <- apply(model_rmse, 2, mean)) ``` ``` ## rf gbm lgb ## 31470.78 28352.80 29898.13 ``` In order to convert these into weights, we first want to invert the RMSEs: ```r inverse_rmse <- (1 / model_rmse) ``` And then calculate weights based on the relative sizes of the RMSEs: ```r (rmse_weights <- inverse_rmse / sum(inverse_rmse)) ``` ``` ## rf gbm lgb ## 0.3161984 0.3509710 0.3328306 ``` --- The vector `rmse_weights` now represents the proportion that each model will contribute to our final prediction. Intuitively enough, GBM here has a somewhat larger weight and the random forest has had its weight reduced from the baseline 1/3. We can then use these weights to adjust our aggregated predictions: ```r stacked_test$weighted <- stacked_test$lgb * rmse_weights[["lgb"]] + stacked_test$rf * rmse_weights[["rf"]] + stacked_test$gbm * rmse_weights[["gbm"]] stacked_test |> summarise( across(c(lgb, rf, gbm, avg, weighted), \(x) sqrt(mean((x - Sale_Price)^2))) ) ``` ``` ## # A tibble: 1 × 5 ## lgb rf gbm avg weighted ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 19731. 22583. 19277. 18990. 18955. ``` And this method beats our simple average! --- It's possible to stack models in even more complex ways than that. For instance, what if rather than using the same weights to predict each observation, we used a model to aggregate our model predictions? To do that, we'd need to be able to observe our model predicting data it had never seen before. We can accomplish this by breaking off a piece of our training set to use as a validation set instead: ```r training <- ames[row_idx < nrow(ames) * 0.6, ] validation <- ames[row_idx >= nrow(ames) * 0.6 & row_idx < nrow(ames) * 0.8, ] ``` --- Once that's accomplished, we need to fit new models on the smaller training data, starting with the random forest and GBM: ```r validation_rf <- ranger( Sale_Price ~ ., training, num.trees = 800, mtry = 20, min.node.size = 5, replace = FALSE, sample.fraction = 1 ) validation_gbm <- gbm( Sale_Price ~ ., data = training, n.trees = 6000, shrinkage = 0.01, interaction.depth = 7, n.minobsinnode = 10 ) ``` ``` ## Distribution not specified, assuming gaussian ... ``` --- Then continuing with `lightgbm`: ```r xtrain <- as.matrix(training[setdiff(names(training), "Sale_Price")]) ytrain <- training[["Sale_Price"]] validation_lgb <- lightgbm( data = xtrain, label = ytrain, verbose = -1L, obj = "regression", nrounds = 1000, params = list( learning_rate = 0.1, max_depth = 2, min_data_in_bin = 18, bagging_freq = 5, bagging_fraction = 0.9, feature_fraction = 0.6 ) ) ``` --- With our models fit, we can now use them to predict the validation data: ```r xtest <- as.matrix(validation[setdiff(names(validation), "Sale_Price")]) validation$rf <- predictions(predict(validation_rf, validation)) validation$gbm <- predict(validation_gbm, validation) ``` ``` ## Using 6000 trees... ``` ```r validation$lgb <- predict(validation_lgb, xtest) ``` And then fit some model to predict sale price based on our predictions. Here I'm going to use a simple linear model as an ensemble, but we could use anything -- including random forests or GBMs -- to aggregate our predictions: ```r ensemble_model <- lm(Sale_Price ~ rf * gbm * lgb, data = validation) ``` --- All that's left is to generate predictions from this ensemble model: ```r stacked_test$model_weight <- predict(ensemble_model, stacked_test) stacked_test |> summarise( across(c(lgb, rf, gbm, avg, weighted, model_weight), \(x) sqrt(mean((x - Sale_Price)^2))) ) ``` ``` ## # A tibble: 1 × 6 ## lgb rf gbm avg weighted model_weight ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 19731. 22583. 19277. 18990. 18955. 19171. ``` While this model doesn't outperform our simple averages, we might try using more complex models to see if we could eke out any further improvements in RMSE. Or, of course, we could decide that our simple averages are good enough for the purposes of our model; it all depends on your application. Stacked ensembles are incredibly useful ways to reflect uncertainty about which of your models is the "best", and can often give you increased predictive accuracy as well. --- Even when they aren't your outright best model, stacked ensembles are a useful way to make your predictions more robust on unknown data. It might be the case that a given model is always the best option in the situations we'll use it in, or it might be the case that with different test data a different model would take the lead. We can't know for certain, and stacking is a good way to reflect and protect against that uncertainty. --- You can also extend stacking to some pretty silly places, if you want to. For instance, we create a whole bunch of models while doing grid searches, because we think they might be the best version of that model for our data. What if, instead of selecting the best set of hyperparameters, we ensembled all of the models in the grid instead? Depending on your models, this might actually give you huge gains in accuracy. But at some point, the accuracy gains from stacking are outweighed by the time and computational power it requires. Of course, where exactly that point is will be different for every project. That's all for this week. Next week we move away from tree-based models onto our last class of ML algorithms. --- class: middle # References --- Titles are links: + [HOML](https://bradleyboehmke.github.io/HOML/gbm.html), specifically chapter 12 (stochastic GBMs) and 15 (stacked ensembles) + [lightGBM parameters](https://sites.google.com/view/lauraepp/parameters) + [Model averaging in ecology: a review of Bayesian, information-theoretic, and tactical approaches for predictive inference](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecm.1309) + [The Use of Bayesian Model Averaging to Better Represent Uncertainty in Ecological Models](https://www.jstor.org/stable/3588906?seq=1#metadata_info_tab_contents)