class: center, middle, inverse, title-slide # MLCA Week 10: ## k-Nearest Neighbors ### Mike Mahoney ### 2021-11-03 --- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ The introduction of numbers as coordinates is an act of violence. .tr[ — Hermann Weyl: Philosophy of Mathematics and Natural Science ] ] --- class: middle center # k-Nearest Neighbors # and Kernel-Based ML --- Today we're going to switch gears and stop talking about tree-based models. Our last two methods are examples of what are known as **kernel-based models** (or **kernel machines**). However, I don't necessarily want to spend a ton of time today getting into the math behind what a "kernel" is. Instead, for this class we can think about these models as being **distance-based methods**, which make predictions based on how _close_ an observation is to other data in the training set. --- The method we'll walk through first is called **k-nearest neighbors** (KNN), which bases its predictions based on some number ("k") of the closest points to your unknown observation. KNN is almost never the best model for the job. First proposed in 1951, KNN is incredibly computationally inefficient, is very slow to predict on most data, and rarely if ever is as accurate as a well-tuned random forest or GBM. However, KNN models are simple to understand and use, are somewhat common in the literature given their age, and will help to set up our final model next week. As a result, it makes sense for us to walk through them here. --- Let's say that you've just gotten back from the field, where you measured the size and greenness of a bunch of plants in both interior and edge forests. You plot your data and it looks like this: ![](week_10_slides_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- Unfortunately, though, you forgot to label where one of your plants was sampled from. When you add this plant to the plot, it looks like this: ![](week_10_slides_files/figure-html/unnamed-chunk-2-1.png)<!-- --> Can you guess where the plant was found? --- We can easily assume the plant was found on a forest edge, because it's right in the middle of the rest of our edge observations. This is exactly how k-nearest neighbors works. For some value of k that you choose, KNN looks at the closest k neighbors and predicts whatever the majority class is. For instance, if we set k = 1 our model would look at the closest point (attached on the graph by a black line) and predict that our unknown observation was an edge plant: ![](week_10_slides_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- We could also imagine, for instance, setting k to 2, in which case KNN would look at the two closest plots and still predict edge: ![](week_10_slides_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- A k of 2 can sometimes cause problems, however. For instance, imagine if this plant instead was our mystery point: ![](week_10_slides_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- If we set k to 2 here, we'd have one nearest neighbor from each class: ![](week_10_slides_files/figure-html/unnamed-chunk-6-1.png)<!-- --> In this case, our model would think that both classes are equally likely! What that means for your predictions depends on the package -- some implementations will fail entirely, some will flip a coin, and others will default to one of your classes -- but no matter what, it's not an ideal situation. --- So at least for binary classification, we prefer to always set k to odd values so that we don't wind up with ties in our predictions. For instance, with k = 3 we can easily categorize this plant as an edge plant: ![](week_10_slides_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- I know this seems extremely simple, but I think it's useful to think about what exactly "distance" is representing here -- that is, how we're calculating "closeness". On our 2D graphs, we can see the "closest" point pretty intuitively. We put a variable on the X and Y axis, and then the closest point is the one that we can draw the shortest line between. It's the "nearest" in some combination of both the X and the Y values. And we could imagine doing the same thing in 3D: plotting each of our variables on an axis and then finding the shortest 1D line between two points. This is effectively what KNN is doing, except in many more dimensions -- as many dimensions as your data frame has columns. KNN is attempting to find the nearest neighbors based on all of your variables at once; it's trying to draw that 1D line in extremely high-dimensional space. But there's a problem with thinking of closeness in this way. To explain why, let's go ahead and load our attrition data into R and create our training/test splits: ```r library(modeldata) data(attrition) set.seed(123) row_idx <- sample(seq_len(nrow(attrition)), nrow(attrition)) training <- attrition[row_idx < nrow(attrition) * 0.8, ] testing <- attrition[row_idx >= nrow(attrition) * 0.8, ] ``` --- Our attrition data has many numeric variables. For instance, it has `Age` and `DailyRate`, both of which were useful predictors back in our decision tree models. Plotted together, they look like this: ```r library(ggplot2) training |> ggplot(aes(DailyRate, Age, color = Attrition)) + geom_point() + scale_color_brewer(palette = "Dark2") ``` ![](week_10_slides_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- However, this isn't how these variables look to a KNN. KNNs don't understand the implicit units in our data, or the fact that you can have a lot more variation in a daily rate than in age. To a KNN, a one-unit change in one variable is the same size as a one-unit change in another. To a KNN, our scatterplot looks like this: ```r training |> ggplot(aes(DailyRate, Age, color = Attrition)) + geom_point() + scale_color_brewer(palette = "Dark2") + scale_y_continuous(limits = c(0, max(training$DailyRate))) + scale_x_continuous(limits = c(min(training$Age), max(training$DailyRate))) ``` ![](week_10_slides_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- Since all of our data is extremely close in age, the model won't pay any attention to age when finding neighbors or making predictions -- which isn't great, since we know age is an adequate predictor! ```r training |> ggplot(aes(DailyRate, Age, color = Attrition)) + geom_point() + scale_color_brewer(palette = "Dark2") + scale_y_continuous(limits = c(0, max(training$DailyRate))) + scale_x_continuous(limits = c(min(training$Age), max(training$DailyRate))) ``` ![](week_10_slides_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- To fix this, we'll have to **rescale** our data so that a one-unit change in one variable is _actually_ as dramatic as a one-unit change in another. First, we'll center our data at 0 by subtracting the mean value from each observation: ```r library(dplyr) training |> mutate(across(where(is.numeric), \(x) ((x - mean(x))))) |> ggplot(aes(DailyRate, Age, color = Attrition)) + geom_point() + scale_color_brewer(palette = "Dark2") ``` ![](week_10_slides_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- We'll then divide each observation by the original standard deviation of the vector. Our variables are now all rescaled so that one unit represents one standard deviation -- an Age value of 1 means that the employee was exactly one standard deviation above the mean age. This way, a change of 1 means the same thing across all of our variables. ```r training |> mutate(across(where(is.numeric), \(x) ((x - mean(x)) / sd(x)))) |> ggplot(aes(DailyRate, Age, color = Attrition)) + geom_point() + scale_color_brewer(palette = "Dark2") ``` ![](week_10_slides_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- So, because KNN uses all of our variables to calculate distance and doesn't understand units in our data, we need to center and scale our variables for the model to work properly. The other implication here I want to highlight is that KNNs use _all_ of our variables to calculate distance between points. Unlike tree-based methods, KNNs don't select only the most effective variables to use while making predictions. This means that, unlike tree-based methods, KNNs are susceptible to noisy variables, and might benefit from dimension reduction and feature engineering techniques (like PCA) which we don't cover in this course. This makes KNNs harder to use and often limits their accuracy compared to other methods. This is a consistent limitation of kernel-based methods; we'll return to this idea next week. --- So we need to center and scale our variables before fitting a KNN. However, we don't want to do this directly to our `attrition` data frame, before we split it into testing and training pieces! Remember that the testing data is supposed to be entirely unknown to the model and the modeler; including testing data in your calculations of mean and standard deviation violates that assumption. This type of mistake, where testing data influences your training data and your models themselves, is known as **data leakage** and can result in too-optimistic estimates of model accuracy, as your hold-out set is no longer truly being held out. As a result, we need to calculate the mean and standard deviation of each variable on our training data, and scale both the training and test sets by those values. While we don't have time to cover much data pre-processing in this course, I want to emphasize that this rule applies to anything you do to your data before fitting a model -- whether it's deciding to filter out variables, rescaling data, imputation, creating principal components, or anything else, you _need_ to make sure testing data never influences the data you train your models with. --- This is where the `recipes` package we mentioned last week really shines. You might remember that we can specify our pre-processing steps by creating a "recipe" object, using `step_` functions to declare what steps we want to perform, and then finishing up with `prep()`. Since we want to center and scale our data, we'll use the `step_center` and `step_scale` functions in our recipe: ```r library(recipes) scaling_recipe <- recipe(Attrition ~ ., data = training) |> step_center(where(is.numeric)) |> step_scale(where(is.numeric)) |> prep() ``` The beauty of `recipes` is that, because we've specified what data to use in the recipe in our `recipe()` call, our `scaling_recipe` will automatically use values from our training data to center and scale whatever data we want. All we have to do is pass our training and testing data to the `bake` function to rescale our data without any threat of leakage: ```r training <- bake(scaling_recipe, training) testing <- bake(scaling_recipe, testing) ``` --- And now we're finally able to get started fitting our first KNN! The best R package for fitting KNNs is actually one we've used before: `caret`. We'll load it now: ```r library(caret) ``` Fitting a KNN is then as simple as using `knn3` just like `lm` or most of the other model functions we've seen: ```r first_knn <- knn3(Attrition ~ ., training) ``` Once that finishes, we'll get a 5-nearest neighbors model object! ```r first_knn ``` ``` ## 5-nearest neighbor model ## Training set outcome distribution: ## ## No Yes ## 971 204 ``` As usual, we can investigate this model using a confusion matrix (on the next slide). --- ```r confusionMatrix( predict(first_knn, testing, type = "class"), testing$Attrition, positive = "Yes" ) ``` ``` ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 258 30 ## Yes 4 3 ## ## Accuracy : 0.8847 ## 95% CI : (0.8427, 0.9188) ## No Information Rate : 0.8881 ## P-Value [Acc > NIR] : 0.6175 ## ## Kappa : 0.1154 ## ## Mcnemar's Test P-Value : 1.807e-05 ## ## Sensitivity : 0.09091 ## Specificity : 0.98473 ## Pos Pred Value : 0.42857 ## Neg Pred Value : 0.89583 ## Prevalence : 0.11186 ## Detection Rate : 0.01017 ## Detection Prevalence : 0.02373 ## Balanced Accuracy : 0.53782 ## ## 'Positive' Class : Yes ## ``` --- We can also use our old friend the ROC curve to calculate the AUC of this model: ```r library(pROC) auc( roc( testing$Attrition, predict(first_knn, testing)[, 2], plot = TRUE ) ) ``` ![](week_10_slides_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ``` ## Area under the curve: 0.58 ``` It turns out, this model does a pretty bad job predicting our test set. --- Looking at our specificity rates, it might be helpful if we balance our classes before predicting. `knn3` doesn't take weights as an argument, so we'll need to resample our data in order to balance our training data: ```r training <- attrition[row_idx < nrow(attrition) * 0.8, ] testing <- attrition[row_idx >= nrow(attrition) * 0.8, ] positive_training <- training[training$Attrition == "Yes", ] negative_training <- training[training$Attrition == "No", ] n_pos <- nrow(positive_training) resampled_positives <- sample(1:n_pos, size = 5 * n_pos, replace = TRUE) resampled_positives <- positive_training[resampled_positives, ] training <- rbind( negative_training, resampled_positives ) ``` --- We also might get better results from choosing a different value of k. Like all other hyperparameters, we can tune this via a grid search. However, cross-entropy is a bad loss function for KNNs. KNNs don't _really_ predict probability, they just predict whatever class the majority of neighbors are. If we force KNN to report probabilities, what we get back instead are percentages of how many neighbors belong to each class. This means that a k of 1 can only predict 100% and 0% probabilities, a k of 3 0%, 33%, 67%, and 100%, and so on. Since cross-entropy specifically penalizes confident-but-wrong guesses, this means that cross-entropy will usually decrease as k increases and guesses become less confident. As a result, cross-entropy will often favor higher values of k whether or not it improves our actual predictions. So instead, we'll use our old friend AUC as a loss function to tune KNNs. --- When we predict with `knn3`, we get a matrix of the "probabilities" (really proportions) of each class: ```r predict(first_knn, testing) |> head(1) ``` ``` ## No Yes ## [1,] 0.7777778 0.2222222 ``` To just get our positive probability, we'll index into the data frame: ```r predict(first_knn, testing)[, 2] |> head(1) ``` ``` ## [1] 0.2222222 ``` And so our function to calculate the AUC looks like this: ```r calc_auc <- function(model, data) { roc(data$Attrition, predict(model, data)[, 2]) |> auc() |> suppressMessages() } ``` I've also added a `suppressMessages` function here because `roc` is a pretty chatty function, which can get annoying when running it 100 times. --- We can then repurpose our `k_fold_cv` function from last time to tune our KNN. We'll scale our data each iteration to avoid data leakage from the testing fold: ```r k_fold_cv <- function(data, k, n) { per_fold <- floor(nrow(data) / k) fold_order <- sample(seq_len(nrow(data)), size = per_fold * k) fold_rows <- split( fold_order, rep(1:k, each = per_fold) ) vapply( fold_rows, \(fold_idx) { fold_test <- data[fold_idx, ] fold_train <- data[-fold_idx, ] scaling_recipe <- recipe(Attrition ~ ., data = fold_train) |> step_center(where(is.numeric)) |> step_scale(where(is.numeric)) |> prep() fold_train <- bake(scaling_recipe, fold_train) fold_test <- bake(scaling_recipe, fold_test) fold_knn <- knn3(Attrition ~ ., fold_train, k = n) calc_auc(fold_knn, fold_test) }, numeric(1) ) |> mean() } ``` --- k can go as high as the total number of rows in your data (in which case you'll always predict the more common class) or as low as 1 (using only the closest neighbor), and there's really not a good rule of thumb for what k values to try. We'll go ahead and try 20 k values evenly spaced between 1 and 401: ```r tuning_grid <- expand.grid( n = floor(seq(1, 401, length.out = 20)), auc = NA ) tuning_grid$n ``` ``` ## [1] 1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 ## [20] 401 ``` We still don't want k to be even, however, in order to avoid ties in our classifications. We'll add 1 to all our even k values in order to make sure we've always got a tiebreaker neighbor: ```r tuning_grid$n <- ifelse( tuning_grid$n %% 2 == 0, tuning_grid$n + 1, tuning_grid$n ) tuning_grid$n ``` ``` ## [1] 1 23 43 65 85 107 127 149 169 191 211 233 253 275 295 317 337 359 379 ## [20] 401 ``` --- And with all that done, the code to iterate over our grid is pretty simple: ```r for (i in seq_len(nrow(tuning_grid))) { tuning_grid$auc[i] <- k_fold_cv( attrition, 5, n = tuning_grid$n[i] ) } head(arrange(tuning_grid, -auc)) ``` ``` ## n auc ## 1 85 0.7572562 ## 2 107 0.7544213 ## 3 169 0.7537498 ## 4 127 0.7526416 ## 5 253 0.7514675 ## 6 65 0.7509871 ``` --- Plotting k makes it pretty obvious that the value of k which maximizes AUC is somewhere just under 100. ```r ggplot(tuning_grid, aes(n, auc)) + geom_line() + geom_point() + labs(x = "k") ``` ![](week_10_slides_files/figure-html/unnamed-chunk-29-1.png)<!-- --> If we wanted, we could run a second grid search to investigate that area more thoroughly; right now we'll just run with k = 85. --- We have to rescale our training and testing sets in order to evaluate our hold-out accuracy: ```r scaling_recipe <- recipe(Attrition ~ ., data = training) |> step_center(where(is.numeric)) |> step_scale(where(is.numeric)) |> prep() training <- bake(scaling_recipe, training) testing <- bake(scaling_recipe, testing) ``` And with that done, we can evaluate our accuracy using AUC: ```r final_knn <- knn3(Attrition ~ ., training, k = 85) calc_auc(final_knn, testing) ``` ``` ## Area under the curve: 0.7627 ``` As well as via a confusion matrix, on the next page. --- ```r confusionMatrix( predict(final_knn, testing, type = "class"), testing$Attrition, positive = "Yes" ) ``` ``` ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 192 12 ## Yes 70 21 ## ## Accuracy : 0.722 ## 95% CI : (0.6672, 0.7724) ## No Information Rate : 0.8881 ## P-Value [Acc > NIR] : 1 ## ## Kappa : 0.2088 ## ## Mcnemar's Test P-Value : 3.082e-10 ## ## Sensitivity : 0.63636 ## Specificity : 0.73282 ## Pos Pred Value : 0.23077 ## Neg Pred Value : 0.94118 ## Prevalence : 0.11186 ## Detection Rate : 0.07119 ## Detection Prevalence : 0.30847 ## Balanced Accuracy : 0.68459 ## ## 'Positive' Class : Yes ## ``` --- Overall, this KNN makes pretty adequate predictions! Odds are, we could do better with a well-tuned random forest or GBM, but as far as easily-explainable models go the KNN did alright. And that's KNN for classification! But the technique can be applied to regression tasks as well. Rather than predict the most common value, KNNs built for regression tasks predict the _average_ value of the neighbors (sometimes weighted by distance, sometimes not, depending on the package you use). We'll walk through the process quickly, but the fundamental steps are the same as when using KNNs for classification. First, we'll load our data: ```r ames <- AmesHousing::make_ames() row_idx <- sample(seq_len(nrow(ames)), nrow(ames)) training <- ames[row_idx < nrow(ames) * 0.8, ] testing <- ames[row_idx >= nrow(ames) * 0.8, ] ``` And create our function to calculate RMSE: ```r calc_rmse <- function(model, data) { sqrt(mean((predict(model, data) - data$Sale_Price)^2)) } ``` --- And lastly, we'll repurpose our `k_fold_cv` function. We need to make a few changes: + Instead of `knn3`, we'll use `knnreg` for regression modeling; + We're going to add `-all_outcomes` to our recipe steps, to make sure we don't scale and center our sale price; + And we'll change the formula in `recipe` and `knnreg` to `Sale_Price ~ .` and our loss function to `calc_rmse` The code with these changes is on the next slide. --- ```r k_fold_cv <- k_fold_cv <- function(data, k, n) { per_fold <- floor(nrow(data) / k) fold_order <- sample(seq_len(nrow(data)), size = per_fold * k) fold_rows <- split( fold_order, rep(1:k, each = per_fold) ) vapply( fold_rows, \(fold_idx) { fold_test <- data[fold_idx, ] fold_train <- data[-fold_idx, ] scaling_recipe <- recipe(Sale_Price ~ ., data = fold_train) |> step_center(where(is.numeric), -all_outcomes()) |> step_scale(where(is.numeric), -all_outcomes()) |> prep() fold_train <- bake(scaling_recipe, fold_train) fold_test <- bake(scaling_recipe, fold_test) fold_knn <- knnreg(Sale_Price ~ ., fold_train, k = n) calc_rmse(fold_knn, fold_test) }, numeric(1) ) |> mean() } ``` --- Just as before, we can iterate through possible k values, find the best one, and plot the relationship between k and RMSE: ```r tuning_grid <- expand.grid( n = floor(seq(1, 401, length.out = 20)), rmse = NA ) for (i in seq_len(nrow(tuning_grid))) { tuning_grid$rmse[i] <- k_fold_cv( training, 5, n = tuning_grid$n[i] ) } head(arrange(tuning_grid, rmse)) ``` ``` ## n rmse ## 1 22 34544.91 ## 2 43 36479.43 ## 3 64 37546.28 ## 4 85 38904.05 ## 5 106 39746.35 ## 6 127 40429.31 ``` --- ```r ggplot(tuning_grid, aes(n, rmse)) + geom_line() + geom_point() + labs(x = "k") ``` ![](week_10_slides_files/figure-html/unnamed-chunk-37-1.png)<!-- --> It looks like the best k is somewhere under 40! We'll run with k = 22 for now. --- Just as before, we need to rescale our training and testing data before fitting a KNN to the full training data: ```r scaling_recipe <- recipe(Sale_Price ~ ., data = training) |> step_center(where(is.numeric), -all_outcomes()) |> step_scale(where(is.numeric), - all_outcomes()) |> prep() training <- bake(scaling_recipe, training) testing <- bake(scaling_recipe, testing) ``` And we can now go ahead and evaluate our RMSE: ```r final_knn <- knnreg(Sale_Price ~ ., training, k = 22) calc_rmse(final_knn, testing) ``` ``` ## [1] 29419.37 ``` We've done dramatically better with random forests and GBMs, but KNN at least beats linear regression. --- class: middle And that, in a nutshell, is the k-nearest neighbors algorithm. As I mentioned at the start of these slides, KNN is extremely computationally inefficient and tends to not create as accurate models as the random forests and GBMs we've talked about in the past. However, k-nearest neighbor models are still somewhat prevalent in the scientific literature, and are a useful way for us to shift from tree-based to kernel-based models. Next week, we'll wrap up with the most famous kernel-based method of them all, the support vector machine. --- class: middle # References --- Titles are links: + [HOML Chapter 8 (KNNs) and 3 (Feature Engineering)](https://bradleyboehmke.github.io/HOML/knn.html)