MLCA Week 11:

class: center, middle, inverse, title-slide

# MLCA Week 11:
## Support Vector Machines
### Mike Mahoney
### 2021-11-10

---

Before we get started, I've got a little bit of housekeeping to go over.

This is our last week of regular class. The next two classes (weeks 12 and 13) are
dedicated time to work on your course project.

I'll still be in Bray 300 during class time if you have questions or want help
with your project, and I'll still be holding my regular office hours. That said,
if you don't think coming to class will be a productive way for you to spend 
your time, that's totally fine.

Four weeks from now (week 14, 2021-12-08, our last day of class) everyone will 
share their final presentation on the project. You should hand in all your files
(report, data, and code) by the start of that class.

---

Your final presentation should be about five minutes (or less). Ideally you'll
talk about the problem you wanted to solve, the methods you used, and how well
it went. You can use slides if you want, or you can feel free to just talk about
your project for five minutes.

The presentations are meant as an opportunity for us to find out what everyone
else has been working on; they should not feel like a high-stakes event.

Your code should run immediately when I load it on my computer and return the
same results as you report; this means no hard-coded file paths or `setwd` 
calls (use the "here" package instead) and `set.seed` used before randomization.

As usual, feel free to reach out with any questions or concerns.

---
class: center middle

# Support Vector Machines

---

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[

Rule 1. New methods always look better than old ones. Neural nets are better than logistic regression, support vector machines are better than neural nets, etc. In fact it is very difficult to run an honest simulation comparison, and easy to inadvertently cheat by choosing favorable examples, or by not putting as much effort into optimizing the dull old standard as the exciting new challenger.

Rule 2. Complicated methods are harder to criticize than simple ones. By now it is easy to check the efficiency of a logistic regression, but it is no small matter to analyze the limitations of a support vector machine.

.tr[
— [Brad Efron: Comment on Statistical Modeling: The Two Cultures](https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full)
]
]

---
class: middle

We're wrapping up this course with one final type of model, the 
**support vector machine** (or SVM).

SVMs are our second kernel-based method, and the most common kernel-based method
in use today.

SVMs are highly flexible, robust to outliers, and are resilent to overfitting.
However, they can also be slow to train, particularly on data sets with many 
observations.

As a result, SVMs are often viewed as more specialized models than
random forests and GBMs. If nothing else is working right, it might be time to 
try an SVM.

---

But before we get into all that, let's go back to our fake data from last week:

![](week_11_slides_files/figure-html/unnamed-chunk-1-1.png)

---

Last week, we focused on classifying our mystery points (the purple triangle) 
based on its nearest neighbors -- if the majority of neighbors nearby were edge
plants, we'd predict an edge plant.

But there might be an easier way to classify things. There's a big gap between
our two classes -- what if we drew a line across it?

![](week_11_slides_files/figure-html/unnamed-chunk-2-1.png)

---

Now instead of having to find the neighbors for each point we want to predict, 
we can just look at where the point falls compared to the line. If the point
is above the line, it's an edge plant; if it's below, it's an interior plant.

When our data can be separated into two classes by a straight line, we call it
**linearly separable**. We can think of the line as a **decision boundary**, 
because our predictions depend entirely on what side of the boundary a point 
falls.

For instance, our mystery point here falls on the "Edge" side of the boundary, 
so we'd classify this point as an edge plant.

![](week_11_slides_files/figure-html/unnamed-chunk-3-1.png)

---

Another name for this line is a **separating hyperplane**.

Remember from way 
back in week 2 that we can call model surfaces in any dimension a hyperplane;
rather than directly modeling an outcome or probability, however, these 
hyperplanes are trying to split our data into two classes.

![](week_11_slides_files/figure-html/unnamed-chunk-4-1.png)

---

But how do we choose where to draw the decision boundary? There are many 
possible different separating hyperplanes that perfectly classify our data:

![](week_11_slides_files/figure-html/unnamed-chunk-5-1.png)

Which one would we want to use as a decision boundary?

---

Just by looking at the graph, we can tell that some of these hyperplanes 
seem "better" than others. How can we choose the "best"?

One approach is to choose the hyperplane that is the _furthest_ from both 
classes. That _should_ maximize how generalizable the decision boundary is, so
that a point with X and Y values a bit further out from the class mean will 
still be correctly classified.

In order to find that furthest line, we need to first find the points from
each class which are closest to the _other_ class:

![](week_11_slides_files/figure-html/unnamed-chunk-6-1.png)

---

Then we draw the line that maximizes the distance of _both_ of those points from 
the line.

We call this decision boundary the **optimal separating hyperplane**, and the
points we use to draw it **support vectors**.

![](week_11_slides_files/figure-html/unnamed-chunk-7-1.png)

---

This method maximizes the distance between the two classes (the area between the
dotted lines, also known as a **margin**). Because we're drawing our decision
boundary so that no points are inside the margin and no points are 
misclassified, we call this a **hard margin classifier**.

This is the simplest version of a **support vector machine**.

![](week_11_slides_files/figure-html/unnamed-chunk-8-1.png)

---

This simple hard-margin SVM has some weaknesses. For instance, let's 
look at a different data set, showing how income and lot size are associated 
with riding lawnmower ownership:

![](week_11_slides_files/figure-html/unnamed-chunk-9-1.png)

(This graph adapted from [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/svm.html#the-soft-margin-classifier).)

---

This data is still linearly separable -- we could draw a straight line that 
would perfectly split our data set into two classes.

However, this data set has a potential outlier here which will seriously impact
how we split our data into classes:

![](week_11_slides_files/figure-html/unnamed-chunk-10-1.png)

---

If we go through the same process as earlier to find our optimal separating 
hyperplane, we'll find that this decision boundary is much closer to the 
"Yes" point cloud than we'd expect.

This probably won't generalize well -- as we test our model on new data, we're 
very likely to mischaracterize "Yes" points as "No"s!

![](week_11_slides_files/figure-html/unnamed-chunk-11-1.png)

---

What if, rather than drawing our hyperplane so that there are no points inside
the margin (like we do with the hard margin classifier), we allowed a few points
to fall inside that space?

That way, we could use other points as support vectors, making our SVM more 
resilient to outliers. This is known as the **soft-margin classifier** (because
we've _softened_ the requirement that no points fall inside the margin).

How many points we let inside the margin is an important hyperparameter of SVMs,
usually referred to as `$C$`.

---

If `$C$` is 0, you let 0 points inside the margin -- 
you use the points which are closest to each other as support vectors to 
produce the hard margin classifier.

If `$C$` is infinite, you use the furthest
removed points as support vectors and let all of your data into the margin.

![](https://bradleyboehmke.github.io/HOML/11b-svm_files/figure-html/smc-1.png)

---

In practice, we almost always use soft margin classifiers when fitting SVMs.
By deciding how many points are allowed inside the margin, we're able to find
the linear decision boundary that best classifies training data while still 
generalizing well to the test set.

But what if our data isn't linearly separable? This is pretty common in the real
world -- if there was an obvious linear boundary between our two classes,
we wouldn't need machine learning to tell them apart.

Imagine a data set that looked like this, where our classes were mixed together:

![](week_11_slides_files/figure-html/unnamed-chunk-12-1.png)

---

What if we could create a new variable so that our data was linearly separable 
in higher dimensions?

For instance, imagine we created a new variable `$Z$` such that `$Z = X^2 + Y^2$`. If
we use that variable to plot our data in 3D, it seems like we might be able to
draw a linear decision boundary for our data:

![](week_11_slides_files/figure-html/unnamed-chunk-13-1.png)

---

We can use that new Z variable to draw a linear decision boundary in 
higher-dimensional space than our original 2D X/Y graph. But when we come back 
to 2D (so plotting the hyperplane using only X and Y values), it's entirely 
nonlinear.

By combining variables to cast your data into a higher-dimensional space to 
find linear decision boundaries, SVMs are able to approximate any non-linear
relationships your predictors have with your outcome.

![](week_11_slides_files/figure-html/unnamed-chunk-14-1.png)

---

In order to cast your data into higher dimensions, SVMs make use of what are known
as _kernel functions_.

Kernel functions are effectively a guess about how your variables are related to
your outcome; they specify some transformation to your variables that will 
_hopefully_ make your data linearly separable.

There are a number of different kernel functions available to choose from, each
a different transformation which might work well for your data. 
In practice, we generally start with something called the 
"**radial basis function**" kernel (or RBF) and try out other kernels only if 
that one doesn't work.

I'm intentionally avoiding getting too deep into the math of how kernel 
transformations work here; conceptually understanding that each kernel is a 
different equation to create new features from your existing variables should be
enough. For details on the mathematical expression of each kernel, check out
`?kernlab::dots`.

---

One last thing before we move on to implementing SVMs in R.

Just like KNNs, SVM are trying to calculate the distance between your points in
order to find separating hyperplanes. This means that, just like with KNNs, you
need to scale and center your data in order to make sure all your variables 
"count" for distance equally, regardless of units.

This also means that, to an SVM, categorical variables (which are encoded to be
either 0 or 1) have lower variance than continuous variables, and so tend to 
contribute less to the final prediction. If your data contains many categorical
variables, SVM may make worse predictions than other methods.

---

Alright, that's enough hyperdimensional thinking for today. Let's get down to 
brass tacks.

We'll be using the `kernlab` package to fit SVMs today -- go ahead and install
and load it now:

```r
install.packages("kernlab")
```

```r
library(kernlab)
```

We'll also go ahead and set our attrition data set up for one last time:

```r
library(modeldata)
data(attrition)
set.seed(123)
row_idx <- sample(seq_len(nrow(attrition)), nrow(attrition))
training <- attrition[row_idx < nrow(attrition) * 0.8, ]
testing <- attrition[row_idx >= nrow(attrition) * 0.8, ]
```

---

We're going to start off predicting attrition by fitting an SVM using an 
RBF kernel, which is the default method in the `ksvm` function we use
to fit SVMs.

That means that we can use `ksvm` to create our SVM using the 
exact same syntax as `lm`:

```r
first_svm <- ksvm(Attrition ~ ., training)
first_svm
```

```
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0247042045168008 
## 
## Number of Support Vectors : 520 
## 
## Objective Function Value : -341.4969 
## Training error : 0.120851
```

---

There's a few useful pieces of information in the output from `ksvm`:

+ We're told we're using a `Gaussian Radial Basis kernel function`, which is 
  the formal name for the RBF kernel.
+ We can see the value of our `$C$` hyperparameter (which controls how "hard" the 
  margin is), which by default is 1.
+ We're told that there's another hyperparameter `sigma` set to some extremely
  small value, which we haven't talked about at all.

```r
first_svm
```

---

Sigma (usually written `$\sigma$`) is what we call a **kernel hyperparameter**. 
While we'll have to tune our `$C$` hyperparameter no matter what kernel we use,
if we chose to use a different kernel, we might have entirely different kernel
hyperparameters.

If you're using any non-RBF kernel, you can think of kernel hyperparameters just
like any other hyperparameter -- a value you need to find through model tuning.
However, when we're using RBF kernels, we're actually able to find `$\sigma$` 
without fitting a model at all.

This lets us cut down the number of models we need to test when tuning 
hyperparameters -- when using RBF kernels, we only need to tune C!

You can see a full list of the kernels available in `kernlab`, and their 
hyperparameters, via `?kernlab::dots`.

---

`$C$` can be any positive number, which means there are a lot of potential values 
to check.

A common recommendation is to start off by testing powers of 2 -- we'll
try from 2^-8 to 2^8:

```r
tuning_grid <- data.frame(
 C = 2^(-8:8),
 loss = NA
)

tuning_grid
```

```
##               C loss
## 1    0.00390625   NA
## 2    0.00781250   NA
## 3    0.01562500   NA
## 4    0.03125000   NA
## 5    0.06250000   NA
## 6    0.12500000   NA
## 7    0.25000000   NA
## 8    0.50000000   NA
## 9    1.00000000   NA
## 10   2.00000000   NA
## 11   4.00000000   NA
## 12   8.00000000   NA
## 13  16.00000000   NA
## 14  32.00000000   NA
## 15  64.00000000   NA
## 16 128.00000000   NA
## 17 256.00000000   NA
```

---

The next thing we need to do to tune our parameters is to write our loss 
function. But there's a hitch here: what loss can we use?

SVMs classify points based entirely on what side of the decision boundary they
fall on. By default, this doesn't generate probabilities for our points -- a
point is either one side or the other.

While there _are_ ways to get SVMs to generate probabilities (such as the 
`prob.model` argument to `ksvm`), they aren't very commonly used, which means 
using cross-entropy or AUC loss like we've done in the past is rare.

---

Instead, we tend to use overall model accuracy as a loss function for SVMs.

Of course, we've already seen how that can be a problem when we have imbalanced 
classes. For that reason, we're going to go ahead and resample our training set
so that our classes are balanced before we tune our SVMs:

```r
positive_training <- training[training$Attrition == "Yes", ]
negative_training <- training[training$Attrition == "No", ]

n_pos <- nrow(positive_training)
resampled_positives <- sample(1:n_pos, 
 size = 5 * n_pos, 
 replace = TRUE)
resampled_positives <- positive_training[resampled_positives, ]

resampled_training <- rbind(
 negative_training,
 resampled_positives
)
```

---

And now we can go ahead and write a function to calculate overall model accuracy.

When we predict with our SVM, we'll get a set of class predictions -- so for our
attrition data, we'll get either Yes or No predictions. We can use those
in a function to calculate overall accuracy like this:

```r
calc_accuracy <- function(model, data) {
 matching <- predict(model, data) == data$Attrition
 sum(matching) / length(matching)
}
```

---

We can plug `calc_accuracy` right into our same old `k_fold_cv` function:

```r
k_fold_cv <- function(data, k, ...) {
 per_fold <- floor(nrow(data) / k)
 fold_order <- sample(seq_len(nrow(data)), 
 size = per_fold * k)
 fold_rows <- split(
 fold_order,
 rep(1:k, each = per_fold)
 )
 vapply(
 fold_rows,
 \(fold_idx) {
 fold_test <- data[fold_idx, ]
 fold_train <- data[-fold_idx, ]

fold_svm <- ksvm(Attrition ~ ., fold_train, ...)
 calc_accuracy(fold_svm, fold_test)
 },
 numeric(1)
 ) |> 
 mean()
}
```

---

The code to loop through our tuning grid is then pretty simple:

```r
for (i in seq_len(nrow(tuning_grid))) {
 tuning_grid$loss[i] <- k_fold_cv(
 resampled_training, 
 5,
 C = tuning_grid$C[i]
 )
}
```

We can look at how accuracy changes as we try different values of C:

```r
library(ggplot2)
ggplot(tuning_grid, aes(C, loss)) + 
  geom_point() + 
  geom_line() 
```

![](week_11_slides_files/figure-html/unnamed-chunk-25-1.png)

---

It looks like higher C values are more effective here! We'll run with C = 64
for now - that's where we got the little bump in the top of our curve.

As usual, we can then go on and fit our final model on the entire training set:

```r
final_svm <- ksvm(
 Attrition ~ .,
 resampled_training,
 C = 64
)
```

We could go ahead and use our `calc_accuracy` function to find the overall 
accuracy for this model:

```r
calc_accuracy(final_svm, testing)
```

```
## [1] 0.8440678
```

But as usual, looking at the overall accuracy obscures how well our model does
at predicting each class. Instead, we'll look at the full confusion matrix on
the next slide.

This model has higher sensitivity but lower specificity than our random forest
from back in week 7, and lower sensitivity and higher specificity than our
logistic models from way week 4. Which of these models is "best" for your data 
is, as usual, a question of what you're looking for.

---

```r
caret::confusionMatrix(
  predict(final_svm, testing),
  testing$Attrition,
  positive = "Yes"
)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  237  21
##        Yes  25  12
##                                           
##                Accuracy : 0.8441          
##                  95% CI : (0.7975, 0.8835)
##     No Information Rate : 0.8881          
##     P-Value [Acc > NIR] : 0.9916          
##                                           
##                   Kappa : 0.2547          
##                                           
##  Mcnemar's Test P-Value : 0.6583          
##                                           
##             Sensitivity : 0.36364         
##             Specificity : 0.90458         
##          Pos Pred Value : 0.32432         
##          Neg Pred Value : 0.91860         
##              Prevalence : 0.11186         
##          Detection Rate : 0.04068         
##    Detection Prevalence : 0.12542         
##       Balanced Accuracy : 0.63411         
##                                           
##        'Positive' Class : Yes             
## 
```

---

So far, we've focused entirely on SVM as a classification model. However, SVMs
can be used for regression as well!

Sometimes referred to as 
**support vector regression** (or SVR), SVM-based regression models are 
excellent at modeling non-linear relationships between your predictors and 
outcome.

Rather than casting your data into higher dimensions in an attempt to find an 
optimal separating hyperplane, SVR attempts to find a dimension where 
as many points lie along the hyperplane as possible -- that is, a dimension where 
your data can be captured perfectly by simple linear regression.

When using SVMs for regression, we're no longer trying to keep as many points 
outside of our margin as possible. Instead, we want all of our points to fall
on the hyperplane, or at the very least within a certain range of it. To avoid
overfitting, we can control how wide that range is via a new hyperparameter, 
`$\epsilon$` (or "epsilon").

---

In practice, using SVMs for regression is pretty similar to when we used it for
classification. We'll need to first re-create our Ames housing data:

```r
ames <- AmesHousing::make_ames()
row_idx <- sample(seq_len(nrow(ames)), nrow(ames))
training <- ames[row_idx < nrow(ames) * 0.8, ]
testing <- ames[row_idx >= nrow(ames) * 0.8, ]
```

As well as our usual RMSE loss function:

```r
calc_rmse <- function(model, data) {
 predictions <- predict(model, data)
 sqrt(mean((predictions - data$Sale_Price)^2))
}
```

---

We'll repurpose our `k_fold_cv` function as well, so that it's modeling 
`Sale_Price` as an outcome and using `calc_rmse` for loss:

fold_svm <- ksvm(Sale_Price ~ ., fold_train, ...)
 calc_rmse(fold_svm, fold_test)
 },
 numeric(1)
 ) |> 
 mean()
}
```

---

And now we'll set up our tuning grid! We're going to test similar values for C
as we did with classification. `$\epsilon$`, meanwhile, tends to be a very small
positive number, so we'll focus on testing possible values under 1:

```r
tuning_grid <- expand.grid(
 C = 2^(-5:5),
 epsilon = 2^(-8:0),
 rmse = NA
)
```

With that work out of the way, we can use an almost identical loop to run each
stage of our grid search:

```r
for (i in seq_len(nrow(tuning_grid))) {
 tuning_grid$rmse[i] <- k_fold_cv(
 training, 
 5,
 C = tuning_grid$C[i],
 epsilon = tuning_grid$epsilon[i]
 )
}

head(arrange(tuning_grid, rmse), 2)
```

```
##    C epsilon     rmse
## 1  2  0.0625 25239.68
## 2 32  0.0625 25415.68
```

---

And then, as usual, we can use the optimal values of our hyperparameters to 
evaluate how an SVM fit on the full training data performs against the test set:

```r
final_svr <- ksvm(
 Sale_Price ~ .,
 training,
 C = 2,
 epsilon = 2^-4
)

calc_rmse(final_svr, testing)
```

```
## [1] 28110.08
```

The answer is "not great" -- our random forest and GBMs have both posted better
accuracy numbers.

In this situation, we might start exploring other kernels, try a larger grid to
find hyperparameters, or perhaps just accept that GBM is a better model for this
data.

---

And that's SVM: create new variables to find a linear surface that 
perfectly separates your classes or perfectly predicts your outcome, with the
definition of "perfectly" controlled by `$C$` and `$\epsilon$`.

SVM are not the best model for all situations. Like traditional regressions, SVM
uses all predictors equally, and therefore perform worse with noisy data. SVM 
can also be slow to fit, and they don't always produce as accurate models as 
GBMs.

So why do we talk about SVM at all?

---

Well, on the one hand, sometimes SVM are actually the best model for your data.

They can approximate smooth nonlinear relationships better than tree-based 
methods, whose hard binary thresholds create a lot of right-angles in the 
prediction surface. Additionally, the ability to change between different 
kernels makes SVM incredibly flexible for many tasks in machine learning.

They might not be the first model you try for a given problem, but if nothing
else is working then you may as well give SVM a go.

---

But also, I think it's worth stepping through the timeline of the models we've 
gone over in this course.

Linear regression, where we started, traces all the way back to 1805. Logistic 
regression first appears around 1838.

Our next stop on the timeline was then KNN, first introduced in 1951, followed 
by decision trees in 1960. SVMs would then appear in 1992, quickly becoming the 
dominant machine learning algorithm until bootstrap aggregated decision trees
were introduced in 1994.

With the introduction of random forests and GBMs in 
2001, followed by stochastic GBMs in 2002, the world suddenly had a wide variety
of equally powerful machine learning methods for predictive tasks.

Each of these introductions was followed by a flurry of interest, with new 
extensions to each model being proposed and new improvements bolted on. Whole 
fields of research rose up following the initial introduction of each algorithm.

It wasn't until 2012 that neural networks, which we didn't have time for in this
course, became extremely popular due to their success as image 
classifiers.

It wasn't until 2014 that xgboost brought GBMs into the spotlight by making them 
consistently highly accurate; it wasn't until 2016 that lightgbm made them fast 
to train as well.

---

Richard Hamming, in his book The Art of Doing Science and Engineering, offers a 
theory of how most fields develop over time.

He suggests most fields grow in an S-curve fashion after being sparked by some 
initial innovation. Once that initial momentum slows, a new innovation is needed 
to kick growth off once again:

![](hamming.png)

(This graph is pulled straight from his book, typos and all.)

---
class: middle

Here's my point.

Machine learning is moving incredibly quickly, and it's hard to guess exactly
where the field will be heading next. It seemed for ages that kernel-based 
methods like SVM would be consistently the best method for prediction, then it
seemed for ages that decision trees were the way of the future. Fields grow 
exponentially, then plateau.

Right now, a lot of attention is being spent on neural networks. And all that 
attention is producing good (or at least effective) results; our algorithmic
culture is entirely powered by artificial neurons guessing what news story 
you'll click on or which video you'll enjoy.

But I don't think that means we should believe that this time we got it right 
and that neural nets will always dominate the ML space. Maybe we'll go back to 
kernel methods, maybe decision trees, maybe something new altogether.

I have no
idea, but I think it's worth covering older options like SVMs just in case they
take over once again.

And hey, sometimes they work.

---
class: middle

I wanted to end with Hamming because of two quotes from his book I've had 
rattling around in my head while writing this course. The first one goes like
this:

> Education is what, when, and why to do things. Training is how to do it.

And I've attempted to design this course to reach both ends.

---
class: middle

The second quote is:

> Teachers should prepare the student for the student’s future, not for the teacher’s past.

And, well, I hope I did.

---
class: middle

# References

---

Titles are links:

+ [Hands-on Machine Learning Chapter 14](https://bradleyboehmke.github.io/HOML/svm.html)

+ [Richard W Hamming: The Art of Doing Science and Engineering](http://worrydream.com/refs/Hamming-TheArtOfDoingScienceAndEngineering.pdf)