MLCA Week 2:

class: center, middle, inverse, title-slide

# MLCA Week 2:
## Regression, Linear Regression, and Statistics
### Mike Mahoney
### 2021-09-08

---

class: center, middle

# Regression

---

# This class focuses on a set of algorithms known as **supervised learning**.

These are models that have been trained on data with known outcomes -- so 
there's some idea of the "true" values our model "should" be predicting.

This is in contrast to **unsupervised learning**, where we don't know the "true" 
outcome we're trying to model.

---

# Supervised learning is what most people think of when they think of modeling.

You might imagine building models to predict...

* the likelihood a species occurs at sites across the landscape; 
 
* which species will go extinct within the next five, ten, fifteen years; 
 
* how much biomass is stored in forests across a region; 
 
* patterns in annual temperatures into the future.

These are all examples of supervised learning problems.

---

# When we use models to predict numeric outcomes, we are using a **regression model**.

This is in contrast to a **classification model**, which we'll discuss more 
next week.

Most of the algorithms we discuss can be used for both regression and 
classification, with minor tweaks.

---

Let's walk through a simple regression example.

For most of this class, we'll make use of the Ames Iowa Housing Data provided
by the `AmesHousing` package to demonstrate regression problems.

Let's install `AmesHousing` now:

```r
install.packages("AmesHousing")
```

Then create a cleaned version of the data by using the 
`AmesHousing::make_ames()` function:

```r
ames <- AmesHousing::make_ames()
```

If all goes well, your data frame should have dimensions like this:

```r
nrow(ames)
```

```
## [1] 2930
```

```r
ncol(ames)
```

```
## [1] 81
```

---

This data frame contains information on a number of homes that were sold in 
Ames, Iowa from 2006 to 2010. Our regression models will mostly focus on 
predicting the sale price of homes (the `Sale_Price` column) as a function of
the other variables in the data set.

Luckily for us, a good number of the variables in this data frame are associated
with the sale price. For instance, we could plot the relationship between the
year a house was built and its sale price:

```r
library(ggplot2)

ggplot(ames, aes(Year_Built, Sale_Price)) + 
  geom_point()
```

![](week_2_slides_files/figure-html/unnamed-chunk-4-1.png)

---

It looks like, generally speaking, house prices are higher for more recently
constructed homes.

Suppose we're interested in the average relationship between these variables -- 
how many dollars each additional year newer is worth. We might imagine drawing a
line straight down the middle of all those points, which would look like this:

```r
ggplot(ames, aes(Year_Built, Sale_Price)) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "red")
```

![](week_2_slides_files/figure-html/unnamed-chunk-5-1.png)

---

This is a linear regression.

To be specific, this is an **ordinary least squares**
regression, which minimizes the squared error -- that is, the (squared) distance 
from the line to all of the points, at the same time.

```r
ggplot(ames, aes(Year_Built, Sale_Price)) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "red")
```

![](week_2_slides_files/figure-html/unnamed-chunk-6-1.png)

---

So we can imagine a single-variable linear model as just drawing a 1D line 
through the middle of our data. But things get a little more complex if we 
want to include additional variables, too.

For instance, let's say we wanted to include the ground floor living area 
(`Gr_Liv_Area`) in our model.

We can plot the relationship between these variables in 3D. This plot is a bit
uglier than before, but we can still see that more living area results in higher 
sale prices:

![](week_2_slides_files/figure-html/unnamed-chunk-7-1.png)

---
class: middle

If we fit a linear model to both these variables, we're doing something known as
**multiple linear regression**.

Instead of fitting a 1D line to all of our 
points, we're instead going to fit a 2D plane -- again trying to minimize the 
(squared) distance between our plane and all of the points:

.pull-left[
![](week_2_slides_files/figure-html/unnamed-chunk-8-1.png)

]

.pull-right[

![](week_2_slides_files/figure-html/unnamed-chunk-9-1.png)

]

As we keep adding more variables, we can imagine that this pattern keeps 
happening -- our linear model will try and find the shape that minimizes errors
in every dimension at once.

Rather than come up with different names for this shape in each dimension (like 
line and plane), these planes are generally referred to as **hyperplanes** or, 
more broadly, **surfaces**.

---
class: middle

Linear models require a few things from your data set in order to function 
properly. Among these assumptions, the most important are:

* Linearity: For linear regression to produce accurate predictions, your outcome
  (the variable being modeled, in our example sale price) must be linearly 
  associated with each of your predictor variables.
  
* Normality: The sample means of your outcome must be normally distributed for
  linear regression to produce a good fit. 
  
* Multicollinearity: Your predictor variables must be independent from each other; 
  they should not be correlated with one another.
  
For the purposes of this course, it's enough to just broadly know what these 
assumptions are; we're only using linear models as a baseline to compare newer
machine learning methods against.

If you're interested in these assumptions and what to do if your data violates 
them, I'd recommend APM 630 - Regression Methods.

---
class: middle

To fit a linear model in R, we can use the `lm` function:

```r
ames_one <- lm(Sale_Price ~ Year_Built + Gr_Liv_Area, ames)
ames_one
```

```
## 
## Call:
## lm(formula = Sale_Price ~ Year_Built + Gr_Liv_Area, data = ames)
## 
## Coefficients:
## (Intercept)   Year_Built  Gr_Liv_Area  
##  -2.106e+06    1.087e+03    9.597e+01
```

---

If you've used linear models in R before, you're likely familiar with using 
`summary` to view how a model performed:

```r
summary(ames_one)
```

```
## 
## Call:
## lm(formula = Sale_Price ~ Year_Built + Gr_Liv_Area, data = ames)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -458172 -26758 -2236 18514 306986 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -2.106e+06 5.734e+04 -36.74 <2e-16 ***
## Year_Built 1.087e+03 2.938e+01 37.01 <2e-16 ***
## Gr_Liv_Area 9.597e+01 1.758e+00 54.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46660 on 2927 degrees of freedom
## Multiple R-squared: 0.6591,	Adjusted R-squared: 0.6588 
## F-statistic: 2829 on 2 and 2927 DF, p-value: < 2.2e-16
```

This command gives us information on model `$R^2$` values, useful for estimation,
and p-values for the model and individual variables, useful for attribution. But
as we discussed last week, we don't care about goodness-of-fit or
significance for prediction models; we care about **accuracy**.

---

Measuring how accurate our model's predictions are requires us to first make
predictions:

```r
# Making a new data frame, to not mess up our original:
ames_copy <- ames
# Make our predictions:
ames_copy$predicted <- predict(ames_one, ames)
```

We can then visualize how well our model predicts by plotting our predictions against
actual values:

```r
ggplot(ames_copy, aes(predicted, Sale_Price)) + 
  geom_point(alpha = 0.4)
```

![](week_2_slides_files/figure-html/unnamed-chunk-13-1.png)

---

By looking at the plot, we can get a sense of how our model performs -- we 
under-predict across the board, and get worse as the homes get more expensive. 
But if we want to be able to compare our accuracy to other models, we need to 
quantify our overall error.

One way to quantify error is to use RMSE, as we talked about last week:

`$$\operatorname{RMSE} = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(\hat{y} - y_{i})^{2}}$$`

In code, we can write this as:

```r
sqrt(mean((ames_copy$predicted - ames_copy$Sale_Price)^2))
```

```
## [1] 46636.73
```

RMSE measures error in the same units as the variable being predicted -- in this
case, it tells us that our model has a root mean squared error of $46,636.73.

---

RMSE is an example of what's known in machine learning as a **loss function** 
(or sometimes a "cost function"). Loss functions measure the "cost" associated 
with an event -- in this case, how much a model should be penalized for each 
error. We (and our algorithms) are always trying to minimize some loss value.

RMSE is the most popular loss function for regression problems, but it's not the
only option. Another common option is mean absolute error:

`$$\operatorname{MAE} = (\frac{1}{n})\sum_{i=1}^{n}\left | \hat{y} - y_{i} \right |$$`

Or in code:

```r
mean(abs((ames_copy$predicted - ames_copy$Sale_Price)))
```

```
## [1] 31615.67
```

This loss function is more immediately interpretable than RMSE: our model 
mispredicts sale prices by an average of $31,615.67.

As a sidenote: MAE is occasionally used as an acronym for _median_ absolute 
error. In this course, we'll always use MAE for _mean_ absolute error, which is 
by far the more common meaning, but it's always a good idea to double-check if 
someone uses the acronym without specifying.

---
class: middle

So if MAE is more easily interpreted than RMSE, why is RMSE a more popular loss
function?

It has to do with how individual errors are weighted. Because RMSE takes the 
mean of the _squared_ errors, it penalizes larger errors much more than MAE.

For instance, here's a graph of the "cost" (on the y axis) of a prediction 
error of various magnitudes (on the x):

![](week_2_slides_files/figure-html/unnamed-chunk-16-1.png)

---
class: middle

Improving a prediction by a certain amount will reduce MAE evenly, no matter 
which prediction improved -- it's just as beneficial for an error of $10,000 to 
become $0 as for an error of $100,000 to become $90,000.

For RMSE, improvements to the most-wrong predictions are weighted more heavily, 
so that bringing the $100,000 error down to $90,000 is viewed as more of
an improvement than bringing the $10,000 down to $0.

At the end of the day, we can only optimize for one of these values -- the model
with the lowest RMSE may not have the lowest MAE as well.

RMSE tends to be the better loss function for how we actually build, compare, and select
predictive models; we want to penalize our more extreme errors more heavily, so
that hopefully each individual prediction is close enough to true to be useful.
But in some situations, where a small error is truly just as undesirable as a 
large one, it makes sense to use MAE instead.

---

So we'll keep using RMSE for our models today. We can use this loss function to
easily compare models -- for instance, how well does our "year built and square 
footage" model compare to models built only on one of those variables?

```r
library(dplyr)
(model_performance <- ames %>%
 mutate(
 predictions_full = predict(
 lm(Sale_Price ~ Year_Built + Gr_Liv_Area, ames),
 .),
 predictions_year = predict(
 lm(Sale_Price ~ Year_Built, ames), 
 .),
 predictions_area = predict(
 lm(Sale_Price ~ Gr_Liv_Area, ames), 
 .)
 ) %>% 
 summarise(
 # This calculates RMSE "across" columns 
 # whose names start with "predictions"
 across(
 starts_with("predictions"), 
 ~ sqrt(mean((. - Sale_Price)^2)))
 ))
```

```
## # A tibble: 1 × 3
## predictions_full predictions_year predictions_area
## <dbl> <dbl> <dbl>
## 1 46637. 66259. 56505.
```

---
class: middle

```r
model_performance
```

```
## # A tibble: 1 × 3
## predictions_full predictions_year predictions_area
## <dbl> <dbl> <dbl>
## 1 46637. 66259. 56505.
```

RMSE collapses all the differences in how our models make predictions down to a
single, easily compared value: our full model (RMSE $46,636) performs much 
better than either of its component models (RMSE $66,259 and $56,504).

This method for assessing model performance is a pretty standard workflow for a
lot of estimation and attribution problems: get some data, fit a model on it, 
report how well your model describes the data.

---
class: middle

With prediction models, however, we aren't particularly interested in how well 
our model predicts the data it's been fit on, since our model has already been 
able to minimize its error on all of the data points we used to create it.

Instead, we want to know how well our model will do at
predicting data when we don't know what the "right" answer will be.

But it's hard to calculate RMSE (or any loss function) when we don't know the 
right answer! Instead, what we can do is remove some data from the set we "train"
our model with, and use our model predictions on that "removed" data to compare 
our models.

This is called the **hold-out** or **simple validation** method of assessing model 
performance. We'll call the data we use to train the model our "training" set, 
and the data we use to evaluate performance our "test" set -- though you might
see "hold-out set" used in more formal settings.

---

Commonly about 20% of the data will be reserved for a "test" set, with the model
fit to the other 80%. With that said, the amounts used can vary wildly -- in 
industry, with models trained on trillions of records, it's not uncommon to see 
10% or even 1% test sets, while some academics go so far as to use 50/50 splits.

Generally speaking, your goal is to have enough of your data in training to have
your model fully capture patterns in the data and enough data in testing to 
make sure your testing data looks like the real-world data you're predicting.

If you don't have any reasons for concern about this, an 80/20 or 70/30 training/testing split 
is probably appropriate.

Mechanically, we can split our data into testing and training sets like this:

```r
# Always set a seed before doing anything with randomization!
set.seed(123)
row_idx <- sample(seq_len(nrow(ames)), nrow(ames))
training <- ames[row_idx < nrow(ames) * 0.8, ]
testing <- ames[row_idx >= nrow(ames) * 0.8, ]
```

---

We can then use the same code as before to compare our model performance, using 
models fit on the training set and evaluated against the test set:

```r
testing %>% 
  mutate(
    predictions_full = predict(
      lm(Sale_Price ~ Year_Built + Gr_Liv_Area, training), 
      .),
    predictions_year = predict(
      lm(Sale_Price ~ Year_Built, training), 
      .),
    predictions_area = predict(
      lm(Sale_Price ~ Gr_Liv_Area, training), 
      .)
    ) %>% 
  summarise(
    across(starts_with("predictions"), ~ sqrt(mean((. - Sale_Price)^2)))
    )
```

```
## # A tibble: 1 × 3
## predictions_full predictions_year predictions_area
## <dbl> <dbl> <dbl>
## 1 47203. 62774. 55628.
```

Our full model performs a bit worse on the hold-out set than it did when we were
evaluating using training data, which is to be expected -- it wasn't 
fit to these values, so wasn't able to minimize its error on specifically these
points. Our component models, however, actually perform slightly _better_ on the 
new data!

---
class: middle

This method of model assessment makes two core assumptions it's good to mention
up-front:

1. Your training data should resemble your test data (any data processing done 
   to one set should be done to both), and
2. Your test data should resemble the real-world data you care about predicting
   accurately.
   
The first assumption is nice and mechanical -- if you scale your variables so 
they're all between 0 and 1 in the training data, you need to do the same thing
in the test data. The second is much harder and requires you to think about how
your model will be used, then ensuring your data is representative.

In other fields, breaking this assumption has resulted in facial recognition 
algorithms not recognizing women
[(link)](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf)
and self-driving cars not identifying dark-skinned pedestrians
[(link)](https://www.vox.com/future-perfect/2019/3/5/18251924/self-driving-car-racial-bias-study-autonomous-vehicle-dark-skin), 
due to the over-representation of white men in the most commonly used
training image data sets.

Depending on your area of study, it might be difficult to have
such glaring blindspots -- but evaluating models based on data which doesn't 
properly represent the spatial extent of the area you'll be predicting, or the 
diversity of land cover types, or any other relevant variable may still leave
you with far too positive a picture of your model's performance.

---
class: middle

There's one other assumption in using holdout sets to assess your model: the 
test data should be completely unknown to both the model and the modeler.

We want our model to be predicting data that wasn't used to build it -- that 
means don't train models on test data, but it also means that you can't evaluate 
your models against test data, use that information to make changes to the models 
you're evaluating, and then evaluate the new models against the same test data.
Given enough time, that's just an inefficient way of fitting your model to the
combined data set, using your manual changes as a substitute for algorithmic 
fitting.

One way around this is to make a third set of data, called a **validation set**,
to evaluate models against while you're still in the process of making tweaks.
Frequently this set is the same size as the test set, so that (for instance) 
the model is trained on 60% of the data, 20% of the data is used to compare 
models and make changes, and the last 20% of the data is used to calculate 
final accuracy metrics.

For now, we're just going to keep using our test and train sets. We'll talk more
about validation sets in the weeks to come.

---
class: middle

So our full model is still the best of the bunch. Let's go ahead and keep 
expanding on it. Can we get any improvement by including what type of foundation 
each home uses?

```r
ames_two <- lm(Sale_Price ~ Year_Built + Gr_Liv_Area + Foundation, training)

testing %>% 
  mutate(predictions_full = predict(ames_two, .)) %>% 
  summarise(predictions_full = sqrt(mean((predictions_full - Sale_Price)^2)))
```

```
## # A tibble: 1 × 1
## predictions_full
## <dbl>
## 1 46227.
```

Looks like the answer is yes, slightly.

It's worth noting that `Foundation` is
the first non-numeric variable we've used. Let's spend a little time talking
about how models use non-numeric variables.

---

Behind the scenes, `lm` has turned our single column
into a handful of temporary variables. We can see this in the 
output when we run `summary(ames_two)` -- the `Foundation` variables don't exist
in our data frame:

```r
summary(ames_two)
```

```
## 
## Call:
## lm(formula = Sale_Price ~ Year_Built + Gr_Liv_Area + Foundation, 
## data = training)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -387154 -27464 -1776 18620 301244 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -2.042e+06 9.298e+04 -21.965 < 2e-16 ***
## Year_Built 1.059e+03 4.841e+01 21.871 < 2e-16 ***
## Gr_Liv_Area 9.581e+01 2.015e+00 47.551 < 2e-16 ***
## FoundationCBlock -1.313e+04 3.729e+03 -3.523 0.000435 ***
## FoundationPConc -6.176e+02 4.700e+03 -0.131 0.895461 
## FoundationSlab -5.332e+04 8.155e+03 -6.538 7.62e-11 ***
## FoundationStone -4.446e+02 1.897e+04 -0.023 0.981305 
## FoundationWood -4.160e+04 2.338e+04 -1.779 0.075372 . 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45870 on 2335 degrees of freedom
## Multiple R-squared: 0.6831,	Adjusted R-squared: 0.6821 
## F-statistic: 719 on 7 and 2335 DF, p-value: < 2.2e-16
```

---
class: middle

Those variables are **binary** variables (also referred to as **booleans**): 
columns with values of either 0 or 1, where 0 corresponds to `FALSE` and 1 to 
`TRUE`.

So in this example, a house with a stone foundation would have a 1 in the 
`FoundationStone` column and 0s everywhere else.

We need to do this because our models don't actually know how to process text 
data; we need to convert it into some numeric representation before we can use
that information to help us make predictions.

This is worth emphasizing: every model we talk about in this course is only 
capable of dealing with numeric data, so the data we provide the model needs to
be numeric.

Sometimes this conversion is done for us (like when we used `lm`), sometimes we 
need to do it for ourselves.

---

When we need to convert categorical data (like foundation types) into binary
variables, we call this transformation **one-hot encoding** a variable. We can
do this pretty easily using `pivot_wider` from the `tidyr` package:

```r
library(tidyr)

one_hot_training <- training %>% 
 mutate(dummy_value = 1) %>% # Create a dummy value column
 pivot_wider(
 # "Turn each value of Foundation into a variable"
 names_from = Foundation,
 # Optional -- add the column name prefix lm uses
 names_prefix = "Foundation", 
 # "Fill the new variables with our dummy value"
 values_from = dummy_value, 
 # Will fill in all the other Foundation fields with 0s
 values_fill = 0 
 ) %>% 
 # Optional -- only keep the columns we care about
 select(starts_with("Foundation"), Sale_Price, Year_Built, Gr_Liv_Area)
head(one_hot_training, n = 2)
```

```
## # A tibble: 2 × 9
## FoundationCBlock FoundationPConc FoundationWood FoundationBrkTil
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0
## 2 1 0 0 0
## # … with 5 more variables: FoundationSlab <dbl>, FoundationStone <dbl>,
## # Sale_Price <int>, Year_Built <int>, Gr_Liv_Area <int>
```

---

Some models (including linear models) don't deal well with one-hot encoded data, 
however, since your binary variables will be perfectly collinear with one 
another. `lm` deals with this by setting one of the variables to `NA`:

```r
lm(Sale_Price ~., data = one_hot_training) %>% 
  summary()
```

```
## 
## Call:
## lm(formula = Sale_Price ~ ., data = one_hot_training)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -387154 -27464 -1776 18620 301244 
## 
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -2.043e+06 9.434e+04 -21.653 < 2e-16 ***
## FoundationCBlock -1.269e+04 1.896e+04 -0.669 0.50330 
## FoundationPConc -1.730e+02 1.918e+04 -0.009 0.99280 
## FoundationWood -4.115e+04 2.988e+04 -1.377 0.16852 
## FoundationBrkTil 4.446e+02 1.897e+04 0.023 0.98130 
## FoundationSlab -5.287e+04 2.029e+04 -2.606 0.00922 ** 
## FoundationStone NA NA NA NA 
## Year_Built 1.059e+03 4.841e+01 21.871 < 2e-16 ***
## Gr_Liv_Area 9.581e+01 2.015e+00 47.551 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45870 on 2335 degrees of freedom
## Multiple R-squared: 0.6831,	Adjusted R-squared: 0.6821 
## F-statistic: 719 on 7 and 2335 DF, p-value: < 2.2e-16
```

---

If we want more control over this process, we can exclude one of the variables
ourselves. This is known as creating a **dummy variable**:

```r
one_hot_training %>% 
  select(-FoundationBrkTil) %>% 
  lm(Sale_Price ~., data = .) %>% 
  summary()
```

```
## 
## Call:
## lm(formula = Sale_Price ~ ., data = .)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -387154 -27464 -1776 18620 301244 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -2.042e+06 9.298e+04 -21.965 < 2e-16 ***
## FoundationCBlock -1.313e+04 3.729e+03 -3.523 0.000435 ***
## FoundationPConc -6.176e+02 4.700e+03 -0.131 0.895461 
## FoundationWood -4.160e+04 2.338e+04 -1.779 0.075372 . 
## FoundationSlab -5.332e+04 8.155e+03 -6.538 7.62e-11 ***
## FoundationStone -4.446e+02 1.897e+04 -0.023 0.981305 
## Year_Built 1.059e+03 4.841e+01 21.871 < 2e-16 ***
## Gr_Liv_Area 9.581e+01 2.015e+00 47.551 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45870 on 2335 degrees of freedom
## Multiple R-squared: 0.6831,	Adjusted R-squared: 0.6821 
## F-statistic: 719 on 7 and 2335 DF, p-value: < 2.2e-16
```

---

This is what `lm` actually does under the hood -- notice how our output now 
matches the original `ames_two` model exactly.

```r
one_hot_training %>% 
  select(-FoundationBrkTil) %>% 
  lm(Sale_Price ~., data = .) %>% 
  summary()
```

---
class: middle

Two pieces of trivia that might be helpful in remembering these encoding methods:

* One-hot encoding is so called because one and only one of the new variables is
  a 1, or "hot". The opposite method (which works just as well) is to have one 
  and only one variable equal to 0, which is known as "one-cold encoding".

* Dummy variables are "dummies" because the model is relying on some information
  that doesn't exist as a variable on its own, but is a combination of the other
  variables . For instance, by removing the BrkTil variable, we've effectively 
  told our models that houses which are 0 for every other type of foundation are
  actually their own BrkTil group.

---
class: middle

That's it for this week. We'll come back to regression, hold-out set validation,
and categorical variable encoding throughout the rest of the course.
  
Next week, classification.

---
class: middle center title

# References

---

Titles link to references:

+ [Hands-On Machine Learning with R](https://bradleyboehmke.github.io/HOML/).
Specifically, we're using Chapter 1 (background and Ames housing data),
2 (hold-out sets, loss) and 4 (linear regression).