class: center, middle, inverse, title-slide # MLCA Week 10: ## Followup ### Mike Mahoney ### 2021-11-02 --- class: middle # Project FAQ --- ### Cross-entropy (from week 7 - tuning RF): ```r calc_cross_entropy <- function(rf_model, data) { data <- predict(rf_model, data) |> predictions() |> cbind(data) |> mutate(prediction = ifelse(Attrition == "Yes", Yes, No), # Force prediction to not be exactly 0 or 1 prediction = max(1e-15, min(1 - 1e-15, prediction)), loss = -log(prediction)) sum(-log(data$prediction)) } ``` ### Cross-entropy: negative log of the probability of _the correct classification_ ### `ranger`: provides predictions of both classes ### `lightgbm`: provides predictions of _the positive class_ --- ### So alteration needed: ```r calc_cross_entropy <- function(lgb_model, data) { data <- data |> mutate(predict(lgb_model, data), # If the correct answer is No, invert the prediction: prediction = ifelse(Attrition == "Yes", prediction, 1 - prediction), # Force prediction to not be exactly 0 or 1 prediction = max(1e-15, min(1 - 1e-15, prediction)), loss = -log(prediction)) sum(-log(data$prediction)) } ``` --- ### Just because it _worked_ doesn't mean it's _working_ <img src="working_not_working.png" width="75%" /> -- <img src="working_working.png" width="75%" /> --- ## Standard citations: ```r citation() ``` ``` ## ## To cite R in publications use: ## ## R Core Team (2021). R: A language and environment for statistical ## computing. R Foundation for Statistical Computing, Vienna, Austria. ## URL https://www.R-project.org/. ## ## A BibTeX entry for LaTeX users is ## ## @Manual{, ## title = {R: A Language and Environment for Statistical Computing}, ## author = {{R Core Team}}, ## organization = {R Foundation for Statistical Computing}, ## address = {Vienna, Austria}, ## year = {2021}, ## url = {https://www.R-project.org/}, ## } ## ## We have invested a lot of time and effort in creating R, please cite it ## when using it for data analysis. See also 'citation("pkgname")' for ## citing R packages. ``` --- ## Other standard cites ```r citation("ranger") # RF citation("lightgbm") # GBM citation("kernlab") # SVM citation("caret") # KNN citation("rpart") # Decision trees ``` Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324 Friedman, J. H., 2002. Stochastic Gradient Boosting. Computational Statistics & Data Analysis 38(4), 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2 Cortes, C., Vapnik, V. Support-vector networks. Machine Learning 20, 273–297 (1995). https://doi.org/10.1007/BF00994018 --- # Other notes: ### Don't force-install packages (it's rude) ### Don't cite LM, GLM ### Pay attention to the rubric (if it isn't there, it isn't worth points) --- # Project re-submit ### Optional! ### I make no promises about turnaround time (but measured in days, not hours.) ### No resubmission after 2021-12-08. ### First project has been finished! --- # Status Update ### One more week of content (SVM) ### Three "work weeks" (bring questions) ### Presentations 2021-12-08