--- title: "Predicting Boston House Prices" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Predicting Boston House Prices} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} use_saved_results <- TRUE knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, eval = !use_saved_results, message = FALSE, warning = FALSE ) if (use_saved_results) { results <- readRDS("vignette_pbhp.rds") eval_frame <- results$eval_frame } ``` ```{r, eval=TRUE} library(dplyr); library(tidyr); library(purrr) # Data wrangling library(ggplot2); library(stringr) # Plotting library(tidyfit) # Auto-ML modeling ``` The Boston house prices data set (`MASS::Boston`) presents a popular test case for regression algorithms. In this article, I assess the relative performance of 15 different linear regression techniques using `tidyfit::regress`. Let's begin by loading the data, and splitting it into equally sized training and test sets: ```{r, eval=TRUE} data <- MASS::Boston # For reproducibility set.seed(123) ix_tst <- sample(1:nrow(data), round(nrow(data)*0.5)) data_trn <- data[-ix_tst,] data_tst <- data[ix_tst,] as_tibble(data) ``` The aim is to predict the median value of owner-occupied homes (`'medv'` in the data set) using the remaining columns.^[Have a look at `?MASS::Boston` for definitions of the features.] We will assess performance using $R^2$, which allows intuitive comparisons of in-sample and out-of-sample fit. ## A simple regression To establish a baseline, we fit an OLS regression model and examine the test and training set R-squared. The `predict` function can be used to obtain predictions, and `yardstick` is used to assess performance. The regression is estimated using `tidyfit::regress`. `tidyfit::m` is a generic model wrapper that is capable of fitting a large number of different algorithms: ```{r, eval=TRUE} model_frame <- data_trn %>% regress(medv ~ ., OLS = m("lm")) ``` Let's create a helper function to assess performance. The helper function wraps `predict` and prepares in and out-of-sample performance stats: ```{r, eval=TRUE} assess <- function(model_frame, data_trn, data_tst) { oos <- model_frame %>% predict(data_tst) %>% group_by(model, .add = TRUE) %>% yardstick::rsq_trad(truth, prediction) %>% mutate(type = "Out-of-sample") %>% arrange(desc(.estimate)) is <- model_frame %>% predict(data_trn) %>% group_by(model, .add = TRUE) %>% yardstick::rsq_trad(truth, prediction) %>% mutate(type = "In-sample") return(bind_rows(oos, is)) } plotter <- function(df) { df %>% mutate(lab = round(.estimate, 2)) %>% mutate(model = str_wrap(model, 12)) %>% mutate(model = factor(model, levels = unique(.$model))) %>% ggplot(aes(model, .estimate)) + geom_point(aes(color = type), size = 2.5, shape = 4) + geom_label(aes(label = lab, color = type), size = 2, nudge_x = 0.35) + theme_bw() + scale_color_manual(values = c("firebrick", "darkblue")) + theme(legend.title = element_blank(), axis.title.x = element_blank()) + coord_cartesian(ylim = c(0.65, 0.95)) } ``` Now we plot the resulting metrics: ```{r, fig.width=3, fig.height=3, fig.align="center", eval=TRUE} model_frame %>% assess(data_trn, data_tst) %>% plotter ``` The in-sample fit of the regression is naturally better than out-of-sample, but the difference is not large. The results can be improved substantially by adding interaction terms between features. This leads to a vast improvement in in-sample fit (again, this is a logical consequence of including a large number of regressors), but leads to no improvement in out-of-sample fit, suggesting substantial overfit: ```{r, fig.width=3, fig.height=3, fig.align="center", eval=TRUE} model_frame <- data_trn %>% regress(medv ~ .*., OLS = m("lm")) model_frame %>% assess(data_trn, data_tst) %>% plotter ``` ## Regularized regression estimators Regularization controls overfit by introducing parameter bias in exchange for a reduction in estimator variance. A large number of approaches to regularization exist that can be summarized in at least 5 categories: 1. Best subset selection 2. L1/L2 penalized regression 3. Latent factor regression 4. Other machine learning 5. Bayesian regression `tidyfit` includes linear estimators from each of these categories, as summarized below: ```{r, echo = F, eval=TRUE} library(kableExtra) tbl <- data.frame( `Best subset` = c("Exhaustive search (`leaps`)", "Forward selection (`leaps`)", "Backward selection (`leaps`)", "Sequential Replacement (`leaps`)", "Genetic Algorithm ('gaselect')"), `L1/L2 penalized` = c("Lasso (`glmnet`)", "Ridge (`glmnet`)", "ElasticNet (`glmnet`)", "Adaptive Lasso (`glmnet`)", ""), `Latent factors` = c("Principal components regression (`pls`)", "Partial least squares (`pls`)", "", "", ""), `Machine Learning` = c("Gradient Boosting (`mboost`)", "Hierarchical feature regression (`hfr`)", "Support vector regression (`e1071`)", "", ""), `Bayesian` = c("Bayesian regression (`arm`)", "Bayesian model averaging (`BMS`)", "Bayesian Lasso (`monomvn`)", "Bayesian Ridge (`monomvn`)", "Spike and Slab Regression ('BoomSpikeSlab')"), check.names = F ) kbl(tbl, align = "ccccc") %>% kable_styling("striped") ``` In the next code chunk, 17 of the available regression techniques are fitted to the data set. Hyperparameters are optimized using a 10-fold cross validation powered by `rsample` in the background. As a practical aside, it is possible to set a `future` plan to ensure hyperparameter tuning is performed on a parallelized backend.^[Parallelizing happens at the level of cross validation slices, so always consider if the overhead of parallel computation is worth it. For instance, parallelizing will likely be very useful with `.cv = "loo_cv"` but can be less useful with `.cv = "vfold_cv"`.] This is done using `plan(multisession(workers = n_workers))`, where `n_workers` is the number of available cores. Furthermore, a progress bar could be activated using `progressr::handlers(global = TRUE)`. ```{r} model_frame_all <- data_trn %>% regress(medv ~ .*., OLS = m("lm"), BAYES = m("bayes"), BMA = m("bma", iter = 10000), SEQREP = m("subset", method = "seqrep", IC = "AIC"), `FORWARD SELECTION` = m("subset", method = "forward", IC = "AIC"), `BACKWARD SELECTION` = m("subset", method = "backward", IC = "AIC"), LASSO = m("lasso"), BLASSO = m("blasso"), SPIKESLAB = m("spikeslab", niter = 10000), RIDGE = m("ridge"), BRIDGE = m("bridge"), ELASTICNET = m("enet"), ADALASSO = m("adalasso", lambda_ridge = c(0.001, 0.01, 0.1)), PCR = m("pcr"), PLSR = m("plsr"), HFR = m("hfr"), `GRADIENT BOOSTING` = m("boost"), SVR = m("svm"), GENETIC = m("genetic", populationSize = 1000, numGenerations = 50, statistic = "AIC", maxVariables = 20), .cv = "vfold_cv", .cv_args = list(v = 10)) ``` Let's use `yardstick` again to calculate in-sample and out-of-sample performance statistics. The models are arranged descending order of test set performance: ```{r} eval_frame <- model_frame_all %>% assess(data_trn, data_tst) ``` ```{r, fig.width=7.25, fig.height=4.25, fig.align="center", eval=TRUE} eval_frame %>% plotter + theme(legend.position = "top", axis.text.x = element_text(angle = 90)) ``` The Bayesian regression techniques overwhelmingly outperform the remaining methods, with the exception of the hierarchical feature regression, genetic algorithm and the adaptive lasso, both of which achieve relatively high out-of-sample performance. The subset selection algorithms achieve relatively good results, while the penalized and latent factor regressions perform worst. All methods improve on the OLS results. ## A glimpse at the backend `tidyfit` makes it exceedingly simple to fit different regularized linear regressions. The package ensures that the input and output of the modeling engine, `tidyfit::m` are standardized and model results are comparable. For instance, whenever necessary, features are standardized with the coefficients transformed back to the original scale. Hyperparameter grids are set to reasonable starting values and can be passed manually to `tidyfit::m`. The package further includes a `tidyfit::classify` method that assumes a binomial or multinomial response. Many of the methods implemented can handle Gaussian and binomial responses. Furthermore, the `tidyfit::regress` and `tidyfit::classify` methods seamlessly integrate with grouped tibbles, making it extremely powerful to estimate a large number of regressions for different data groupings. Groups as well as response distributions are automatically respected by the `predict` method.