--- title: "Bootstrapping Confidence Intervals" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Bootstrapping Confidence Intervals} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE, eval=TRUE} use_saved_results <- TRUE knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, eval = !use_saved_results, message = FALSE, warning = FALSE ) if (use_saved_results) { library(pls) results <- readRDS("vignette_bci.rds") estimates <- results$estimates jackknife_estimates <- results$jackknife_estimates } ``` ```{r, eval=TRUE} library(dplyr); library(tidyr); library(purrr); library(ggplot2) # Data wrangling library(tidyfit) # Model fitting ``` The combination of `.cv = "bootstraps"` and `.return_slices = TRUE` in `tidyfit::regress` or `tidyfit::classify` makes it very easy to calculate bootstrap confidence intervals for estimated coefficients. As an additional convenience function, `coef.tidyfit.models` includes the option of adding percentile bootstrap intervals directly. In this short example, I will calculate and compare bootstrap confidence bands for a partial least squares regression and a principal components regression using Boston house price data: ```{r, eval=TRUE} data <- MASS::Boston %>% scale %>% as_tibble ``` `tidyfit` handles data scaling internally (i.e. PLSR and PCR are always fitted on scaled data), however, scaling the data manually here will give us standardized coefficients, which are easier to visualize and compare. ## Fit the model Instead of selecting an optimal number of latent components, I define a preset. This keeps things a little simpler. Note that dropping the `ncomp = 5` argument results the optimal number of components being selected using bootstrap resampling. ```{r} model_frame <- data %>% regress(medv ~ ., m("plsr", ncomp = 5), m("pcr", ncomp = 5), .cv = "bootstraps", .cv_args = list(times = 100), .return_slices = TRUE) ``` The coefficients are returned for each slice when `.add_bootstrap_intervals = FALSE` (default behavior --- see `coef(model_frame)`). To obtain bootstrap intervals, I pass `.add_bootstrap_interval = TRUE` to `coef`: ```{r} estimates <- coef(model_frame, .add_bootstrap_interval = TRUE, .bootstrap_alpha = 0.05) ``` ```{r, eval=TRUE} estimates ``` The intervals are nested in `model_info`: ```{r, eval=TRUE} estimates <- estimates %>% unnest(model_info) estimates ``` ## Plot the results And thus, in a concise workflow, we have 95% bootstrap confidence intervals for the coefficients of a PCR and PLS regression: ```{r, fig.width=7, fig.height=3.5, fig.align="center", eval=TRUE} estimates %>% filter(term != "(Intercept)") %>% ggplot(aes(term, estimate, color = model)) + geom_hline(yintercept = 0) + geom_errorbar(aes(ymin = .lower, ymax = .upper), position = position_dodge()) + theme_bw(8) ``` ## Comparing to built-in jackknife procedure The `pls`-package includes built-in functionality to jackknife confidence intervals for the coefficients. We can compare these results by passing `jackknife = TRUE` and `validation = "LOO"` to `m()`, and setting `.cv = "none"` (default): ```{r} model_frame_jackknife <- data %>% regress(medv ~ ., m("plsr", ncomp = 5, jackknife = TRUE, validation = "LOO"), m("pcr", ncomp = 5, jackknife = TRUE, validation = "LOO")) jackknife_estimates <- coef(model_frame_jackknife) ``` Now the `coef()` generic method also provides standard errors and $p$-values for the coefficients using `pls::jack.test`: ```{r, eval = TRUE} jackknife_estimates <- jackknife_estimates %>% unnest(model_info) %>% # Create approximate 95% intervals using 2 standard deviations mutate(.upper = estimate + 2 * std.error, .lower = estimate - 2 * std.error) jackknife_estimates ``` The plot is almost exactly identical to the bootstrap results above: ```{r, fig.width=7, fig.height=3.5, fig.align="center", eval=TRUE} jackknife_estimates %>% filter(term != "(Intercept)") %>% ggplot(aes(term, estimate, color = model)) + geom_hline(yintercept = 0) + geom_errorbar(aes(ymin = .lower, ymax = .upper), position = position_dodge()) + theme_bw(8) ```