Virtual Bootstraps

set.seed(123)
library(strapgod)
library(dplyr)
iris <- as_tibble(iris)

Introduction

The goal of strapgod is to make it easy to create virtual groups on top of tibbles for use with resampling. This means that your tibble is grouped, but you don’t actually “materialize” the groups until you actually need them. By doing this, some computations involving large amounts of bootstraps or resamples can be made much more efficient.

Creating resampled data frames

There are two core functions that help you generate a resampled_df object.

bootstrapify() takes a data frame and bootstraps the rows of that data frame a set number of times to generate the virtual groups.

iris_boot <- bootstrapify(iris, times = 10)

nrow(iris)
#> [1] 150
nrow(iris_boot)
#> [1] 150

iris_boot
#> # A tibble: 150 x 5
#> # Groups:   .bootstrap [10]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

What you’ll immediately notice is that:

The invisible .bootstrap column is the virtual group. It hasn’t been materialized (there are still only 150 rows, not 150 * 10 rows), but dplyr still seems to know about it.

samplify() is the other function that can generate resampled tibbles. It is a slight generalization of bootstrapify() that also allows you to specify the size of each resample, and if you want to resample with replacement or not.

iris_samp <- samplify(iris, times = 10, size = 20, replace = FALSE)

iris_samp
#> # A tibble: 150 x 5
#> # Groups:   .sample [10]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

This result:

Resampled summaries

What can you do with these neat resampled data frames? Great question! For one thing, you can summarise() the tibble to compute bootstrapped summaries quickly and efficiently.

# without the bootstrap
iris %>%
  summarise(
    mean_length = mean(Sepal.Length)
  )
#> # A tibble: 1 x 1
#>   mean_length
#>         <dbl>
#> 1        5.84

# with the bootstrap
iris %>%
  bootstrapify(10) %>%
  summarise(
    mean_length = mean(Sepal.Length)
  )
#> # A tibble: 10 x 2
#>    .bootstrap mean_length
#>         <int>       <dbl>
#>  1          1        5.90
#>  2          2        5.80
#>  3          3        5.80
#>  4          4        5.80
#>  5          5        5.81
#>  6          6        5.84
#>  7          7        5.79
#>  8          8        5.86
#>  9          9        5.86
#> 10         10        5.86

This makes it easy to compute bootstrapped estimates of individual statistics, along with bootstrapped standard deviations around those estimates.

iris %>%
  bootstrapify(10) %>%
  summarise(mean_length = mean(Sepal.Length)) %>%
  summarise(
    bootstrapped_mean = mean(mean_length),
    bootstrapped_sd   = sd(mean_length)
  )
#> # A tibble: 1 x 2
#>   bootstrapped_mean bootstrapped_sd
#>               <dbl>           <dbl>
#> 1              5.83          0.0592

If you want, you can take an existing grouped data frame and bootstrapify that as well, allowing you to compute bootstrapped statistics across some other variable.

iris_group_strap <- iris %>%
  group_by(Species) %>%
  bootstrapify(100) 

iris_group_strap
#> # A tibble: 150 x 5
#> # Groups:   Species, .bootstrap [300]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

Reusing the code from above, we can now compute bootstrapped estimates for the mean Sepal.Length of each Species, along with standard deviations around those estimates.

iris_group_strap %>%
  summarise(mean_length = mean(Sepal.Length)) %>%
  summarise(
    bootstrapped_mean = mean(mean_length),
    bootstrapped_sd   = sd(mean_length)
  )
#> # A tibble: 3 x 3
#>   Species    bootstrapped_mean bootstrapped_sd
#>   <fct>                  <dbl>           <dbl>
#> 1 setosa                  5.00          0.0514
#> 2 versicolor              5.92          0.0685
#> 3 virginica               6.59          0.0969

Understanding virtual groups

The virtual groups are stored in the group_data() metadata of the resampled_df object. Every grouped data frame has one of these, and they are used internally to power the dplyr group_by() system.

group_data(iris_boot)
#> # A tibble: 10 x 2
#>    .bootstrap .rows      
#>         <int> <list>     
#>  1          1 <int [150]>
#>  2          2 <int [150]>
#>  3          3 <int [150]>
#>  4          4 <int [150]>
#>  5          5 <int [150]>
#>  6          6 <int [150]>
#>  7          7 <int [150]>
#>  8          8 <int [150]>
#>  9          9 <int [150]>
#> 10         10 <int [150]>

The .bootstrap column contains the unique values of the groups, and the .rows column is a list column, where each element is an integer vector. That integer vector holds the rows that belong to that specific group. So, for .bootstrap == 1, there is a vector with 150 integers identifying the rows belonging to that resample.

group_data(iris_boot)$.rows[[1]]
#>   [1]  44 119  62 133 142   7  80 134  83  69 144  69 102  86  16 135  37
#>  [18]   7  50 144 134 104  97 150  99 107  82  90  44  23 145 136 104 120
#>  [35]   4  72 114  33  48  35  22  63  63  56  23  21  35  70  40 129   7
#>  [52]  67 120  19  85  31  20 113 135  57 100  15  58  42 123  68 122 122
#>  [69] 120  66 114  95 107   1  72  34  57  92  53  17  37 101  63 119  16
#>  [86]  66 148 134 133  27  20  98  52  99  49  29 118  15  71  77  90  50
#> [103]  74 144  73 134 138  92  62  23 141  46  10 143 109  22  83 144  88
#> [120]  61  98  48  47  33  56 148  24  14  22 104  93 134 101 111  79  99
#> [137] 124 118 147  66  47  62   2  28 127  35  36  12  37 110

When a call to collect() is made, this row index information is used to construct the output. Essentially, we start with the group_data() and utilize the .rows info to replicate the rows of the original data frame for each group, building up the complete resampled data frame. Notice how we now have the 150 * 10 = 1500 rows from the 10 bootstraps.

collect(iris_boot)
#> # A tibble: 1,500 x 6
#> # Groups:   .bootstrap [10]
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          1          5           3.5          1.6         0.6 setosa    
#>  2          1          7.7         2.6          6.9         2.3 virginica 
#>  3          1          5.9         3            4.2         1.5 versicolor
#>  4          1          6.4         2.8          5.6         2.2 virginica 
#>  5          1          6.9         3.1          5.1         2.3 virginica 
#>  6          1          4.6         3.4          1.4         0.3 setosa    
#>  7          1          5.7         2.6          3.5         1   versicolor
#>  8          1          6.3         2.8          5.1         1.5 virginica 
#>  9          1          5.8         2.7          3.9         1.2 versicolor
#> 10          1          6.2         2.2          4.5         1.5 versicolor
#> # … with 1,490 more rows

To learn more about collect(), and the other supported dplyr functions in strapgod, read the vignette("dplyr-support", "strapgod").