Support for dplyr

set.seed(123)
library(strapgod)
library(dplyr)

Introduction

As much as possible, strapgod attempts to let you use any dplyr function that you want on the resampled_df object that is returned by bootstrapify() and samplify(). Some functions have specialized behavior, like summarise(), while most others just call collect() to materialize the bootstrap rows before passing on to the underlying dplyr function.

What follows is a list of the dplyr functions that have “special” properties when used on a resampled_df.

collect()

The most important dplyr function for strapgod is collect(). Generally, this has been used to force a computation from a data base query and return the results as a tibble, and it has a similar context here. collect() forces the materialization of the virtual groups, and returns the full grouped tibble back to you.

x <- bootstrapify(iris, 10)

# Not materialized
x
#> # A tibble: 150 x 5
#> # Groups:   .bootstrap [10]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

# Materialized
collect(x)
#> # A tibble: 1,500 x 6
#> # Groups:   .bootstrap [10]
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          1          4.3         3            1.1         0.1 setosa    
#>  2          1          5           3.3          1.4         0.2 setosa    
#>  3          1          7.7         3.8          6.7         2.2 virginica 
#>  4          1          4.4         3.2          1.3         0.2 setosa    
#>  5          1          4.3         3            1.1         0.1 setosa    
#>  6          1          7.7         3.8          6.7         2.2 virginica 
#>  7          1          5.5         2.5          4           1.3 versicolor
#>  8          1          5.5         2.6          4.4         1.2 versicolor
#>  9          1          5.5         2.6          4.4         1.2 versicolor
#> 10          1          6.1         3            4.6         1.4 versicolor
#> # … with 1,490 more rows

When calling collect() directly, there are two arguments available to extract extra information about the bootstraps.

id adds a sequence of integers from 1:n for each bootstrap group. It would be equivalent to adding the row_number() by group after the collect(), but saves some typing.

collect(x, id = ".id")
#> # A tibble: 1,500 x 7
#> # Groups:   .bootstrap [10]
#>    .bootstrap   .id Sepal.Length Sepal.Width Petal.Length Petal.Width
#>         <int> <int>        <dbl>       <dbl>        <dbl>       <dbl>
#>  1          1     1          4.3         3            1.1         0.1
#>  2          1     2          5           3.3          1.4         0.2
#>  3          1     3          7.7         3.8          6.7         2.2
#>  4          1     4          4.4         3.2          1.3         0.2
#>  5          1     5          4.3         3            1.1         0.1
#>  6          1     6          7.7         3.8          6.7         2.2
#>  7          1     7          5.5         2.5          4           1.3
#>  8          1     8          5.5         2.6          4.4         1.2
#>  9          1     9          5.5         2.6          4.4         1.2
#> 10          1    10          6.1         3            4.6         1.4
#> # … with 1,490 more rows, and 1 more variable: Species <fct>

original_id tacks on the original row of the current bootstrap observation. It is generally more useful than id, as it provides a way to link the bootstrap rows back to the original data.

collect(x, original_id = ".original_id")
#> # A tibble: 1,500 x 7
#> # Groups:   .bootstrap [10]
#>    .bootstrap .original_id Sepal.Length Sepal.Width Petal.Length
#>         <int>        <int>        <dbl>       <dbl>        <dbl>
#>  1          1           14          4.3         3            1.1
#>  2          1           50          5           3.3          1.4
#>  3          1          118          7.7         3.8          6.7
#>  4          1           43          4.4         3.2          1.3
#>  5          1           14          4.3         3            1.1
#>  6          1          118          7.7         3.8          6.7
#>  7          1           90          5.5         2.5          4  
#>  8          1           91          5.5         2.6          4.4
#>  9          1           91          5.5         2.6          4.4
#> 10          1           92          6.1         3            4.6
#> # … with 1,490 more rows, and 2 more variables: Petal.Width <dbl>,
#> #   Species <fct>

summarise()

The motivation for this package was summarise(). It efficiently computes the summary results, only materializing the bootstrap rows as they are needed at the C++ level.

summarise(x, mean_length = mean(Sepal.Length))
#> # A tibble: 10 x 2
#>    .bootstrap mean_length
#>         <int>       <dbl>
#>  1          1        5.77
#>  2          2        5.90
#>  3          3        5.92
#>  4          4        5.87
#>  5          5        5.74
#>  6          6        5.79
#>  7          7        5.79
#>  8          8        5.82
#>  9          9        5.82
#> 10         10        5.94

You can group by other columns before creating the virtual groups, and bootstrapify() will respect those extra groups in the summarise() call. Pay attention to how easy it is to go from a non-bootstrapped version to a bootstrapped version. It’s just one extra line!

# Non-bootstrapped
iris %>%
  group_by(Species) %>%
  summarise(
    mean_length_across_species = mean(Sepal.Length)
  )
#> # A tibble: 3 x 2
#>   Species    mean_length_across_species
#>   <fct>                           <dbl>
#> 1 setosa                           5.01
#> 2 versicolor                       5.94
#> 3 virginica                        6.59

# Bootstrapped
iris %>%
  group_by(Species) %>%
  bootstrapify(5) %>%
  summarise(
    mean_length_across_species = mean(Sepal.Length)
  )
#> # A tibble: 15 x 3
#> # Groups:   Species [3]
#>    Species    .bootstrap mean_length_across_species
#>    <fct>           <int>                      <dbl>
#>  1 setosa              1                       5.05
#>  2 setosa              2                       5.02
#>  3 setosa              3                       5.08
#>  4 setosa              4                       5.00
#>  5 setosa              5                       4.94
#>  6 versicolor          1                       5.92
#>  7 versicolor          2                       5.96
#>  8 versicolor          3                       5.88
#>  9 versicolor          4                       5.95
#> 10 versicolor          5                       5.89
#> 11 virginica           1                       6.72
#> 12 virginica           2                       6.54
#> 13 virginica           3                       6.72
#> 14 virginica           4                       6.59
#> 15 virginica           5                       6.61

do()

While dplyr::do() is basically deprecated and has been replaced by group_modify(), it still has its uses sometimes. Like summarise(), do() materializes the groups only when they are required. Here we run the same linear model on each bootstrapped set of data.

do(x, model = lm(Sepal.Length ~ Sepal.Width, data = .))
#> Source: local data frame [10 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 10 x 2
#>    .bootstrap model 
#>  *      <int> <list>
#>  1          1 <lm>  
#>  2          2 <lm>  
#>  3          3 <lm>  
#>  4          4 <lm>  
#>  5          5 <lm>  
#>  6          6 <lm>  
#>  7          7 <lm>  
#>  8          8 <lm>  
#>  9          9 <lm>  
#> 10         10 <lm>

group_nest()

group_nest() will materialize the groups so that they become columns in the outer tibble after the nest has been performed.

group_nest(x)
#> # A tibble: 10 x 2
#>    .bootstrap data              
#>         <int> <list>            
#>  1          1 <tibble [150 × 5]>
#>  2          2 <tibble [150 × 5]>
#>  3          3 <tibble [150 × 5]>
#>  4          4 <tibble [150 × 5]>
#>  5          5 <tibble [150 × 5]>
#>  6          6 <tibble [150 × 5]>
#>  7          7 <tibble [150 × 5]>
#>  8          8 <tibble [150 × 5]>
#>  9          9 <tibble [150 × 5]>
#> 10         10 <tibble [150 × 5]>

You can set keep = TRUE to include the groups in the inner tibbles as well.

group_nest(x, keep = TRUE)$data[[1]]
#> # A tibble: 150 x 6
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          1          4.3         3            1.1         0.1 setosa    
#>  2          1          5           3.3          1.4         0.2 setosa    
#>  3          1          7.7         3.8          6.7         2.2 virginica 
#>  4          1          4.4         3.2          1.3         0.2 setosa    
#>  5          1          4.3         3            1.1         0.1 setosa    
#>  6          1          7.7         3.8          6.7         2.2 virginica 
#>  7          1          5.5         2.5          4           1.3 versicolor
#>  8          1          5.5         2.6          4.4         1.2 versicolor
#>  9          1          5.5         2.6          4.4         1.2 versicolor
#> 10          1          6.1         3            4.6         1.4 versicolor
#> # … with 140 more rows

group_split()

group_split() allows you to materialize all of the bootstrap tibbles into separate tibbles, all bundled together into a list.

group_split(x) %>% head(n = 3)
#> [[1]]
#> # A tibble: 150 x 6
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          1          4.3         3            1.1         0.1 setosa    
#>  2          1          5           3.3          1.4         0.2 setosa    
#>  3          1          7.7         3.8          6.7         2.2 virginica 
#>  4          1          4.4         3.2          1.3         0.2 setosa    
#>  5          1          4.3         3            1.1         0.1 setosa    
#>  6          1          7.7         3.8          6.7         2.2 virginica 
#>  7          1          5.5         2.5          4           1.3 versicolor
#>  8          1          5.5         2.6          4.4         1.2 versicolor
#>  9          1          5.5         2.6          4.4         1.2 versicolor
#> 10          1          6.1         3            4.6         1.4 versicolor
#> # … with 140 more rows
#> 
#> [[2]]
#> # A tibble: 150 x 6
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          2          5.8         2.8          5.1         2.4 virginica 
#>  2          2          4.9         3.1          1.5         0.1 setosa    
#>  3          2          6.9         3.1          4.9         1.5 versicolor
#>  4          2          5.5         2.3          4           1.3 versicolor
#>  5          2          4.9         2.5          4.5         1.7 virginica 
#>  6          2          5.4         3.7          1.5         0.2 setosa    
#>  7          2          4.8         3.4          1.9         0.2 setosa    
#>  8          2          6.4         3.2          4.5         1.5 versicolor
#>  9          2          5           3            1.6         0.2 setosa    
#> 10          2          6.1         2.6          5.6         1.4 virginica 
#> # … with 140 more rows
#> 
#> [[3]]
#> # A tibble: 150 x 6
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          3          6           3            4.8         1.8 virginica 
#>  2          3          7.3         2.9          6.3         1.8 virginica 
#>  3          3          5           3.4          1.5         0.2 setosa    
#>  4          3          5.7         2.5          5           2   virginica 
#>  5          3          5           3.6          1.4         0.2 setosa    
#>  6          3          5.2         3.4          1.4         0.2 setosa    
#>  7          3          5           3.3          1.4         0.2 setosa    
#>  8          3          5.6         2.5          3.9         1.1 versicolor
#>  9          3          6.1         2.8          4.7         1.2 versicolor
#> 10          3          5           3            1.6         0.2 setosa    
#> # … with 140 more rows

You can specify keep = FALSE if you never want to see the bootstrap columns.

group_split(x, keep = FALSE) %>% head(n = 3)
#> [[1]]
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          4.3         3            1.1         0.1 setosa    
#>  2          5           3.3          1.4         0.2 setosa    
#>  3          7.7         3.8          6.7         2.2 virginica 
#>  4          4.4         3.2          1.3         0.2 setosa    
#>  5          4.3         3            1.1         0.1 setosa    
#>  6          7.7         3.8          6.7         2.2 virginica 
#>  7          5.5         2.5          4           1.3 versicolor
#>  8          5.5         2.6          4.4         1.2 versicolor
#>  9          5.5         2.6          4.4         1.2 versicolor
#> 10          6.1         3            4.6         1.4 versicolor
#> # … with 140 more rows
#> 
#> [[2]]
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          5.8         2.8          5.1         2.4 virginica 
#>  2          4.9         3.1          1.5         0.1 setosa    
#>  3          6.9         3.1          4.9         1.5 versicolor
#>  4          5.5         2.3          4           1.3 versicolor
#>  5          4.9         2.5          4.5         1.7 virginica 
#>  6          5.4         3.7          1.5         0.2 setosa    
#>  7          4.8         3.4          1.9         0.2 setosa    
#>  8          6.4         3.2          4.5         1.5 versicolor
#>  9          5           3            1.6         0.2 setosa    
#> 10          6.1         2.6          5.6         1.4 virginica 
#> # … with 140 more rows
#> 
#> [[3]]
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          6           3            4.8         1.8 virginica 
#>  2          7.3         2.9          6.3         1.8 virginica 
#>  3          5           3.4          1.5         0.2 setosa    
#>  4          5.7         2.5          5           2   virginica 
#>  5          5           3.6          1.4         0.2 setosa    
#>  6          5.2         3.4          1.4         0.2 setosa    
#>  7          5           3.3          1.4         0.2 setosa    
#>  8          5.6         2.5          3.9         1.1 versicolor
#>  9          6.1         2.8          4.7         1.2 versicolor
#> 10          5           3            1.6         0.2 setosa    
#> # … with 140 more rows

group_modify()

group_modify() is similar to do(), but (as of dplyr 0.8.0.1) always returns a data frame and gives you access to the non-group and group data separately.

# Just show the first 2 rows of each bootstrap
group_modify(x, ~head(.x, n = 2))
#> # A tibble: 20 x 6
#> # Groups:   .bootstrap [10]
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          1          4.3         3            1.1         0.1 setosa    
#>  2          1          5           3.3          1.4         0.2 setosa    
#>  3          2          5.8         2.8          5.1         2.4 virginica 
#>  4          2          4.9         3.1          1.5         0.1 setosa    
#>  5          3          6           3            4.8         1.8 virginica 
#>  6          3          7.3         2.9          6.3         1.8 virginica 
#>  7          4          5           3.4          1.5         0.2 setosa    
#>  8          4          5.9         3.2          4.8         1.8 versicolor
#>  9          5          5.4         3.4          1.7         0.2 setosa    
#> 10          5          5.2         3.4          1.4         0.2 setosa    
#> 11          6          5           2.3          3.3         1   versicolor
#> 12          6          7.7         2.6          6.9         2.3 virginica 
#> 13          7          5.9         3.2          4.8         1.8 versicolor
#> 14          7          6.5         3            5.5         1.8 virginica 
#> 15          8          6.8         3.2          5.9         2.3 virginica 
#> 16          8          4.9         2.5          4.5         1.7 virginica 
#> 17          9          5.8         2.8          5.1         2.4 virginica 
#> 18          9          6.3         2.5          4.9         1.5 versicolor
#> 19         10          4.6         3.4          1.4         0.3 setosa    
#> 20         10          4.8         3            1.4         0.1 setosa

# As you iterate though each group, you have access to that
# group's metadata through `.y` if you need it.
group_modify_group_data <- group_modify(x, ~tibble(.g = list(.y)))

group_modify_group_data
#> # A tibble: 10 x 2
#> # Groups:   .bootstrap [10]
#>    .bootstrap .g              
#>         <int> <list>          
#>  1          1 <tibble [1 × 1]>
#>  2          2 <tibble [1 × 1]>
#>  3          3 <tibble [1 × 1]>
#>  4          4 <tibble [1 × 1]>
#>  5          5 <tibble [1 × 1]>
#>  6          6 <tibble [1 × 1]>
#>  7          7 <tibble [1 × 1]>
#>  8          8 <tibble [1 × 1]>
#>  9          9 <tibble [1 × 1]>
#> 10         10 <tibble [1 × 1]>

group_modify_group_data$.g[[1]]
#> # A tibble: 1 x 1
#>   .bootstrap
#>        <int>
#> 1          1

Like do(), it can be a convenient way to run multiple models as long as you return a data frame from each one.

x %>%
  group_by(Species, add = TRUE) %>%
  group_modify(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))
#> # A tibble: 60 x 7
#> # Groups:   .bootstrap, Species [30]
#>    .bootstrap Species    term         estimate std.error statistic  p.value
#>         <int> <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1          1 setosa     (Intercept)    0.500     0.249      2.01  4.98e- 2
#>  2          1 setosa     Sepal.Length   0.190     0.0500     3.81  3.74e- 4
#>  3          1 versicolor (Intercept)    0.474     0.447      1.06  2.94e- 1
#>  4          1 versicolor Sepal.Length   0.641     0.0751     8.53  1.38e-11
#>  5          1 virginica  (Intercept)    1.16      0.451      2.57  1.42e- 2
#>  6          1 virginica  Sepal.Length   0.673     0.0680     9.90  3.39e-12
#>  7          2 setosa     (Intercept)    0.0527    0.496      0.106 9.16e- 1
#>  8          2 setosa     Sepal.Length   0.291     0.101      2.88  6.14e- 3
#>  9          2 versicolor (Intercept)    0.287     0.436      0.658 5.14e- 1
#> 10          2 versicolor Sepal.Length   0.663     0.0725     9.16  9.43e-12
#> # … with 50 more rows

ungroup()

ungroup() will return the original tibble back to you, without materializing the virtual groups.

ungroup(x)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

as_tibble()

Like ungroup(), you can get the original tibble back by converting it to one explicitly with as_tibble().

as_tibble(x)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

Other dplyr functions

Most other dplyr functions work by first calling collect(), and then passing off to the underlying dplyr implementation. This means you can use mutate() like so:

mutate(x, mean = mean(Sepal.Length))
#> # A tibble: 1,500 x 7
#> # Groups:   .bootstrap [10]
#>    .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          1          4.3         3            1.1         0.1 setosa 
#>  2          1          5           3.3          1.4         0.2 setosa 
#>  3          1          7.7         3.8          6.7         2.2 virgin…
#>  4          1          4.4         3.2          1.3         0.2 setosa 
#>  5          1          4.3         3            1.1         0.1 setosa 
#>  6          1          7.7         3.8          6.7         2.2 virgin…
#>  7          1          5.5         2.5          4           1.3 versic…
#>  8          1          5.5         2.6          4.4         1.2 versic…
#>  9          1          5.5         2.6          4.4         1.2 versic…
#> 10          1          6.1         3            4.6         1.4 versic…
#> # … with 1,490 more rows, and 1 more variable: mean <dbl>

This doesn’t really get you anything in terms of speed, but can be convenient as an automatic way to convert back to a tibble and keep going with your workflow.