# Introduction

This vignette demonstrates how to write a Stan program that computes and stores the pointwise log-likelihood required for using the loo package. The other vignettes included with the package demonstrate additional functionality.

Some sections from this vignette are excerpted from our papers

• Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. :10.1007/s11222-016-9696-4. Links: published | arXiv preprint.

• Vehtari, A., Gelman, A., and Gabry, J. (2017). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.04544.

which provide important background for understanding the methods implemented in the package.

# Example: Well water in Bangladesh

This example comes from a survey of residents from a small area in Bangladesh that was affected by arsenic in drinking water. Respondents with elevated arsenic levels in their wells were asked if they were interested in getting water from a neighbor’s well, and a series of logistic regressions were fit to predict this binary response given various information about the households (Gelman and Hill, 2007). Here we fit a model for the well-switching response given two predictors: the arsenic level of the water in the resident’s home, and the distance of the house from the nearest safe well.

The sample size in this example is $$N=3020$$, which is not huge but is large enough that it is important to have a computational method for LOO that is fast for each data point. On the plus side, with such a large dataset, the influence of any given observation is small, and so the computations should be stable.

## Coding the Stan model

Here is the Stan code for fitting the logistic regression model, which we save in a file called logistic.stan:

data {
int<lower=0> N;             // number of data points
int<lower=0> P;             // number of predictors (including intercept)
matrix[N,P] X;              // predictors (including 1s for intercept)
int<lower=0,upper=1> y[N];  // binary outcome
}
parameters {
vector[P] beta;
}
model {
beta ~ normal(0, 1);
y ~ bernoulli_logit(X * beta);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | X[n] * beta);
}
}

We have defined the log likelihood as a vector named log_lik in the generated quantities block so that the individual terms will be saved by Stan. After running Stan, log_lik can be extracted (using the extract_log_lik function provided in the loo package) as an $$S \times N$$ matrix, where $$S$$ is the number of simulations (posterior draws) and $$N$$ is the number of data points.

## Fitting the model with RStan

Next we fit the model in Stan using the rstan package:

library("rstan")

# Prepare data
url <- "http://stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat"
wells <- read.table(url)
wells$dist100 <- with(wells, dist / 100) X <- model.matrix(~ dist100 + arsenic, wells) standata <- list(y = wells$switch, X = X, N = nrow(X), P = ncol(X))

# Fit model
fit_1 <- stan("logistic.stan", data = standata)
print(fit_1, pars = "beta")
         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta[1]  0.00       0 0.08 -0.16 -0.05  0.00  0.05  0.15  1964    1
beta[2] -0.89       0 0.10 -1.09 -0.96 -0.89 -0.82 -0.68  2048    1
beta[3]  0.46       0 0.04  0.38  0.43  0.46  0.49  0.54  2198    1

## Computing approximate leave-one-out cross-validation using PSIS-LOO

We can then use the loo package to compute the efficient PSIS-LOO approximation to exact LOO-CV:

library("loo")

# Extract pointwise log-likelihood and compute LOO
log_lik_1 <- extract_log_lik(fit_1, merge_chains = FALSE)

# as of loo v2.0.0 we can optionally provide relative effective sample sizes
# when calling loo, which allows for better estimates of the PSIS effective
# sample sizes and Monte Carlo error
r_eff <- relative_eff(exp(log_lik_1))

loo_1 <- loo(log_lik_1, r_eff = r_eff, cores = 2)
print(loo_1)
Computed from 4000 by 3020 log-likelihood matrix

Estimate   SE
elpd_loo  -1968.5 15.6
p_loo         3.2  0.1
looic      3937.0 31.2
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The printed output from the loo function shows the estimates $$\widehat{\mbox{elpd}}_{\rm loo}$$ (expected log predictive density), $$\widehat{p}_{\rm loo}$$ (effective number of parameters), and $${\rm looic} =-2\, \widehat{\mbox{elpd}}_{\rm loo}$$ (the LOO information criterion).

The line at the bottom of the printed output provides information about the reliability of the LOO approximation (the interpretation of the $$k$$ parameter is explained in the PSIS-LOO section in help("loo-package") and in greater detail in Vehtari, Gelman, and Gabry (2017)). In this case the message tells us that all of the estimates for $$k$$ are fine.

## Comparing models

To compare this model to an alternative model for the same data we can use the compare function in the loo package. First we’ll fit a second model to the well-switching data, using log(arsenic) instead of arsenic as a predictor:

standata$X[, "arsenic"] <- log(standata$X[, "arsenic"])
fit_2 <- stan(fit = fit_1, data = standata)

log_lik_2 <- extract_log_lik(fit_2, merge_chains = FALSE)
r_eff_2 <- relative_eff(exp(log_lik_2))
loo_2 <- loo(log_lik_2, r_eff = r_eff_2, cores = 2)
print(loo_2)
Computed from 4000 by 3020 log-likelihood matrix

Estimate   SE
elpd_loo  -1952.3 16.2
p_loo         3.1  0.1
looic      3904.6 32.4
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

We can now compare the models on LOO using the compare function:

# Compare
comp <- compare(loo_1, loo_2)

This new object, comp, contains the estimated difference of expected leave-one-out prediction errors between the two models, along with the standard error:

print(comp)
elpd_diff        se
16.2       4.4 

The positive difference in elpd (and its scale relative to the standard error) indicates a preference for the second model.

# References

Gelman, A., and Hill, J. (2007). Data Analysis Using Regression and Multilevel Hierarchical Models. Cambridge University Press.

Stan Development Team (2017). The Stan C++ Library, Version 2.17.0. http://mc-stan.org

Stan Development Team (2018) RStan: the R interface to Stan, Version 2.17.3. http://mc-stan.org

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. :10.1007/s11222-016-9696-4. online, arXiv preprint arXiv:1507.04544.

Vehtari, A., Gelman, A., and Gabry, J. (2017). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.