Logistic Regression with Missing Covariates

Wei Jiang

2019-01-15

Introduction of misaem

misaem is a package to apply statistical inference for logistic regression model with missing data. This methodology is based on likelihood, including:

  1. A stochastic approximation version of EM algorithm based on Metropolis-Hasting sampling, to estimate the parameters of logistic regression;
  2. Estimation of parameters’ variance based one Louis formula;
  3. Model selection procedure based on BIC.

Synthetic dataset

We first generate a design matrix of size \(N=500\) times \(p=5\) by drawing each observation from a multivariate normal distribution \(\mathcal{N}(\mu, \Sigma)\). Then, we generate the response according to the logistic regression model.

We consider as the true values for the parameters \[\begin{equation*} \begin{split} \beta &= (0, 1, -1, 1, 0, -1),\\ \mu &= (1,2,3,4,5),\\ \Sigma &= \text{diag}(\sigma)C \text{diag}(\sigma), \end{split} \end{equation*}\] where the \(\sigma\) is the vector of standard deviations \[\sigma=(1,2,3,4,5)\]
and \(C\) the correlation matrix \[C = \begin{bmatrix} 1 & 0.8 & 0 & 0 & 0\\ 0.8 & 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0.3 & 0.6\\ 0 & 0 & 0.3 & 1 & 0.7\\ 0 & 0 & 0.6 & 0.7 & 1\\ \end{bmatrix}.\]

# Generate dataset
set.seed(200)
N <- 500  # number of subjects
p <- 5     # number of explanatory variables
mu.star <- 1:p  #rep(0,p)  # mean of the explanatory variables
sd <- 1:p # rep(1,p) # standard deviations
C <- matrix(c(   # correlation matrix
1,   0.8, 0,   0,   0,
0.8, 1,   0,   0,   0,
0,   0,   1,   0.3, 0.6,
0,   0,   0.3, 1,   0.7,
0,   0,   0.6, 0.7, 1), nrow=p)
Sigma.star <- diag(sd)%*%C%*%diag(sd) # covariance matrix
beta.star <- c(1, -1, 1, 0, -1) # coefficients
beta0.star <- 0  # intercept
beta.true = c(beta0.star,beta.star)

# Design matrix
X.complete <- matrix(rnorm(N*p), nrow=N)%*%chol(Sigma.star)+
              matrix(rep(mu.star,N), nrow=N, byrow = TRUE)

# Reponse vector
p1 <- 1/(1+exp(-X.complete%*%beta.star-beta0.star))
y <- as.numeric(runif(N)<p1)

Then we randomly introduced 10% of missing values in the covariates according to the MCAR (Missing completely at random) mechanism.

# Generate missingness
set.seed(200)
p.miss <- 0.10
patterns <- runif(N*p)<p.miss # missing completely at random
X.obs <- X.complete
X.obs[patterns] <- NA

Have a look at our synthetic dataset:

head(X.obs)
##           [,1]        [,2]      [,3]         [,4]        [,5]
## [1,] 1.0847563  1.71119812 5.0779956  9.731254821 13.02285225
## [2,] 1.2264603  0.04664033 5.3758000  6.383093558  4.84730504
## [3,] 1.4325565  1.77934455        NA  8.421927692  7.26902254
## [4,] 1.5580652  5.69782193 5.5942869 -0.440749372 -0.96662931
## [5,] 1.0597553 -0.38470918 0.4462986  0.008402997  0.04745022
## [6,] 0.8853591  0.56839374 3.4641522  7.047389616          NA

Main usage: Estimation for logistic regression with missingness

The main function in our package is miss.saem function, which returns the estimation of parameters for logistic regression with missingness. Here we apply this function with its default options.

#Charge library
library(misaem)

# SAEM
list.saem = miss.saem(X.obs,y)
## iteration = 10 beta = 0.07193013 0.8621059 -0.9593602 1.188501 -0.00691982 -1.115174 
## Distance from last iteration = 0.002800532 
## iteration = 20 beta = -0.03136551 1.256794 -1.12295 1.180936 -0.009885152 -1.114326 
## Distance from last iteration = 0.0176632 
## iteration = 30 beta = 0.02075442 1.16449 -1.021039 1.162069 -0.01682317 -1.099308 
## Distance from last iteration = 0.1218524 
## iteration = 40 beta = 0.1477152 1.073935 -1.095803 1.173065 -0.02928672 -1.099446 
## Distance from last iteration = 0.1414668 
## iteration = 50 beta = 0.1205986 1.113298 -0.9688564 1.041548 -0.06052164 -0.9799622 
## Distance from last iteration = 0.1737565 
## iteration = 60 beta = 0.1129732 1.013455 -0.9389359 1.027017 -0.05479268 -0.9524152 
## Distance from last iteration = 3.076897e-05 
## iteration = 70 beta = 0.07847484 1.023609 -0.9626167 1.038256 -0.04132143 -0.9646895 
## Distance from last iteration = 0.0002820075 
## iteration = 80 beta = 0.07095625 1.070815 -0.9922385 1.053396 -0.04173298 -0.980045 
## Distance from last iteration = 5.67119e-05 
## iteration = 90 beta = 0.07070879 1.089326 -1.010678 1.072014 -0.03172497 -1.007943 
## Distance from last iteration = 7.494584e-05 
## iteration = 100 beta = 0.06932229 1.107081 -1.022656 1.081968 -0.03006122 -1.017227 
## Distance from last iteration = 1.828823e-05 
## iteration = 110 beta = 0.06329327 1.116114 -1.028553 1.089239 -0.0295718 -1.025201 
## Distance from last iteration = 1.094912e-06 
## iteration = 120 beta = 0.06704591 1.124846 -1.035163 1.090537 -0.02853558 -1.026653 
## Distance from last iteration = 1.048773e-05 
## iteration = 130 beta = 0.06694411 1.125318 -1.033447 1.086222 -0.02904691 -1.024076 
## Distance from last iteration = 1.398283e-05 
## iteration = 140 beta = 0.06392523 1.117632 -1.02875 1.085534 -0.02812343 -1.024316 
## Distance from last iteration = 2.058142e-05 
## iteration = 150 beta = 0.05796131 1.11061 -1.026311 1.085827 -0.02504412 -1.025536 
## Distance from last iteration = 1.077455e-05 
## iteration = 160 beta = 0.06594651 1.109383 -1.027915 1.085974 -0.0271385 -1.024832 
## Distance from last iteration = 2.593489e-06 
## iteration = 170 beta = 0.07014832 1.114645 -1.031966 1.086403 -0.02695946 -1.025621 
## Distance from last iteration = 5.779123e-06 
## iteration = 180 beta = 0.0696883 1.118459 -1.033234 1.085297 -0.02696528 -1.02516 
## Distance from last iteration = 4.383226e-06 
## iteration = 190 beta = 0.06977064 1.120923 -1.035369 1.086957 -0.02654733 -1.02698 
## Distance from last iteration = 2.213467e-06 
## iteration = 200 beta = 0.07263775 1.126912 -1.039287 1.087265 -0.02704987 -1.027656 
## Distance from last iteration = 7.281153e-07
print(list.saem$beta)
## [1]  0.07295149  1.12523513 -1.03819294  1.08639345 -0.02713150 -1.02667062

And if you need to obtain the variance of estimation:

# SAEM with variance estimation
list.saem = miss.saem(X.obs,y,var_cal = TRUE)
## iteration = 10 beta = 0.07193013 0.8621059 -0.9593602 1.188501 -0.00691982 -1.115174 
## Distance from last iteration = 0.002800532 
## iteration = 20 beta = -0.03136551 1.256794 -1.12295 1.180936 -0.009885152 -1.114326 
## Distance from last iteration = 0.0176632 
## iteration = 30 beta = 0.02075442 1.16449 -1.021039 1.162069 -0.01682317 -1.099308 
## Distance from last iteration = 0.1218524 
## iteration = 40 beta = 0.1477152 1.073935 -1.095803 1.173065 -0.02928672 -1.099446 
## Distance from last iteration = 0.1414668 
## iteration = 50 beta = 0.1205986 1.113298 -0.9688564 1.041548 -0.06052164 -0.9799622 
## Distance from last iteration = 0.1737565 
## iteration = 60 beta = 0.1129732 1.013455 -0.9389359 1.027017 -0.05479268 -0.9524152 
## Distance from last iteration = 3.076897e-05 
## iteration = 70 beta = 0.07847484 1.023609 -0.9626167 1.038256 -0.04132143 -0.9646895 
## Distance from last iteration = 0.0002820075 
## iteration = 80 beta = 0.07095625 1.070815 -0.9922385 1.053396 -0.04173298 -0.980045 
## Distance from last iteration = 5.67119e-05 
## iteration = 90 beta = 0.07070879 1.089326 -1.010678 1.072014 -0.03172497 -1.007943 
## Distance from last iteration = 7.494584e-05 
## iteration = 100 beta = 0.06932229 1.107081 -1.022656 1.081968 -0.03006122 -1.017227 
## Distance from last iteration = 1.828823e-05 
## iteration = 110 beta = 0.06329327 1.116114 -1.028553 1.089239 -0.0295718 -1.025201 
## Distance from last iteration = 1.094912e-06 
## iteration = 120 beta = 0.06704591 1.124846 -1.035163 1.090537 -0.02853558 -1.026653 
## Distance from last iteration = 1.048773e-05 
## iteration = 130 beta = 0.06694411 1.125318 -1.033447 1.086222 -0.02904691 -1.024076 
## Distance from last iteration = 1.398283e-05 
## iteration = 140 beta = 0.06392523 1.117632 -1.02875 1.085534 -0.02812343 -1.024316 
## Distance from last iteration = 2.058142e-05 
## iteration = 150 beta = 0.05796131 1.11061 -1.026311 1.085827 -0.02504412 -1.025536 
## Distance from last iteration = 1.077455e-05 
## iteration = 160 beta = 0.06594651 1.109383 -1.027915 1.085974 -0.0271385 -1.024832 
## Distance from last iteration = 2.593489e-06 
## iteration = 170 beta = 0.07014832 1.114645 -1.031966 1.086403 -0.02695946 -1.025621 
## Distance from last iteration = 5.779123e-06 
## iteration = 180 beta = 0.0696883 1.118459 -1.033234 1.085297 -0.02696528 -1.02516 
## Distance from last iteration = 4.383226e-06 
## iteration = 190 beta = 0.06977064 1.120923 -1.035369 1.086957 -0.02654733 -1.02698 
## Distance from last iteration = 2.213467e-06 
## iteration = 200 beta = 0.07263775 1.126912 -1.039287 1.087265 -0.02704987 -1.027656 
## Distance from last iteration = 7.281153e-07
print(list.saem$beta)
## [1]  0.07295149  1.12523513 -1.03819294  1.08639345 -0.02713150 -1.02667062
print(list.saem$var_obs)
##               [,1]        [,2]          [,3]         [,4]          [,5]
## [1,]  0.1045912507 -0.01640825 -8.015102e-03 -0.007844460 -6.446553e-03
## [2,] -0.0164082517  0.14023540 -6.540748e-02  0.011371833 -1.119750e-03
## [3,] -0.0080151020 -0.06540748  4.229590e-02 -0.012590862  8.142047e-05
## [4,] -0.0078444600  0.01137183 -1.259086e-02  0.020041181  2.209681e-03
## [5,] -0.0064465534 -0.00111975  8.142047e-05  0.002209681  4.480773e-03
## [6,]  0.0005674298 -0.01229699  1.228287e-02 -0.017117600 -4.048477e-03
##               [,6]
## [1,]  0.0005674298
## [2,] -0.0122969901
## [3,]  0.0122828659
## [4,] -0.0171176002
## [5,] -0.0040484767
## [6,]  0.0185079131

Model selection with missing values

To perform model selection with missing values, we adapt criterion BIC and step-wise method. The function model_selection will return the index of variables included in the best model selected, and also the estimates for the best model. Pay attention that here the dimension of dataset should be less than 20 to avoid blocking your computer.

# model selection for SAEM
list.saem.select = model_selection(X.obs,y)
## iteration = 10 beta = 0.03693459 0.9922979 -0.9512952 1.089339 0 -1.028696 
## Distance from last iteration = 0.0005214435 
## iteration = 20 beta = 0.02679083 1.016159 -0.9809538 1.108529 0 -1.053341 
## Distance from last iteration = 0.0001119514 
## iteration = 30 beta = 0.00634951 1.026026 -0.9844437 1.112693 0 -1.056097 
## Distance from last iteration = 8.295187e-05 
## iteration = 40 beta = 0.03240261 1.067195 -1.017125 1.128108 0 -1.073618 
## Distance from last iteration = 3.896732e-06 
## iteration = 50 beta = 0.0346749 1.056908 -1.010936 1.125084 0 -1.072799 
## Distance from last iteration = 1.076438e-05 
## iteration = 60 beta = 0.03256774 1.046976 -1.00047 1.115547 0 -1.062582 
## Distance from last iteration = 7.940021e-06 
## iteration = 70 beta = 0.02659178 1.046861 -1.000536 1.109275 0 -1.055039 
## Distance from last iteration = 1.453078e-05 
## iteration = 80 beta = 0.02337511 1.058919 -1.005586 1.108637 0 -1.054648 
## Distance from last iteration = 5.774576e-06 
## iteration = 90 beta = 0.02881635 1.068504 -1.012951 1.112225 0 -1.060291 
## Distance from last iteration = 1.514593e-05 
## iteration = 100 beta = 0.02948875 1.076905 -1.017599 1.114678 0 -1.062922 
## Distance from last iteration = 4.426403e-06 
## iteration = 110 beta = 0.02584919 1.08304 -1.020688 1.116973 0 -1.065361 
## Distance from last iteration = 4.719359e-07 
## iteration = 120 beta = 0.02857304 1.088741 -1.024188 1.116082 0 -1.064399 
## Distance from last iteration = 2.546118e-06 
## iteration = 130 beta = 0.02843074 1.090349 -1.023225 1.112404 0 -1.062091 
## Distance from last iteration = 4.238248e-06 
## iteration = 140 beta = 0.02703252 1.088101 -1.021314 1.111167 0 -1.061152 
## Distance from last iteration = 6.07387e-06 
## iteration = 150 beta = 0.02457677 1.085172 -1.020317 1.110807 0 -1.060302 
## Distance from last iteration = 5.800769e-06
print(list.saem.select$subset_choose)
## [1] 1 2 3 5
print(list.saem.select$beta)
## [1]  0.02480092  1.08537618 -1.02082832  1.11083772  0.00000000 -1.06022232

Prediction on test set

In order to evaluate the prediction performance, we generate a test set of size \(Nt=100\) times \(p=5\) follow the same distribution as previous design matrix, and also with 10% of missing values. Given the real value of response according to the logistic regression model, we can evaluate the prediction results by a confusion matrix.

# Generate test set with missingness
set.seed(200)
Nt = 100
X.test <- matrix(rnorm(Nt*p), nrow=Nt)%*%chol(Sigma.star)+
          matrix(rep(mu.star,Nt), nrow=Nt, byrow = TRUE)

# Real value for response of test set
p1 <- 1/(1+exp(-X.test%*%beta.star-beta0.star))
y.test <- as.numeric(runif(Nt)<p1)

# Generate missingness on test set
p.miss <- 0.10
X.test[runif(Nt*p)<p.miss] <- NA

# Prediction on test set
pr.saem <- pred_saem(X.test, list.saem.select$beta, 
                     list.saem.select$mu, list.saem.select$sig2)

# Confusion matrix
pred.saem = (pr.saem>0.5)*1
table(y.test,pred.saem )
##       pred.saem
## y.test  0  1
##      0 66  7
##      1  8 19

Reference

Logistic Regression with Missing Covariates – Parameter Estimation, Model Selection and Prediction (2018, Jiang W., Josse J., Lavielle M., Traumabase Group), arXiv:1805.04602.