Variable Types

2018-06-28

All variables are generated via an appropriate transformation of standard normal variables, as described below.

1) Continuous Variables: Continuous variables are simulated using either Fleishman (1978)’s third-order (method = “Fleishman”) or Headrick (2002)’s fifth-order (method = “Polynomial”) power method transformation. This is a computationally efficient algorithm that simulates continuous distributions through the method of moments. It works by matching standardized cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman’s third-order method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick’s fifth-order method. The transformation is expressed as follows: $\Large Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5,\ Z \sim iid\ N(0,1),$ where $$\Large c_{4}$$ and $$\Large c_{5}$$ both equal $$\Large 0$$ for Fleishman’s method. The real constants are calculated by find_constants, which solves the system of equations given in fleish or poly. All variables are simulated with mean $$\Large 0$$ and variance $$\Large 1$$, and then transformed to the specified mean and variance at the end.

1. The required parameters for simulating continuous variables include: mean, variance, skewness, standardized kurtosis (kurtosis - 3), and standardized fifth and sixth cumulants (for the fifth-order method). If the goal is to simulate a theoretical distribution (i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using calc_theory. If the goal is to mimic an empirical data set, these values can be found using calc_moments (using the method of moments) or calc_fisherk (using Fisher’s k-statistics). If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). Due to the nature of the integration involved in calc_theory, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub) used in the integration process. For example, in order to ensure that skew is exactly 0 for symmetric distributions.

2. For some sets of cumulants, it is either not possible to find power method constants or the calculated constants do not generate valid power method pdfs. In these situations, adding a value to the sixth cumulant may provide solutions (see find_constants). When using Headrick’s fifth-order approximation, if simulation results indicate that a continuous variable does not generate a valid pdf, the user can try find_constants with various sixth cumulant correction vectors to determine if a valid pdf can be found.

3. Choice of Fleishman’s or Headrick’s Method: Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis values, given skew and standardized fifth ($$\Large \gamma_{3}$$) and sixth ($$\Large \gamma_{4}$$) cumulants, is larger than with Fleishman’s method. For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of $$\Large \gamma_{3}^2/\gamma_{4} > 9/14$$ (see Headrick and Kowalchuk (2007)). This eliminates the Chi-squared family of distributions, which has a constant ratio of $$\Large \gamma_{3}^2/\gamma_{4} = 2/3$$. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.

4. SimMultiCorrData::Headrick.dist is Headrick’s Table 1 (2002, p. 691-2). This data.frame contains selected symmetrical and asymmetrical theoretical densities with their associated values of skewness ($$\Large \gamma_{1}$$), standardized kurtosis ($$\Large \gamma_{2}$$), and standardized fifth ($$\Large \gamma_{3}$$) and sixth ($$\Large \gamma_{4}$$) cumulants. Constants were calculated by Headrick using his fifth-order polynomial transformation. SimMultiCorrData::H_params provides the parameter inputs needed to calculate the standardized cumulants for the distributions in SimMultiCorrData::Headrick.dist using the function calc_theory.

2) Ordinal Variables: Ordinal variables ($$\Large r \ge 2$$ categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal cdf at the cumulative probabilities defined by each variable’s marginal distribution, as follows: Suppose the goal is to simulate $$\Large m$$ ordinal random variables for sample size $$\Large n$$. For $$\Large 1 \leq i \leq m$$, let $$\Large {P}_{i}(y)$$ be the probability mass function and $$\Large {F}_{i}(y)$$ the cumulative distribution function of variable $$\Large {Y}_{i}$$, with support $$\Large {\mathbb{Y}}_{i} = \{{y}_{i(1)},\ {y}_{i(2)},\ ...,\ {y}_{i(r_{i})}\}$$. Here, $$\Large r_{i}$$ is the number of categories for variable $$\Large {Y}_{i}$$. Note that these are the ordered possible values for $$\Large {Y}_{i}$$ with associated probabilities: $\Large {\mathbb{P}}_{i} = \{{P}_{i}({y}_{i(1)}),\ {P}_{i}({y}_{i(2)}),\ ...,\ {P}_{i}({y}_{i(r_{i})})\},$ where $$\Large {P}_{i}({y}_{i(r_{i})}) = 1-\sum_{h=1}^{r_{i}-1}{P}_{i}({y}_{i(h)})$$, and cumulative probabilities: $\Large {\mathbb{F}}_{i} = \{{F}_{i}({y}_{i(l)}) = \sum_{j=1}^{l}{P}_{i}({y}_{i(j)}),\ l = 1,\ ...,\ r_{i}-1\}.$ The final cumulative probability is obviously one. The corresponding standard normal quantiles for these are given by: $\Large {\mathbb{q}}_{i} = \{{q}_{i(l)} = {\Phi}^{-1}({F}_{i}({y}_{i(l)})) = {\Phi}^{-1}(\sum_{j=1}^{l}{P}_{i}({y}_{i(j)})),\ l = 1,\ ...,\ r_{i}-1\}.$ If the support for variable $$\Large {Y}_{i}$$ is not provided, the default is to use $$\Large \{1,\ 2,\ ...,\ r_{i}\}$$. The $$\Large m$$ random normal variates $$\Large {Z}_{1},\ {Z}_{2},\ ...,\ {Z}_{m}$$ are discretized using the cumulative probabilities and corresponding normal quantiles found above: $\Large if\ {Z}_{i} < {q}_{i(1)} \rightarrow {Y}_{i} = {y}_{i(1)},$ $\Large if\ {q}_{i(1)} \leq {Z}_{i} < {q}_{i(2)} \rightarrow {Y}_{i} = {y}_{i(2)},\ ...,\$ $\Large if\ {q}_{i(r_{i})} \leq {Z}_{i} \rightarrow {Y}_{i} = {y}_{i(r_{i})}.$

Required parameters: For ordinal variables, these are the cumulative marginal probabilities and support values (if desired). The probabilities should be combined into a list of length equal to the number of ordinal variables. The $$\Large i^{th}$$ element is a vector of the cumulative probabilities defining the marginal distribution of the $$\Large i^{th}$$ variable. If the variable can take $$\Large r$$ values, the vector will contain $$\Large r - 1$$ probabilities (the $$\Large r^{th}$$ is assumed to be $$\Large 1$$). Note that for binary variables, the user-supplied probability should be the probability of the $$\Large 1^{st}$$ (lower) support value. This would ordinarily be considered the probability of failure ($$\Large q$$), while the probability of the $$\Large 2^{nd}$$ (upper) support value would be considered the probability of success ($$\Large p = 1 - q$$). The support values should be combined into a separate list. The $$\Large i^{th}$$ element is a vector containing the $$\Large r$$ ordered support values.

3) Poisson and Negative Binomial Variables: Count variables are generated using the inverse cdf method. The cumulative distribution function of a standard normal variable has a uniform distribution. The appropriate quantile function $$\Large F_{Y}^{-1}$$ is applied to this uniform variable with the designated parameters to generate the count variable: $$\Large Y = F_{y}^{-1}(\Phi(Z))$$.

Required parameters: For Poisson variables, the lambda (mean) value should be given (see stats::qpois). For Negative Binomial variables, the size (target number of successes) and either the success probability or the mean should be given (see see stats::qnbinom). The Negative Binomial variable represents the number of failures which occur in a sequence of Bernoulli trials before the target number of successes is achieved. For Correlation Method 2, a vector of total cumulative probability truncation values should be provided (one for Poisson and one for Negative Binomial). These values represent the amount of probability removed from the range of the cdf’s $$\Large F_{Y}$$ when creating finite supports. The value may vary by variable, but a good default value is $$\Large 0.0001$$ (suggested by Barbiero and Ferrari (2015)).