Overview
For a recent project, I needed to be able to repeatedly estimate means and covariances for many items for many rather large datasets (well, large as far as Psychology/Health studies usually go). In addition, I needed something that would do a decent job at appropriately handling missing data and would finish rather quickly. And I needed the code in R and ASAP.
Aren’t there already R packages that could do this? I thought so and I tried a few, including lavaan
, norm
, and norm2
. In the end, modifying code from a project with Joshua Starr led to the EMgaussian package, which as a byproduct included code for estimating means and covariances using the EM algorithm. Although the implemented approach carried with it assumptions regarding multivariate normality, it would probably get the job done.
This post documents some tests with EMgaussian
. In brief, I found that for my particular project, it was much faster than lavaan
, but it probably depends on how many patterns of missing data there are. lavaan
seems fast if there are few missing data patterns, EMgaussian
may be faster if there are many missing data patterns. And norm
and norm2
are not great but for different reasons.
Load Packages
# packages used for handling missing data or visualizing it
library(EMgaussian)
library(lavaan)
library(norm)
library(norm2)
library(mvnmle)
#library(rrcovNA) # well, this one actually just calls em.norm
library(mice)
library(visdat)
# timings
library(tictoc)
library(rbenchmark)
# a little data manipulation
library(dplyr)
Quick Illustration
Here I grab a dataset that I know has some missing values, but that many: Big Five Inventory (IPIP) from the psych
package, which has 2800 respondents and 25 Likert-type responses with something like 6 response options each.
library(psych)
bfi.sub <- bfi[,1:25]
dim(bfi.sub)
## [1] 2800 25
There’s not much missing data; if we do listwise deletion (remove anyone with a missing value), we end up losing 74 participants. But, it should work as a quick illustration to start off.
vis_miss(bfi.sub)
dim(na.omit(bfi.sub))
## [1] 2436 25
Comparison of Methods
Here, we compare several approaches.
lavaan
can compute means and covariances using what structural equation modelers call “full-information maximum likelihood” (FIML).- Known as Direct Maximum Likelihood by others.
- I thought
lavaan
used to use the EM algorithm inside the function below, but I’m not sure that it still does.
norm
is the first package I learned that could compute the EM means and covariances.- But, apparently it is now discouraged in favor of
norm2
and there are warnings about how it may not be able to reliably handle more then 30 variables or something.
- But, apparently it is now discouraged in favor of
norm2
is supposed to be better thannorm
, but as of this writing was booted off of CRAN.mvnmle
also apparently has an implementation.EMgaussian
is my own implementation of the EM algorithm for computing means and covariances.- The package is available on CRAN
- It was initially based on work from Städler, N., & Bühlmann, P. (2012).
rrcovNA
?- Well, it turns out that it actually just calls
em.norm
from thenorm
package, so I did not study it.
- Well, it turns out that it actually just calls
Same Answer?
Obtain estimates for all 5 methods.
# EM gaussian
em.res <- em.cov(bfi.sub)
# lavaan
lav.res <- lavCor(bfi.sub, missing="fiml", output="sampstat")
# norm
s <- prelim.norm(as.matrix(bfi.sub))
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)
# norm2
norm2.res <- emNorm(bfi.sub)
# mvnmle
mvnmle.res <- mlest(bfi.sub)
What is the max absolute difference for any element of the covariance matrix? It seems all methods do yield about the same answer for the covariance matrix.
EMgaussian
vs. lavaan
:
max(abs(em.res$S - lav.res$cov))
## [1] 5.469403e-12
EMgaussian
vs. norm
:
max(abs(em.res$S - norm.res$sigma))
## [1] 1.350084e-06
EMgaussian
vs. norm2
:
max(abs(em.res$S - norm2.res$param$sigma))
## [1] 8.856597e-08
EMgaussian
vs. mvnmle
:
max(abs(em.res$S - mvnmle.res$sigmahat))
## [1] 1.298663e-05
How Long?
Well, using default optimization criteria. Particular differences in tolerance could lead to different speeds of convergence, and those differ across estimation approaches.
benchmark(
"EMgaussian" = {em.res <- em.cov(bfi.sub)},
"lavaan" = {lav.res <- lavCor(bfi.sub, missing="fiml", output="sampstat")},
"norm" = {s <- prelim.norm(as.matrix(bfi.sub))
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)},
"norm2" = {norm2.res <- emNorm(bfi.sub)},
"mvnmle" = {mvnmle.res <- mlest(bfi.sub)},
replications = 50
)
## test replications elapsed relative user.self sys.self user.child
## 1 EMgaussian 50 1.03 2.641 0.97 0.04 NA
## 2 lavaan 50 4.81 12.333 4.47 0.28 NA
## 5 mvnmle 50 572.74 1468.564 509.39 58.26 NA
## 3 norm 50 0.39 1.000 0.32 0.10 NA
## 4 norm2 50 49.86 127.846 49.53 0.05 NA
## sys.child
## 1 NA
## 2 NA
## 5 NA
## 3 NA
## 4 NA
Here, EMgaussian
doesn’t do too bad. norm
is fast, but we’ll see later that it breaks with other datasets. norm2
is quite slow. lavaan
is also just a bit slower than EMgaussian
, but not too bad. mvnmle
was slow enough that I decide not to study it any more!
When is lavaan
Faster than EMgaussian
?
… and vice versa … and when does norm
and norm2
fail?
Here, I take a different dataset and purposely delete values from it. This allows a benchmark for performance: do any of the methods come close to the result based on complete data? It also helps illustrate how the number of missing data patterns apparently affects computational time.
The Data
This is not the actual dataset I was using for the project, but provides something comparable. The spi
dataset contains 4000 respondents with 135 Likert-type items with 6 response options per item. It apparently does not contain any missing values.
# load data
library(psychTools)
dat.sub <- spi %>% dplyr::select(starts_with("q_"))
nobs <- nrow(dat.sub)*ncol(dat.sub)
dim(dat.sub)
## [1] 4000 135
Here, I save the true means and covariances from this complete set of data. This serves as a benchmark for the missing data methods.
# True means and covariances - before deleting missing values
truemeans <- colMeans(dat.sub)
truecov <- cov(dat.sub)
Few Missing Data Patterns
In one set of tests, I take inspiration from this paper by Xijuan Zhang on how to create missing data for simulation studies. Here, there are three conditioning variables that predict missingness on three subsets of the items. Technically this results in data that are Missing at Random (MAR).
# Create MAR data
# Probability of missingness for each row, based on each of three variables
# The functions here are essentially that of a logistic regression
p.pat1 <- 1/(1+exp(-(-7 + .75 + dat.sub$q_253)))
p.pat2 <- 1/(1+exp(-(-5 + 1.75 + dat.sub$q_952)))
p.pat3 <- 1/(1+exp(-(-8 + 2.75 + dat.sub$q_1904)))
# We end up with a ballpark of around .3 missing rate
#mean(p.pat1)
#mean(p.pat2)
#mean(p.pat3)
# Create a copy of the dataset, and creating missing values for three
set.seed(1234)
dat.miss <- dat.sub
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat1, TRUE, FALSE),4:50] <- NA
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat2, TRUE, FALSE),51:98] <- NA
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat3, TRUE, FALSE),99:135] <- NA
But, there are relatively few possible patterns of missing data (\(2^3 = 8\), combinations, actually). We end up with about this proportion of missing data in total:
# Runs, but the plot not really readable
#md.pattern(dat.miss)
# a little more readable
vis_miss(dat.miss)
# proportion of missing data
sum(is.na(dat.miss))/nobs
## [1] 0.2945833
Comparison
Here are timings as well as comparisons of each approach with the true means and covariances.
First, timing how long each takes:
EMgaussian
# EM algorithm
tic()
em.res <- em.cov(dat.miss)
toc()
## 29.62 sec elapsed
lavaan
# lavaan
tic()
lav.res <- lavCor(dat.miss, missing="fiml", output="sampstat")
toc()
## 7.86 sec elapsed
norm
# norm
tic()
s <- prelim.norm(as.matrix(dat.miss)) # generates a warning, that's probably bad
## Warning in prelim.norm(as.matrix(dat.miss)): NAs introduced by coercion to
## integer range
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)
toc()
## 34.42 sec elapsed
norm2
# norm2
tic()
norm2.res <- emNorm(dat.miss)
toc()
## 35.5 sec elapsed
So, here lavaan
is actually the fastest, and by quite a bit. EMgaussian
is next, and norm
and norm2
are slow. But, speed is not the only story and we’ll see this pattern changes with many missing data patterns.
Next, compare to true values. For comparisons with the means and covariances, max absolute differences are printed as are Frobenius norm for the difference between the true and estimated covariance matrix (smaller numbers are better; differences here are all tiny).
results <- data.frame(
methods = c("EMgaussian","lavaan","norm","norm2"),
maxabsmeans = c(max(abs(em.res$mu - truemeans)),
max(abs(lav.res$mean - truemeans)),
max(abs(norm.res$mu - truemeans)),
max(abs(norm2.res$param$beta - truemeans))),
Fnorm = c(norm(em.res$S - truecov, type="F"),
norm(lav.res$cov - truecov, type="F"),
norm(norm.res$sigma - truecov, type="F"),
norm(norm2.res$param$sigma - truecov, type="F")),
maxabscov = c(max(abs(em.res$S - truecov)),
max(abs(lav.res$cov - truecov)),
max(abs(norm.res$sigma - truecov)),
max(abs(norm2.res$param$sigma - truecov)))
)
knitr::kable(results, digits = 3)
methods | maxabsmeans | Fnorm | maxabscov |
---|---|---|---|
EMgaussian | 0.063 | 3.68 | 0.201 |
lavaan | 0.063 | 3.68 | 0.201 |
norm | 396.633 | 12745694.15 | 571219.957 |
norm2 | 0.063 | 3.68 | 0.201 |
EMgaussian
and lavaan
perform similarly, as does norm2
, though lavaan
is faster. norm
is way off.
Many Missing Data Patterns
Here, many missing data patterns are created instead of just a few. Although I’m pretty sure the project I was working on did not have Missing Complete at Random (MCAR) data, it was just easier for this illustration.
# Create MCAR data
nobs <- nrow(dat.sub)*ncol(dat.sub)
dat.miss <- as.matrix(dat.sub)
dat.miss[sample(1:nobs,.3*nobs)] <- NA
# Runs, but the plot not really readable
#md.pattern(dat.miss)
# a little more readable
vis_miss(as.data.frame(dat.miss))
# proportion of missing data
sum(is.na(dat.miss))/nobs
## [1] 0.3
Comparison
Here are timings as well as comparisons of each approach with the true means and covariances.
EMgaussian
# EM algorithm
tic()
em.res <- em.cov(dat.miss)
toc()
## 9.37 sec elapsed
lavaan
# lavaan
tic()
lav.res <- lavCor(dat.miss, missing="fiml", output="sampstat")
toc()
## 163.05 sec elapsed
norm
# norm
tic()
s <- prelim.norm(as.matrix(dat.miss)) # generates a warning, that's probably bad
## Warning in prelim.norm(as.matrix(dat.miss)): NAs introduced by coercion to
## integer range
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)
toc()
## 0.12 sec elapsed
norm2
# norm2
tic()
norm2.res <- emNorm(dat.miss)
## Note: Eigen power method failed to converge
## OCCURRED IN: estimate_worst_frac in MOD norm_engine
toc()
## 1403.65 sec elapsed
Wow, so EMgaussian
was much faster than lavaan
. Aside from that norm
actually kind of failed and norm2
was abysmally slow.
Next, compare to true values. For comparisons with the means and covariances, max absolute differences are printed as are Frobenius norm for the difference between the true and estimated covariance matrix (smaller numbers are better).
results <- data.frame(
methods = c("EMgaussian","lavaan","norm","norm2"),
maxabsmeans = c(max(abs(em.res$mu - truemeans)),
max(abs(lav.res$mean - truemeans)),
max(abs(norm.res$mu - truemeans)),
max(abs(norm2.res$param$beta - truemeans))),
Fnorm = c(norm(em.res$S - truecov, type="F"),
norm(lav.res$cov - truecov, type="F"),
norm(norm.res$sigma - truecov, type="F"),
norm(norm2.res$param$sigma - truecov, type="F")),
maxabscov = c(max(abs(em.res$S - truecov)),
max(abs(lav.res$cov - truecov)),
max(abs(norm.res$sigma - truecov)),
max(abs(norm2.res$param$sigma - truecov)))
)
knitr::kable(results, digits=3)
methods | maxabsmeans | Fnorm | maxabscov |
---|---|---|---|
EMgaussian | 0.038 | 3.425 | 0.104 |
lavaan | 0.038 | 3.425 | 0.104 |
norm | 579.505 | 4304244.960 | 776252.564 |
norm2 | 0.038 | 3.425 | 0.104 |
EMgaussian
and lavaan
perform similarly, as does norm2
. norm
again failed.
Final Thoughts
EMgaussian
is faster thanlavaan
when there are many missing data patterns.lavaan
can be faster thanEMgaussian
when there are few missing data patterns.norm
often fails to give reasonable estimates.norm2
seems to give reasonable results, but is slower thanlavaan
andEMgaussian
.mvnmle
was too slow in the initial test so I did not study it.
Estimation Speed
While some differences in computational time could be attributed to things like the criteria for convergence (e.g., an estimation method will “converge” sooner if we stop when differences across iterations are less than .01 instead of less than .00001) or starting values (e.g., poor starting values might require more iterations), I suspect that the differences between lavaan
and EMgaussian
may be due to how the optimization problem is set up and the language used to set up some steps. Major portions of the EM algorithm are coded in Rcpp
(i.e., C++) for EMgaussian
, which sometimes will give it an advantage. But, lavaan
is grouping respondents based on their missing data pattern whereas EMgaussian
is not. With few missing data patterns, this may allow lavaan
to take some computational shortcuts. With many missing data patterns, those shortcuts disappear and the coding in EMgaussian
then gains an advantage. This also means that EMgaussian
could probably get a lot faster if the code is changed to group by missing data patterns.
Coda
Either EMgaussian
or lavaan
may be a good choice. However, I have also elsewhere noticed lavaan
failing when EMgaussian
succeeds: (bug report for qgraph
)[https://github.com/SachaEpskamp/qgraph/issues/88]
Other Questions
Simulations
More than a few replications should be done to test/compare methods. A similar set of simulations was done comparing EMgaussian
and lavaan
but is not reported here.
What about multiple imputation?
Sure, you could use mice
, but how fast would it be with a very large dataset (suppose 70+ items and 11k respondents)? I think Amelia
might be a bit faster.
Aren’t there other R packages that should do this?
I would think so, but for some reason most of my searching came up empty. If anyone knows of one, feel free to contact me.
Visualizing Missing Data Patterns
Seems tricky with large datasets. I thought visdat
did ok at this. I did not have luck with md.pattern
from mice
.
Further Reading
If you would like more information on missing data handling, including the EM algorithm or “FIML”, here is some work I recommend.
- Statistical analysis with missing data by Little and Rubin (for the technically inclined)
- Applied missing data analysis by Enders (more readable)
- Flexible imputation of missing data by van Buuren (if you go with multiple imputation)
Related research articles: