EMgaussian for Missing Data: Comparison vs. lavaan, norm, and Friends

Overview

For a recent project, I needed to be able to repeatedly estimate means and covariances for many items for many rather large datasets (well, large as far as Psychology/Health studies usually go). In addition, I needed something that would do a decent job at appropriately handling missing data and would finish rather quickly. And I needed the code in R and ASAP.

Aren’t there already R packages that could do this? I thought so and I tried a few, including lavaan, norm, and norm2. In the end, modifying code from a project with Joshua Starr led to the EMgaussian package, which as a byproduct included code for estimating means and covariances using the EM algorithm. Although the implemented approach carried with it assumptions regarding multivariate normality, it would probably get the job done.

This post documents some tests with EMgaussian. In brief, I found that for my particular project, it was much faster than lavaan, but it probably depends on how many patterns of missing data there are. lavaan seems fast if there are few missing data patterns, EMgaussian may be faster if there are many missing data patterns. And norm and norm2 are not great but for different reasons.

Load Packages

# packages used for handling missing data or visualizing it
library(EMgaussian)
library(lavaan)
library(norm)
library(norm2)
library(mvnmle)
#library(rrcovNA) # well, this one actually just calls em.norm

library(mice)
library(visdat)

# timings
library(tictoc)
library(rbenchmark)

# a little data manipulation
library(dplyr)

Quick Illustration

Here I grab a dataset that I know has some missing values, but that many: Big Five Inventory (IPIP) from the psych package, which has 2800 respondents and 25 Likert-type responses with something like 6 response options each.

library(psych)
bfi.sub <- bfi[,1:25]
dim(bfi.sub)
## [1] 2800   25

There’s not much missing data; if we do listwise deletion (remove anyone with a missing value), we end up losing 74 participants. But, it should work as a quick illustration to start off.

vis_miss(bfi.sub)

dim(na.omit(bfi.sub))
## [1] 2436   25

Comparison of Methods

Here, we compare several approaches.

  • lavaan can compute means and covariances using what structural equation modelers call “full-information maximum likelihood” (FIML).
    • Known as Direct Maximum Likelihood by others.
    • I thought lavaan used to use the EM algorithm inside the function below, but I’m not sure that it still does.
  • norm is the first package I learned that could compute the EM means and covariances.
    • But, apparently it is now discouraged in favor of norm2 and there are warnings about how it may not be able to reliably handle more then 30 variables or something.
  • norm2 is supposed to be better than norm, but as of this writing was booted off of CRAN.
  • mvnmle also apparently has an implementation.
  • EMgaussian is my own implementation of the EM algorithm for computing means and covariances.
  • rrcovNA?
    • Well, it turns out that it actually just calls em.norm from the norm package, so I did not study it.

Same Answer?

Obtain estimates for all 5 methods.

# EM gaussian
em.res <- em.cov(bfi.sub)

# lavaan
lav.res <- lavCor(bfi.sub, missing="fiml", output="sampstat")

# norm
s <- prelim.norm(as.matrix(bfi.sub))
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)

# norm2
norm2.res <- emNorm(bfi.sub)

# mvnmle
mvnmle.res <- mlest(bfi.sub)

What is the max absolute difference for any element of the covariance matrix? It seems all methods do yield about the same answer for the covariance matrix.

EMgaussian vs. lavaan:

max(abs(em.res$S - lav.res$cov))
## [1] 5.469403e-12

EMgaussian vs. norm:

max(abs(em.res$S - norm.res$sigma))
## [1] 1.350084e-06

EMgaussian vs. norm2:

max(abs(em.res$S - norm2.res$param$sigma))
## [1] 8.856597e-08

EMgaussian vs. mvnmle:

max(abs(em.res$S - mvnmle.res$sigmahat))
## [1] 1.298663e-05

How Long?

Well, using default optimization criteria. Particular differences in tolerance could lead to different speeds of convergence, and those differ across estimation approaches.

benchmark(
  "EMgaussian" = {em.res <- em.cov(bfi.sub)},
  "lavaan" = {lav.res <- lavCor(bfi.sub, missing="fiml", output="sampstat")},
  "norm" = {s <- prelim.norm(as.matrix(bfi.sub))
            est <- em.norm(s, showits = FALSE)
            norm.res <- getparam.norm(s,est)},
  "norm2" = {norm2.res <- emNorm(bfi.sub)},
  "mvnmle" = {mvnmle.res <- mlest(bfi.sub)},
  replications = 50
)
##         test replications elapsed relative user.self sys.self user.child
## 1 EMgaussian           50    1.03    2.641      0.97     0.04         NA
## 2     lavaan           50    4.81   12.333      4.47     0.28         NA
## 5     mvnmle           50  572.74 1468.564    509.39    58.26         NA
## 3       norm           50    0.39    1.000      0.32     0.10         NA
## 4      norm2           50   49.86  127.846     49.53     0.05         NA
##   sys.child
## 1        NA
## 2        NA
## 5        NA
## 3        NA
## 4        NA

Here, EMgaussian doesn’t do too bad. norm is fast, but we’ll see later that it breaks with other datasets. norm2 is quite slow. lavaan is also just a bit slower than EMgaussian, but not too bad. mvnmle was slow enough that I decide not to study it any more!

When is lavaan Faster than EMgaussian?

… and vice versa … and when does norm and norm2 fail?

Here, I take a different dataset and purposely delete values from it. This allows a benchmark for performance: do any of the methods come close to the result based on complete data? It also helps illustrate how the number of missing data patterns apparently affects computational time.

The Data

This is not the actual dataset I was using for the project, but provides something comparable. The spi dataset contains 4000 respondents with 135 Likert-type items with 6 response options per item. It apparently does not contain any missing values.

# load data
library(psychTools)
dat.sub <- spi %>% dplyr::select(starts_with("q_"))
nobs <- nrow(dat.sub)*ncol(dat.sub)
dim(dat.sub)
## [1] 4000  135

Here, I save the true means and covariances from this complete set of data. This serves as a benchmark for the missing data methods.

# True means and covariances - before deleting missing values
truemeans <- colMeans(dat.sub)
truecov <- cov(dat.sub)

Few Missing Data Patterns

In one set of tests, I take inspiration from this paper by Xijuan Zhang on how to create missing data for simulation studies. Here, there are three conditioning variables that predict missingness on three subsets of the items. Technically this results in data that are Missing at Random (MAR).

# Create MAR data

# Probability of missingness for each row, based on each of three variables
# The functions here are essentially that of a logistic regression
p.pat1 <- 1/(1+exp(-(-7 + .75 + dat.sub$q_253)))
p.pat2 <- 1/(1+exp(-(-5 + 1.75 + dat.sub$q_952)))
p.pat3 <- 1/(1+exp(-(-8 + 2.75 + dat.sub$q_1904)))

# We end up with a ballpark of around .3 missing rate
#mean(p.pat1)
#mean(p.pat2)
#mean(p.pat3)

# Create a copy of the dataset, and creating missing values for three 
set.seed(1234)
dat.miss <- dat.sub
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat1, TRUE, FALSE),4:50] <- NA
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat2, TRUE, FALSE),51:98] <- NA
dat.miss[ifelse(runif(nrow(dat.sub))<p.pat3, TRUE, FALSE),99:135] <- NA

But, there are relatively few possible patterns of missing data (\(2^3 = 8\), combinations, actually). We end up with about this proportion of missing data in total:

# Runs, but the plot not really readable
#md.pattern(dat.miss)

# a little more readable
vis_miss(dat.miss)

# proportion of missing data
sum(is.na(dat.miss))/nobs
## [1] 0.2945833

Comparison

Here are timings as well as comparisons of each approach with the true means and covariances.

First, timing how long each takes:

EMgaussian

# EM algorithm
tic()
em.res <- em.cov(dat.miss)
toc()
## 29.62 sec elapsed

lavaan

# lavaan
tic()
lav.res <- lavCor(dat.miss, missing="fiml", output="sampstat")
toc()
## 7.86 sec elapsed

norm

# norm
tic()
s <- prelim.norm(as.matrix(dat.miss)) # generates a warning, that's probably bad
## Warning in prelim.norm(as.matrix(dat.miss)): NAs introduced by coercion to
## integer range
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)
toc()
## 34.42 sec elapsed

norm2

# norm2
tic()
norm2.res <- emNorm(dat.miss)
toc()
## 35.5 sec elapsed

So, here lavaan is actually the fastest, and by quite a bit. EMgaussian is next, and norm and norm2 are slow. But, speed is not the only story and we’ll see this pattern changes with many missing data patterns.

Next, compare to true values. For comparisons with the means and covariances, max absolute differences are printed as are Frobenius norm for the difference between the true and estimated covariance matrix (smaller numbers are better; differences here are all tiny).

results <- data.frame(
  methods = c("EMgaussian","lavaan","norm","norm2"),
  maxabsmeans = c(max(abs(em.res$mu - truemeans)),
            max(abs(lav.res$mean - truemeans)),
            max(abs(norm.res$mu - truemeans)),
            max(abs(norm2.res$param$beta - truemeans))),
  Fnorm = c(norm(em.res$S - truecov, type="F"),
            norm(lav.res$cov - truecov, type="F"),
            norm(norm.res$sigma - truecov, type="F"),
            norm(norm2.res$param$sigma - truecov, type="F")),
  maxabscov = c(max(abs(em.res$S - truecov)),
                max(abs(lav.res$cov - truecov)),
                max(abs(norm.res$sigma - truecov)),
                max(abs(norm2.res$param$sigma - truecov)))
)

knitr::kable(results, digits = 3)
methods maxabsmeans Fnorm maxabscov
EMgaussian 0.063 3.68 0.201
lavaan 0.063 3.68 0.201
norm 396.633 12745694.15 571219.957
norm2 0.063 3.68 0.201

EMgaussian and lavaan perform similarly, as does norm2, though lavaan is faster. norm is way off.

Many Missing Data Patterns

Here, many missing data patterns are created instead of just a few. Although I’m pretty sure the project I was working on did not have Missing Complete at Random (MCAR) data, it was just easier for this illustration.

# Create MCAR data
nobs <- nrow(dat.sub)*ncol(dat.sub)
dat.miss <- as.matrix(dat.sub)
dat.miss[sample(1:nobs,.3*nobs)] <- NA
# Runs, but the plot not really readable
#md.pattern(dat.miss)

# a little more readable
vis_miss(as.data.frame(dat.miss))

# proportion of missing data
sum(is.na(dat.miss))/nobs
## [1] 0.3

Comparison

Here are timings as well as comparisons of each approach with the true means and covariances.

EMgaussian

# EM algorithm
tic()
em.res <- em.cov(dat.miss)
toc()
## 9.37 sec elapsed

lavaan

# lavaan
tic()
lav.res <- lavCor(dat.miss, missing="fiml", output="sampstat")
toc()
## 163.05 sec elapsed

norm

# norm
tic()
s <- prelim.norm(as.matrix(dat.miss)) # generates a warning, that's probably bad
## Warning in prelim.norm(as.matrix(dat.miss)): NAs introduced by coercion to
## integer range
est <- em.norm(s, showits = FALSE)
norm.res <- getparam.norm(s,est)
toc()
## 0.12 sec elapsed

norm2

# norm2
tic()
norm2.res <- emNorm(dat.miss)
## Note: Eigen power method failed to converge
## OCCURRED IN: estimate_worst_frac in MOD norm_engine
toc()
## 1403.65 sec elapsed

Wow, so EMgaussian was much faster than lavaan. Aside from that norm actually kind of failed and norm2 was abysmally slow.

Next, compare to true values. For comparisons with the means and covariances, max absolute differences are printed as are Frobenius norm for the difference between the true and estimated covariance matrix (smaller numbers are better).

results <- data.frame(
  methods = c("EMgaussian","lavaan","norm","norm2"),
  maxabsmeans = c(max(abs(em.res$mu - truemeans)),
            max(abs(lav.res$mean - truemeans)),
            max(abs(norm.res$mu - truemeans)),
            max(abs(norm2.res$param$beta - truemeans))),
  Fnorm = c(norm(em.res$S - truecov, type="F"),
            norm(lav.res$cov - truecov, type="F"),
            norm(norm.res$sigma - truecov, type="F"),
            norm(norm2.res$param$sigma - truecov, type="F")),
  maxabscov = c(max(abs(em.res$S - truecov)),
                max(abs(lav.res$cov - truecov)),
                max(abs(norm.res$sigma - truecov)),
                max(abs(norm2.res$param$sigma - truecov)))
)

knitr::kable(results, digits=3)
methods maxabsmeans Fnorm maxabscov
EMgaussian 0.038 3.425 0.104
lavaan 0.038 3.425 0.104
norm 579.505 4304244.960 776252.564
norm2 0.038 3.425 0.104

EMgaussian and lavaan perform similarly, as does norm2. norm again failed.

Final Thoughts

  • EMgaussian is faster than lavaan when there are many missing data patterns.
  • lavaan can be faster than EMgaussian when there are few missing data patterns.
  • norm often fails to give reasonable estimates.
  • norm2 seems to give reasonable results, but is slower than lavaan and EMgaussian.
  • mvnmle was too slow in the initial test so I did not study it.

Estimation Speed

While some differences in computational time could be attributed to things like the criteria for convergence (e.g., an estimation method will “converge” sooner if we stop when differences across iterations are less than .01 instead of less than .00001) or starting values (e.g., poor starting values might require more iterations), I suspect that the differences between lavaan and EMgaussian may be due to how the optimization problem is set up and the language used to set up some steps. Major portions of the EM algorithm are coded in Rcpp (i.e., C++) for EMgaussian, which sometimes will give it an advantage. But, lavaan is grouping respondents based on their missing data pattern whereas EMgaussian is not. With few missing data patterns, this may allow lavaan to take some computational shortcuts. With many missing data patterns, those shortcuts disappear and the coding in EMgaussian then gains an advantage. This also means that EMgaussian could probably get a lot faster if the code is changed to group by missing data patterns.

Coda

Either EMgaussian or lavaan may be a good choice. However, I have also elsewhere noticed lavaan failing when EMgaussian succeeds: (bug report for qgraph)[https://github.com/SachaEpskamp/qgraph/issues/88]

Other Questions

Simulations

More than a few replications should be done to test/compare methods. A similar set of simulations was done comparing EMgaussian and lavaan but is not reported here.

What about multiple imputation?

Sure, you could use mice, but how fast would it be with a very large dataset (suppose 70+ items and 11k respondents)? I think Amelia might be a bit faster.

Aren’t there other R packages that should do this?

I would think so, but for some reason most of my searching came up empty. If anyone knows of one, feel free to contact me.

Visualizing Missing Data Patterns

Seems tricky with large datasets. I thought visdat did ok at this. I did not have luck with md.pattern from mice.

Further Reading

If you would like more information on missing data handling, including the EM algorithm or “FIML”, here is some work I recommend.

Related research articles:

Falk and Starr Savalei and Falk

Avatar
Carl F. Falk
Associate Professor of Quantitative Psychology