Parsimony in model selection: Tools for assessing fit propensity

CPA, 2024

Carl F. Falk\(^1\) and Michael Muthukrishna\(^2\)

\(^1\)McGill University

\(^2\)London School of Economics and Political Science

(Both authors contributed equally to this research)

Overview

Fit Propensity (FP)

  • FP is a tendency for a model to fit a wide range of data patterns
    • High FP \(\approx\) less parsimonious
  • If a model has low FP (doesn’t fit random data), but fits this data well
    • Good argument in favor of the model
  • \(df\) or # estimated parameters are coarse measures of parsimony
    • Models can have equal \(df\) but different FP
    • Model with fewer parameters (more \(df\)) can have higher FP

Previous software

  • Preacher (2003, 2006)
    • Introduced fit propensity for covariance structure models (aka SEM w/ continuous variables)

Software?

General procedure:

  1. Generate random correlation matrices
  1. Fit models to the random correlation matrices
  1. Record index of model fit
  1. Summarize relative model fit

FORTRAN, MCMC (kind of slow)

RAMONA 4.0 for DOS

Only SRMR studied/supported

ECDF of index

ockhamSEM (Falk & Muthukrishna, 2023)

  • Random correlation matrices
  • Model syntax uses lavaan (Rosseel, 2012)
  • Automated summaries
  • Parallel processing supported for some features

https://github.com/falkcarl/ockhamSEM.git

install.packages("devtools")
devtools::install_github("falkcarl/ockhamSEM")
  • MCMC algorithm (Preacher, 2003, 2006)
  • Everything supported by clusterGeneration (Qiu & Joe, 2020)
  • Relatively easy to use
  • All fit indices supported by fitMeasures
  • Text / descriptives
  • Empirical CDFs of indices (ggplot2; Wickham, 2016)
  • Euler plots (eulerr; Larsson, 2020)

An Example

Some models

These examples (from Preacher, 2006) have the same df

Define/fit models with lavaan

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications

Define/fit models with lavaan

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • Load package

Define/fit models with lavaan

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • Load package
  • Sham data just so we can have a fitted model

Define/fit models with lavaan

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • Load package
  • Sham data just so we can have a fitted model
  • Estimate models with lavaan

Random corr. matrices & fit models

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications

Random corr. matrices & fit models

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • 2 or more models can be specified

Random corr. matrices & fit models

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • 2 or more models can be specified
  • Anything in output from fitmeasures() is supported

Random corr. matrices & fit models

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
  • 2 or more models can be specified
  • Anything in output from fitmeasures() is supported
  • Onion method by default is uniform over space of correlation matrices (Joe, 2006; Lewandowski, Kurowicka, & Joe, 2009)

Random corr. matrices & fit models

library(ockhamSEM)
set.seed(1234)

mat <- diag(3) # identity matrix
colnames(mat) <- rownames(mat) <- paste0("V", seq(1, 3))

mod1 <- 'V3 ~ V1 + V2
         V1 ~~ 0*V2'
mod2 <- 'V3 ~ V1
         V2 ~ V3'

mod1.fit <- sem(mod1, sample.cov=mat, sample.nobs=500)
mod2.fit <- sem(mod2, sample.cov=mat, sample.nobs=500)

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
[1] "Generate matrices"
[1] "Fitting models"
  • 2 or more models can be specified
  • Anything in output from fitmeasures() is supported
  • Onion method by default is uniform over space of correlation matrices (Joe, 2006; Lewandowski, Kurowicka, & Joe, 2009)
  • Here, 1000 random matrices (more = better)

Summaries (text)

  • Quantiles, means, distribution comparisons of each fit measure for each model
res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
        
summary(res,
        lower.tail=c(TRUE,FALSE))
# lower.tail, lower=better for fit index?

 Quantiles for each model and fit measure:

 Model  1 
       srmr   cfi
0%    0.001 1.000
10%   0.041 0.994
20%   0.091 0.969
30%   0.132 0.923
40%   0.176 0.852
50%   0.219 0.778
60%   0.269 0.682
70%   0.315 0.574
80%   0.406 0.415
90%   0.635 0.227
100% 11.059 0.000

 Model  2 
      srmr   cfi
0%   0.000 1.000
10%  0.031 0.982
20%  0.051 0.936
30%  0.079 0.870
40%  0.105 0.763
50%  0.134 0.656
60%  0.162 0.521
70%  0.202 0.393
80%  0.239 0.254
90%  0.294 0.129
100% 0.400 0.000

 Information about replications for each model and fit measure:

 Model  1 

Mean across replications
 srmr   cfi 
0.366 0.693 

Median across replications
 srmr   cfi 
0.219 0.778 

Number of finite values
srmr  cfi 
1000 1000 

Number of NA values
srmr  cfi 
   0    0 

 Model  2 

Mean across replications
 srmr   cfi 
0.148 0.600 

Median across replications
 srmr   cfi 
0.134 0.656 

Number of finite values
srmr  cfi 
1000 1000 

Number of NA values
srmr  cfi 
   0    0 

 Effect Sizes for Differences in Model Fit:

  srmr 

 Model 1 vs. Model 2 
   Cohen's d:           0.411 
   Cliff's delta:       0.34 
   Komolgorov Smirnov:  0.273 

  cfi 

 Model 1 vs. Model 2 
   Cohen's d:           0.306 
   Cliff's delta:       0.17 
   Komolgorov Smirnov:  0.146 

Summaries (ECDFs)

  • Model 2 looks like it has higher FP (according to SRMR)
res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
        
plot(res,
     whichfit = "srmr",
     lower.tail=TRUE)
# lower.tail, lower=better for fit index?

Summaries (ECDFs)

  • Model 1 looks like it has higher FP (according to CFI)
res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
        
plot(res,
     whichfit = "cfi",
     lower.tail=FALSE)
# lower.tail, lower=better for fit index?     

Summaries (Euler)

Proportion of fitted models that had CFI = .95 or better

Intersection of when Model 1 and 2 both had \(CFI \ge .95\) is overlap b/w “1” and “2”

Model 1 does not always have better CFI

res <- run.fitprop(
  mod1.fit, mod2.fit, # models
  fit.measure=c("srmr","cfi"), # fit indices
  rmethod="onion", # method for R matrices
  reps=1000) # number of replications
        
plot(res,
     type="euler", whichfit = "cfi",
     cutoff = .95, lower.tail=FALSE)

Conclusion

Stuff we learned and limitations

  • Fit indices don’t always agree
  • Model fitting can still be slow
  • Restricting data space matters

Ongoing work

  • Re-write for modularity (Jorgensen)
  • New visualizations (Winter)
  • Normalized maximum likelihood (NML; Preacher/Jorgensen)
    • Do model selection based in part on FP

Conclusion

  • We hope you will find ockhamSEM useful for
    • Teaching about FP
    • Studying FP of certain models of interest
    • Generating ideas for the study of FP itself
    • Model selection

Thank you!

  • Thank you very much to the CPA quantitative methods section!

  • Thanks also to Victoria Savalei for her inspiration and early input on this project. Kris Preacher for pointing us to his FORTRAN code.

  • Reference:

Falk, C.F., & Muthukrishna, M. (2023). Parsimony in model selection: Tools for assessing fit propensity. Psychological Methods, 28, 123-136. https://doi.org/10.1037/met0000422

Other examples

Simplex vs factor model w/ equality constraint on loadings

Restriction of data space matters

  • All positive correlations (positive manifold) vs no restriction

These three models vs unidimensional model (Model 4)

  • Inspired by models for Rosenberg Self-Esteem Scale studied by Donnellan, Ackerman, & Brecheen (2016)

Results depend on the fit index