REMLA_tutorial

library(REMLA)
#> Package 'REMLA' version 1.0
#> For documentation, questions, and issues please see github link https://github.com/knieser/REM.
#> Type 'citation("REMLA")' for citing this R package in publications.
library(GPArotation)

Introduction

This vignette introduces the use of the robust expectation maximization (REM) algorithm; a statistical method for estimating the parameters of a latent variable model in the presence of population heterogeneity as suggested by Nieser & Cochran (2023). The REM algorithm is based on the expectation-maximization (EM) algorithm, but it allows for the case when not all the data are generated by the assumed data generating model.

Data Description

To illustrate the practical application of the REM (Robust Expectation Maximization) package for both Exploratory and Confirmatory Factor Analysis, we have chosen to employ it in the context of the Holzinger and Swineford dataset publicly available in the lavaan package.

The Holzinger and Swineford dataset consist of mental ability test scores of seventh and eighth grade children from two different schools. Originally, this dataset consisted of 26 test scores but this modified version in the lavaan package consist of 15 variables, 9 of which are test scores.

library(lavaan)
#> This is lavaan 0.6-17
#> lavaan is FREE software! Please report any bugs.

df <- HolzingerSwineford1939
head(df)
#>   id sex ageyr agemo  school grade       x1   x2    x3       x4   x5        x6
#> 1  1   1    13     1 Pasteur     7 3.333333 7.75 0.375 2.333333 5.75 1.2857143
#> 2  2   2    13     7 Pasteur     7 5.333333 5.25 2.125 1.666667 3.00 1.2857143
#> 3  3   2    13     1 Pasteur     7 4.500000 5.25 1.875 1.000000 1.75 0.4285714
#> 4  4   1    13     2 Pasteur     7 5.333333 7.75 3.000 2.666667 4.50 2.4285714
#> 5  5   2    12     2 Pasteur     7 4.833333 4.75 0.875 2.666667 4.00 2.5714286
#> 6  6   2    14     1 Pasteur     7 5.333333 5.00 2.250 1.000000 3.00 0.8571429
#>         x7   x8       x9
#> 1 3.391304 5.75 6.361111
#> 2 3.782609 6.25 7.916667
#> 3 3.260870 3.90 4.416667
#> 4 3.000000 5.30 4.861111
#> 5 3.695652 6.30 5.916667
#> 6 4.347826 6.65 7.500000

We limit the data to the columns that will be used in the analysis. Data can be either in a matrix or data frame and must be numeric.

data = df[,-c(1:6)] 

Modeling using REM for Exploratory Factor Analysis

The REM_EFA function, specifically designed for conducting exploratory factor analysis, requires specification of three parameters: X, k_range and delta. The X parameter corresponds to the dataframe or matrix to be analyzed. The k_range parameter comprises a vector of numbers of factors to be modeled. In the example provided, the algorithm conducts exploratory analysis with 1 to 3 factors. Moreover, the delta parameter serves as a hyper-parameter, ranging from 0 to 1, and it reflects the researcher’s tolerance level for potentially down-weighting data inaccurately within the model. The default for the delta parameter is 0.05.

The output contains a separate list of estimates for each number of factors under consideration.

model_EFA = REM_EFA( X = data, k_range = 1:3, delta = 0.05)
#> EFA with 1 factors
#>  getting EM estimates...
#> done
#>  searching for optimal epsilon
#> done
#>  getting REM estimates...
#> done
#>  calculating standard errors...
#> done
#> EFA with 2 factors
#>  getting EM estimates...
#> done
#>  searching for optimal epsilon
#> done
#>  getting REM estimates...
#> done
#>  calculating standard errors...
#> done
#> EFA with 3 factors
#>  getting EM estimates...
#> done
#>  searching for optimal epsilon
#> done
#>  getting REM estimates...
#> done
#>  calculating standard errors...
#> done

The summary function

The summary function displays the estimates of the optimal model which is selected using the lowest score of Bayesian Information Criterion (BIC) among the number of factors considered. In other words, it selects the optimal model after considering the BIC in the models with 1, 2, and 3 factors.

summary(model_EFA)
#> -----Exploratory factor analysis using REM----- 
#> Call: REM_EFA(X = data, k_range = 1:3, delta = 0.05) 
#> 
#> According to BIC evaluated at REM estimates,
#>  the optimal number of factors is: 3 
#> 
#> REM estimates w/ delta = 0.05 ( epsilon = 8.67525e-07 )
#>  -------------------------- 
#> Est. gamma = 0.85 
#> 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1     4.894    0.114    0.708   -0.043   0.373
#> x2     5.634    0.050    0.394   -0.142   0.821
#> x3     2.050   -0.148    0.565    0.044   0.683
#> x4     2.869    0.799    0.098    0.002   0.295
#> x5     3.522    0.925   -0.060   -0.006   0.210
#> x6     2.253    0.835    0.014    0.012   0.304
#> x7     3.893    0.043   -0.132    0.782   0.459
#> x8     5.841   -0.043    0.047    0.764   0.451
#> x9     5.720   -0.020    0.373    0.510   0.511
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    1.000    0.286    0.342
#> Factor 2    0.286    1.000    0.136
#> Factor 3    0.342    0.136    1.000
#> 
#> EM estimates 
#>  -------------------------- 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1     4.237    0.178    0.589    0.028   0.513
#> x2     5.180    0.037    0.495   -0.126   0.749
#> x3     1.993   -0.084    0.674    0.021   0.543
#> x4     2.641    0.843    0.020    0.003   0.282
#> x5     3.379    0.894   -0.067    0.004   0.244
#> x6     2.003    0.810    0.075   -0.016   0.307
#> x7     3.849    0.026   -0.150    0.762   0.497
#> x8     5.468   -0.054    0.103    0.731   0.473
#> x9     5.335    0.015    0.358    0.482   0.544
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    1.000    0.149    0.357
#> Factor 2    0.149    1.000    0.030
#> Factor 3    0.357    0.030    1.000
#> 
#> Difference (EM - REM) 
#>  -------------------------- 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1    -0.657    0.065   -0.119    0.071   0.140
#> x2    -0.454   -0.013    0.100    0.016  -0.072
#> x3    -0.057    0.064    0.109   -0.023  -0.140
#> x4    -0.228    0.045   -0.078    0.001  -0.014
#> x5    -0.143   -0.031   -0.006    0.010   0.033
#> x6    -0.250   -0.024    0.061   -0.027   0.003
#> x7    -0.044   -0.017   -0.018   -0.020   0.038
#> x8    -0.373   -0.011    0.056   -0.034   0.022
#> x9    -0.385    0.034   -0.015   -0.028   0.033
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    0.000   -0.137    0.015
#> Factor 2   -0.137    0.000   -0.107
#> Factor 3    0.015   -0.107    0.000

From the REM and EM output above, note that \(x_4\), \(x_5\), and \(x_6\) strongly load on Factor 1. Similarly, \(x_1\),\(x_2\), and \(x_3\) moderately load on Factor 2 and that \(x_7\), \(x_8\) and \(x_9\) moderately load on Factor 3. Additionally, the gamma or average REM weight was 0.849, indicating that about 15 percent of the sample differed in the factor model parameters from the majority of the sample.

Plot Distribution of the weights

In the following plot we show the distribution of the weights to know how much people are getting down-weighted.


hist(model_EFA[[2]]$REM_output$weights,
     main="REM Weight Distribution",
     xlab = "Weights", 
     ylab = "Frequency")

Confirmatory Factor Analysis

To conduct a confirmatory factor analysis,we need to specify a model which consists of structural equations relating the measured variables with the latent variables. We consider the following three factor model:

\[Visual = x_1 + x_2 + x_3 \] \[Textual = x_4 + x_5 + x_6 \] \[Speed = x_7 + x_8 + x_9\] To create the model parameter, first create a string variable that contains each equation in a new line and separate equalities using the \(=~\) symbol as shown below.

# Define your model as a string
model <- " Visual =~ x1  +  x2  +  x3 
          Textual =~ x4 + x5 + x6
          Speed =~ x7 + x8 + x9
"

This model is used to investigate the relationships between the latent variables “Visual”,“Textual”, and “Speed” and their respective sets of observed variables by examining how well the observed variables predict the underlying constructs. The model will estimate coefficients for these relationships, providing information about the strength and direction of these associations.

Finally, use the REM_CFA() function to perform the confirmatory factor analysis. For this include X (the numeric data frame under consideration), delta (a hyper-parameter), and the model parameter.

# CFA model with delta = 0.05
model_CFA = REM_CFA(X = data,delta = 0.05,model = model)
#> CFA with 3 factors
#>  getting EM estimates...
#> done
#>  searching for optimal epsilon
#> done
#>  getting REM estimates...
#> done
#>  calculating standard errors...
#> done
summary(model_CFA)
#> -----Confirmatory factor analysis using REM----- 
#> Call: REM_CFA(X = data, delta = 0.05, model = model) 
#> 
#> Model:     Visual =~ x1  +  x2  +  x3 
#>           Textual =~ x4 + x5 + x6
#>           Speed =~ x7 + x8 + x9
#>  
#> 
#> REM estimates w/ delta = 0.05 ( epsilon = 8.054465e-07 )
#>  -------------------------- 
#> Est. gamma = 0.829 
#> 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1     5.037    0.821    0.000    0.000   0.327
#> x2     5.757    0.359    0.000    0.000   0.871
#> x3     2.058    0.500    0.000    0.000   0.750
#> x4     2.897    0.000   -0.843    0.000   0.289
#> x5     3.558    0.000   -0.871    0.000   0.241
#> x6     2.270    0.000   -0.840    0.000   0.295
#> x7     3.895    0.000    0.000    0.634   0.598
#> x8     5.859    0.000    0.000    0.752   0.435
#> x9     5.758    0.000    0.000    0.657   0.568
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    1.000   -0.496    0.360
#> Factor 2   -0.497    1.000   -0.261
#> Factor 3    0.360   -0.261    1.000
#> 
#> EM estimates 
#>  -------------------------- 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1     4.236    0.764    0.000    0.000   0.416
#> x2     5.180    0.427    0.000    0.000   0.818
#> x3     1.993    0.585    0.000    0.000   0.657
#> x4     2.637    0.000   -0.851    0.000   0.276
#> x5     3.373    0.000   -0.855    0.000   0.270
#> x6     2.001    0.000   -0.838    0.000   0.298
#> x7     3.848    0.000    0.000    0.567   0.679
#> x8     5.467    0.000    0.000    0.719   0.483
#> x9     5.334    0.000    0.000    0.670   0.552
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    1.000   -0.459    0.478
#> Factor 2   -0.459    1.000   -0.285
#> Factor 3    0.478   -0.285    1.000
#> 
#> Difference (EM - REM) 
#>  -------------------------- 
#>    Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1    -0.801   -0.057    0.000    0.000   0.090
#> x2    -0.577    0.068    0.000    0.000  -0.053
#> x3    -0.064    0.085    0.000    0.000  -0.092
#> x4    -0.260    0.000   -0.008    0.000  -0.013
#> x5    -0.185    0.000    0.017    0.000   0.029
#> x6    -0.269    0.000    0.002    0.000   0.003
#> x7    -0.047    0.000    0.000   -0.067   0.081
#> x8    -0.392    0.000    0.000   -0.033   0.049
#> x9    -0.424    0.000    0.000    0.013  -0.017
#> 
#> Factor correlations 
#>          Factor 1 Factor 2 Factor 3
#> Factor 1    0.000    0.038    0.119
#> Factor 2    0.038    0.000   -0.024
#> Factor 3    0.119   -0.024    0.000

The output above depicts estimates for the confirmatory analysis. Note that again \(x_4\), \(x_5\), and \(x_6\) are strongly correlated with Factor 1. Similarly, \(x_1\),\(x_2\), and \(x_3\) are moderately correlated with factor 2 and that \(x_7\), \(x_8\) and \(x_9\) are moderately correlated with Factor 3. However, the gamma or average REM weight was 0.833. Note that the difference between the EM and REM intercepts are negative indicating that this down-weighted group has lower average item scores.