--- title: "mantar" output: rmarkdown::html_vignette: toc: true toc_depth: 3 bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{mantar} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `mantar` provides users with several methods for handling missing data in the context of network analysis. The vignette is organized as follows: it begins with details on installation, followed by a description of the main functions and data sets provided by the package. Next, it discusses the available functionality in more depth by outlining the key arguments and their effects. Finally, a complete example analysis based on a real-world data set is presented. ## Installation The current stable version (0.2.0) is [available on CRAN](https://cran.r-project.org/package=stuart) and can be installed using the usual approach: ```{r, craninstall, eval = FALSE} install.packages("mantar") ``` You can install the development version of `mantar` from [GitHub](https://github.com/kai-nehler/mantar). To do so, you need the `remotes` package. ```{r, remoteinstall, eval = FALSE} # install.packages("remotes") remotes::install_github("kai-nehler/mantar@develop") ``` The extension `@develop` ensures that you get the latest development version of the package, which may include new features and bug fixes not yet available in the stable release on CRAN. Exluding this extension will install the same version as the one on CRAN. ```{r, loadmantar} library(mantar) ``` After installation the easiest way to get an overview of functions and capabilities is to use `help(package = "mantar")` to open the package help-file. You could also read the rest of this vignette for an introduction and some examples. ## Features This section provides an overview of the main functions and data sets included in the `mantar` package. ### Functionalities As described above, the package offers approaches for estimating network structures: - **Neighborhood selection**: This approach is based on node-wise regressions with information criteria for model selection. Each node is used once as a dependent variable, while all other nodes serve as potential predictors. For each node, the best set of predictors is selected using an information criterion. The resulting regression weights are then combined to obtain partial correlations between nodes. - **Regularization**: This approach relies on penalized maximum likelihood estimation of the (inverse) covariance matrix. Through appropriate penalty parameters, some entries of the estimated inverse covariance matrix are shrunk to zero, which in turn yields zeros in the partial correlation matrix and therefore a sparse network structure. Multiple values of the penalty parameter are evaluated, and information criteria are used to select the optimal penalization. For data sets with missing values, two promising missing approaches are implemented: - **Two-step Expectation-Maximization (EM):** A fast method that estimates the correlation matrix via an EM algorithm using the `lavaan` package. It performs well when the sample size is very large relative to the amount of missingness and the complexity of the network. - **Stacked Multiple Imputation (MI):** A more robust approach across a wider range of sample sizes. Multiple imputation is performed using predictive mean matching (PMM) with the `mice` package. The imputed data sets are stacked into a single data set, and a correlation matrix is estimated from this combined data. Both methods produce a correlation matrix that is then used for network estimation. It is also possible to compute the correlation matrix using pairwise or listwise deletion. However, these methods are generally not recommended, except in specific cases (e.g., when data are missing completely at random and the proportion of missingness is very small). By default, correlations are computed using Pearson correlations. However, with complete data, listwise deletion, or the stacked MI approach, users may choose to treat variables as ordered categorical, in which case polychoric and polyserial correlations are computed where appropriate. This option is particularly advisable when variables have a low number of categories or exhibit noticeable non-normality. At the same time, estimating polychoric and polyserial correlations requires a sufficiently large number of observations relative to the number of variables to ensure stable and reliable estimates. In addition to network estimation, the package also supports **stepwise regression search** based on information criteria for a **single dependent variable**. This regression search is available for both complete and incomplete data and relies on the same two-step EM or stacked MI procedures to handle missing values as the network analysis. While both methods to handle missingness are expected to perform well in this context, no specific simulation study has been conducted to compare their effectiveness for single regression modeling, and thus their relative strengths remain an open question. ### Data Sets The package includes dummy data sets that resemble a typical psychological data set, where the number of observations is considerably larger than the number of variables. Although the variables have descriptive names, these are included solely to make the examples more engaging - the data themselves are **fully synthetic**. Three data sets without missing values are included: - `mantar_dummy_full_cont`: Fully observed data (no missing values) - `mantar_dummy_full_cat`: Fully observed data with ordered categorical variables - `mantar_dummy_full_mix`: Fully observed data with a mix of continuous and ordered categorical variables Additionally, three data sets with missing values are provided: - `mantar_dummy_mis_cont`: Data with approximately 30% missing values in each continuous variable - `mantar_dummy_mis_cat`: Data with approximately 25% missing values in each ordered categorical variable - `mantar_dummy_mis_mix`: Data with approximately 25% missing values in each variable, with a mix of continuous and ordered categorical variables These data sets are intended for examples and testing only. ```{r example} # Load some example data sets for the ReadMe data(mantar_dummy_full_cont) data(mantar_dummy_full_cat) data(mantar_dummy_mis_cont) # Preview the first few rows of these data sets head(mantar_dummy_full_cont) head(mantar_dummy_full_cat) head(mantar_dummy_mis_cont) ``` ## Function Usage The `mantar` package provides two primary functions for network estimation: `neighborhood_net()` and `regularization_net()`. This section introduces their key arguments and demonstrates their usage with practical examples. We begin by estimating a network using `neighborhood_net()` with a complete data set (i.e., without missing values). Next, we show how to estimate a network when the data set contains missing data. Finally, we provide a brief example of network estimation using regularization techniques via the `regularization_net()` function. ### Network Estimation via Neighborhood Selection The `neighborhood_net()` function estimates a network structure based on neighborhood selection using information criteria for model selection in node-wise regressions. The function can either be provided with raw data (data frame or matrix) or a correlation matrix along with sample sizes for each variable. The examples will use raw data, as this is the more complex case. The following arguments are particularly relevant for controlling the network estimation process (with fully observed data): #### Information Criteria The `ic_type` argument controls the penalty applied during model selection for node-wise regressions. It defines the penalty per parameter (i.e., the number of predictors plus the intercept), thereby influencing the sparsity of the resulting model. The available options are: - `ic_type = "bic"` (default): corresponds to the **Bayesian Information Criterion (BIC)** - `ic_type = "aic"`: corresponds to the **Akaike Information Criterion (AIC)** - `ic_type = "aicc"`: corresponds to the **corrected Akaike Information Criterion (AICc)** #### Estimation of Partial Correlation The `pcor_merge_rule` argument determines how partial correlations are estimated based on the regression results between two nodes: - `"and"` (default): a partial correlation is estimated **only if both** regression weights (from node A to B and from B to A) are non-zero. - `"or"`: a partial correlation is estimated if **at least one** of the two regression weights is non-zero. Although both options are available, current simulation evidence suggests that the `"and"` rule yields more accurate partial correlation estimates than the `"or"` rule. Therefore, changing this default is **not recommended** unless you have a specific reason. #### Type of Correlation The `ordered` argument specifies how variables are treated when estimating correlations from raw data. - **Global specification:** - `ordered = TRUE`: all variables are treated as ordered categorical - `ordered = FALSE`: all variables are treated as continuous - **Variable-specific specification:** - A logical vector of length equal to the number of variables can be supplied to indicate which variables are treated as ordered categorical (e.g., `ordered = c(TRUE, FALSE, FALSE, TRUE)`). - If a global specification is used, the function automatically creates a vector of this value with the same length as the number of variables. Based on these specifications, the function applies the appropriate correlation type for each pair of variables: - both `FALSE`: **Pearson correlation** - one `TRUE` and one `FALSE`: **polyserial correlation** - both `TRUE`: **polychoric correlation** #### Estimation with Continuous Complete Data After discussing the key arguments, we can now illustrate how to estimate a network structure using the `neighborhood_net()` function with a data set without missing values. ```{r example_network} # Estimate network from full data set using BIC, the and rule as well as treating the data as continuous result_full_cont <- neighborhood_net(data = mantar_dummy_full_cont, ic_type = "bic", pcor_merge_rule = "and", ordered = FALSE) # View estimated partial correlations result_full_cont ``` #### Estimation with Ordered Categorical Complete Data We can also estimate a network structure when some variables are ordered categorical. In the following example, we treat all variables as ordered categorical. ```{r example_network_ord} # Estimate network from full data set using BIC, the and rule as well as treating the # data as ordered categorical result_full_cat <- neighborhood_net(data = mantar_dummy_full_cat, ic_type = "bic", pcor_merge_rule = "and", ordered = TRUE) # View estimated partial correlations result_full_cat ``` In the case of missing data, the `neighborhood_net()` function offers several additional arguments that control how sample size and missingness are handled. #### Calculation of Sample Size The `n_calc` argument specifies how the sample size is calculated for each node-wise regression. This affects the penalty term used in model selection. The available options are: - `"individual"` *(default)*: Uses the number of non-missing observations for each individual variable. This is the recommended approach. - `"average"`: Uses the average number of non-missing observations across all variables. - `"max"`: Uses the maximum number of non-missing observations across all variables. - `"total"`: Uses the total number of observations in the data set (i.e., the number of rows). #### Handling Missing Data The `missing_handling` argument specifies how the correlation matrix is estimated when the input data contains missing values. Two approaches are supported: - `"two-step-em"`: Applies a standard **Expectation-Maximization (EM)** algorithm to estimate the covariance matrix. This method is the default as it is computationally efficient. However, it only performs well when the sample size is large relative to the amount of edges in the network and the proportion of missingness. - `"stacked-mi"`: Applies **multiple imputation** to create several completed data sets, which are then stacked into a single data set. A correlation matrix is computed from this stacked data. As described previously, deletion techniques (listwise and pairwise) are also available, but their use is not recommended. When `"two-step-em"` is selected, the correlation matrix is always based on Pearson correlations, regardless of the `ordered` argument. In contrast, when `"stacked-mi"` is used, the `ordered` argument determines how variables are treated (continuous vs. ordered categorical) during the correlation estimation. If `"stacked-mi"` is used, the `nimp` argument controls the number of imputations (default: `20`), while `imp_method` specifies the imputation method (default: `"pmm"` for predictive mean matching). #### Estimation with Missing Data We can now illustrate how to estimate a network structure using the `neighborhood_net()` function with a data set that contains missing values. All variables are continuous in this example. ```{r example_network_mis, results = "hide"} # Estimate network for data set with missing values result_mis_cont <- neighborhood_net(data = mantar_dummy_mis_cont, n_calc = "individual", missing_handling = "stacked-mi", nimp = 20, imp_method = "pmm", pcor_merge_rule = "and") ``` ```{r example_network_mis_view} # View estimated partial correlations result_mis_cont ``` Note: Network estimation with stacked multiple imputation may take some time. During the imputation process, messages from the `mice` package may be printed. ### Network Estimation via Regularization The `regularization_net()` function estimates a network structure based on regularization techniques using information criteria for model selection. Similar to `neighborhood_net()`, this function can either be provided with raw data (data frame or matrix) or a correlation matrix along with sample sizes for each variable. The examples will use raw data, as this is the more complex case. The following arguments are particularly relevant for controlling the network estimation process (with fully observed data): #### Type of Regularization Penalty and corresponding Parameters The `penalty` argument controls the type of regularization used in the network estimation. The recommended options are using the graphical lasso (`"glasso"`) as a convex penalty or `"atan"` as a non-convex penalty. For `glasso`, the `lambda_min_ratio` and `n_lambdas` arguments control the range and number of penalty parameters evaluated during model selection. The default values are generally appropriate. For all nonconvex penalties (e.g., `"atan"`), there is the option to specify an additional parameter via the `gamma` argument. However, just using one default value (default value is different for different penalty types) for `gamma`, by setting the argument `vary = "lambda"` is sufficient. The last argument controlling the regularization process is `pen_diag` which specifies whether the diagonal elements of the covariance matrix should be penalized (`TRUE`) or not (`FALSE`, default). #### Information Criteria The `ic_type` argument determines the information-criterion penalty used during model selection in the regularization process. It specifies the penalty applied per freely estimated parameter (i.e., each included edge or nonzero partial correlation), thereby controlling the sparsity of the resulting model. The available options are: - `ic_type = "bic"` (default): corresponds to the **Bayesian Information Criterion (BIC)** - `ic_type = "ebic"`: corresponds to the **Extended Bayesian Information Criterion (EBIC)** - `ic_type = "aic"`: corresponds to the **Akaike Information Criterion (AIC)** The default depends on the selected regularization approach. For non-convex penalties, the default is `"bic"`, whereas for the `"glasso"` penalty the default is `"ebic"`. In the latter case, an additional parameter `extended_gamma` must be specified (default: `0.5`). #### Type of Correlation The `ordered` argument specifies how variables are treated when estimating correlations from raw data. - **Global specification:** - `ordered = TRUE`: all variables are treated as ordered categorical - `ordered = FALSE`: all variables are treated as continuous - **Variable-specific specification:** - A logical vector of length equal to the number of variables can be supplied to indicate which variables are treated as ordered categorical (e.g., `ordered = c(TRUE, FALSE, FALSE, TRUE)`). - If a global specification is used, the function automatically creates a vector of this value with the same length as the number of variables. Based on these specifications, the function applies the appropriate correlation type for each pair of variables: - both `FALSE`: **Pearson correlation** - one `TRUE` and one `FALSE`: **polyserial correlation** - both `TRUE`: **polychoric correlation** #### Estimation with Continuous Complete Data After discussing the key arguments, we can now illustrate how to estimate a network structure using the `regularization_net()` function with a data set without missing values. ```{r example_reg_network} # Estimate network from full data set using BIC and the glasso penalty result_full_cont <- regularization_net(data = mantar_dummy_full_cont, penalty = "glasso", vary = "lambda", n_lambda = 100, lambda_min_ratio = 0.1, ic_type = "bic", pcor_merge_rule = "and", ordered = FALSE) # View estimated partial correlations result_full_cont ``` With missing data, the `regularization_net()` function offers several additional arguments that control how sample size, information criteria computation and missingness are handled. #### Calculation of Sample Size for Information Criteria The `n_calc` argument specifies how the sample size is calculated for the information criteria computation. Only one value is needed here, as the regularization approach does not rely on node-wise regressions. The default input is `"average"`, which uses the average number of non-missing observations for all estimated correlations - this includes the correlations of variables with themselves. Ignoring these correlations (i.e., using the average number of non-missing observations across different variables only) is also possible with setting `count_diagonal` to `FALSE`. Within the information criteria computation, the likelihood for the candidate models has to be computed. The `likelihood` argument controls how this is done: - `"mat_based"` (default): The likelihood is computed based on the sample correlation matrix. - `"obs_based"`: The likelihood is computed based on the observed data. This option is only available when the raw input data contains no ordered categorical variables. In these cases, the observed data log-likelihood is recommended as it is a better representation of the sample data than using the sample correlation matrix. These options to compute the likelihood are also available with full data. However, they return the exact same results in this case. #### Handling Missing Data The `missing_handling` argument specifies how the correlation matrix is estimated when the input data contains missing values. Two approaches are supported: - `"two-step-em"`: Applies a standard **Expectation-Maximization (EM)** algorithm to estimate the covariance matrix. This method is the default as it is computationally efficient. However, it only performs well when the sample size is large relative to the amount of edges in the network and the proportion of missingness. - `"stacked-mi"`: Applies **multiple imputation** to create several completed data sets, which are then stacked into a single data set. A correlation matrix is computed from this stacked data. As described previously, deletion techniques (listwise and pairwise) are also available, but their use is not recommended. When `"two-step-em"` is selected, the correlation matrix is always based on Pearson correlations, regardless of the `ordered` argument. In contrast, when `"stacked-mi"` is used, the `ordered` argument determines how variables are treated (continuous vs. ordered categorical) during the correlation estimation. If `"stacked-mi"` is used, the `nimp` argument controls the number of imputations (default: `20`), while `imp_method` specifies the imputation method (default: `"pmm"` for predictive mean matching). #### Estimation with Missing Data We can now illustrate how to estimate a network structure using the `regularization_net()` function with a data set that contains missing values. All variables are continuous in this example. ```{r example_reg_network_mis} # Estimate network for data set with missing values result_mis_cont <- regularization_net(data = mantar_dummy_mis_cont, likelihood = "obs_based", penalty = "glasso", vary = "lambda", n_lambda = 100, lambda_min_ratio = 0.1, ic_type = "ebic", extended_gamma = 0.5, n_calc = "average", missing_handling = "two-step-em", pcor_merge_rule = "and", ordered = FALSE) # View estimated partial correlations result_mis_cont ``` ## Complete Example Analysis Finally, we consider a real-world example that is also described as an application in @nehler.2024. This example is based on the data from the cross-sectional study reported in @vervaet.2021. The original data is available in the [OSF Project](https://osf.io/ks85g/) and can be temporarily downloaded and loaded into the `R` environment. ```{r example_data_prep, message=FALSE, warning=FALSE} url <- "https://osf.io/download/6s9p4/" zipfile <- file.path(tempdir(), "vervaet.zip") exdir <- file.path(tempdir(), "vervaet") dir.create(exdir, recursive = TRUE, showWarnings = FALSE) download.file(url, destfile = zipfile, mode = "wb") unzip(zipfile, exdir = exdir) load(file.path(exdir, "Supplementary materials", "Dataset.RData")) ``` In this example, we analyze data from `r nrow(Data)` individuals to examine the cross-sectional network structure of 32 scores related to eating disorders (ED) and associated factors (e.g., depressive symptoms, anxiety), with the goal of identifying transdiagnostic vulnerabilities. We now perform a check for missingness in the data set. ```{r example_data_miss} colMeans(is.na(Data)) ``` Missingness proportions range from as low as `r round(min(colMeans(is.na(Data))) * 100)`% to as high as `r round(max(colMeans(is.na(Data))) * 100)`%. Overall, the average missingness rate was `r round(mean(colMeans(is.na(Data))) * 100)`%. The first decision concerns the choice of a suitable network estimation method. Simulation studies indicate that glasso regularization combined with EBIC model selection performs well when the ratio of the number of observations to the number of variables is relatively small [@isvoranu.2021; @nehler.2025]. In our case, with $N$ = `r nrow(Data)` observations and $p$ = `r ncol(Data)` variables, this ratio is relatively large. Under such conditions, both nonconvex penalties and neighborhood selection tend to perform well. The literature further suggests that neighborhood selection may be advantageous when the amount of missingness differs substantially between variables [@nehler.2024], which is the case in this data set. This motivates our choice to proceed with the `neighborhood_net()` function. Next, we examine the measurement level of the variables, as this also influences the choice of an appropriate calculation method and determines which missing-data handling strategies are feasible. An initial overview can be obtained by inspecting the summary of the data set. ```{r ordered_cat} summary(Data) ``` The variables are ordered categorical, but the number of categories is sufficiently large to treat them as continuous, as demonstrated by @johal.2023. Treating the variables as continuous gives us full flexibility in choosing among the available missing-data handling methods. @nehler.2024 showed that when the amount of available information relative to the number of parameters to be estimated (i.e., edges) is low, the stacked multiple imputation approach tends to perform better, while in other situations it performs similarly to the two-step EM algorithm. In our case, although the number of nodes is relatively high, we also have a large number of observations and only a moderate to small amount of missingness overall. Therefore, both methods are feasible, but we opt for the two-step EM approach due to its substantially lower computational demand. Estimation thus proceeds as follows: ```{r example_final_network} final_result <- neighborhood_net(data = Data, n_calc = "individual", missing_handling = "two-step-em", pcor_merge_rule = "and", ordered = FALSE) ``` The estimated partial correlation matrix can be accessed. ```{r example_final_network_view} final_result$pcor ``` This partial correlation matrix can be used for reporting purposes, but it can also serve as input for further analyses in other packages (e.g., centrality analysis, community detection). Beyond that, two methods are available for inspecting the results of networks estimated with the `mantar` package; regardless of the estimation method used or whether missing values were present (although the output structure may differ slightly). The first option is to obtain a summary of the results. ```{r example_final_network_summary} summary(final_result) ``` This output mainly provides information about the estimation process,much of which reflects the arguments we specified earlier, but it also includes two particularly informative elements. First, it reports the density of the estimated network (i.e., the proportion of non-zero edges). Second, it provides the effective sample sizes used for each node-wise regression. Most of the time, the goal is not to present the partial correlation matrix itself, but rather to visualize the resulting network structure. This can be achieved by creating a network plot, which in `mantar` builds on the functionality of the `qgraph` [package](https://CRAN.R-project.org/package=qgraph). The plot can be customized using the various options provided by `qgraph`. A common customization is to color nodes according to predefined clusters and to display full variable names in a legend. The names and grouping structure used here follow the original [analysis code](https://osf.io/6s9p4) of the study reported in @vervaet.2021. ```{r example_final_network_plot, message=FALSE, warning=FALSE} Groups <- c(rep("EDI-II", 11), rep("BDI", 1), rep("STAI", 1), rep("RS-NL", 1), rep("TCI", 7), rep("YSQ", 5), rep("FMPS", 6)) # Create names for legend Names <- c("Drive for Thinness", "Bulimia","Body Dissatisfaction", "Ineffectiveness", "Perfectionism", "Interpersonal Distrust ", "Interoceptive Awareness ", "Maturity Fears", "Asceticism", "Impulse Regulation","Social Insecurity", "Depression", "Anxiety", "Resilience", "Novelty Seeking", "Harm Avoidance", "Reward Dependence", "Persistence", "Self-Directedness", "Cooperativeness", "Self- Transcendence", "Disconnection and Rejection", "Impaired Autonomy & Performance", "Impaired Limits", "Other-directness", "Overvigilance & Inhibition", "Concern over Mistakes", "Personal Standards", "Parental Expectations", "Parental Criticism", "Doubting of Actions", "Order and Organisation") Lab_Colors <- c(rep('white', 11), rep('white', 1), rep('black', 1), rep('white', 1), rep('black', 7), rep('black', 5), rep('white', 6)) plot(final_result, layout = 'spring', nodeNames = Names, groups = Groups, label.color = Lab_Colors, vsize = 5, legend.cex = 0.15, label.cex = 1.25, negCol = "#7A0403FF", posCol = "#00204DFF") ``` This example demonstrated how to estimate a psychological network structure using the `mantar` package while appropriately handling missing data, and outlined the key considerations involved in choosing between the available estimation options. It also showed how the resulting network can be further analyzed and visualized using the methods provided in the package. ## References