For illustration, we selected the Cleveland Clinic Heart Disease Data set from the University of California in Irvine (UCI) machine learning data repository (Dua and Graff 2017). Below, we are using eleven variables, five of which are continuous, four are dichotomous, and two categorical variables.
library(modgo)
data("Cleveland", package = "modgo")
# Specifying dichotomous and ordinal categorical variables
<- c("Sex","HighFastBloodSugar","CAD","ExInducedAngina")
binary_variables <- c("Chestpaintype","RestingECG")
categorical_variables <- 500
nrep <- c("Age", "STDepression", binary_variables[c(1,3)], categorical_variables) plot_variables
In this section, we run modgo with its default settings. For modgo to produce results that mimic the original data set efficiently, user needs to specify dichotomous and ordinal categorical variables. Variables will be considered as continuous, otherwise. All modgo runs in this and the following sections will produce 500 data sets with the specification nrep = 500; the default is 100.
Figure 1 shows the correlation plots for the default modgo run, and Figure 2 displays the distribution plots for the original data set and one simulated data set. The default displayed simulated data set is the first one. Moreover, for all the plots a set of variables are used.
<- modgo(data = Cleveland,
test bin_variables = binary_variables,
categ_variables = categorical_variables,
nrep = nrep)
modgo provides an option so that only subjects (instances) are simulated that fulfill a specific requirement. In the simplest case (Section 2.1), the user can specify an upper or a lower boundary, or an interval for a variable. The use may alternatively specify a combination of variables and thresholds.
Three steps are required when subjects need to fulfill a specific selection criterion for a continuous variable. First, the name of the variable needs to be specified, for which the threshold needs to be set. Second, the left and right boundaries need to be specified. Third, a data frame with three columns is defined with Column 1: variable name of threshold variable, Column 2: left boundary, i.e., lower bound, Column 3: right boundary, i.e., upper bound. Finally, the data frame is imported using the thresh_var argument. In the example, all subjects have to be at least 66 years old. The selection variable therefore is Age with left threshold 65 and right threshold infinity NA.
If the percentage of samples fulfilling the indicated threshold requirements are less than 10% of the simulated samples, modgo stops to avoid excessive computation time. However, users can force thresh_force = TRUE the requested simulation to be run.
Figure 3 shows the correlation plot for this illustration. Substantial differences between the original and the simulated correlation plots can be observed for the RestingECG and several other variables. Figure 4 displays the corresponding distribution plot. The age distribution is shifted as expected. Furthermore, the distribution of subjects with coronary artery disease (CAD = 1) is higher in the simulated than the original data set.
<- c("Age")
Variables <- c(65)
thresh_left <- c(NA)
thresh_right <- data.frame(Variables, thresh_left, thresh_right)
thresholds
print(as.matrix(thresholds))
## Variables thresh_left thresh_right
## [1,] "Age" "65" NA
<- modgo(data = Cleveland,
test_thresh bin_variables = binary_variables,
categ_variables = categorical_variables,
thresh_var = thresholds,
nrep = nrep,
thresh_force = TRUE)
For continuous variables, modgo provides the option to add a normally distributed noise with mean 0 and variance \(\sigma_{p}^2\). With this perturbation, the variance of the perturbed variable is identical to the variance of the original variable. This option permits the generation of values from continuous variables, which were not observed in the original data set.
To specify which variables are to be perturbed and to which degree, i.e., percentage, the user needs to provide modgo with a named vector of the percentages and with the corresponding variables names as the names of the vector.
Similar to the previous examples, Figure 5 shows the correlation plots for the expansion to perturbations, and Figure 6 displays the distribution plots. Figure 6 shows that the distribution of both resting blood pressure and cholesterol change substantially due to the perturbation.
#Create named vector
<- c(0.9,0.7)
perturb_vector names(perturb_vector) <- c("RestingBP","Cholsterol")
<- modgo(data = Cleveland,
test_pertru bin_variables = binary_variables,
categ_variables = categorical_variables,
pertr_vec = perturb_vector,
nrep = nrep)
Another feature provided by modgo is the ability to simulate a data set using the Generalized Lambdas Distribution (GLD) method. This method allows users to generate values that are not present in the original data set, as the default method only uses values from the original data. The GLD method is based on the GLDEX package (Su 2007). More information on Generalized Lambdas Distributions can be found in Fitting Statistical Distributions (Zaven A. Karian 2000).
<- modgo(data = Cleveland,
test_GLD bin_variables = binary_variables,
categ_variables = categorical_variables,
generalized_mode = TRUE,
nrep = nrep)
GLDEX provides three basic models for calculating the four Lambdas for each distribution. These models are called rmfmkl (default model), rprs, and star (Su 2007). They can also be combined for a bi-modal estimation. We give you the option to specify your desired model(or a combination of models) for each variable in the dataset. In the next step, we will show you how to specify the desired GLD models.
<- c("Age","STDepression")
Variables <- c("rprs", "star-rmfmkl")
Model <- cbind(Variables,
model_matrix Model)
<- modgo(data = Cleveland,
test_GLD_define_model bin_variables = binary_variables,
categ_variables = categorical_variables,
generalized_mode = TRUE,
generalized_mode_model = model_matrix,
nrep = nrep)
By examining the distribution plots shown above, we can observe that the GLD method has the capability to generate values for certain variables, such as STDepression, that do not exist in the original data set and can also be extremely high. Additionally, there is an alternative option to compute Generalized Lambdas independently of the modgo function and subsequently utilize them as an input for a subsequent modgo run.
<- generalizedMatrix(data = Cleveland,
gener_lambdas_matrix generalized_mode_model = model_matrix,
bin_variables = binary_variables)
<- modgo(data = Cleveland,
test_GLD_define_model_set_lambdas bin_variables = binary_variables,
categ_variables = categorical_variables,
generalized_mode = TRUE,
generalized_mode_lmbds = gener_lambdas_matrix,
nrep = nrep)
Lastly, an intriguing feature provided by modgo in
conjunction with the GLD method is the capability to simulate a data set
without requiring the original data. To execute modgo without a
data set, the user must provide the following:
1) Correlation matrix of the data set
2) Generalized matrix of the data set
3) Sample size of the simulated data set
4) Variable names
Below, we present an example of this case.
# Necessary arguments
<- generalizedMatrix(data = Cleveland,
gener_lambdas_matrix generalized_mode_model = model_matrix,
bin_variables = binary_variables)
<- cor(Cleveland)
sigma <- colnames(sigma)
variables_names <- 100
sample_size
<- modgo(data = NULL,
test_GLD_no_data_set variables = variables_names,
bin_variables = binary_variables,
categ_variables = categorical_variables,
sigma = sigma,
generalized_mode = TRUE,
generalized_mode_lmbds = gener_lambdas_matrix,
n_samples = sample_size,
nrep = nrep)
## Data set is not provided
To demonstrate the simulation of survival variables, we chose the cancer data set from the survival package (Therneau 2023). This data set contains 167 samples and 10 variables. In order to set up modgo_survival(), the user must specify a status variable and a time variable, in addition to the other arguments of modgo.
# cancer prepare
data("cancer", package = "survival")
<- na.omit(cancer)
cancer $sex <- cancer$sex - 1
cancer$status <- cancer$status - 1
cancer
<- "time"
time_var_cancer <- "status"
status_var_cancer <- c("status", "sex")
bin_var_cancer <- c("ph.ecog")
cat_var_list_cancer
<- colnames(cancer)[1:6] plot_variables_surv
The modgo_survival function divides the data set into two separate data sets based on the status variable, and then it simulates each data set individually using the Generalized Lambdas Distribution method. The user can specify which GLDEX model should be used for each data set, with the default being “rprs”.
# Survival run
<- modgo_survival(data = cancer,
test_surv surv_method = 1,
bin_variables = bin_var_cancer,
categ_variables = cat_var_list_cancer,
event_variable = status_var_cancer,
time_variable = time_var_cancer,
generalized_mode_model_no_event = "rmfmkl",
generalized_mode_model_event = "rprs")
In the following plot, the surv_fit() curves from the survival package for both original and simulated data are depicted.
<- c(rep("Original", dim(test_surv$original_data)[1]),
data_set_info rep("Simulated", dim(test_surv$simulated_data[[1]])[1]))
<- rbind(test_surv$original_data,
combine_data_set $simulated_data[[1]])
test_surv<- cbind(combine_data_set,
combine_data_set
data_set_info)<- survfit(Surv(time, status) ~ data_set_info,
fit data=combine_data_set)
plot(fit,
fun = "F",
col=1:2)
legend(700, 1,
c("Original data set", "Simulated data set"),
lty=c(1,1),
col=c(1,2),
bty='n',
lwd=2)