ordbetaregPlease note: Due to CRAN issues with the
kableExtra package, I am not able to run the
modelsummary code in the vignette, and for that reason it
is commented out. However, this code works fine on all normal machines
and you should feel free to uncomment it and use it with
ordbetareg models.
The ordered beta regression model is designed explicitly for data with upper and lower bounds, such as survey slider scales, dose/response relationships, and anything that can be considered a proportion or a percentage. This type of data cannot be fit with the standard beta regression model because the beta distribution does not allow for any observations at the bounds, such as a percentage/proportion of 0% or 100%. The ordered beta regression model solves this problem by combining the beta distribution with an ordinal distribution over continuous and discrete, or degenerate, observations at the bounds. It is an efficient model that produces intelligible estimates that also respect the bounds of the dependent variable. For more information, I refer you to my paper on the model for more details (also there is an ungated version available here).
This notebook contains instructions for running the ordered beta
regression model in the R package ordbetareg.
ordbetareg is a front-end to brms, a very
powerful regression modeling package based on the Stan Hamiltonian Markov Chain Monte
Carlo sampler. I only show in this vignette a small part of the
features which are available via brms, and I refer the user
to the copious
documentation describing the features of the brms package. Suffice
it to say that most kinds of regression models can be fit with the
software, including hierarchical, dynamic, nonlinear and multivariate
models (or all of the above in combination). The ordbetareg
package allows for all of these features to be used with the ordered
regression model distribution by adding this distribution to
brms.
In addition, I show in this vignette how to fit the package compares
to fitting the ordered beta regression model with the R package glmmTMB.
glmmTMB uses maximum likelihood and related methods, so it
is much faster than the Bayesian inference employed by
ordbetareg, although it has less flexibility and fewer
options in terms of model possibilities.
The reproducible Rmarkdown file this vignette is compiled from can be accessed from the package Github site.
If you use the model in a paper, please cite it as:
Kubinec, Robert. “Ordered Beta Regression: A Parsimonious, Well-Fitting Model for Continuous Data with Lower and Upper Bounds.” Political Analysis. 2022. Forthcoming.
# whether to run models from scratch
run_model <- FIf you prefer to run brms directly rather than use this
package, you can run the underlying code via the R script
define_ord_betareg.R in the paper Github repo.
Please not that I cannot offer support for this alternative as it is not
a complete integration with brms.
In this vignette I use an empirical application from data from a Pew
Forum survey built into the package to show how the package works. It is
also possible to generate simulated ordered beta data using the
rordbeta function, as I do later in this guide. The
following plot shows a histogram of respondents’ views towards college
professors based on a bounded 0 to 100 scale:
data("pew")
pew %>% 
  ggplot(aes(x=as.numeric(therm))) +
  geom_histogram(bins=100) +
  theme_minimal() + 
  theme(panel.grid=element_blank()) +
  scale_x_continuous(breaks=c(0,25,50,75,100),
                     labels=c("0","Colder","50","Warmer","100")) +
  ylab("") +
  xlab("") +
  labs(caption=paste0("Figure shows the distribution of ",sum(!is.na(pew$therm))," non-missing survey responses."))The distributions of feelings towards college professors contains
both degenerate (0 and 100) and continuous responses between 0 and 100.
To model it, the outcome needs to be re-scaled to lie strictly between 0
and 1. However, it is not necessary to do that ahead of time as the
ordbetareg package will do that re-normalization
internally. I also do some other data processing tasks:
model_data <- select(pew,therm,age="F_AGECAT_FINAL",
                        sex="F_SEX_FINAL",
                        income="F_INCOME_FINAL",
                        ideology="F_IDEO_FINAL",
                        race="F_RACETHN_RECRUITMENT",
                        education="F_EDUCCAT2_FINAL",
                     region="F_CREGION_FINAL",
                        approval="POL1DT_W28",
                       born_again="F_BORN_FINAL",
                       relig="F_RELIG_FINAL",
                        news="NEWS_PLATFORMA_W28") %>% 
    mutate_at(c("race","ideology","income","approval","sex","education","born_again","relig"), function(c) {
      factor(c, exclude=levels(c)[length(levels(c))])
    }) %>% 
    # need to make these ordered factors for BRMS
    mutate(education=ordered(education),
           income=ordered(income))The completed dataset has 2538 observations.
ordbetaregThe ordbetareg function will take care of normalizing
the outcome and adding additional information necessary to estimate the
distribution. Any additional arguments can be passed to the underlying
brm function to use specific brm features. For
example, in the code below I use the backend="cmdstanr"
argument to brm(), which allows me to use the R package
cmdstanr for estimating models. cmdstanr tends
to have the most up to date version of Stan, though you must install it yourself.
What you need to pass to the ordbetareg function are the
standard components of any R model: a formula and a dataset that has all
the variables mentioned in the formula. There are additional parameters
that allow you to modify the priors, such as for the dispersion
parameter phi. While in most cases these priors are
sensible defaults, if you have data with an unusual scale (such as very
large or very small values), you may want to change the priors to ensure
they are not having an outsize effect on your estimates. If you want to
assign a prior to specific regression coefficient or another part of the
model, you can use the extra_prior option and pass a
brms-compliant prior object.
If you want to use some of brms more powerful
techniques, such as multivariate modeling, you can also pass the result
of a bf function call to the formula argument.
I refer you to the brmsformula() function
help for more details.
To demonstrate some of the power of using brms as a
regression engine, I will model education and income as ordinal
predictors by using the mo() function in the formula
definition. By doing so, we can get a single effect for education and
income instead of having to use dummies for separate education/income
categories. As a result, I can include an interaction between the two
variables to see if wealthier more educated people have better views
towards college professors than poorer better educated people. Finally,
I include varying (random) census region intercepts.
if(run_model) {
  
  ord_fit_mean <- ordbetareg(formula=therm ~ mo(education)*mo(income) +
                               (1|region), 
                       data=model_data,
                       control=list(adapt_delta=0.95),
                cores=1,chains=1,iter=500,
                refresh=0)
                # NOTE: to do parallel processing within chains
                # add the options below
                #threads=5,
                #backend="cmdstanr"
                #where threads is the number of cores per chain
                # you must have cmdstanr set up to do so
                # see https://mc-stan.org/cmdstanr/
  
} else {
  
  data("ord_fit_mean")
  
}Because this model is somewhat complicated with multiple ordinal
factors and multilevel intercepts, I use the option
adapt_delta=0.95 to improve sampling and remove divergent
transitions.
The default priors in ordbetareg are weakly informative
for data that is scaled roughly between \([-10,10]\). Outside of those scales, you
may want to increase the SD of the Normal prior on regression
coefficients via the coef_prior_sd and/or
phi_coef_prior_sd options to ordbetareg
(depending on whether the prior is for the coefficients in the main
outcome model or the auxiliary dispersion phi regression).
If you need to set a specific prior on the intercept, pass values to the
intercept_prior_mean/phi_intercept_prior_mean
and intercept_prior_sd/phi_intercept_prior_sd
parameters (you need to have values for both for the prior to be set
correctly) to set a Normally-distributed prior on the intercept. You can
also zero out the intercept by setting a tight prior around 0, such as
intercept_prior_mean=0 and
intercept_prior_sd=0.001.
If you need to do something more involved with priors, you can use
the extra_prior option. This argument can take any
brms-specified prior, including any of the many
brms options for priors. These options are beyond the scope
of the vignette, but I refer the reader to the brms
documentation for more info.
We can visualize the model cutpoints by showing them relative to the
empirical distribution. To do so, we have to transform the cutpoints
using the inverse logit function in R (plogis) to get back
values in the scale of the response, and I have to exponentiate and add
the first cutpoint to get the correct value for the second cutpoint. The
plot shows essentially how spread-out the cutpoints are relative to the
data, though it should be noted that the analogy is inexact–it is not
the case that observations above or below the cutpoints are considered
to be discrete or continuous. Rather, as distance from the cutpoints
increases, the probability of a discrete or continuous response
increases. Cutpoints that are far apart indicate a data distribution
that shows substantial differences between continuous and discrete
observations.
all_draws <- prepare_predictions(ord_fit_mean)
cutzero <- plogis(all_draws$dpars$cutzero)
cutone <- plogis(all_draws$dpars$cutzero + exp(all_draws$dpars$cutone))
pew %>% 
  ggplot(aes(x=therm)) +
  geom_histogram(bins=100) +
  theme_minimal() + 
  theme(panel.grid=element_blank()) +
  scale_x_continuous(breaks=c(0,25,50,75,100),
                     labels=c("0","Colder","50","Warmer","100")) +
  geom_vline(xintercept = mean(cutzero)*100,linetype=2) +
  geom_vline(xintercept = mean(cutone)*100,linetype=2) +
  ylab("") +
  xlab("") +
  labs(caption=paste0("Figure shows the distribution of ",sum(!is.na(pew$therm))," non-missing survey responses."))The plot shows that the model sees significant heterogeneity between the discrete responses at the bounds and the continuous responses in the plot.
The best way to visualize model fit is to plot the full predictive
distribution relative to the original outcome. Because ordered beta
regression is a mixed discrete/continuous model, a separate plotting
function, pp_check_ordbetareg, is included that accurately
handles the unique features of this distribution. This function returns
a list with two plots, discrete and
continuous, which can either be printed and plotted or
further modified as ggplot2 objects.
The discrete plot, which is a bar graph, shows that the posterior distribution accurately captures the number of different types of responses (discrete or continuous) in the data. For the continuous plot, shown as a density plot with one line per posterior draw, the model can’t capture all of the modality in the distribution – there are effectively four separate modes – but it is reasonably accurate over the middle responses and the responses near the bounds.
# new theme option will add in new ggplot2 themes or themes
# from other packages
plots <- pp_check_ordbeta(ord_fit_mean,
                          ndraws=100,
                          outcome_label="Thermometer Rating",
                          new_theme=ggthemes::theme_economist())
plots$discreteplots$continuousThe plotting function also has the ability to animate the posterior draws for the continuous distribution:
plot2 <- pp_check_ordbeta(ord_fit_mean, animate=TRUE,
                          outcome_label="Thermometer Rating",
                          new_theme=ggthemes::theme_economist())
plot2$continuous
#> Warning: No renderer available. Please install the gifski, av, or magick
#> package to create animated output
#> NULLWe can see the coefficients from the model in table form using the
modelsummary package, which has support for
brms models. We’ll specify only confidence intervals as
other frequentist statistics have no Bayesian analogue (i.e. p-values).
We’ll also specify only the main effects of the ordinal predictors, and
give them more informative names.
Please note: modelsummary code is commented out
due to an issue with the CRAN server. Please uncomment and use it as the
package/code works fine.
#library(modelsummary)
# code fails on CRAN, but works everywhere else
# modelsummary(ord_fit_mean,statistic = "conf.int",
#              metrics = "RMSE",
#              coef_map=c("b_Intercept"="Intercept",
#                         "bsp_moeducation"="Education",
#                         "bsp_moincome"="Income",
#                         "bsp_moeducation:moincome"="EducationXIncome"))
knitr::include_graphics("model_sum.png")modelsummary tables have many more options, including
output to both html and latex formats. I refer the reader to the package
documentation for more info.
There is a related package, marginaleffects, that allows
us to convert these coefficients into more meaningful marginal effect
estimates, i.e., the effect of the predictors expresses as the actual
change in the outcome on the 0 - 1 scale. Because we have two variables
in our model that are both ordinal in nature,
marginaleffects will produce an estimate of the marginal
effect of each value of each ordinal predictor. I use the
avg_slopes function to convert the ordbetareg
model to a data frame of marginal/conditional effects in the scale of
the outcome that can be easily printed:
avg_slopes(ord_fit_mean, variables="education") %>%
  select(Variable="term",
         Level="contrast",
         `5% Quantile`="conf.low",
         `Posterior Mean`="estimate",
         `95% Quantile`="conf.high") %>% 
  knitr::kable(caption = "Marginal Effect of Education on Professor Thermometer",
               format.args=list(digits=2),
               align=c('llccc'))| Variable | Level | 5% Quantile | Posterior Mean | 95% Quantile | 
|---|---|---|---|---|
| education | Associate’s degree - Less than high school | 0.0183 | 0.067 | 0.130 | 
| education | College graduate/some postgrad - Less than high school | 0.0851 | 0.129 | 0.184 | 
| education | High school graduate - Less than high school | -0.0030 | 0.016 | 0.051 | 
| education | Postgraduate - Less than high school | 0.1565 | 0.198 | 0.250 | 
| education | Some college, no degree - Less than high school | 0.0097 | 0.044 | 0.095 | 
We can see that we have separate marginal effects for each level of
education due to modeling it as an ordinal predictor. At present the
avg_slopes() function cannot calculate a single effect,
though that is possible manually. As we can see with the raw coefficient
in the table above, the marginal effects are all positive, though the
magnitude varies across different levels of education.
Note as well that the slopes function can calculate
unit-level (row-wise) estimates of the predictors on the outcome (see
marginaleffects documentation for more info).
glmmTMBAs mentioned, glmmTMB added support in 2022 for the
ordered beta regression model. Unlike ordbetareg,
glmmTMB requires the outcome to be normalized to 0 and 1
prior to usage. Also, at present glmmTMB requires priors on
the cutpoints to be specified separately (future version of the package
should not require this). Because glmmTMB does not have as
many features as brms, it cannot fit all model options such
as the monotonic predictors for ordinal variables. However, it does have
strong support for multilevel models and random effects. The following
code normalizes the outcome and then fits and summarizes a
glmmTMB model:
# need to normalize first to 0 and 1
model_data$therm_norm <- (model_data$therm - min(model_data$therm))/(max(model_data$therm) - min(model_data$therm))
# fit requires specification of priors on cutpoints (labeled as psi)
# will not be required in future versions of glmmTMB
glmm_tmb_fit <- glmmTMB(therm_norm ~ approval + (1|region),
                        data=model_data,
                        family=ordbeta(),
                        start=list(psi = c(-1, 1)))
# model coefficients
# modelsummary(glmm_tmb_fit)
# marginal effects in scale of the outcome
avg_slopes(glmm_tmb_fit)
#> 
#>      Term             Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  approval Disapprove - Approve    0.249    0.00966 25.8   <0.001  0.23  0.268
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.highordbetareg supports brms functions for
using multiple imputed datasets and for multivariate (multiple response)
modeling, at least with two response variables. brms is
able to apply the model to a list of multiple-imputed datasets (created
with other packages such as mice) and then combine the
resulting models into a single posterior distribution, which makes it
very easy to do inference. I first demonstrate briefly how to do this
and also multivariate modeling, but I refer the user to the
brms documentation for more details.
In the code below I use the rordbeta function to
simulate some ordered beta responses given a covariate, then create
missing values by replacing some values at random with NA.
I use the mice package to impute this missing data,
creating two imputed datasets which I then feed into
ordbetareg and set the use_brm_multiple option
to TRUE:
# simplify things by using one covariate within the [0,1] interval
X <- runif(n = 100,0,1)
outcome <- rordbeta(n=100,mu = 0.3 * X, phi =3, cutpoints=c(-2,2))
# set 10% of values of X randomly to NA
X[runif(n=100)<0.1] <- NA
# create a list of two imputed datasets with package mice
mult_impute <- mice::mice(data=tibble(outcome=outcome,
                                      X=X),m=2,printFlag = FALSE) %>% 
  mice::complete(action="all")
# pass list to the data argument and set use_brm_multiple to TRUE
if(run_model) {
  
  fit_imputed <- ordbetareg(formula = outcome ~ X,
                            data=mult_impute,
                            use_brm_multiple = T,
                            cores=1,chains=1, iter=500)
  
} else {
  
  data('fit_imputed')
  
}
# all functions now work as though the model had only one dataset
# imputation uncertainty included in all results/analyses
# marginal effects, though, only incorporate one imputed dataset
#modelsummary(fit_imputed,statistic = 'conf.int')
avg_slopes(fit_imputed)
#> 
#>  Term Estimate 2.5 % 97.5 %
#>     X    0.395 0.274  0.545
#> 
#> Columns: term, estimate, conf.low, conf.highIn the following code, I show how to model a
Gaussian/Normally-distributed variable and an ordered beta regression
model together. This is useful when examining the role of a mediator,
which we will simulate here as a Normally-distributed variable
Z. In order to specify a different distribution than
ordered beta, the family parameter must be specified in the
bf formula function. Otherwise the function assumes that
the distribution is ordered beta.
# generate a new Gaussian/Normal outcome with same predictor X and mediator
# Z
X <- runif(n = 100,0,1)
Z <- rnorm(100, mean=3*X)
# use logit function to map unbounded continuous data to [0,1] interval
# X is mediated by Z
outcome <- rordbeta(n=100, mu = plogis(.4 * X + 1.5 * Z))
# use the bf function from brms to specify two formulas/responses
# set_rescor must be FALSE as one distribution is not Gaussian (ordered beta)
# OLS for mediator
mod1 <- bf(Z ~ X,family = gaussian)
# ordered beta
mod2 <- bf(outcome ~ X + Z)
if(run_model) {
  
  fit_multivariate <- ordbetareg(formula=mod1 + mod2 + set_rescor(FALSE),
                                 data=tibble(outcome=outcome,
                                             X=X,Z=Z),
                                 cores=1,chains=1, iter=500)
  
}
#modelsummary(fit_multivariate,statistic = "conf.int")
# need to calculate each sub-model's marginal effects separately
avg_slopes(fit_multivariate,resp="outcome")
#> 
#>  Term Estimate   2.5 % 97.5 %
#>     X   0.0518 -0.0832  0.185
#> 
#> Columns: term, estimate, conf.low, conf.high
avg_slopes(fit_multivariate, resp="Z")
#> 
#>  Term Estimate 2.5 % 97.5 %
#>     X     3.36  2.56   4.16
#> 
#> Columns: term, estimate, conf.low, conf.highTo estimate the indirect effect of X on
outcome that is mediated by Z, we can use the
package bayestestR:
bayestestR::mediation(fit_multivariate)
#> # Causal Mediation Analysis for Stan Model
#> 
#>   Treatment: X
#>   Mediator : Z
#>   Response : outcome
#> 
#> Effect                 | Estimate |         95% ETI
#> ---------------------------------------------------
#> Direct Effect (ADE)    |    0.701 | [-1.042, 2.336]
#> Indirect Effect (ACME) |    5.763 | [ 3.580, 8.217]
#> Mediator Effect        |    1.761 | [ 1.178, 2.272]
#> Total Effect           |    6.482 | [ 4.355, 9.313]
#> 
#> Proportion mediated: 88.91% [60.89%, 116.94%]Because the effect of Z on the outcome is positive, and
X has a positive direct (unmediated effect) on the outcome,
the total effect of X on the outcome is larger than the
direct effect of X without considering mediation.
As I explain in the paper, one of the main advantages of using a Beta
regression model is its ability to model the dispersion among
respondents not just in terms of variance (i.e. heteroskedasticity) but
also the shape of dispersion, whether it is U or inverted-U shaped.
Conceptually, a U shape would imply that respondents are bipolar, moving
towards the extremes. An inverted-U shape would imply that respondents
tend to cluster around a central value. We can predict these responses
conditionally in the sample by adding predictors for phi,
the scale/dispersion parameter in the Beta distribution. Higher values
of phi imply a uni-modal distribution clustered around a
central value, with increasing phi implying more
clustering. Lower values of phi imply a bi-modal
distribution with values at the extremes. Notably, these effects are
calculated independently of the expected value, or mean, of the
distribution, so values of phi will produce different
shapes depending on the average value.
The one change we need to make to fit this model is to add a formula
predicting phi in the code below. Because we now have two
formulas–one for the mean and one for dispersion–I use the
bf function to indicate these two sub-models. I also need
to specify phi_reg to be TRUE because some of the priors
will change.
Because there is no need to model the mean, I leave the first formula
as therm ~ 1 with the 1 representing only an intercept, not
a covariate. I then specify a separate model for phi with
an interaction between age and sex to see if
these covariates are associated with dispersion.
if(run_model) {
  
  ord_fit_phi <- ordbetareg(bf(therm ~ 1, 
                               phi ~ age + sex),
                            phi_reg = T,
                            data=model_data,
                            cores=2,chains=2,iter=500,
                            refresh=0)
  # NOTE: to do parallel processing within chains
  # add the options below
  #threads=threading(5),
  #backend="cmdstanr"
  #where threads is the number of cores per chain
  # you must have cmdstanr set up to do so
  # see https://mc-stan.org/cmdstanr/
  
} else {
  
  data("ord_fit_phi")
  
}We can quickly examine the raw coefficients:
summary(ord_fit_phi)
#>  Family: ord_beta_reg 
#>   Links: mu = identity; phi = log; cutzero = identity; cutone = identity 
#> Formula: therm ~ 1 
#>          phi ~ age + sex
#>    Data: data (Number of observations: 2535) 
#>   Draws: 2 chains, each with iter = 500; warmup = 250; thin = 1;
#>          total post-warmup draws = 500
#> 
#> Population-Level Effects: 
#>               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept         0.31      0.02     0.27     0.36 1.00      550      347
#> phi_Intercept     1.00      0.08     0.83     1.16 1.00      291      239
#> phi_age30M49      0.04      0.09    -0.15     0.21 1.01      316      350
#> phi_age50M64     -0.05      0.09    -0.22     0.13 1.01      297      254
#> phi_age65P       -0.14      0.10    -0.33     0.04 1.01      310      369
#> phi_sexFemale     0.13      0.05     0.02     0.22 1.00      414      370
#> 
#> Family Specific Parameters: 
#>         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> cutzero    -2.70      0.10    -2.89    -2.51 1.01      430      411
#> cutone      1.67      0.02     1.63     1.71 1.02      458      312
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).However, these are difficult to interpret as they relate to the Beta
distribution, which is highly nonlinear. Generally speaking, higher
values of phi mean the distribution is more concentrated
around a single point. Lower values imply the distribution is more
dispersed to the point that it actually becomes bi-modal, meaning that
responses could be close to either 0 or 1 but are unlikely in the
middle.
Because phi is a dispersion parameter, by definition the
covariates have no effect on the average value. As a result, we’ll need
to use the posterior_predict function in brms
if we want to get an idea what the covariates do. We don’t have any
fancy packages to do this for us, so we’ll have to pass in two data
frames, one with sex equal to female and one with sex equal to male.
We’ll want each data frame to have each unique value of age in the
data.
# we can use some dplyr functions to make this really easy
female_data <- distinct(model_data, age) %>% 
  mutate(sex="Female")
male_data <- distinct(model_data, age) %>% 
  mutate(sex="Male")
to_predict <- bind_rows(female_data,
                        male_data) %>% 
  filter(!is.na(age))
pred_post <- posterior_predict(ord_fit_phi,
                               newdata=to_predict)
# better with iterations as rows
pred_post <- t(pred_post)
colnames(pred_post) <- 1:ncol(pred_post)
# need to convert to a data frame
data_pred <- as_tibble(pred_post) %>% 
  mutate(sex=to_predict$sex,
         age=to_predict$age) %>% 
  gather(key="iter",value='estimate',-sex,-age)
data_pred %>% 
  ggplot(aes(x=estimate)) +
  geom_density(aes(fill=sex),alpha=0.5,colour=NA) +
  scale_fill_viridis_d() +
  theme(panel.background = element_blank(),
        panel.grid=element_blank())We can see that the female distribution is more clustered around a
central value – 0.75 – than are men, who are somewhat more likely to be
near the extremes of the data. However, the movement is modest, as the
value of the coefficient suggests. Regression on phi is
useful when examining polarizing versus clustering dynamics in the
data.
Finally, we can also simulate data from the ordered beta regression
model with the sim_ordbeta function. This is useful either
for examining how different parameters interact with each other, or more
generally for power calculation by iterating over different possible
sample sizes. I demonstrate the function here, though note that the
vignette loads saved simulation results unless run_model is
set to TRUE. Because each simulation draw has to estimate a model, it
can take some time to do calculations. Using multiple cores a la the
cores option is strongly encouraged to reduce processing
time.
To access the data simulated for each run, the
return_data=TRUE option can be set. To get a single
simulated dataset, simply use this option combined with a single
iteration and set of parameter values. The data are saved as a list in
the column data in the returned data frame. The chunk below
examines the first 10 rows of a single simulated dataset (note that the
rows are repeated k times for each iteration, while there
is one unique simulated dataset per iteration). Each predictor is listed
as a Var column from 1 to k, while the
simulated outcome is in the outcome column.
# NOT RUN IN THE VIGNETTE
single_data <- sim_ordbeta(N=100,iter=1,
                           return_data=T)
# examine the first dataset
knitr::kable(head(single_data$data[[1]]))By default, the function simulates continuous predictors. To simulate
binary variables, such as in a standard experimental design, use the
beta_type function to specify "binary"
predictors and treat_assign to determine the proportion
assigned to treatment for each predictor. For a standard design with
only one treatment variable, we’ll also specify that k=1
for a single covariate. Finally, to estimate a reasonable treatment
effect, we will specify that beta_coef=0.5, which equals an
increase of .5 on the logit scale. While it can be tricky to know a
priori what the marginal effect will be (i.e., the actual change in the
outcome), the function will calculate marginal effects and report them,
so you can play with other options to see what gets you marginal
effects/treatment effects of interest.
if(run_model) {
  
  sim_data <- sim_ordbeta(N=c(250,500,750),
                          k=1,
                          beta_coef = .5,
                          iter=100,cores=10,
                          beta_type="binary",
                          treat_assign=0.3)
  
} else {
  
  data("sim_data")
  
}For example, in the simulation above, the returned data frame stores
the true marginal effect in the marg_eff column, and lists
it as 0.284, which is quite a large effect for a \([0,1]\) outcome. The following plot shows
some of the summary statistics derived by aggregating over the
iterations of the simulation. Some of the notable statistics that are
included are power (for all covariates \(k\)), S errors (wrong sign of the estimated
effect) and M errors (magnitude of bias). As can be seen, issues of bias
decline markedly for this treatment effect size and a sample of 500 or
greater has more than enough power.
sim_data %>% 
    select(`Proportion S Errors`="s_err",N,Power="power",
         `M Errors`="m_err",Variance="var_marg") %>% 
  gather(key = "type",value="estimate",-N) %>%
  ggplot(aes(y=estimate,x=N)) +
  #geom_point(aes(colour=model),alpha=0.1) +
  stat_summary(fun.data="mean_cl_boot") + 
  ylab("") +
  xlab("N") +
  scale_x_continuous(breaks=c(250,500,750)) +
  scale_color_viridis_d() +
  facet_wrap(~type,scales="free_y",ncol = 2) +
  labs(caption=stringr::str_wrap("Summary statistics calculated as mean with bootstrapped confidence interval from simulation draws. M Errors  and S errors are magnitude of bias (+1 equals no bias) and incorrect sign of the estimated marginal effect respectively. Variance refers to estimated posterior variance (uncertainty) of the marginal effect(s).",width=50)) +
  guides(color=guide_legend(title=""),
         linetype=guide_legend(title="")) +
  theme_minimal() +
  theme(plot.caption = element_text(size=7),
        axis.text.x=element_text(size=8))While the sim_ordbeta function has the ability to
iterate over N, it is of course possible to do more complex
experiments by wrapping the function in a loop and passing different
values of other parameters, such as treat_assign.