## Introduction

In this vignette, we explain how replication of an ANOVA can be tested with the prior predictive $$p$$-value through the ANOVAreplication R-package (Zondervan-Zwijnenburg 2018). More detailed information on the prior predictive $$p$$-value (Box 1980) in the context of replication can be found in Zondervan-Zwijnenburg, Van de Schoot, and Hoijtink (2019).

We present the general workflow to use the prior predictive $$p$$-value and illustrate it with examples from the Reproducibility Project Psychology (Open Science Collaboration 2012), but let us first load the ANOVAreplication package.

library(ANOVAreplication)

## Workflow

The workflow to calculate and interpret the prior predictive $$p$$-value is depicted in the figure below. All steps in this workflow can be taken with the ANOVAreplication functions, or through the interactive environment that can be launched by running runShiny() in the R-console. The interactive application is also available online.

For the first three steps (1a-1c) of the prior predictive $$p$$-value workflow, the only thing that is needed is the original study.

• Step 1a is to derive the relevant feature to be evaluated in the test statistic from the findings of the original study. We will elaborate on this topic in the Relevant Features section of this vignette.
• Step 1b is defining the population for which replication should be rejected (i.e., $$H_a$$). What is the ordering of the means in this population and what is the effect size in that ordering? $$H_a$$ can be a population in which all means are equal (this is also the default in the runShiny function), but it does not have to be.
• Step 1c is obtaining the data of the original study, or reconstruct the data based on reported means, standard deviations and group sample sizes.

If the new study is not yet conducted, Step 2 is calculating the required sample size per group for the new study to reject replication with sufficient power in case $$H_a$$ is true, where power is denoted by $$\gamma$$ in the figure above. The sample sizes calculation can be conducted with the sample.size.calc function in the ANOVAreplication package. It can occur that the function indicates that the target power level was not reached for a set maximum sample size. For some combinations of original study effect sizes, original study sample sizes, $$H_\text{RF}$$ and $$H_a$$, the power is limited, and there is no new study sample size that will result in sufficient statistical power. This means that the Type II error probability (1-$$\gamma$$) is higher than requested. The researcher should reconsider of conducting the new study is justifiable. We advise to test replication mainly for original studies with at least a medium effect size and preferably group sample sizes $$\geq 50$$.

Step 3 is computing the prior predictive $$p$$-value (with prior.predictive.check) and the power associated to the sample size of the new study (with power.calc). Note that it is not a post-hoc power analysis, as the definition of $$H_a$$ is unrelated to the new study. Hence, the power to reject replication for $$H_a$$ can be insufficient (i.e., larger than 1- the preset Type II error rate $$\beta$$), while the prior predictive $$p$$-value is statistically significant, or vice versa.

The workflow figure above assists in interpreting the resulting $$p$$-value, considering the statistical power to reject replication for $$H_a$$, unless the prior predictive $$p$$-value is 1.00. If the new study perfectly meets the features described in $$H_\text{RF}$$, the prior predictive $$p$$-value will be 1.00. In such a case, we confirm replication of the relevant features in the original study as captured in $$H_\text{RF}$$, irrespective of power.

In case of a non-significant result in combination with low power, the researcher should emphasize the probability that not rejecting replication is a Type II error, and it is advised to conduct a replication study with larger $$n_{jr}$$. The required sample size per group can again be calculated with the sample.size.calc function. If the required $$n_{jr}$$ is excessive given $$H_a$$, it may be an inevitable conclusion that the original study is not suited for replication testing by means of the prior predictive $$p$$-value. If replication is rejected despite low power, it implies that the observed new dataset deviates more from $$H_R$$ than most datasets simulated under $$H_a$$. With sufficient statistical power, it is still informative to notify the reader of the achieved power and/or the probability of a Type II error.

## Relevant Features

To obtain a test-statistic $$\bar F$$ in each dataset, we evaluate a Relevant Feature Hypothesis $$H_\text{RF}$$ that captures the findings of the original study.

The constraints in $$H_\text{RF}$$ should be based on the findings of the original study, which implies that $$H_\text{RF}$$ is always in agreement with the results of the original study. The results alone, however, are usually not enough to determine which $$H_\text{RF}$$ is to be evaluated. For example, an original study shows that $$\bar{y}_{1o} < \bar{y}_{2o} < \bar{y}_{3o}$$. This finding may lead to $$H_\text{RF}$$: $$\mu_{1d}< \mu_{2d} < \mu_{3d}$$, but also to $$H_\text{RF}$$: $$(\mu_{1d},\mu_{2d})$$< $$\mu_{3d}$$ or $$H_\text{RF}$$: $$\mu_{1d}$$< $$(\mu_{2d} ,\mu_{3d})$$. Which exact features should be covered in $$H_\text{RF}$$, can be guided by the conclusions of the original study. For example, if in the original study it is concluded that a treatment condition leads to better outcomes than two control conditions, the most logical specification of the relevant features is $$H_\text{RF}$$: $$(\mu_{\text{controlA}d},\mu_{\text{controlB}d})$$< $$\mu_{\text{Treatment}d}$$. Alternatively, if in the original study it is concluded that treatment A is better than treatment B, which is better than the control condition, a logical relevant feature hypothesis would be: $$H_\text{RF}$$: $$\mu_{\text{TreatmentA}d}> \mu_{\text{TreatmentB}d} > \mu_{\text{Control}d}$$. It may also occur that the researcher conducting the replication test has an interest to evaluate a claim that is not in the original study, but could be made based on its results. In all cases, the researcher conducting the replication test should substantiate the choices made in the formulation of $$H_\text{RF}$$ with results from the original study. It is good practice to also pre-register $$H_\text{RF}$$.

We can use the function create_matrices to transform $$H_\text{RF}$$ into matrices that can be used by the R-package as follows:

#mu1<mu3 & mu2 <mu3
create_matrices(varnames=c("g1","g2","g3"),hyp="g1<g3 & g2<g3")
## $Amat ## g1 g2 g3 ## g1<g3 -1 0 1 ## g2<g3 0 -1 1 ## ##$difmin
## g1<g3 g2<g3
##     0     0
##
## $E ## [1] 0 Or equivalently: #also #mu1<mu3 & mu2 <mu3 create_matrices(varnames=c("g1","g2","g3"),hyp="(g1,g2)<g3") ##$Amat
##       g1 g2 g3
## g1<g3 -1  0  1
## g2<g3  0 -1  1
##
## $difmin ## g1<g3 g2<g3 ## 0 0 ## ##$E
## [1] 0

These matrices can be used to calculate $$\bar F$$ with Fbar.dif.

HR <- create_matrices(varnames=c("g1","g2","g3"),hyp="(g1,g2)<g3")
Fbar.dif(data.r,Amat=HR$Amat,difmin=HR$difmin,effectsize=FALSE)

Alternatively, it may be of interest to test if the effect sizes found in the original study can be replicated. Effect sizes can be quantified using Cohen’s $$d$$ for the difference between two groups. Given that we are testing more qualitative conclusions of the original study, it can be interesting to use the threshold for an effect size category as a minimum in $$H_\text{RF}$$. That is: .00 for negligable, .20 for small, .50 for medium, and .80 for large effect sizes. For example, the original study may demonstrate that $$\hat d_{12o} = .67$$, and $$\hat d_{23o} = .34$$, where $$\hat d_{jj'o}$$ denotes an estimate of Cohen’s $$d$$ based on the observed data. Researchers may consider this result replicated if $$H_{R}$$: $$d_{12r} \geq .5$$ & $$d_{23r} \geq .2$$ is supported by the new data. Better even, would be to use a minimally relevant effect size. Including a minimum value or difference in a hypothesis is in line with , who highlights the relevance of detectability of an effect with the example of levitation. If the original study documents 9 inch of levitation in an experimental group, researchers may be more interested in rejecting the qualitative claim that levitation is an existing and detectable phenomenon, that is testing, for example, $$H_{R}: d_{\text{control,experimental,}r} > 0.2$$, than in assessing whether a levitation of 7 inch as found in a new study is significantly different from the 9 inch in the original study, that is, testing $$H_{R}$$: $$\mu_{\text{experimental,}r}=9.00$$.

#d12>.5, d23>.2
create_matrices(varnames=c("g1","g2","g3"), hyp="g2-g1>.5&g3-g2>.2")
## $Amat ## g1 g2 g3 ## g2-g1>.5 -1 1 0 ## g3-g2>.2 0 -1 1 ## ##$difmin
## g2-g1>.5 g3-g2>.2
##      0.5      0.2
##
## $E ## [1] 0 #g4>(g1,g2,g3). g4-g1>0.8, g4-g2>0.5, g4-g3>0.2 create_matrices(varnames = c("g1","g2","g3","g4"), hyp = "g4-g1>0.8 & g4-g2>0.5 & g4-g3>0.2") ##$Amat
##           g1 g2 g3 g4
## g4-g1>0.8 -1  0  0  1
## g4-g2>0.5  0 -1  0  1
## g4-g3>0.2  0  0 -1  1
##
## $difmin ## g4-g1>0.8 g4-g2>0.5 g4-g3>0.2 ## 0.8 0.5 0.2 ## ##$E
## [1] 0

These matrices can be used to evaluate effect size differences with Fbar.dif as follows:

HR <- create_matrices(varnames=c("g1","g2","g3"),hyp="g2-g1>.5&g3-g2>.2")
Fbar.dif(data.r,Amat=HR$Amat,difmin=HR$difmin,effectsize=TRUE)

Finally, in some cases it may be of interest whether the means as found in the original study are replicated in the new study. This implies that $$\mu_{1r} = \bar{y}_{1o}, ..., \mu_{Jr} = \bar{y}_{Jo}$$. For example, if $$\bar{y}_{1o} = 1, \bar{y}_{2o} = 2, \bar{y}_{3o} =3$$ is observed, the corresponding hypothesis for the new study is $$H_{R}$$: $$\mu_{1r} = 1, \mu_{2r} = 2, \mu_{3r} =3$$.

To calculate $$\bar F$$ for such an exact hypothesis, the Fbar.exact function can be used as follows:

means <- aggregate(data$y,by=list(data$g),mean)[,2]
Fbar.dif(data.r,exact=means)

Note that the fact that the new study will not find these exact means, will not necessarily mean that replication will be rejected. On the contrary. Such as specific $$H_\text{RF}$$ is also never true for the predicted data. The prior predictive $$p$$-value will indicate if the deviance from $$H_\text{RF}$$ in the new study is extreme (i.e., significant) compared to deviance in the predicted data.

Again, which exact features should be covered in $$H_\text{RF}$$ is to be decided based on the results of the original study. The researcher conducting the replication test should substantiate the choices made in the formulation of $$H_\text{RF}$$ with results from the original study. It is good practice to also pre-register $$H_\text{RF}$$.

## Example 1: Simple Ordering of Means

The first study is Fischer, Greitemeyer, and Frey (2008), who studied the impact of self-regulation resources on confirmatory information processing.

According to the theory, people who have low self-regulation resources (i.e., depleted participants) will prefer information that matches their initial standpoint. An ego-threat condition was added, because the literature proposes that ego-threat affects decision relevant information processing, although the direction of this effect is not clear. To determine which relevant feature of the results should be tested for replication, we follow the original findings:

Planned contrasts revealed that the confirmatory information processing tendencies of participants with reduced self-regulation resources […] were stronger than those of nondepleted […] and ego threatened participants […].

This translates to: $$H_\text{RF}$$: $$\mu_{\text{low self-regulation},r}>$$ $$(\mu_{\text{high self-regulation},r},\mu_{\text{ego-threatened},r})$$ (Workflow Step 1a).

We want to reject replication when all means in the population are equal. That is: $$H_{a}$$: $$\mu_{\text{low self-regulation},r}=$$ $$(\mu_{\text{high self-regulation},r}=$$$$\mu_{\text{ego-threatened},r})$$ (Workflow Step 1b).

We simulate the original data based on the means, standard deviations and sample sizes reported in Fischer, Greitemeyer, and Frey (2008) (Workflow Step 1c).

means = c(0.36,-0.19,-0.18)
sds = c(1.08,0.53,0.81)
n = c(28,28,28) #N = 84
y <- list(NA)
for (i in 1:3){y[[i]] <- generate.data(n[i],means[i],sds[i])}
y <- unlist(y)
group <- c(rep(0,n[1]),rep(1,n[2]),rep(2,n[3]))
p <- length(unique(group))
data=data.frame(y=y,g=group)
aggregate(data$y,by=list(data$g),mean)
##   Group.1     x
## 1       0  0.36
## 2       1 -0.19
## 3       2 -0.18
dataFischer <- data

As the replication study is already conducted by Galliani (2015), we do not calculate the required sample size to test replication (Workflow Step 2). Instead, we load or generate the new data (here the data is generated, to simplify the vignette).

means = c(-0.07485080,-0.04550836,0.12737181)
sds = c(0.4498046,0.4655386,0.6437770)
n.r = c(48,47,45)
y <- list(NA)
for (i in 1:3){y[[i]] <- generate.data(n.r[i],means[i],sds[i])}
y <- unlist(y)
group <- c(rep(0,n.r[1]),rep(1,n.r[2]),rep(2,n.r[3]))
p <- length(unique(group))
data=data.frame(y=y,g=group)
aggregate(data$y,by=list(data$g),mean)
##   Group.1           x
## 1       0 -0.07485080
## 2       1 -0.04550836
## 3       2  0.12737181
dataGalliani <- data

Now, we can proceed to calculate the prior predictive $$p$$-value and the power of the replication test (Workflow Step 3). To obtain the prior predictive $$p$$-value and calculate its power, we first need to obtain the posterior of the original study with the Gibbs.ANOVA function.

post <- Gibbs.ANOVA(dataFischer,it=1000,burnin=1000,seed=2019)