% \VignetteEngine{knitr::knitr} % \VignetteEncoding{UTF-8} % \VignetteIndexEntry{Introduction to ReplicationSuccess} % \VignetteDepends{knitr} \documentclass[a4paper, 11pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[english]{babel} \usepackage{graphics} \usepackage{amsmath, amssymb} \usepackage[round]{natbib} \bibliographystyle{apalikedoiurl} \usepackage{doi} \usepackage{color} \newcommand{\hl}{\textcolor{black}} \newcommand{\soutr}[1]{\textcolor{red}{\xout {\textcolor{black}{#1}}}} %\usepackage{todonotes} %% do not reload the xcolor package, otherwise the colors defined by knitr %% are purged % \usepackage[dvipsnames]{xcolor} %% could alternatively load xcolor and redefine knitr color (however, not robust %% to changes of knitr) % \definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345} % own commands %% custom commands \newcommand{\eg}{\textit{e.\,g.\,}} % e.g. \newcommand{\ie}{\textit{i.\,e.\,}} % i.e. \newcommand{\cf}{\textit{cf\,}} % cf \DeclareMathOperator{\Nor}{N} % Normal \DeclareMathOperator{\Var}{Var} % Variance \DeclareMathOperator{\E}{\mathsf{E}} % Expectation \DeclareMathOperator{\Cov}{Cov} % Covariance \DeclareMathOperator{\Corr}{Corr} % Correlation \DeclareMathOperator{\se}{se} % Standard error \newcommand{\dmin}{d_{\tiny \mbox{min}}} % margins % \usepackage[a4paper, total={6.5in, 10in}]{geometry} % title, author, date, etc. \title{\textbf{Design and Analysis of Replication Studies with \texttt{ReplicationSuccess}}} \author{\textbf{Leonhard Held, Charlotte Micheloud, Samuel Pawel, Felix Hofmann, Florian Gerber} \\ Epidemiology, Biostatistics and Prevention Institute (EBPI) \\ Center for Reproducible Science (CRS) \\ University of Zurich, Switzerland} \date{Package version \Sexpr{packageVersion(pkg = "ReplicationSuccess")}} % hyperref options \definecolor{darkcoral}{rgb}{0.8, 0.36, 0.27} \definecolor{darkcyan}{rgb}{0.0, 0.55, 0.55} \usepackage{hyperref} \hypersetup{ bookmarksopen=true, breaklinks=true, pdftitle={ReplicationSuccess}, pdfauthor={Leonhard Held, Charlotte Micheloud, Samuel Pawel}, colorlinks=true, linkcolor=darkcoral, anchorcolor=black, citecolor=darkcyan, urlcolor=darkcoral, } \begin{document} << "knitr-options", echo = FALSE >>= library(knitr) opts_chunk$set(size = "small", fig.height = 4, fig.align = "center", cache = FALSE, message = FALSE, warning = FALSE) @ \maketitle \begin{center} \begin{minipage}{0.65\textwidth} \rule{\textwidth}{0.4pt} {\small \centering \textbf{Abstract} \\ This vignette provides an overview of the \textsf{R} package \texttt{ReplicationSuccess}. The package contains utilities for the design and analysis of replication studies. Traditional methods based on statistical significance and confidence intervals, % as well as more recently developed methods such as the sceptical $p$-value % \citep{Held2020} are included. as well as the sceptical $p$-value \citep{Held2020} are included. The functionality of the package is illustrated using data sets from four large-scale replication projects which come also with the package. } \rule{\textwidth}{0.4pt} \end{minipage} \end{center} \section{Introduction} Over the course of the last decade, the conduct of replication studies has increased substantially. These developments were mainly caused by the so-called ``replication crisis'' in the social and life sciences. However, there is no consensus which statistical approach should be used to assess whether a replication study successfully replicated an original discovery. Moreover, depending on the chosen approach for analysis, the statistical considerations in the design of the replication study differ. The \textsf{R} package \texttt{ReplicationSuccess} provides functionality to analyse and plan replication studies in several ways. Specifically, functions for analysis, power and samples size calculations based % on statistical significance, as well as based on more recent methods, such as % the sceptical $p$-value \citep{Held2020}, are included. on statistical significance and confidence intervals, as well as on more recent methods, such as the sceptical $p$-value \citep{Held2020}, are included. This vignette illustrates the usage of the package on the data sets from four large-scale replication projects which are also included in the package. \texttt{ReplicationSuccess} was first created to provide software for computing the sceptical $p$-value and related power and sample size calculations \citep{Held2020}. Methods to compute the $p$-value for intrinsic credibility \citep{Held2019a}, the harmonic mean $\chi^2$-test \citep{Held2020b}, forecasting of replication studies \citep{Pawel2020}, and interim analyses of replication studies \citep{MicheloudHeld2021} were added subsequently. Recently, substantial changes to many of the existing functions were made due to two recalibrations of the sceptical $p$-value approach \citep{Held_etal2022, Held_etal2022-Fram}. Use \texttt{news(package = "ReplicationSuccess")} to see a history of the changes. The development version of \texttt{ReplicationSuccess} is available on GitHub (\url{https://github.com/crsuzh/ReplicationSuccess/}, the stable version is available on CRAN (\url{https://cran.r-project.org/web/packages/ReplicationSuccess/}). \subsection{Statistical framework} \texttt{ReplicationSuccess} assumes a simple but general statistical framework: The (suitably transformed) effect estimates $\hat{\theta}_o$ and $\hat{\theta}_r$ from original (subscript $o$) and replication (subscript $r$) study are assumed to be normally distributed around the unknown effect size $\theta$. Their variances are equal to their squared standard errors $\sigma_o^2$ and $\sigma_r^2$ which are assumed to be known. The same framework is also common in meta-analysis and can for example be applied to mean differences, odds ratios (log transformation), or correlation coefficients (Fisher $z$-transformation). Many of the functions in the package take unitless quantities as input: the $z$-values $z_o = \hat{\theta}_o/\sigma_o$ and $z_r = \hat{\theta}_r/\sigma_r$, the relative effect size $d = \hat{\theta}_r/\hat{\theta}_o$ (or shrinkage $s = 1 - d$), and the variance ratio $c = \sigma^2_o/\sigma^2_r$. The squared standard errors are usually inversely proportional to the sample size of each study, $\sigma^2_o = \kappa^2/n_o$ and $\sigma^2_r = \kappa^2/n_r$ for some unit variance $\kappa^2$. The variance ratio can then be identified as the relative sample size $c = n_r/n_o$. This may require some adaptation for certain types of effect sizes. Computation of these quantities from real data will be illustrated below. \section{Data sets} \texttt{ReplicationSuccess} includes data from four replication projects with a ``one-to-one'' design (\ie one replication for one original study), and from one replication project with multiple replications for one original study. The four replication projects with a one-to-one design are: \begin{itemize} \item \textbf{Reproducibility Project: Psychology:} In the \emph{Reproducibility Project: Psychology} 100 replications of studies from the field of psychology were conducted \citep{Opensc2015}. The original studies were published in three major Psychology journals in the year 2008. Only the study pairs of the ``meta-analytic subset'' are included here, which consists of 73 studies where the standard error of the Fisher $z$-transformed effect estimates can be computed \citep{Johnson2016}. \item \textbf{Experimental Economics Replication Project:} This project attempted to replicate 18 experimental economics studies published between 2011 and 2015 in two high impact economics journals \citep{Camerer2016}. For this project a \emph{prediction market} was also conducted in order to estimate the peer beliefs about whether a replication will result in a statistically significant result. Prediction markets are a tool to aggregate beliefs of market participants regarding the possibility of an investigated outcome and they have been used successfully in numerous domains, \eg sports and politics \citep{Dreber2015}. The estimated peer beliefs are also included for each study pair. \item \textbf{Social Sciences Replication Project:} This project involved 21 replications of studies on the social sciences published in the journals \emph{Nature} and \emph{Science} between 2010 and 2015 \citep{Camerer2018}. As in the experimental economics replication project, a prediction market to estimate peer beliefs about the replicability of the original studies was conducted and the resulting belief estimates are also provided in the package. In this project, the replications were conducted in two stages. In stage 1, the replication studies had 90\% power to detect 75\% of the original effect estimate. Data collection was stopped if a two-sided $p$-value $< 0.05$ and an effect in the same direction as the original were found. If not, data collection was continued in stage 2 to have 90\% power to detect 50\% of the original effect size for the first and second data collection pooled. \item \textbf{Experimental Philosophy Replicability Project:} In this project, 40 replications of experimental philosophy studies were carried out. The original studies had to be published between 2003 and 2015 in one of 35 journals in which experimental philosophy research is usually published (a list defined by the coordinators of this project) and they had to be listed on the experimental philosophy page of the Yale university \citep{Cova2018}. The data from the subset of 31 study pairs where effect estimates on correlation scale as well as effective sample size for both the original and replication were available are included in the package. \end{itemize} In these data sets, effect estimates are provided as correlation coefficients ($r$), as well as Fisher $z$-transformed correlation coefficients ($\hat{\theta} = \text{tanh}^{-1}(r)$). In the descriptive analysis of data from replication projects it has become common practice to transform effect sizes to the correlation scale, because correlations are bounded to the interval between minus one and one and thus easy to compare and interpret. Design and statistical analysis, on the other hand, is then usually carried out on a scale where the distribution of the estimates is well approximated by a normal distribution. For correlation coefficients this is the case after applying the Fisher $z$-transformation, which leads to their variance asymptotically being only a function of the study sample size $n$, \ie $\Var(\hat{\theta}) = 1/(n - 3)$ \citep{Fisher1921}. The data can be loaded with the command \texttt{data("RProjects")}. For a description of the variables see the documentation with \texttt{?RProjects}. An extended version of the Social Sciences Replication Project including the details of stages one and two can be loaded with \texttt{data("SSRP")}. It is a good idea to first compute the unitless quantities $z_o$, $z_r$ and $c$, since most functions of the package use them as input. % We also use the function \texttt{z2p} to compute the one-sided % $p$-values for original and replication study. As all original estimates are % positive, we specify the argument \texttt{alternative} to \texttt{"greater"}. << "data-loading" >>= library(ReplicationSuccess) data("RProjects") str(RProjects, width = 80, strict.width = "cut") ## computing zo, zr, c RProjects$zo <- with(RProjects, fiso/se_fiso) RProjects$zr <- with(RProjects, fisr/se_fisr) RProjects$c <- with(RProjects, se_fiso^2/se_fisr^2) @ Note that each variable ending with an \texttt{o} is associated with the original study, while each variable ending with an \texttt{r} is associated with the replication study. Plotting the original versus the replication effect estimate on the correlation scale paints the following picture. << "plot-projects", fig.height = 6 >>= ## plots of effect estimates par(mfrow = c(2, 2), las = 1, pty = "s", mar = c(4, 2.5, 2.5, 1)) for (p in unique(RProjects$project)) { data_project <- subset(RProjects, project == p) significant <- ifelse(data_project$pr1 < 0.025, "#DF536BCC", "#00000080") plot(rr ~ ro, data = data_project, ylim = c(-0.5, 1), col = significant, xlim = c(-0.5, 1), main = p, xlab = expression(italic(r)[o]), cex = 0.7, pch = 19, ylab = expression(italic(r)[r])) legend("topleft", legend = "replication significant", cex = 0.8, pch = 20, col = "#DF536BCC", bty = "n") abline(h = 0, lty = 2) abline(a = 0, b = 1, col = "grey") } @ In most cases the replication estimate is smaller than the corresponding original estimate. Furthermore, a substantial number of the replication estimates does not achieve statistical significance at one-sided 2.5\% level, while almost all original estimates did. The fifth data set comes from the following project: \begin{itemize} \item \textbf{Protzko et al. (2020)}: This data set originates from a prospective replication project involving four laboratories \citep{Protzko2020}. Each of them conducted four original studies and for each original study a replication study was carried out within the same lab (self-replication) and by the other three labs (external-replication). Most studies used simple between-subject designs with two groups and a continuous outcome so that for each study, an estimate of the standardized mean difference (SMD) could be computed from the group means, group standard deviations, and group sample sizes. For studies with covariate adjustment and/or binary outcomes, effect size transformations as described in the supplementary material of \citet{Protzko2020} were used to obtain effect estimates and standard errors on SMD scale. \end{itemize} % The dataset can be loaded with the command \texttt{data("protzko2020")}. The forest plots of the effect estimates for the first four experiments looks as follows. <>= data("protzko2020") ## forestplots of effect estimates parOld <- par(mar = c(5, 8, 4, 2), mfrow = c(2, 2)) experiments <- unique(protzko2020$experiment)[1:4] for (ex in experiments) { ## compute CIs dat <- subset(protzko2020, experiment == ex) za <- qnorm(p = 0.975) plotDF <- data.frame(lower = dat$smd - za*dat$se, est = dat$smd, upper = dat$smd + za*dat$se) colpalette <- c("#000000", "#1B9E77", "#D95F02") cols <- colpalette[dat$type] yseq <- seq(1, nrow(dat)) ## forestplot plot(x = plotDF$est, y = yseq, xlim = c(-0.15, 0.8), ylim = c(0.8*min(yseq), 1.05*max(yseq)), type = "n", yaxt = "n", xlab = "Effect estimate (SMD)", ylab = "") abline(v = 0, col = "#0000004D") arrows(x0 = plotDF$lower, x1 = plotDF$upper, y0 = yseq, angle = 90, code = 3, length = 0.05, col = cols) points(y = yseq, x = plotDF$est, pch = 20, lwd = 2, col = cols) axis(side = 2, at = yseq, las = 1, labels = dat$type, cex.axis = 0.85) title(main = ex) } @ \section{Design and analysis of replication studies} Although a replication study needs to be planned and conducted before the results can be analysed, we will first discuss the particular analysis approaches. We do this because the chosen analysis strategy has a substantial impact on the design of a replication study. In the design phase of a replication study, we will then focus only on the determination of the sample size. \subsection{Statistical significance} \paragraph{Analysis} The most commonly used approach is to declare a replication successful if both the original and replication study achieve statistical significance (in the same direction). The significance level is conventionally chosen to be 0.05 for two-sided $p$-values, respectively, 0.025 for one-sided $p$-values. Working with one-sided $p$-values is usually simpler since effect direction is automatically taken into account, while with two-sided $p$-values one has also to check whether the original and replication estimate go in the same direction. % There are some variations of this approach, for example, % \citet{Camerer2016} only assessed whether the replication effect is significant in % the same direction, but not whether the original effect shows the same significance % status. For the four one-to-one replication projects we can simply check whether the one-sided $p$-values (in the positive direction) of original and replication are both below the conventional threshold 0.025. << >>= for (p in unique(RProjects$project)) { data_project <- subset(RProjects, project == p) significant_O <- data_project$po1 < 0.025 significant_R <- data_project$pr1 < 0.025 success <- significant_O & significant_R cat(paste0(p, ": \n")) cat(paste0(round(mean(significant_O)*100, 1), "% original studies significant (", sum(significant_O), "/", length(significant_O), ")\n")) cat(paste0(round(mean(significant_R)*100, 1), "% replications significant (", sum(significant_R), "/", length(significant_R), ")\n")) cat(paste0(round(mean(success)*100, 1), "% both studies significant in the same direction (", sum(success), "/", length(success), ")\n \n")) } @ Despite its appealing simplicity, assessing replication success with statistical significance is often criticized. For example, non-significant replication results are expected if the original finding was a false positive, % (\eg with 95\% probability if the two-sided % significance level is 5\%), on the other hand, however, they can also be caused due to low power of the replication study \citep{Goodman1992}. On the other hand, statistical significance can still be achieved for a %trivial replication effect estimate which is much smaller than the one from the original study, provided its standard error is small enough (\eg because of a very large replication sample size). \paragraph{Design} Selecting the same sample size in the replication study as in the original study may lead to a severely underpowered design and as a result, true effects may not be detected. To ensure that the replication study reliably detects true effects, the studies should be well-powered. In classical sample size planning, usually a clinically relevant effect is specified and the sample size is then determined so that it can be detected with a certain power. Luckily, in the replication setting the clinically relevant effect does not need to be specified but can be replaced with the effect estimate from the original study. However, using the standard sample size calculation approach is not well suited, because the uncertainty of the original effect estimate is ignored. One way of tackling this issue is to use a Bayesian approach, incorporating the original estimate and its precision into a design prior that is used for power calculations. This corresponds to the concept of ``predictive power'' and generally leads to larger sample sizes than the standard method. In practice, however, often more ad hoc approaches are used. For instance, the original estimate is just shrunken by an (arbitrary) constant, \eg it was halved in the social sciences replication project, and standard sample size calculations are then carried out. Using the function \texttt{sampleSizeSignificance}, it is straightforward to plan the sample size of the replication study with the just mentioned approaches. The argument \texttt{designPrior} allows to carry out sample size planning based on classical power ignoring the uncertainty (\texttt{"conditional"}) or based on predictive power (\texttt{"predictive"}). Moreover, ad hoc shrinkage can be specified with the argument \texttt{shrinkage}. Note that the function \texttt{sampleSizeSignificance}, as well as most of the functions from the package, takes $z$-values (and not $p$-values) as arguments. The transformation from $p$- to $z$-values and vice versa can easily be done using the functions \texttt{p2z} and \texttt{z2p}. The following code shows a few examples. Note that the function returns the required relative sample size $c = n_r/n_o$, \ie the factor by which the sample size of the original study needs to be multiply for the replication study. << >>= sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "conditional") sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "predictive") sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "conditional", shrinkage = 0.25) @ The power of the replication study for a fixed relative sample size $c$ can be calculated with the function \texttt{powerSignificance}. Figure \ref{fig:powerSignificance} shows the power to achieve significance in the replication as a function of either the (one-sided) $p$-value or the $z$-value of the original study. If the original estimate was just significant at the 0.025 level, the probability for significance in the replication is just about 0.5 for conditional and predictive power. This result was first mentioned by \citet{Goodman1992} already two decades ago, yet many practitioners of statistics still find it counterintuitive, because they confuse type I error rates with replication probabilities. Thus, for the replication to achieve significance with high probability, the sample size needs to be increased compared to the original if the the evidence for the original discovery was only weak or moderate (Figure \ref{fig:sampleSizeSignificance}). \begin{figure}[!h] << "plot-powerSignificance", echo = FALSE, fig.height = 4 >>= po <- seq(0.0001, 0.05, 0.0001)/2 ## plot power plot(po, powerSignificance(zo = p2z(po, alternative = "one.sided"), designPrior = "conditional")*100, type = "l", ylim = c(0, 100), lwd = 1.5, ylab = "Power (%)", xlab = expression(italic(p)[o]), las = 1, yaxt = "n") axis(side = 2, at = seq(0, 100, 25), las = 1) axis(side = 3, at = seq(0.0, 0.025, by = 0.005), labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005), alternative = "one.sided"), 2))) mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2) abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3)) # abline(h = 50, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4)) # abline(h = 50, col = "#333333B3", lty = 3) lines(po, powerSignificance(zo = p2z(po, alternative = "one.sided"), designPrior = "predictive")*100, lwd = 2, lty = 2) legend("topright", legend = c("conditional", "predictive"), title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n") @ \caption{Power to achieve significance of the replication study at the one-sided 2.5\% level as a function of the (one-sided) $p$-value or $z$-value of the original study using the same sample size as in the original study.} \label{fig:powerSignificance} \end{figure} \begin{figure}[!h] << "plot-sampleSizeSignificance", echo = FALSE, fig.height = 4 >>= ## plot sample size plot(po, sampleSizeSignificance(zo = p2z(po, alternative = "one.sided"), designPrior = "conditional", power = 0.8), type = "l", ylim = c(0.5, 8), log = "y", lwd = 1.5, ylab = expression(paste("Relative sample size ", n[r]/n[o])), xlab = expression(italic(p)[o]), las = 1, yaxt = "n") axis(side = 3, at = seq(0.0, 0.025, by = 0.005), labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005), alternative = "one.sided"), 2))) axis(side = 2, las = 1, at = c(0.5, 1, 2, 4, 8, 16, 32), labels = c("1/2", "1", "2", "4", "8", "16", "32")) mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2) abline(h = c(0.5, 1, 2, 4, 8), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3)) # abline(h = 1, col = "#333333B3", lty = 3) abline(h = 1, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4)) lines(po, sampleSizeSignificance(zo = p2z(po, alternative = "one.sided"), designPrior = "predictive", power = 0.8), lwd = 2, lty = 2) legend("topleft", legend = c("conditional", "predictive"), title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n") @ \caption{Relative sample size to achieve significance of the replication study at the one-sided 2.5\% level with 80\% power as a function of the (one-sided) $p$-value or $z$-value of original study.} \label{fig:sampleSizeSignificance} \end{figure} \subsection{Compatibility of effect size} \paragraph{Analysis} Another way for analysing a replication study is to examine the statistical compatibility between original and replication effect estimate. A popular approach is to check whether the replication estimate is contained within a prediction interval based on the original estimate \citep{Patil2016, Pawel2020}. This approach based on a $(1 - \alpha)$ prediction interval is equivalent to conducting a meta-analytic $Q$-test with the two estimates, and rejecting compatibility when the corresponding $p$-value $p_Q < \alpha$ \citep{Hedges2019b}. The $p$-value from the $Q$-test is usually preferred, since it tells quantitatively how compatible the estimates are. In contrast, a prediction interval can give a better idea about the range of plausible replication effect estimates besides the observed one. Both approaches are available in \texttt{ReplicationSuccess}: The function \texttt{Qtest} returns the $p$-value from the meta-analytic $Q$-test, whereas the function \texttt{predictionInterval} returns a prediction interval for the replication effect based on the original counterpart (see the documentation of the functions for further details). % , a prediction interval of the replication effect % estimate can be computed under different predictive % distributions which depend on the design prior. The default design prior % \texttt{"predictive"} is likely the choice most people would want to use as it takes % into account the uncertainty of the original estimate without shrinking it. For the four data sets, we can easily compute $p$-values from $Q$-test, as well as 95\% prediction intervals. For easier visual assessment we transform the intervals and estimates back to the correlation scale. << "plot-predictionInterval", fig.height = 6 >>= ## compute prediction intervals for replication projects par(mfrow = c(2, 2), las = 1, mai = rep(0.65, 4)) for (p in unique(RProjects$project)) { data_project <- subset(RProjects, project == p) pQ <- Qtest(thetao = data_project$fiso, thetar = data_project$fisr, seo = data_project$se_fiso, ser = data_project$se_fisr) PI <- predictionInterval(thetao = data_project$fiso, seo = data_project$se_fiso, ser = data_project$se_fisr) ## transforming back to correlation scale PI <- tanh(PI) incompatible <- pQ < 0.05 ## incompatible at 5% level color <- ifelse(incompatible == FALSE, "#00000099", "#DF536BCC") study <- seq(1, nrow(data_project)) plot(data_project$rr, study, col = color, pch = 20, cex = 0.5, xlim = c(-0.5, 1), xlab = expression(italic(r)[r]), ylab = "Study", main = paste0(p, ": ", round(mean(incompatible)*100, 0), "% incompatible")) arrows(PI$lower, study, PI$upper, study, length = 0.02, angle = 90, code = 3, col = color) abline(v = 0, lty = 3) } @ While both approaches enable statements about compatibility of original and replication effect estimates, they carry a fundamental problem due to the structure of the underlying hypothesis test: If $p_Q < \alpha$ (or equivalently the $(1- \alpha)$ prediction interval does not contain the replication estimate) one has established incompatibility/non-replication. If, however, one fails to establish incompatibility, it remains unclear whether this is due to small power or true compatibility of both estimates \citep{Hedges2019b}. % The criticism that this approach receives is that for studies which are % underpowered, the prediction intervals will become very wide. % This in turn can lead to very different effect estimates being compatible, % \eg even ones that go in the opposite direction, ultimately providing % no information about the effect itself (which actually happens in some % cases in the economics and philosophy data sets). \subsection{The sceptical $p\,$-value} <>= ## data from study morewedge <- subset(RProjects, study == "Morewedge et al. (2010), Science") ## Functions ssv <- function(zo, so, a) { tau2 <- so^2/(zo^2/qnorm(p = a/2, lower.tail = FALSE)^2 - 1) return(tau2) } ## Parameters and computations optionsOld <- options(scipen = 5) theta_o <- morewedge$fiso theta_r <- morewedge$fisr so <- morewedge$se_fiso sr <- morewedge$se_fisr c <- so^2/sr^2 zo <- theta_o/so zr <- theta_r/sr alpha <- 0.05 za <- qnorm(p = alpha/2, lower.tail = FALSE) ps <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "two.sided", type = "nominal"), 2) ps1 <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "nominal"), 2) ps1r <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "golden"), 2) tau <- sqrt(ssv(zo, so, alpha)) # sufficiently sceptical prior sd at alpha = 0.05 s2_p <- 1/(1/so^2 + 1/tau^2) # posterior variance mu_p <- s2_p*theta_o/so^2 # posterior mean @ \paragraph{Analysis} The \emph{sceptical $p$-value}, a new quantitative measure of replication success was first introduced in \citet{Held2020}. Conceptually, replication success is declared if the replication study is in conflict with a sceptical prior that would render the original study non-significant. The sceptical $p$-value arises from combining the analysis of credibility \citep{Matthews2001a} with the prior-predictive check \citep{Box1980}. Specifically, using Bayes theorem in reverse, the prior distribution of the effect size is determined such that based on the original study, the $(1 - \alpha)$ credible interval of the posterior distribution of the effect just includes zero. This prior corresponds to the objection of a sceptic who argues that the original finding is no longer significant if combined with a sufficiently sceptical prior. Replication success at level $\alpha$ is then achieved if the tail probability of the replication estimate under its prior predictive distribution is smaller than $\alpha$, rendering the objection of the sceptic unrealistic. The smallest level $\alpha$ at which replication success can be declared defines the sceptical $p$-value. The method has attractive properties: The sceptical $p$-value is never smaller than the ordinary $p$-values from both studies which ensures that both studies have to be sufficiently convincing on their own such that replication success is possible. It also takes into account the size of the effect estimates, \ie it becomes larger if the replication estimate is smaller than the original estimate, which guarantees that shrinkage of the replication effect estimate is penalized. One-sided sceptical $p$-values are preferred because they guarantee that replication success is only possible if the directions of original and replication effect estimates are the same. We will only report one-sided $p$-values in the following. \hl{The sceptical $p$-value in its original formulation (\hl{`nominal'}) can be easily computed with the function \texttt{pSceptical} and \texttt{type = "nominal"}.} Figure~\ref{fig:examplerevBayes} shows the original study from \citet{Morewedge2010} and its replication by the SSRP \citep{Camerer2018}. In this example, the one-sided sceptical $p$-value turns out to be $p_\text{S} = \Sexpr{ps1}$. % practical illustration of the procedure, % also see \citet{Held2020} and \citet{Held_etal2022} for technical details. \begin{figure}[!h] << "plot-pSceptical", echo = FALSE, fig.height = 4 >>= ## Plot df <- data.frame(estimate = factor(c("Original Study", # "Posterior", # "Sceptical Prior", "Replication Study"), levels = c("Original Study", # "Posterior", # "Sceptical Prior", "Replication Study")), x = c(1, 2), col = c(1, 4), theta = c(theta_o, theta_r), lower = c(theta_o - za*so, theta_r - za*sr), upper = c(theta_o + za*so, theta_r + za*sr), p = c(signif(2*pnorm(q = zo, lower.tail = FALSE), 1), signif(2*pnorm(q = zr, lower.tail = FALSE), 2))) ylims <- c(-0.1, 1) xlims <- c(0.5, 2.5) cex <- 1 plot(x = df$x, y = df$theta, pch = " ", xlim = xlims, ylim = ylims, xlab = "", ylab = "Effect Size", xaxt = "n", las = 1) axis(side = 1, at = df$x, labels = df$estimate, cex.axis = 0.9, mgp = c(3, 2, 0)) abline(h = seq(-2, 2, 0.25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.4)) abline(h = 0, lty = 2) # ## one-sided CIs # arrows(x0 = df$x[-3], x1 = df$x[-3], y0 = df$theta[-3] + 0.05, y1 = df$upper[-3], # col = df$col[-3], code = 2, angle = 90, length = 0, lwd = 1.5) # points(x = df$x[-3], y = df$upper[-3], pch = 17, cex = 1.3*cex, col = df$col[-3]) ## two-sided CI arrows(x0 = df$x, x1 = df$x, y0 = df$theta + 0.05, y1 = df$upper, col = df$col, code = 2, angle = 90, length = 0.1, lwd = 1.5) arrows(x0 = df$x, x1 = df$x, y0 = df$theta - 0.05, y1 = df$lower, col = df$col, code = 2, angle = 90, length = 0.1, lwd = 1.5) ## arrows and text points(x = df$x, y = df$theta, pch = 20, cex = cex*1.5, col = df$col) text(x = df$x[1] - 0.2, y = df$theta[1], labels = bquote(paste(hat(theta)[o], " = ", .(round(df$theta[1], 2))))) text(x = df$x[2] + 0.2, y = df$theta[2], labels = bquote(paste(hat(theta)[r], " = ", .(round(df$theta[2], 2))))) text(x = df$x[1], y = df$upper[1] + 0.1, labels = bquote(paste(p[o], " = ", .(round(morewedge$po1, 4))))) text(x = df$x[2], df$upper[2] + 0.1, labels = bquote(paste(p[r], " = ", .(round(morewedge$pr1, 4))))) text(x = 1.5, y = 0.3, labels = bquote(paste(p[S], " = ", .(ps1)))) # arrows(x0 = 2.1, x1 = 2.9, y0 = 0.9, y1 = 0.9, length = 0.05, lwd = 1.2, col = "#000000B2") # text(x = 2.5, y = 0.98, labels = "reverse-Bayes", cex = 0.75, col = "#000000B2") # arrows(x0 = 3.15, x1 = 4.5, y0 = 0.9, y1 = 0.9, length = 0.05, lwd = 1.2, col = "#000000B2") # text(x = 3.8, y = 0.98, labels = "prior-data conflict assessment", cex = 0.75, col = "#000000B2") # points(x = 2, y = 0, pch = 1, cex = 2) # arrows(x0 = 1.85, x1 = 1.98, y0 = -0.2, y1 = -0.03, length = 0.05, col = "#000000B2") # text(x = 1.85, y = -0.25, labels = "fixed at zero", col = "#000000B2", cex = 0.75) box() @ \caption{Original study from \citet{Morewedge2010} and its replication by the SSRP \citep{Camerer2018}. Shown are the effect estimates $\hat\theta_o$, $\hat\theta_r$ together with the 95\% confidence interval and the one-sided $p$-values $p_o$ and $p_r$ and the nominal sceptical $p$-value $p_S$. % The one-sided sceptical $p$-value is $p_\text{S} = \Sexpr{ps1}$ for this % example. % if a % recalibration with the golden level is applied it becomes % $\tilde{p}_\text{S} = \Sexpr{ps1r}$. } \label{fig:examplerevBayes} \end{figure} The example study can be isolated with the following command: <>= morewedge <- subset(RProjects, study == "Morewedge et al. (2010), Science") @ % This method provides a sound approach to quantify % replication success and it has attractive properties. <>= ps_gold <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "golden"), 2) ps_contr <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "controlled"), 2) @ {\flushleft\textit{\hl{Interpretation and recalibration}}} \\ \hl{Interpretation of the sceptical $p$-value as a continuous measure of replication success is recommended. However, if an answer to ``did it replicate?'' is needed, \hl{a threshold is required}. The sceptical $p$-value can be thresholded at $\alpha$ just as an ordinary $p$-value. Our default choice is $\alpha = 0.025$ for one-sided $p$-values.} \hl{\citet{Held2020} calculated the nominal sceptical $p$-values of a number of case studies and a relatively low rate of replication success was found. In addition, \citet{Held_etal2022} further studied properties of the \hl{nominal sceptical $p$-value} and concluded that it may be too stringent for most realistic scenarios. Two recalibrations were subsequently proposed: the golden and the controlled sceptical $p$-values. They can be computed using the \texttt{pSceptical} function with \texttt{type = "golden"} or \texttt{type = "controlled"}, respectively, and should also be thresholded at $\alpha$.} \hl{The two recalibration types can be summarized as follows:} \begin{itemize} \item \hl{The \emph{golden} sceptical $p$-value \citep{Held_etal2022} provides an attractive balance between significance testing and effect size comparison. It ensures that for original studies, which are just significant at the specified significance level $\alpha$, replication success is only possible if the replication effect estimate is larger than the original one (\ie if there is no shrinkage). The golden sceptical $p$-value controls the overall Type-I error rate for $c \geq 1$, so does not provide exact overall Type-I error control for any $c$. This is our recommended default type of recalibration. In the \citet{Morewedge2010} replication example, we obtain $p_S = \Sexpr{ps_gold} < \Sexpr{alpha/2}$, so the replication is successful with the golden sceptical $p$-value.} % \item \hl{The \emph{controlled} sceptical $p$-value \citep{Held_etal2022-Fram} ensures exact overall Type-I error control for any value of the variance ratio $c$. The overall T1E rate is controlled at level $\alpha^2$ for \texttt{alternative = "two.sided"} (where $\alpha$ is two-sided) or \texttt{"one.sided"} (where $\alpha$ is one-sided) if the direction was pre-specified in advance. % For alternative is \texttt{"one.sided"} % and no pre-specified direction, the % Type-I error rate is controlled at level $2\alpha^2$ % (the conventional level for type I error control of two % independent experiments with % two-sided testing at level $2\alpha$ in the first and one-sided % testing at level $\alpha$ in the second). In the \citet{Morewedge2010} replication example, we obtain $p_S = \Sexpr{ps_contr} < \Sexpr{alpha/2}$, so the replication is also successful with the controlled sceptical $p$-value.} \end{itemize} \hl{The three types of sceptical $p$-values for the \citet{Morewedge2010} example are calculated as follows:} <>= print(pS_nominal <- pSceptical(zo = morewedge$zo, zr = morewedge$zr, c = morewedge$c, alternative = "one.sided", type = "nominal")) print(pS_golden <- pSceptical(zo = morewedge$zo, zr = morewedge$zr, c = morewedge$c, alternative = "one.sided", type = "golden")) print(pS_controlled <- pSceptical(zo = morewedge$zo, zr = morewedge$zr, c = morewedge$c, alternative = "one.sided", type = "controlled")) @ The one-sided \hl{golden and controlled} sceptical $p$-values are computed for the four one-to-one replication projects as follows: << "plot-pSceptical-projects", fig.height = 4 >>= ## computing one-sided golden and controlled sceptical p-value for replication projects RProjects$psG <- with(RProjects, pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "golden")) RProjects$psC <- with(RProjects, pSceptical(zo = zo, zr = zr, c = c, alternative = "one.sided", type = "controlled")) @ And the success rates in the four projects with the statistical significance criterion, and the golden and controlled sceptical $p$-value are <>= for (p in unique(RProjects$project)) { data_project <- subset(RProjects, project == p) cat(paste0(p, ": \n")) success_sceptG <- (data_project$psG < 0.025) cat(paste0(round(mean(success_sceptG)*100, 2), "% smaller than 0.025 (one-sided golden sceptical p-value) \n")) success_sceptC <- (data_project$psC < 0.025) cat(paste0(round(mean(success_sceptC)*100, 2), "% smaller than 0.025 (one-sided controlled sceptical p-value) \n")) success_tradit <- (data_project$po1 < 0.025) & (data_project$pr1 < 0.025) cat(paste0(round(mean(success_tradit)*100, 2), "% smaller than 0.025 (both one-sided traditional p-values) \n")) if(sum(success_sceptG != success_tradit) > 0){ discrep <- data_project[(success_sceptG != success_tradit), c("ro", "rr", "c", "po1", "pr1", "psG")] ## print effect estimates, 1sided p-values, and c of discrepant studies cat("Discrepant studies (golden vs significance): \n") print(signif(discrep, 2), row.names = FALSE) } if(sum(success_sceptC != success_tradit) > 0){ discrep <- data_project[(success_sceptC != success_tradit), c("ro", "rr", "c", "po1", "pr1", "psC")] ## print effect estimates, 1sided p-values, and c of discrepant studies cat("Discrepant studies (controlled vs significance): \n") print(signif(discrep, 2), row.names = FALSE) } if(sum(success_sceptC != success_sceptG) > 0){ discrep <- data_project[(success_sceptC != success_sceptG), c("ro", "rr", "c", "po1", "pr1", "psG", "psC")] ## print effect estimates, 1sided p-values, and c of discrepant studies cat("Discrepant studies (golden vs controlled): \n") print(signif(discrep, 2), row.names = FALSE) } cat("\n \n") } @ For most studies the three approaches agree, but there is disagreement for seven of them. The golden sceptical $p$-value may not indicate replication success when there is substantial shrinkage of the replication effect estimate relative to the original one, even if both estimates are significant (this is the case for one study in the psychology project, two studies in the social sciences project, one study in the philosophy project). In contrast, provided there is not much shrinkage, it may still indicate replication success for non-significant original or replication studies (this is the case for two studies in the psychology project). On the other hand, the controlled sceptical $p$-value can indicate success for non-significant original or replication study, even in the presence of considerable shrinkage (this is the case for one study in the psychology project and one in the experimental economics project). The following plot shows golden $p_S$ versus controlled $p_S$ for all study pairs, stratified by project. <>= par(mfrow = c(2, 2), las = 1, pty = "s", mar = c(4, 2.5, 2.5, 1)) myaxis <- c(10e-5, 0.001, 0.01, 0.1, 1) for (p in unique(RProjects$project)) { data_project <- subset(RProjects, project == p) plot(psG ~ psC, data = data_project, ylim = c(10e-5, 1), xlim = c(10e-5, 1), main = p, xlab = expression(paste("controlled ", p[S])), ylab = expression(paste("golden ", p[S])), pch = 19, col = "#00000099", cex = 1.3, axes = F, log = "xy") abline(h = 0, lty = 2) abline(a = 0, b = 1, col = "grey") abline(h = 0.025, lty = 2, col = "grey") abline(v = 0.025, lty = 2, col = "grey") axis(1, at = myaxis, labels = myaxis) axis(2,at = myaxis, labels = myaxis) box() } @ << "threshold-p-sceptical2", echo = FALSE >>= ## computing nominal, controlled, and golden replication success levels ## for one-sided uncalibrated sceptical p-value thresh_gol <- levelSceptical(level = 0.025, alternative = "one.sided", type = "golden") thresh_contr <- levelSceptical(level = 0.025, alternative = "one.sided", type = "controlled", c = c(1, 2)) thresh_nom <- levelSceptical(level = 0.025, alternative = "one.sided", type = "nominal") @ {\flushleft\textit{\hl{Replication success level}}} \\ % \hl{Instead of recalibrating the sceptical $p$-value and thresholding it at $\alpha$, it is equivalent to use the uncalibrated sceptical $p$-value and threshold it at the replication success level $\gamma$. The replication success level is computed using \texttt{levelSceptical} and the recalibration \texttt{type} (\texttt{"golden"} or \texttt{"controlled"}). The golden replication success level depends on $\alpha$ and is $\Sexpr{round(thresh_gol, 3)}$ for $\alpha = 0.025$. The controlled replication success level depends on $\alpha$ and the variance ratio $c$. It is for example $\Sexpr{round(thresh_contr[1], 3)}$ for $c = 1$ and $\Sexpr{round(thresh_contr[2], 3)}$ for $c = 2$.} \hl{Using \texttt{type = "nominal"} simply returns $\alpha$.} << "threshold-p-sceptical" >>= ## computing nominal, golden and controlled replication success levels ## for one-sided uncalibrated sceptical p-value print(rs_level_nom <- levelSceptical(level = 0.025, alternative = "one.sided", type = "nominal")) print(rs_level_gol <- levelSceptical(level = 0.025, alternative = "one.sided", type = "golden")) print(rs_level_contr <- levelSceptical(level = 0.025, alternative = "one.sided", type = "controlled", c = c(1, 2))) @ \hl{The replication success level $\gamma$ is also the bound on the partial Type-I error rate of the sceptical $p$-value \citep[Section 3.1]{Held_etal2022-Fram}. This means that the individual $p$-values $p_o$ and $p_r$ both need to be smaller than $\gamma$ for replication success to be possible with the sceptical $p$-value.} \paragraph{Design} Sample size calculations work in a similar manner as when they are based on statistical significance: % Design works similarly as for the statistical significance analysis strategy; Using the function \texttt{sampleSizeReplicationSuccess}, we need to choose a design prior, \hl{a threshold $\alpha$ for the sceptical $p$-value,} a recalibration type, and the desired power to obtain the required relative sample size $c = n_r/n_o$. The following code shows two examples. << >>= sampleSizeReplicationSuccess(zo = 2.5, power = 0.8, level = 0.025, alternative = "one.sided", designPrior = "conditional", type = c("golden", "controlled")) sampleSizeReplicationSuccess(zo = 2.5, power = 0.8, level = 0.025, alternative = "one.sided", designPrior = "predictive", type = c("golden", "controlled")) @ The function \texttt{powerSignificance} allows to calculate the power for a fixed relative sample size $c$. Figure \ref{fig:powerReplicationSuccess} shows the power to achieve replication success \hl{with the golden and controlled sceptical $p$-values} as a function of the one-sided $p$-value (or $z$-value) of the original study, assuming equal sample sizes in original and replication studies ($c = 1$). % The probability for replication success if the original % study showed only weak evidence ($p_o = 0.025$) is now smaller than 0.5, % which is reached for an original $p$-value of around 0.015. \begin{figure}[!h] <<"plot-powerReplicationSuccess", echo = FALSE, fig.height = 4>>= ## plot power po <- seq(0.001, 0.05, length.out = 100)/2 plot(po, powerReplicationSuccess(zo = p2z(po, alternative = "one.sided"), designPrior = "conditional", level = 0.025, alternative = "one.sided", type = "golden")*100, type = "l", ylim = c(0, 100), lwd = 1.5, ylab = "Power (%)", las = 1, xlab = expression(italic(p)[o]), yaxt = "n") abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3)) # abline(h = 50, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4)) axis(side = 2, at = seq(0, 100, 25), las = 1) axis(side = 3, at = seq(0.0, 0.025, by = 0.005), labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005), alternative = "one.sided"), 2))) mtext(text = expression(paste( italic(z)[o])), side = 3, line = 2) lines(po, powerReplicationSuccess(zo = p2z(po, alternative = "one.sided"), designPrior = "predictive", level = 0.025, alternative = "one.sided", type = "golden")*100, lwd = 2, lty = 2) lines(po, powerReplicationSuccess(zo = p2z(po, alternative = "one.sided"), designPrior = "conditional", level = 0.025, alternative = "one.sided", type = "controlled")*100, col = "red", lwd = 2) lines(po, powerReplicationSuccess(zo = p2z(po, alternative = "one.sided"), designPrior = "predictive", level = 0.025, alternative = "one.sided", type = "controlled")*100, col = "red", lty = 2, lwd = 2) legend("topright", legend = c("conditional", "predictive"), title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n") # abline(h = 50, lty = 3) legend("bottomleft", legend = c(expression(paste("golden ", p[S])), expression(paste("controlled ", p[S]))), col = c("black", "red"), lty = 1, lwd = 2, bty = "n") @ \caption{Power to achieve replication success (with the golden and controlled sceptical $p$-value and $\alpha = 0.025$) as a function of the one-sided $p$-value or $z$-value of the original study.} \label{fig:powerReplicationSuccess} \end{figure} % Figure \ref{fig:sampleSizeReplicationSuccess} shows the required sample size to achieve replication success \hl{with the golden and controlled sceptical $p$-value} with 80\% power. With the golden sceptical $p$-value, the required relative sample sizes gets very large for borderline significant original studies and is even impossible for non-significant original studies. We thus recommend to use the controlled sceptical $p$-value for sample size calculations, which also allows sample size calculation for non-significant original studies. \begin{figure}[!h] << "plot-sampleSizeReplicationSuccess", echo = FALSE, fig.height = 4>>= # po <- seq(0.0001, 0.05, 0.0001) po <- seq(0.001, 0.05, length.out = 100)/2 ## plot sample size plot(po, sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"), power = 0.8, level = 0.025, designPrior = "conditional", alternative = "one.sided", type = "golden"), type = "l", log = "y", lwd = 1.5, las = 1, ylim = c(0.5, 20), ylab = expression(paste("Relative sample size ", n[r]/n[o])), xlab = expression(italic(p)[o]), yaxt = "n") axis(side = 2, las = 1, at = c(0.5, 1, 2, 4, 8, 16, 32), labels = c("1/2", "1", "2", "4", "8", "16", "32")) # abline(h = 1, lty = 3) abline(h = c(0.5, 1, 2, 4, 8, 16, 32), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3)) abline(h = 1, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4)) axis(side = 3, at = seq(0.0, 0.025, by = 0.005), labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005), alternative = "one.sided"), 2))) mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2) suppressWarnings({ lines(po, sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"), power = 0.8, level = 0.025, designPrior = "predictive", alternative = "one.sided", type = "golden"), lwd = 2, lty = 2) lines(po, sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"), power = 0.8, level = 0.025, designPrior = "conditional", alternative = "one.sided", type = "controlled"), col = "red", lwd = 2) lines(po, sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"), power = 0.8, level = 0.025, designPrior = "predictive", alternative = "one.sided", type = "controlled"), col = "red", lwd = 2, lty = 2) }) legend("topleft", legend = c("conditional", "predictive"), title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n") legend("bottomright", legend = c(expression(paste("golden ", p[S])), expression(paste("controlled ", p[S]))), col = c("black", "red"), lty = 1, lwd = 2, bty = "n") @ \caption{Relative sample size to achieve replication success (with the golden and controlled sceptical $p$-values and $\alpha = 0.025$) with 80\% power as a function of the (one-sided) $p$-value or $z$-value of the original study.} \label{fig:sampleSizeReplicationSuccess} \end{figure} \subsection{Relative effect size} \paragraph{Analysis} The requirements on the replication $p$-value ($p_r < \alpha$) and on the sceptical $p$-value ($p_S < \alpha$) can both be transformed to requirements on the relative effect size $d = \hat{\theta}_r/\hat{\theta}_o$ \citep{Held_etal2022}. % (the % ratio of the replication effect estimate to the original effect estimate). In short, replication success is declared if the relative effect size is larger than a certain bound (the minimum relative effect size $\dmin$). The minimum relative effect size $\dmin$ is computed based on the result from the original study, the relative sample size, the significance level/threshold for the sceptical $p$-value. The type of recalibration is also required for the sceptical $p$-value. The functions are \texttt{effectSizeSignificance} and \texttt{effectSizeReplicationSuccess} for significance and the sceptical $p$-value, respectively. << echo = FALSE >>= d <- morewedge$fisr/morewedge$fiso dminrs <- effectSizeReplicationSuccess(zo = morewedge$zo, c = morewedge$c) dminsign <- effectSizeSignificance(zo = morewedge$zo, c = morewedge$c) @ For the \citet{Morewedge2010} example, the minimum relative effect size to achieve replication success is $\dmin = \Sexpr{round(dminsign, 2)}$ for significance and $\dmin = \Sexpr{round(dminrs, 2)}$ with the golden sceptical $p$-value. Since the observed relative effect size is $d = \Sexpr{round(d, 2)}$, the replication is successful with both methods. \section{Special topics} \subsection{Interim analysis} Adaptive designs are a type of designs where one or more interim analyses are planned during the course of a study. This topic has extensively been studied and used in clinical trials for example, where continuing a study that should be stopped may lead to serious consequences. However, this type of design has rarely been covered in the framework of replication studies. An adaptive design was adopted in the social sciences replication project, but without a power (re)calculation at interim. \texttt{ReplicationSuccess} allows to calculate the power of the replication study after an interim analysis has been performed, taking into account the results from the first part of the study. The power at interim is a useful tool to decide whether a replication study should be stopped prematurely for futility \citep{MicheloudHeld2021}. The function \texttt{powerSignificanceInterim} is an extension of \texttt{powerSignificance} and requires in addition the specification of \texttt{zi}, the $z$-value at the interim analysis and \texttt{f}, the fraction of the replication study already completed. Moreover, the argument \texttt{designPrior} can be set to \texttt{"conditional"}, \texttt{"informed predictive"} and \texttt{"predictive"}. Finally, the argument \texttt{analysisPrior} allows to also take the original result into account in the analysis of the replication study. % The function \texttt{sampleSizeSignificanceInterim} can % be used to re-estimate the sample size of the replication study when % results from an interim analysis are available. % Figure~\ref{fig:PowerSignificanceInterim} shows the conditional and the % predictive power of the replication studies that continued into stage 2. % While the conditional power at interim is larger than 80\% for all the studies, % the predictive power is close to 0\% for some studies and always smaller % than the conditional power. % \begin{figure} << echo = FALSE, fig.height = 3.5, eval = FALSE >>= data("SSRP") intpow_cond <- with(SSRP, powerSignificanceInterim(zo = fiso/se_fiso, zi = fisi/se_fisi, c = nr/no, f = ni/nr, designPrior = "conditional")) intpow_pred <- with(SSRP, powerSignificanceInterim(zo = fiso/se_fiso, zi = fisi/se_fisi, c = nr/no, f = ni/nr, designPrior = "predictive")) plot(intpow_cond*100, intpow_pred*100, xlab = "Conditional power (in %)", ylab = "Predictive power (in %)", pch = 20, cex = 1.2, xlim = c(80, 100), ylim = c(0, 100), yaxt = "n", las = 1, col = "#00000099") axis(side = 2, at = seq(0, 100, 25), las = 1) abline(a = 0, b = 1, col = "grey") @ % \caption{Conditional vs.\ predictive power at interim of the 10 studies from the % Social Science Replication Project that were not stopped after stage 1. The grey % line indicates the same value for conditional and predictive power.} % \label{fig:PowerSignificanceInterim} % \end{figure} \subsection{Between-study heterogeneity} It is often more realistic to assume that the unknown effect sizes from original and replication studies are not exactly the same but that there is between-study heterogeneity. This can be the case, for example, if the replication study is conducted with a different population of participants (\eg in a different country) or in another laboratory with different equipment. For this reason, the functions for design of replication studies allow to incorporate additionally uncertainty due to between-study heterogeneity. Some functions in the package (\eg \texttt{sampleSizeSignificance} or \texttt{predictionInterval}) allows to specify the argument \texttt{h}, the relative between-study heterogeneity variance $h = \tau^2/\sigma^2$, \ie the ratio of the heterogeneity variance to the variance of the original effect estimate. By default, \texttt{h} is set to zero, however, if between-study heterogeneity is expected, \eg a different population of study participants is used, this should be considered in the design. See \citet{Pawel2020} for more details. \subsection{Data-driven shrinkage with empirical Bayes} The functions for design of replication studies allow to specify the argument \texttt{shrinkage}, in order to shrink the original effect estimate towards zero by a certain (arbitrary) amount. A more principled approach is to use a design prior which induces shrinkage and then estimate the prior variance by empirical Bayes. This leads to ``data-driven'' shrinkage that is larger when there was only weak evidence for the effect, and smaller when there was strong evidence for the effect (shown in Figure \ref{fig:ebshrinkage}). Furthermore, under this prior, the specified between-study heterogeneity will also induce shrinkage towards zero, for details see \citet{Pawel2020}. Empirical Bayes shrinkage can be enabled by setting the design prior argument to \texttt{"EB"}. \begin{figure}[!htbp] <<"shrinkage", echo = FALSE, fig.height = 3.5 >>= zo <- seq(0, 4, 0.01) s <- pmax(1 - 1/zo^2, 0) shrinkage <- (1 - s)*100 plot(zo, shrinkage, type = "l", ylim = c(0, 100), las = 1, xlab = expression(paste(italic(z)[o])), ylab = "Shrinkage (%)", yaxt = "n") axis(side = 2, at = seq(0, 100, 25), las = 1) axis(side = 3, at = seq(0, 4, by = 1), labels = c(signif(z2p(seq(0, 3, by = 1), alternative = "one.sided"), 2), signif(z2p(4, alternative = "one.sided"), 1))) abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3)) mtext(text = expression(italic(p)[o]), side = 3, line = 2) @ \caption{Empirical Bayes shrinkage when there is no between-study heterogeneity.} \label{fig:ebshrinkage} \end{figure} \section{Outlook} Development on \texttt{ReplicationSuccess} will continue. We invite anyone with ideas for new functionality, bug-reports, or other contributions to the package to get in touch with us over GitHub (\url{https://github.com/crsuzh/ReplicationSuccess/}). \bibliography{bibliography} <>= options(optionsOld) par(parOld) @ \end{document}