The aggreCAT package, and the mathematical aggregators therein, were developed by the repliCATS (Collaborative Assessment for Trustworthy Science) project as a part of the SCORE program (Systematizing Confidence in Open Research and Evidence), funded by DARPA (Defense Advanced Research Projects Agency) (Alipourfard et al. 2021). The SCORE program is the largest replication project in science to date, and aims to build automated tools that can rapidly and reliably assign “Confidence Scores” to research claims from empirical studies in the Social and Behavioural Sciences (SBS). Confidence Scores are quantitative measures of the likely reproducibility or replicability of a research claim or result, and may be used by consumers of scientific research as a proxy measure for their credibility in the absence of replication effort (Alipourfard et al. 2021).
Replications are time-consuming and costly (Isager et al. 2020), and studies have shown that replication outcomes can be reliably elicited from researchers (Gordon et al. 2020). Consequently, the DARPA SCORE program generated Confidence Scores for \(> 4000\) SBS claims using expert elicitation based on two very different strategies – prediction markets (Gordon et al. 2020) and the IDEA protocol (Hemming et al. 2017), the latter of which is used by the repliCATS project (Fraser et al. 2021). A proportion of these research claims were randomly selected for direct replication, against which the elicited and aggregated Confidence Scores are ‘ground-truthed’ or verified. The aim of the DARPA SCORE project is to aid the development of artificial intelligence tools that can automatically assign Confidence Scores.
The aggreCAT package includes the core
dataset data_ratings consisting of judgements elicited
during a pilot experiment exploring the performance of IDEA groups in
assessing replicability of a set of claims with “known outcomes.”
“Known-outcome” claims are SBS research claims that have been subject to
replication studies in previous large-scale replication projects1. Data were
collected using the repliCATS IDEA protocol at a two day workshop2 in the
Netherlands, on July 2019, at which 25 participants assessed the
replicability of 25 unique SBS claims. In addition to the probabilistic
estimates provided for each research claim assessed, participants were
also asked to rate the claim’s plausibility and comprehensibility,
answer whether they were involved in any aspect of the original study,
and to provide their reasoning in support of their quantitative
estimates, which were used to form measures of reasoning breadth and
engagement (Fraser et al. 2021).
data_ratings is a tidy data.frame wherein each observation (or
row) corresponds to a single value in the set of values
constituting a participant’s complete assessment of a research claim.
Each research claim is assigned a unique paper_id, and each
participant has a unique (and anonymous) user_name. The
variable round denotes the round in which each
value was elicited (round_1 or
round_2). question denotes the type of
question the value pertains to;
direct_replication for probabilistic judgements about the
replicability of the claim, belief_binary for participants’
belief in the plausibility of the claim, comprehension for
participants’ comprehensibility ratings, and
involved_binary for involvement in the original study. An
additional column element maintains the tidy structure of
the data, while capturing the multiple values that comprise
a full assessment of the replicability (direct_replication)
of a claim; three_point_best,
three_point_lower and three_point_upper denote
the best estimate and lower and upper bounds respectively.
binary_question describes the element for both
the plausibility rating (belief_binary) and involvement
(involved_binary) questions, whereas
likert_binary is the element describing a
participant’s comprehension rating. Judgements are recorded
in column value in the form of percentage probabilities
ranging from (0,100). The binary_questions corresponding to
comprehensibility and involvement consist of binary values
(1 for the affirmative, and -1 for the
negative). Finally, values corresponding to participants’ comprehension
ratings are on a likert_binary scale from 1
through 7. Note that additional columns with participant
attributes can be included in the ratings dataset if required by the
user; we include the group column in
data-ratings, which describes the group number the
participant was a part of. Below we show some example data for a single
user for a single claim to illustrate this structure of the core
data_ratings dataset.
aggreCAT::data_ratings %>%
dplyr::filter(paper_id == dplyr::first(paper_id),
user_name == dplyr::first(user_name)) %>%
head()
#> # A tibble: 6 × 7
#> round paper_id user_name question element value group
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 round_1 100 7l8m7dmjdb direct_replication three_point_lower 30 UOM1
#> 2 round_1 100 7l8m7dmjdb involved_binary binary_question -1 UOM1
#> 3 round_1 100 7l8m7dmjdb belief_binary binary_question -1 UOM1
#> 4 round_1 100 7l8m7dmjdb direct_replication three_point_best 40 UOM1
#> 5 round_1 100 7l8m7dmjdb direct_replication three_point_upper 45 UOM1
#> 6 round_1 100 7l8m7dmjdb comprehension likert_binary 5 UOM1Not all data necessary for constructing weights on performance is
contained in data_ratings. Additional data collected as
part of the repliCATS IDEA protocol are contained within separate
datasets to data_ratings. Participants provided
justifications for giving particular judgements, and these are contained
in data_justifications. On the repliCATS platform users
were given the option to comment on others’ justifications
(data_comments), to vote on others’ comments
(data_comment_ratings) and on others’ justifications
(data_justification_ratings). Finally, aggreCAT contains three ‘supplementary’ datasets
containing data collected externally to the repliCATS IDEA protocol:
data_supp_quiz, data_supp_priors, and
data_supp_reasons.
Prior to the workshop, participants were asked to complete an
optional quiz on statistical concepts and meta-research which we thought
would aid in reliably evaluating the replicability of research claims.
Quiz responses are contained in data_supp_quiz and are used
to construct performance weights for the aggregation method
QuizWAgg, where each participant receives a
quiz_score if they completed the quiz, and NA
if they did not attempt the quiz (see Hanea et
al. 2021 for further details). Additional methods of scoring the
quiz responses are provided in data_supp_quiz.
aggreCAT::data_supp_quiz
#> # A tibble: 910 × 4
#> user_name quiz_score quiz_score_even quiz_score_stats
#> <chr> <dbl> <dbl> <dbl>
#> 1 9z4w4flvjn 8 12 9
#> 2 78k6istw7a 6.5 11 9
#> 3 qc9qj66h99 5.5 9 7
#> 4 mxs7lru9t4 11 16 11
#> 5 y7o584bn9z 7 11 8
#> 6 sri1ckvmbg 7.5 11 7
#> 7 l0ml45as80 8.5 13 8
#> 8 qc2crcnprb 6.5 9 6
#> 9 r7kn2x3mnj 8 13 11
#> 10 wt9lq995t9 9.5 15 11
#> # ℹ 900 more rowsThe ReasonWAgg aggregation type uses the number of
unique reasons given by participants to support a Best Estimate for a
given claim \(B_{i,c}\) to construct
performance weights, and is contained within
data_supp_reasons. Qualitative statements made by
individuals during claim evaluation were recorded on the repliCATS
platform (Pearson et al. 2021) and coded
as falling into one of 25 unique reasoning categories by the repliCATS
Reasoning team (Wintle et al. 2021).
Reasoning categories include plausibility of the claim, effect size,
sample size, presence of a power analysis, transparency of reporting,
and journal reporting (Hanea et al. 2021).
Within data_supp_reasons, each of the reasoning categories
that passed our inter-coder reliability threshold are distributed as
columns in the dataset whose names are prefixed with RW,
and for each claim paper_id, each participant
user_id is assigned a logical 1 or
0 if they included that reasoning category in support of
their Best estimate for that claim. See ReasoningWAgg() for
details on the ReasonWAgg aggregation method.
aggreCAT::data_supp_reasons %>%
glimpse()
#> Rows: 593
#> Columns: 13
#> $ paper_id <chr> …
#> $ user_name <chr> …
#> $ `RW04 Date of publication` <dbl> …
#> $ `RW15 Effect size` <dbl> …
#> $ `RW16 Interaction effect` <dbl> …
#> $ `RW17 Interval or range measure for statistical uncertainty (CI SD etc)` <dbl> …
#> $ `RW18 Outside participants areas of expertise` <dbl> …
#> $ `RW20 Plausibility` <dbl> …
#> $ `RW21 Population or subject characteristics (sampling practices)` <dbl> …
#> $ `RW22 Power adequacy or sample size` <dbl> …
#> $ `RW32 Reputation` <dbl> …
#> $ `RW37 Revision statements` <dbl> …
#> $ `RW42 Significance statistical (p-value etc )` <dbl> …The method BayPRIORsAgg (implemented in
BayesianWAgg()) uses Bayesian updating to update a prior
probability of a claim replicating estimated from a predictive model
(Gould et al. 2021) using an aggregate of
the best estimates for all participants assessing a given claim \(c\) (Hanea et al.
2021). The prior data is contained in
data_supp_priors with each claim in column
paper_id being assigned a prior probability (on the logit
scale) of the claim replicating in column prior_means.
data_commentsdata_confidence_scoresdata_justificationsdata_outcomesMany labs 1, 2 and 3 Klein et al. (2014), Klein et al. (2018), Ebersole et al. (2016), the Social Sciences Replication Project Camerer et al. (2018) and the Reproducibility Project Psychology “Estimating the Reproducibility of Psychological Science” (2015).↩︎
See Hanea et al. (2021) for details. The workshop was held at the annual meeting of the Society for the Improvement of Psychological Science (SIPS), <https://osf.io/ndzpt/>.↩︎