aggreCAT datasets

library(aggreCAT)
library(tidyverse)

DARPA SCORE program and the repliCATS project

The aggreCAT package, and the mathematical aggregators therein, were developed by the repliCATS (Collaborative Assessment for Trustworthy Science) project as a part of the SCORE program (Systematizing Confidence in Open Research and Evidence), funded by DARPA (Defense Advanced Research Projects Agency) (Alipourfard et al. 2021). The SCORE program is the largest replication project in science to date, and aims to build automated tools that can rapidly and reliably assign “Confidence Scores” to research claims from empirical studies in the Social and Behavioural Sciences (SBS). Confidence Scores are quantitative measures of the likely reproducibility or replicability of a research claim or result, and may be used by consumers of scientific research as a proxy measure for their credibility in the absence of replication effort (Alipourfard et al. 2021).

Replications are time-consuming and costly (Isager et al. 2020), and studies have shown that replication outcomes can be reliably elicited from researchers (Gordon et al. 2020). Consequently, the DARPA SCORE program generated Confidence Scores for \(> 4000\) SBS claims using expert elicitation based on two very different strategies – prediction markets (Gordon et al. 2020) and the IDEA protocol (Hemming et al. 2017), the latter of which is used by the repliCATS project (Fraser et al. 2021). A proportion of these research claims were randomly selected for direct replication, against which the elicited and aggregated Confidence Scores are ‘ground-truthed’ or verified. The aim of the DARPA SCORE project is to aid the development of artificial intelligence tools that can automatically assign Confidence Scores.

Datasets

The aggreCAT package includes the core dataset data_ratings consisting of judgements elicited during a pilot experiment exploring the performance of IDEA groups in assessing replicability of a set of claims with “known outcomes.” “Known-outcome” claims are SBS research claims that have been subject to replication studies in previous large-scale replication projects1. Data were collected using the repliCATS IDEA protocol at a two day workshop2 in the Netherlands, on July 2019, at which 25 participants assessed the replicability of 25 unique SBS claims. In addition to the probabilistic estimates provided for each research claim assessed, participants were also asked to rate the claim’s plausibility and comprehensibility, answer whether they were involved in any aspect of the original study, and to provide their reasoning in support of their quantitative estimates, which were used to form measures of reasoning breadth and engagement (Fraser et al. 2021).

Formatted Judgement Data

data_ratings is a tidy data.frame wherein each observation (or row) corresponds to a single value in the set of values constituting a participant’s complete assessment of a research claim. Each research claim is assigned a unique paper_id, and each participant has a unique (and anonymous) user_name. The variable round denotes the round in which each value was elicited (round_1 or round_2). question denotes the type of question the value pertains to; direct_replication for probabilistic judgements about the replicability of the claim, belief_binary for participants’ belief in the plausibility of the claim, comprehension for participants’ comprehensibility ratings, and involved_binary for involvement in the original study. An additional column element maintains the tidy structure of the data, while capturing the multiple values that comprise a full assessment of the replicability (direct_replication) of a claim; three_point_best, three_point_lower and three_point_upper denote the best estimate and lower and upper bounds respectively. binary_question describes the element for both the plausibility rating (belief_binary) and involvement (involved_binary) questions, whereas likert_binary is the element describing a participant’s comprehension rating. Judgements are recorded in column value in the form of percentage probabilities ranging from (0,100). The binary_questions corresponding to comprehensibility and involvement consist of binary values (1 for the affirmative, and -1 for the negative). Finally, values corresponding to participants’ comprehension ratings are on a likert_binary scale from 1 through 7. Note that additional columns with participant attributes can be included in the ratings dataset if required by the user; we include the group column in data-ratings, which describes the group number the participant was a part of. Below we show some example data for a single user for a single claim to illustrate this structure of the core data_ratings dataset.

aggreCAT::data_ratings %>%
  dplyr::filter(paper_id == dplyr::first(paper_id),
                user_name == dplyr::first(user_name)) %>%
  head()
#> # A tibble: 6 × 7
#>   round   paper_id user_name  question           element           value group
#>   <chr>   <chr>    <chr>      <chr>              <chr>             <dbl> <chr>
#> 1 round_1 100      7l8m7dmjdb direct_replication three_point_lower    30 UOM1 
#> 2 round_1 100      7l8m7dmjdb involved_binary    binary_question      -1 UOM1 
#> 3 round_1 100      7l8m7dmjdb belief_binary      binary_question      -1 UOM1 
#> 4 round_1 100      7l8m7dmjdb direct_replication three_point_best     40 UOM1 
#> 5 round_1 100      7l8m7dmjdb direct_replication three_point_upper    45 UOM1 
#> 6 round_1 100      7l8m7dmjdb comprehension      likert_binary         5 UOM1

Not all data necessary for constructing weights on performance is contained in data_ratings. Additional data collected as part of the repliCATS IDEA protocol are contained within separate datasets to data_ratings. Participants provided justifications for giving particular judgements, and these are contained in data_justifications. On the repliCATS platform users were given the option to comment on others’ justifications (data_comments), to vote on others’ comments (data_comment_ratings) and on others’ justifications (data_justification_ratings). Finally, aggreCAT contains three ‘supplementary’ datasets containing data collected externally to the repliCATS IDEA protocol: data_supp_quiz, data_supp_priors, and data_supp_reasons.

Quiz Score Data

Prior to the workshop, participants were asked to complete an optional quiz on statistical concepts and meta-research which we thought would aid in reliably evaluating the replicability of research claims. Quiz responses are contained in data_supp_quiz and are used to construct performance weights for the aggregation method QuizWAgg, where each participant receives a quiz_score if they completed the quiz, and NA if they did not attempt the quiz (see Hanea et al. 2021 for further details). Additional methods of scoring the quiz responses are provided in data_supp_quiz.

aggreCAT::data_supp_quiz
#> # A tibble: 910 × 4
#>    user_name  quiz_score quiz_score_even quiz_score_stats
#>    <chr>           <dbl>           <dbl>            <dbl>
#>  1 9z4w4flvjn        8                12                9
#>  2 78k6istw7a        6.5              11                9
#>  3 qc9qj66h99        5.5               9                7
#>  4 mxs7lru9t4       11                16               11
#>  5 y7o584bn9z        7                11                8
#>  6 sri1ckvmbg        7.5              11                7
#>  7 l0ml45as80        8.5              13                8
#>  8 qc2crcnprb        6.5               9                6
#>  9 r7kn2x3mnj        8                13               11
#> 10 wt9lq995t9        9.5              15               11
#> # ℹ 900 more rows

Reasoning Data

The ReasonWAgg aggregation type uses the number of unique reasons given by participants to support a Best Estimate for a given claim \(B_{i,c}\) to construct performance weights, and is contained within data_supp_reasons. Qualitative statements made by individuals during claim evaluation were recorded on the repliCATS platform (Pearson et al. 2021) and coded as falling into one of 25 unique reasoning categories by the repliCATS Reasoning team (Wintle et al. 2021). Reasoning categories include plausibility of the claim, effect size, sample size, presence of a power analysis, transparency of reporting, and journal reporting (Hanea et al. 2021). Within data_supp_reasons, each of the reasoning categories that passed our inter-coder reliability threshold are distributed as columns in the dataset whose names are prefixed with RW, and for each claim paper_id, each participant user_id is assigned a logical 1 or 0 if they included that reasoning category in support of their Best estimate for that claim. See ReasoningWAgg() for details on the ReasonWAgg aggregation method.

aggreCAT::data_supp_reasons %>%
  glimpse()
#> Rows: 593
#> Columns: 13
#> $ paper_id                                                                 <chr> …
#> $ user_name                                                                <chr> …
#> $ `RW04 Date of publication`                                               <dbl> …
#> $ `RW15 Effect size`                                                       <dbl> …
#> $ `RW16 Interaction effect`                                                <dbl> …
#> $ `RW17 Interval or range measure for statistical uncertainty (CI SD etc)` <dbl> …
#> $ `RW18 Outside participants areas of expertise`                           <dbl> …
#> $ `RW20 Plausibility`                                                      <dbl> …
#> $ `RW21 Population or subject characteristics (sampling practices)`        <dbl> …
#> $ `RW22 Power adequacy or sample size`                                     <dbl> …
#> $ `RW32 Reputation`                                                        <dbl> …
#> $ `RW37 Revision statements`                                               <dbl> …
#> $ `RW42 Significance statistical (p-value etc )`                           <dbl> …

Bayesian Prior Data

The method BayPRIORsAgg (implemented in BayesianWAgg()) uses Bayesian updating to update a prior probability of a claim replicating estimated from a predictive model (Gould et al. 2021) using an aggregate of the best estimates for all participants assessing a given claim \(c\) (Hanea et al. 2021). The prior data is contained in data_supp_priors with each claim in column paper_id being assigned a prior probability (on the logit scale) of the claim replicating in column prior_means.

aggreCAT::data_supp_priors
#> # A tibble: 25 × 2
#>    paper_id prior_means
#>    <chr>          <dbl>
#>  1 20           -0.285 
#>  2 21           -1.30  
#>  3 24           -1.78  
#>  4 26           -1.75  
#>  5 28           -1.98  
#>  6 38            0.239 
#>  7 79           -0.0827
#>  8 100          -1.03  
#>  9 102           0.504 
#> 10 103          -2.12  
#> # ℹ 15 more rows

TODO

References

Alipourfard, Nazanin, Beatrix Arendt, Daniel M Benjamin, Noam Benkler, Michael M Bishop, Mark Burstein, Martin Bush, et al. 2021. “Systematizing Confidence in Open Research and Evidence (SCORE).” SocArXiv. https://doi.org/10.31235/osf.io/46mnb.
Camerer, CF, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jurgen Huber, Magnus Johannesson, Michael Kirchler, et al. 2018. “Evaluating the Replicability of Social Science Experiments in Nature and Science Between 2010 and 2015.” Naturecom.
Ebersole, Charles R., Olivia E. Atherton, Aimee L. Belanger, Hayley M. Skulborstad, Jill M. Allen, Jonathan B. Banks, Erica Baranski, et al. 2016. “Many Labs 3: Evaluating Participant Pool Quality Across the Academic Semester via Replication.” Journal of Experimental Social Psychology 67: 68–82. https://doi.org/https://doi.org/10.1016/j.jesp.2015.10.012.
“Estimating the Reproducibility of Psychological Science.” 2015. Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.
Fraser, Hannah, Martin Bush, Bonnie Wintle, Fallon Mody, Eden T Smith, Anca Hanea, Elliot Gould, et al. 2021. “Predicting Reliability Through Structured Expert Elicitation with repliCATS (Collaborative Assessments for Trustworthy Science).” MetaArXiv. https://doi.org/10.31222/osf.io/2pczv.
Gordon, Michael, Domenico Viganola, Michael Bishop, Yiling Chen, Anna Dreber, Brandon Goldfedder, Felix Holzmeister, et al. 2020. “Are Replication Rates the Same Across Academic Fields? Community Forecasts from the DARPA SCORE Programme.” Royal Society Open Science 7 (7): 200566. https://doi.org/10.1098/rsos.200566.
Gould, Elliot, Aaron Willcox, Hannah Fraser, Felix Singleton Thorn, and David P Wilkinson. 2021. “Using Model-Based Predictions to Inform the Mathematical Aggregation of Human-Based Predictions of Replicability,” April. https://doi.org/10.31222/osf.io/f675q.
Hanea, Anca, David P Wilkinson, Marissa McBride, Aidan Lyon, Don van Ravenzwaaij, Felix Singleton Thorn, Charles T Gray, et al. 2021. “Mathematically Aggregating Experts’ Predictions of Possible Futures.” PLoS ONE 16 (9). https://doi.org/https://doi.org/10.1371/journal.pone.0256919.
Hemming, Victoria, Mark A. Burgman, Anca M. Hanea, Marissa F. McBride, and Bonnie C. Wintle. 2017. “A Practical Guide to Structured Expert Elicitation Using the IDEA Protocol.” Edited by Barbara Anderson. Methods in Ecology and Evolution 9 (1): 169–80. https://doi.org/10.1111/2041-210x.12857.
Isager, Peder Mortvedt, R van Aert, S Bahnik, MJ Brandt, KA Desoto, R Ginner-Sorolla, JI Krueger, et al. 2020. “Deciding What to Replicate: A Formal Definition of "Replication Value" and a Decision Model for Replication Study Selection.” Journal of Informetrics 13 (2): 635–42. https://doi.org/https://doi.org/10.1037/met0000438.
Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams Jr., Stepan Bahnic, Michael J. Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability.” Social Psychology 45 (3): 142–52.
Klein, Richard A., Michelangelo Vianello, Fred Hasselman, Byron G. Adams, Jr. Reginald B. Adams, Sinan Alper, Mark Aveyard, et al. 2018. “Many Labs 2: Investigating Variation in Replicability Across Samples and Settings.” Advances in Methods and Practices in Psychological Science 1 (4): 443–90. https://doi.org/10.1177/2515245918810225.
Pearson, Ross, Hannah Fraser, Martin Bush, Fallon Mody, Ivo Widjaja, Andy Head, David Peter Wilkinson, et al. 2021. “Eliciting Group Judgements about Replicability: A Technical Implementation of the IDEA Protocol.” http://hdl.handle.net/10125/70666.
Wintle, Bonnie, Fallon Mody, Eden T Smith, Anca Hanea, David P. Wilkinson, Victoria Hemming, Martin Bush, et al. 2021. “Predicting and Reasoning about Replicability Using Structured Groups,” May. https://doi.org/10.31222/osf.io/vtpmb.

  1. Many labs 1, 2 and 3 Klein et al. (2014), Klein et al. (2018), Ebersole et al. (2016), the Social Sciences Replication Project Camerer et al. (2018) and the Reproducibility Project Psychology “Estimating the Reproducibility of Psychological Science” (2015).↩︎

  2. See Hanea et al. (2021) for details. The workshop was held at the annual meeting of the Society for the Improvement of Psychological Science (SIPS), <https://osf.io/ndzpt/>.↩︎