--- title: "aggreCAT datasets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{aggreCAT datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: vignette_references.bib --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE, warning=FALSE} library(aggreCAT) library(tidyverse) ``` ### DARPA SCORE program and the repliCATS project The [aggreCAT]{.pkg} package, and the mathematical aggregators therein, were developed by [the repliCATS (Collaborative Assessment for Trustworthy Science) project](https://replicats.research.unimelb.edu.au/) as a part of the [SCORE program](https://www.darpa.mil/program/systematizing-confidence-in-open-research-and-evidence) (Systematizing Confidence in Open Research and Evidence), funded by DARPA (Defense Advanced Research Projects Agency) [@alipourfard2021]. The SCORE program is the largest replication project in science to date, and aims to build automated tools that can rapidly and reliably assign "Confidence Scores" to research claims from empirical studies in the Social and Behavioural Sciences (SBS). Confidence Scores are quantitative measures of the likely reproducibility or replicability of a research claim or result, and may be used by consumers of scientific research as a proxy measure for their credibility in the absence of replication effort [@alipourfard2021]. Replications are time-consuming and costly [@Isager2020], and studies have shown that replication outcomes can be reliably elicited from researchers [@Gordon2020]. Consequently, the DARPA SCORE program generated Confidence Scores for $> 4000$ SBS claims using expert elicitation based on two very different strategies -- prediction markets [@Gordon2020] and the IDEA protocol [@hemming2017], the latter of which is used by the repliCATS project [@Fraser:2021]. A proportion of these research claims were randomly selected for direct replication, against which the elicited and aggregated Confidence Scores are 'ground-truthed' or verified. The aim of the DARPA SCORE project is to aid the development of artificial intelligence tools that can automatically assign Confidence Scores. # Datasets The [aggreCAT]{.pkg} package includes the core dataset `data_ratings` consisting of judgements elicited during a pilot experiment exploring the performance of IDEA groups in assessing replicability of a set of claims with "known outcomes." "Known-outcome" claims are SBS research claims that have been subject to replication studies in previous large-scale replication projects[^1]. Data were collected using the repliCATS IDEA protocol at a two day workshop[^2] in the Netherlands, on July 2019, at which 25 participants assessed the replicability of 25 unique SBS claims. In addition to the probabilistic estimates provided for each research claim assessed, participants were also asked to rate the claim's plausibility and comprehensibility, answer whether they were involved in any aspect of the original study, and to provide their reasoning in support of their quantitative estimates, which were used to form measures of reasoning breadth and engagement [@Fraser:2021]. [^1]: Many labs 1, 2 and 3 @Klein2014, @Klein2018ManyL2, @Ebersole2016, the Social Sciences Replication Project @Camerer2018 and the Reproducibility Project Psychology @aac4716. [^2]: See @Hanea2021 for details. The workshop was held at the annual meeting of the Society for the Improvement of Psychological Science (SIPS), [\](https://osf.io/ndzpt/){.uri}. ## Formatted Judgement Data `data_ratings` is a *tidy* [data.frame]{.class} wherein each *observation* (or row) corresponds to a single value in the set of `value`s constituting a participant's complete assessment of a research claim. Each research claim is assigned a unique `paper_id`, and each participant has a unique (and anonymous) `user_name`. The variable `round` denotes the round in which each `value` was elicited (`round_1` or `round_2`). `question` denotes the type of question the `value` pertains to; `direct_replication` for probabilistic judgements about the replicability of the claim, `belief_binary` for participants' belief in the plausibility of the claim, `comprehension` for participants' comprehensibility ratings, and `involved_binary` for involvement in the original study. An additional column `element` maintains the tidy structure of the data, while capturing the multiple `value`s that comprise a full assessment of the replicability (`direct_replication`) of a claim; `three_point_best`, `three_point_lower` and `three_point_upper` denote the best estimate and lower and upper bounds respectively. `binary_question` describes the `element` for both the plausibility rating (`belief_binary`) and involvement (`involved_binary`) questions, whereas `likert_binary` is the `element` describing a participant's `comprehension` rating. Judgements are recorded in column `value` in the form of percentage probabilities ranging from (0,100). The `binary_question`s corresponding to comprehensibility and involvement consist of binary values (`1` for the affirmative, and `-1` for the negative). Finally, values corresponding to participants' comprehension ratings are on a `likert_binary` scale from `1` through `7`. Note that additional columns with participant attributes can be included in the ratings dataset if required by the user; we include the `group` column in `data-ratings`, which describes the group number the participant was a part of. Below we show some example data for a single user for a single claim to illustrate this structure of the core `data_ratings` dataset. ```{r data_ratings-sample, message=TRUE, results='hold'} aggreCAT::data_ratings %>% dplyr::filter(paper_id == dplyr::first(paper_id), user_name == dplyr::first(user_name)) %>% head() ``` Not all data necessary for constructing weights on performance is contained in `data_ratings`. Additional data collected as part of the repliCATS IDEA protocol are contained within separate datasets to `data_ratings`. Participants provided justifications for giving particular judgements, and these are contained in `data_justifications`. On the repliCATS platform users were given the option to comment on others' justifications (`data_comments`), to vote on others' comments (`data_comment_ratings`) and on others' justifications (`data_justification_ratings`). Finally, [aggreCAT]{.pkg} contains three 'supplementary' datasets containing data collected externally to the repliCATS IDEA protocol: `data_supp_quiz`, `data_supp_priors`, and `data_supp_reasons`. ## Quiz Score Data {#sec-quiz-supplementary-data} Prior to the workshop, participants were asked to complete an optional quiz on statistical concepts and meta-research which we thought would aid in reliably evaluating the replicability of research claims. Quiz responses are contained in `data_supp_quiz` and are used to construct performance weights for the aggregation method `QuizWAgg`, where each participant receives a `quiz_score` if they completed the quiz, and `NA` if they did not attempt the quiz [see @Hanea2021 for further details]. Additional methods of scoring the quiz responses are provided in `data_supp_quiz`. ```{r data_supp_quiz-sample, message=TRUE, results='hold'} aggreCAT::data_supp_quiz ``` ## Reasoning Data {#sec-reasonwagg-supplementary-data} The `ReasonWAgg` aggregation type uses the number of unique reasons given by participants to support a Best Estimate for a given claim $B_{i,c}$ to construct performance weights, and is contained within `data_supp_reasons`. Qualitative statements made by individuals during claim evaluation were recorded on the repliCATS platform [@Pearson2021] and coded as falling into one of 25 unique reasoning categories by the repliCATS Reasoning team [@Wintle:2021]. Reasoning categories include plausibility of the claim, effect size, sample size, presence of a power analysis, transparency of reporting, and journal reporting [@Hanea2021]. Within `data_supp_reasons`, each of the reasoning categories that passed our inter-coder reliability threshold are distributed as columns in the dataset whose names are prefixed with `RW`, and for each claim `paper_id`, each participant `user_id` is assigned a logical `1` or `0` if they included that reasoning category in support of their Best estimate for that claim. See `ReasoningWAgg()` for details on the `ReasonWAgg` aggregation method. ```{r data_supp_reasons-sample} aggreCAT::data_supp_reasons %>% glimpse() ``` ## Bayesian Prior Data {#sec-bayesian-supplementary-data} The method `BayPRIORsAgg` (implemented in `BayesianWAgg()`) uses Bayesian updating to update a prior probability of a claim replicating estimated from a predictive model [@Gould2021a] using an aggregate of the best estimates for all participants assessing a given claim $c$ [@Hanea2021]. The prior data is contained in `data_supp_priors` with each claim in column `paper_id` being assigned a prior probability (on the logit scale) of the claim replicating in column `prior_means`. ```{r data_supp_priors-sample} aggreCAT::data_supp_priors ``` # TODO - [ ] `data_comments` - [ ] `data_confidence_scores` - [ ] `data_justifications` - [ ] `data_outcomes` # References