--- title: "How rfair works: methodology and architecture" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How rfair works: methodology and architecture} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(rfair) ``` This vignette describes what rfair measures and how, in enough detail to interpret and reproduce its scores. For a quick tour see `vignette("rfair")`; for the reuse/sensitivity extensions see `vignette("beyond-fuji")`. ## 1. Background: FAIR, the FAIRsFAIR metrics, and F-UJI The **FAIR principles** (Wilkinson et al. 2016) state that research data should be **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable. They are aspirational; to assess a real data object you need *measurable* indicators. The **FAIRsFAIR** project turned the principles into a concrete, testable metric set, and the **F-UJI** tool (Devaraju & Huber, PANGAEA) implemented an automated assessment service for them. F-UJI is a Python web service: you send it a persistent identifier (PID) and it returns per-metric scores. `rfair` is a **native R reimplementation** of the F-UJI metrics (version 0.8). It performs the whole assessment in R, with no external server, so assessments are scriptable, reproducible, and embeddable in R pipelines. The original `rfair` package (v1) was only an HTTP client for an F-UJI server; this version (v2) is the engine itself. ## 2. The assessment pipeline A single call to `assess_fair()` runs this pipeline: ``` identifier │ id_parse() scheme detection + normalization + resolver URL ▼ resolution content-negotiated GET, follow redirects -> landing page │ resolve_landing_page() ▼ harvesting a sequence of collectors, in priority order: │ collect_html_meta() embedded JSON-LD (schema.org), Dublin Core, │ OpenGraph, Highwire meta tags │ collect_signposting() HTTP Link header + typed links │ collect_datacite() DataCite JSON via content negotiation │ collect_xml() DataCite XML, Dublin Core, MODS, EML, ISO19139 │ collect_rdf() JSON-LD (native) and Turtle/RDF-XML (via rdflib) │ collect_github() GitHub repository + codemeta.json + CITATION.cff │ harvest_data() HEAD on data links for MIME type and size ▼ mapping + merging each source is mapped to one reference schema and │ merge_metadata() merged (first-non-empty for scalars; union for │ lists; longer-but-similar replacement) ▼ evaluation one evaluator per metric inspects the merged metadata │ run_evaluators() and the resolved identifier, scoring each test ▼ scoring per-test scores -> per-metric -> F/A/I/R -> overall │ get_assessment_summary() ▼ fair_assessment tidy S3 object (print / summary / as.data.frame / as_fuji_json / as_rdf) ``` ### Identifier handling `id_parse()` recognizes DOIs, Handles, ARKs, URNs, UUIDs, `identifiers.org` PIDs, w3id, and plain URLs, normalizes them, and constructs a resolver URL. Persistence is inferred from the scheme. ```{r} id_parse("https://doi.org/10.5281/zenodo.8347772")[c("preferred_schema", "is_persistent", "identifier_url")] ``` ### Harvesting and content negotiation Different repositories expose metadata in different ways. rfair asks for several representations of the same object via HTTP **content negotiation** (the `Accept` header) and scrapes the landing page, then **merges** everything into a single reference schema (~30 elements: `creator`, `title`, `publisher`, `publication_date`, `license`, `access_level`, `object_content_identifier`, `related_resources`, ...). When two sources disagree, scalars keep the first non-empty value (replaced only by a longer, sufficiently-similar string), and list-valued elements are unioned. ### The metric model Metrics are data-driven: their definitions, tests, scores, and maturity levels come from the bundled FAIRsFAIR YAML, not from hard-coded R logic. ```{r} rfair_metric_versions() # bundled metric versions # v0.8 has 17 metrics across F/A/I/R (one row each): nrow(as.data.frame(assess_fair("https://doi.org/10.5281/zenodo.8347772", resolve = FALSE))) ``` Each metric has one or more **tests**. A test contributes a *score* and a *maturity* level (a CMMI level 0–3: incomplete, initial, moderate, advanced) when it passes. Metrics use one of two scoring mechanisms: * **cumulative** — passed tests' scores add up; * **alternative** — tests are alternative routes to the same points (the earned score is capped at the metric total). The criterium engine (`criterium_engine.R`) builds each metric's result from the YAML and lets evaluators mark tests passed; `as_fuji_json()` then emits a payload matching the upstream F-UJI `FAIRResults` schema. ## 3. What each FAIR category measures (v0.8) | | metric | what rfair checks | |---|---|---| | **F** | F1-01MD | identifier follows a unique scheme (URI/URN/UUID/HASH/PID) | | | F1-02MD | identifier is persistent and registered (resolves) | | | F2-01M | core descriptive metadata present (creator, title, id, date, publisher, type, summary, keywords) | | | F3-01M | metadata links to the downloadable data content | | | F4-01M | metadata offered in a search-engine-ingestible way (embedded JSON-LD / meta tags) | | **A** | A1-01M | access level / rights are stated in metadata | | | A1-02MD | metadata and data are retrievable via their identifiers | | | A1.1-01MD | identifiers use a standardized communication protocol (http/https/ftp) | | | A1.2-01MD | the protocol supports authentication where needed | | **I** | I1-01M | metadata uses a formal, machine-readable representation (JSON-LD/RDF/XML) | | | I2-01M | metadata uses terms from registered semantic vocabularies | | | I3-01M | qualified references to related entities (with relation types) | | **R** | R1-01M | metadata describes the data content (type, format/size) | | | R1.1-01M | a machine-readable license is present and SPDX/CC-recognized | | | R1.2-01M | provenance information (creators, dates, contributors) | | | R1.3-01M | a community-/discipline-endorsed metadata standard is used | | | R1.3-02D | data is in a recommended (scientific/open/long-term) file format | The score for a category is the sum of earned over total across its metrics; the overall FAIR score is the sum across all 17, and the maturity is the (clamped) mean of the per-category maturities. ```{r} # the canonical principle definitions these metrics map to fair_principles("I")[, c("id", "definition")] ``` ## 4. Software FAIR (FRSM) For software objects, rfair also bundles the FRSM (FAIR for Research Software) metric set; select it with `metric_version = "0.7_software"`. The GitHub harvester inspects the repository file tree for signals (a license file, tests, CI workflows, dependency manifests, a registry DOI, a release version, contributors) and the 17 FRSM evaluators score from them. FRSM scoring is heuristic and not yet validated against an upstream software-FAIR reference. ## 5. Fidelity to F-UJI Because rfair reimplements an existing scoring engine, it includes a non-CRAN conformance harness. `tests/conformance/run.R` runs identifiers through both rfair and a locally run, version-matched F-UJI server and compares per-metric earned scores. A manual run on 2026-06-16 against F-UJI 4.0.0 (metrics v0.8) measured **94.1% on a Zenodo DOI (16/17 metrics exact)** and **85.3%** across PANGAEA and Dryad; the consistent divergence was the data file-format metric (F-UJI uses Tika content detection where rfair uses an HTTP HEAD). This reference-server comparison is not reproduced by CI yet. A separate harness (`tests/conformance/parity.R`) compares the R engine with the browser TypeScript engine on registry-derivable metrics after the `webapp` branch is checked out alongside the package. ## 6. Beyond F-UJI rfair adds checks that automated FAIR tools usually miss, motivated by peer review of a COVID-19 FAIR study: license *reusability* (not just presence) with the (Re)usable Data Project taxonomy, controlled-access/sensitive-data flagging, identifier hygiene, and the **FAIR-TLC** (Traceable, Licensed, Connected) extension. See `vignette("beyond-fuji")`. ## 7. Limitations * The browser app is registry-only (CORS): it cannot harvest landing pages, so some metrics score lower than the R engine. * I2-01M (semantic vocabularies) scores 0 for objects whose metadata uses only default namespaces (dc/schema.org/DataCite) — this matches F-UJI. * RDF Turtle/RDF-XML harvesting and `as_rdf()` Turtle output need the optional `rdflib` package (system `librdf`); without it those paths are skipped. * Live scores depend on the object's current metadata and on third-party services (DataCite, Crossref, GitHub) being reachable. ## References * Wilkinson et al. (2016). The FAIR Guiding Principles. *Sci Data*. \doi{10.1038/sdata.2016.18} * Devaraju & Huber. F-UJI. * FAIRsFAIR metrics. \doi{10.5281/zenodo.15045911} * Carbon et al. (2019). (Re)usable data licensing. *PLOS ONE*. \doi{10.1371/journal.pone.0213090}