---
title: "How rfair works: methodology and architecture"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{How rfair works: methodology and architecture}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(rfair)
```
This vignette describes what rfair measures and how, in enough detail to
interpret and reproduce its scores. For a quick tour see
`vignette("rfair")`; for the reuse/sensitivity extensions see
`vignette("beyond-fuji")`.
## 1. Background: FAIR, the FAIRsFAIR metrics, and F-UJI
The **FAIR principles** (Wilkinson et al. 2016) state that research data should
be **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable. They are
aspirational; to assess a real data object you need *measurable* indicators.
The **FAIRsFAIR** project turned the principles into a concrete, testable metric
set, and the **F-UJI** tool (Devaraju & Huber, PANGAEA) implemented an automated
assessment service for them. F-UJI is a Python web service: you send it a
persistent identifier (PID) and it returns per-metric scores.
`rfair` is a **native R reimplementation** of the F-UJI metrics (version 0.8).
It performs the whole assessment in R, with no external server, so assessments
are scriptable, reproducible, and embeddable in R pipelines. The original
`rfair` package (v1) was only an HTTP client for an F-UJI server; this version
(v2) is the engine itself.
## 2. The assessment pipeline
A single call to `assess_fair()` runs this pipeline:
```
identifier
│ id_parse() scheme detection + normalization + resolver URL
▼
resolution content-negotiated GET, follow redirects -> landing page
│ resolve_landing_page()
▼
harvesting a sequence of collectors, in priority order:
│ collect_html_meta() embedded JSON-LD (schema.org), Dublin Core,
│ OpenGraph, Highwire meta tags
│ collect_signposting() HTTP Link header + typed links
│ collect_datacite() DataCite JSON via content negotiation
│ collect_xml() DataCite XML, Dublin Core, MODS, EML, ISO19139
│ collect_rdf() JSON-LD (native) and Turtle/RDF-XML (via rdflib)
│ collect_github() GitHub repository + codemeta.json + CITATION.cff
│ harvest_data() HEAD on data links for MIME type and size
▼
mapping + merging each source is mapped to one reference schema and
│ merge_metadata() merged (first-non-empty for scalars; union for
│ lists; longer-but-similar replacement)
▼
evaluation one evaluator per metric inspects the merged metadata
│ run_evaluators() and the resolved identifier, scoring each test
▼
scoring per-test scores -> per-metric -> F/A/I/R -> overall
│ get_assessment_summary()
▼
fair_assessment tidy S3 object (print / summary / as.data.frame /
as_fuji_json / as_rdf)
```
### Identifier handling
`id_parse()` recognizes DOIs, Handles, ARKs, URNs, UUIDs, `identifiers.org`
PIDs, w3id, and plain URLs, normalizes them, and constructs a resolver URL.
Persistence is inferred from the scheme.
```{r}
id_parse("https://doi.org/10.5281/zenodo.8347772")[c("preferred_schema", "is_persistent", "identifier_url")]
```
### Harvesting and content negotiation
Different repositories expose metadata in different ways. rfair asks for several
representations of the same object via HTTP **content negotiation** (the `Accept`
header) and scrapes the landing page, then **merges** everything into a single
reference schema (~30 elements: `creator`, `title`, `publisher`,
`publication_date`, `license`, `access_level`, `object_content_identifier`,
`related_resources`, ...). When two sources disagree, scalars keep the first
non-empty value (replaced only by a longer, sufficiently-similar string), and
list-valued elements are unioned.
### The metric model
Metrics are data-driven: their definitions, tests, scores, and maturity levels
come from the bundled FAIRsFAIR YAML, not from hard-coded R logic.
```{r}
rfair_metric_versions() # bundled metric versions
# v0.8 has 17 metrics across F/A/I/R (one row each):
nrow(as.data.frame(assess_fair("https://doi.org/10.5281/zenodo.8347772", resolve = FALSE)))
```
Each metric has one or more **tests**. A test contributes a *score* and a
*maturity* level (a CMMI level 0–3: incomplete, initial, moderate, advanced)
when it passes. Metrics use one of two scoring mechanisms:
* **cumulative** — passed tests' scores add up;
* **alternative** — tests are alternative routes to the same points (the earned
score is capped at the metric total).
The criterium engine (`criterium_engine.R`) builds each metric's result from the
YAML and lets evaluators mark tests passed; `as_fuji_json()` then emits a payload
matching the upstream F-UJI `FAIRResults` schema.
## 3. What each FAIR category measures (v0.8)
| | metric | what rfair checks |
|---|---|---|
| **F** | F1-01MD | identifier follows a unique scheme (URI/URN/UUID/HASH/PID) |
| | F1-02MD | identifier is persistent and registered (resolves) |
| | F2-01M | core descriptive metadata present (creator, title, id, date, publisher, type, summary, keywords) |
| | F3-01M | metadata links to the downloadable data content |
| | F4-01M | metadata offered in a search-engine-ingestible way (embedded JSON-LD / meta tags) |
| **A** | A1-01M | access level / rights are stated in metadata |
| | A1-02MD | metadata and data are retrievable via their identifiers |
| | A1.1-01MD | identifiers use a standardized communication protocol (http/https/ftp) |
| | A1.2-01MD | the protocol supports authentication where needed |
| **I** | I1-01M | metadata uses a formal, machine-readable representation (JSON-LD/RDF/XML) |
| | I2-01M | metadata uses terms from registered semantic vocabularies |
| | I3-01M | qualified references to related entities (with relation types) |
| **R** | R1-01M | metadata describes the data content (type, format/size) |
| | R1.1-01M | a machine-readable license is present and SPDX/CC-recognized |
| | R1.2-01M | provenance information (creators, dates, contributors) |
| | R1.3-01M | a community-/discipline-endorsed metadata standard is used |
| | R1.3-02D | data is in a recommended (scientific/open/long-term) file format |
The score for a category is the sum of earned over total across its metrics; the
overall FAIR score is the sum across all 17, and the maturity is the (clamped)
mean of the per-category maturities.
```{r}
# the canonical principle definitions these metrics map to
fair_principles("I")[, c("id", "definition")]
```
## 4. Software FAIR (FRSM)
For software objects, rfair also bundles the FRSM (FAIR for Research Software)
metric set; select it with `metric_version = "0.7_software"`. The GitHub
harvester inspects the repository file tree for signals (a license file, tests,
CI workflows, dependency manifests, a registry DOI, a release version,
contributors) and the 17 FRSM evaluators score from them. FRSM scoring is
heuristic and not yet validated against an upstream software-FAIR reference.
## 5. Fidelity to F-UJI
Because rfair reimplements an existing scoring engine, it includes a
non-CRAN conformance harness. `tests/conformance/run.R` runs identifiers through
both rfair and a locally run, version-matched F-UJI server and compares
per-metric earned scores. A manual run on 2026-06-16 against F-UJI 4.0.0
(metrics v0.8) measured **94.1% on a Zenodo DOI (16/17 metrics exact)** and
**85.3%** across PANGAEA and Dryad; the consistent divergence was the data
file-format metric (F-UJI uses Tika content detection where rfair uses an HTTP
HEAD). This reference-server comparison is not reproduced by CI yet. A separate
harness (`tests/conformance/parity.R`) compares the R engine with the browser
TypeScript engine on registry-derivable metrics after the `webapp` branch is
checked out alongside the package.
## 6. Beyond F-UJI
rfair adds checks that automated FAIR tools usually miss, motivated by peer
review of a COVID-19 FAIR study: license *reusability* (not just presence) with
the (Re)usable Data Project taxonomy, controlled-access/sensitive-data flagging,
identifier hygiene, and the **FAIR-TLC** (Traceable, Licensed, Connected)
extension. See `vignette("beyond-fuji")`.
## 7. Limitations
* The browser app is registry-only (CORS): it cannot harvest landing pages, so
some metrics score lower than the R engine.
* I2-01M (semantic vocabularies) scores 0 for objects whose metadata uses only
default namespaces (dc/schema.org/DataCite) — this matches F-UJI.
* RDF Turtle/RDF-XML harvesting and `as_rdf()` Turtle output need the optional
`rdflib` package (system `librdf`); without it those paths are skipped.
* Live scores depend on the object's current metadata and on third-party
services (DataCite, Crossref, GitHub) being reachable.
## References
* Wilkinson et al. (2016). The FAIR Guiding Principles. *Sci Data*. \doi{10.1038/sdata.2016.18}
* Devaraju & Huber. F-UJI.
* FAIRsFAIR metrics. \doi{10.5281/zenodo.15045911}
* Carbon et al. (2019). (Re)usable data licensing. *PLOS ONE*. \doi{10.1371/journal.pone.0213090}