--- title: "From R to RDF" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{From R to RDF} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setupvignette, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) if (!requireNamespace("rdflib", quietly = TRUE)) { stop("Please install 'rdflib' to run this vignette.") } if (!requireNamespace("jsonld", quietly = TRUE)) { stop("Please install 'jsonld' to run this vignette.") } ``` ```{r setup} library(dataset) library(rdflib) library(jsonld) ``` The **RDF** (Resource Description Framework) annotation significantly enhances the interoperability and exchangeability of datasets in data repositories by leveraging a standardised, machine-readable format for describing and linking data. This vignette shows how to leverage the capabilities of the _dataset_ package with [rdflib](https://docs.ropensci.org/rdflib/index.html), an R-user-friendly wrapper on rOpenSci to work with the _redland_ Python library for performing common tasks on rdf data, such as parsing and converting between formats including rdfxml, turtle, nquads, ntriples, and trig, creating rdf graphs, and performing SPARQL queries. ## Standardised Semantic Framework RDF provides a common framework to describe resources and their relationships using triples (subject-predicate-object). This standardisation ensures that data from different systems can be understood in a unified way, regardless of the original source or format. Notice that this format is a stricter version of the tidy dataset concept, where not only on every observation is in a row, but there are always strictly three columns. ```{r orangedf} data(orange_df) orange_df[1, ] ``` Instead of placing the relevant measurement of an observed flower into the intersection of columns and rows, in the triple format we put them next to each other: - the first tree's age is 118 days. - the first tree's circumference is 30 mm. ```{r triples} dataset_to_triples(orange_df[1:2, ])[1:3, ] ``` We describe the `dataset_df` datasets in such triplets, where each triplet is a semantic statement: it connects a single observation unit with a single measurement. ## Enhanced Interoperability RDF uses globally unique identifiers (URIs) for resources, ensuring that different datasets can reference the same entities unambiguously. This allows seamless data integration and querying across repositories, even if the datasets come from diverse domains. Our `defined` class supports this enhanced interoperability. In the example below, an application can look up that the numeric values in your table conform the statistical definition of GDP, and they are expressed in millions of dollars; meaning that you have to multiply them by 1000 if you want to join them with different data expressed in thousands of dollars. ```{r defined} gdp_vector <- defined( c(3897, 7365, 6753), label = "Gross Domestic Product", unit = "https://rdf.vegdata.no/V440/v440-doc/v440-brudata-owl-doc/unit_MillionUSD.html", concept = "http://data.europa.eu/83i/aa/GDP" ) ``` There are several ways to add permanent identifiers to observational units, variable definitions, and specific observed values. The simplest (but certainly not the easiest to read for a human eye) standard format for writing them into a plain text file that you can share online is the [RDF 1.1 N-Triples](https://www.w3.org/TR/n-triples/) format.The NTriple format creates URIs (similarly formatted as URLs) for the definitions that can be looked up in an online resource. This can be combined with literal strings that may also include information if they should be read back to a system as strings, doubles, integers, dates or date-time variables. ```{r ntriples} n_triple( s = "https://doi.org/10.5281/zenodo.10396807", # permanent, global ID of the dataset p = "http://purl.org/dc/terms/description", # library definition of 'description' o = "The famous (Fisher's or Anderson's) iris data set." ) # literal string ``` ## Richer metaadata RDF supports linking datasets through shared URIs, enabling the creation of interconnected knowledge graphs. Linked Data principles help relate datasets in meaningful ways, making it easier to discover, navigate, and integrate information. RDF annotations allow datasets to include detailed metadata about their structure, provenance, usage rights, and content. This metadata provides critical context, enabling automated tools to interpret and process the data effectively. Most scientific researchers are familiar with data *findability*, *accessibility*, *interoperability*, and *reuse*. Your dataset's properties will significantly improve if you add standard metadata used by libraries globally (according to the Dublin Core standards) or the DataCite data repository standards. Such standards use globally shared definitions on how a title or a subtitle should be added to your dataset or how you can add with IRIs keywords that any user interprets the same way in the world, even if they do not speak English or your language. RDF supports the use of ontologies and controlled vocabularies (e.g., DataCite, Dublin Core, Schema.org), allowing datasets to be described consistently within and across domains. The `as_dublincore` function allows the export of your dataset's data in the Dublin Core format, and `as_datacite` in the DataCite format. Some of the metadata are generated behind the scenes, for example, timestamps or size measurements. ```{r bibliography} as_dublincore(iris_dataset, type = "ntriples") ``` Interoperability and reusability can further increase if the next user can trust your dataset, and has to perform less checks on it; or the next user can reproduce what you did. Data provenance is the metadata that provides a comprehensive record of the origins, history, and transformations of data throughout its lifecycle. Our `provenance` functions records some of this data automatically, and allow you to add more information, for example, about your data sources, the R packages used, the persons involved in the creation and review process, or the statistical transformations carried out. ```{r prov} provenance(iris_dataset) ``` ## Adding your dataset into an RDF triplestore RDF data can be stored in triple stores and queried using SPARQL, a powerful query language. This makes it easier to retrieve specific subsets of data or infer new information based on existing annotations ```{r rdf} # initialise an rdf triplestore: dataset_describe <- rdf() # open a temporary file: temp_prov <- tempfile() # describe the dataset in temporary file: describe(iris_dataset, temp_prov) # parse temporary file into the RDF triplestore; rdf_parse(rdf = dataset_describe, doc = temp_prov, format = "ntriples") # show RDF triples: dataset_describe ``` By using RDF, datasets can be exchanged as interoperable graphs (e.g., in formats like RDF/XML, Turtle, or JSON-LD). ```{r jsonld} options(rdf_print_format = "jsonld") dataset_describe ``` ## Make the entire dataset interoperable Eventually you can make the entire dataset interoperable, with making every observation, every statement independent of R, your computer, your OS, and to a large extent the natural language that you use. _This will be further developed until we can express in a semantically correct way an entire dataset automatically_ ```{r} n_triples(dataset_to_triples(iris[1:4, ])) ```