\documentclass[a4paper,fleqn]{article} \usepackage[round,longnamesfirst]{natbib} \usepackage{graphicx,keyval,url} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage{Sweave} \usepackage{a4wide} % \sloppy{} % \newcommand{\XML}{\textsc{xml}} % If using acronym markup, include HTTP (and OAI-PMH?). \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \let\code=\texttt \newcommand{\ePub}{ePub$^\textrm{\tiny WU}$} % \RecustomVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl, fontsize=\small} \RecustomVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl} % \RecustomVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small} \title{Metadata Harvesting with R and OAI-PMH} \author{Kurt Hornik} \date{2023-01-31} %% \VignetteIndexEntry{Metadata Harvesting with R and OAI-PMH} \begin{document} \maketitle{} The Open Archives Initiative (\url{https://www.openarchives.org/}) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. One key project is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH, \url{https://www.openarchives.org/pmh/}) which provides ``a low-barrier mechanism for repository interoperability'' for archives (institutional repositories) containing digital content (digital libraries). OAI-PMH allows people (service providers, such as the ones registered with the OAI listed on \url{https://www.openarchives.org/service/listproviders.html}) to harvest metadata (from data providers, such as the ones registered with and validated by the OAI listed on \url{https://www.openarchives.org/Register/BrowseSites/}). Data Providers administer systems that support the OAI-PMH as a means of exposing metadata. Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added services. OAI-PMH, currently in version 2.0, defines a mechanism for data providers to expose their metadata. The protocol mandates that individual archives map their metadata to the Dublin Core (DC, \url{https://dublincore.org/}), a simple and common metadata set for cross-domain information resource description. OAI-PMH is a set of six \emph{verbs} or services that are invoked within HTTP, returning the request results in XML format. The OAI-PMH specification can be found at \url{https://www.openarchives.org/OAI/openarchivesprotocol.html}. Here, we summarize the basic facts and terminology. % https://en.wikipedia.org/wiki/Open_Archives_Initiative_Protocol_for_Metadata_Harvesting A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories. Repositories are network accessible servers that can process the six OAI-PMH requests, and are managed by a data provider to expose metadata to harvesters. OAI-PMH distinguishes between three distinct entities related to the metadata made accessible by the OAI-PMH: \begin{description} \item[\emph{resource}] the object or ``stuff'' that metadata is ``about''. Its nature is outside the scope of the OAI-PMH. \item[\emph{item}] a constituent of a repository from which metadata about a resource can be disseminated. \item[\emph{record}] metadata in a specific metadata format. A record is returned as an XML-encoded byte stream in response to a protocol request to disseminate a specific metadata format from a constituent item. \end{description} For each item there is an unambiguous unique identifier which is used in OAI-PMH requests for extracting metadata from the item. Items may contain metadata in multiple formats; Dublin Core is mandatory. Selective harvesting allows harvesters to limit harvest requests to portions of the metadata available from a repository. The OAI-PMH supports selective harvesting with two types of harvesting criteria that may be combined in an OAI-PMH request: datestamps and membership in sets, an optional construct for grouping items. The XML encoding of records is organized into the following parts: \begin{description} \item[\texttt{header}] contains the unique identifier, a datestamp (the date of creation, modification or deletion of the record), zero or more setSpec elements indicating the set membership of the item, and an optional status attribute for indicating the withdrawal of availability of the specified metadata format for the item, dependent on the repository support for deletions. \item[\texttt{metadata}] a single manifestation (format) of the metadata from an item. \item[\texttt{about}] an optional and repeatable container to hold data about the metadata part of the record. Contents of the containers must conform to an XML Schema. Common uses of these containers include rights statements and provenance statements. \end{description} The OAI-PMH verbs (requests) are as follows. \begin{description} \item[\texttt{GetRecord}] retrieve an individual metadata record from a repository. \item[\texttt{Identify}] retrieve information about a repository. \item[\texttt{ListIdentifiers}] an abbreviated form of \texttt{ListRecords} which retrieves only headers rather than records. \item[\texttt{ListMetadataFormats}] retrieve the metadata formats available from a repository, or optionally the formats available for a specific item. \item[\texttt{ListRecords}] harvest records from a repository. \item[\texttt{ListSets}] retrieve the set structure of a repository. \end{description} Optional arguments to \texttt{ListRecords} and \texttt{ListIdentifiers} permit selective harvesting of headers based on set membership and/or datestamp. Some of these requests return a \emph{list} of discrete entities: \texttt{ListRecords} returns a list of records, \texttt{ListIdentifiers} returns a list of headers, and \texttt{ListSets} returns a list of sets. These lists may be large, and it may be practical to partition them among a series of requests and responses. Repositories may reply with incomplete results and a resumption token, which the harvester can use to issue an additional request (and repeat until completion). The R package \pkg{OAIHarvester} provides functions for performing each of the six OAI-PMH requests, using, respectively, packages \pkg{curl} \cite{oaih:Ooms:2023}) and \pkg{xml2} \cite{oaih:Wickham+Hester+Ooms:2021} for HTTP and XML processing. List requests will automatically be re-issued until complete results are obtained. The names of these verb functions start with `\verb|oaih|' and follow a ``combine words with underscores'' scheme (e.g., \verb|oaih_list_records|, corresponding to the OAI-PMH \verb|ListRecords| verb, for harvesting records). The functions return the actual (aggregated) \emph{result} of the repository's response to the harvester's request. In addition to these functions for performing OAI-PMH requests, function \verb|oaih_harvest| is a high-level harvester which allows specifying several metadata formats or sets, and giving datestamps as Date or POSIXt date/time objects. Finally, function \verb|oaih_transform| provides functionality for transforming the XML results to ``useful'' R data structures for further processing or analysis. The results of the verb requests are transformed by default. The ideas underlying these transformations are best illustrated for harvesting records. In a list context, the result is a list of records, each containing the header (with identifier and datestamp and arbitrarily many set specs), metadata in a certain format, and arbitrarily many about entries. Conceptually, we can think of identifier, datestamp, setSpec, metadata and about as \emph{variables} ``observed'' for the items in the repository as cases, suggesting the usual rectangular case by variables data organization. When obtaining a single record, it seems natural to transform to a list with these variables. If the rectangular data structure were a data frame, selecting one row (corresponding to a single record) would not straightforwardly yield the single record list transformation (because subscripting list variables in the data frame would give length one sublists rather than the elements). Thus, in the rectangular cases we instead treat rows and columns symmetrically by arranging data in a ``list matrix'' (a list with a dim attribute, or equivalently, a matrix of list elements). As matrix subscripting drops dimensions when a single row or column is selected, one gets the expected simple list (without a dim attribute) in these cases. (Equivalently, the transformed \verb|oaih_list_records| result is the same as combining the transformed \verb|oaih_get_record| results by rows \verb|rbind|.) When harvesting records, identifiers and datestamps naturally transform to character strings, and set specs (a header may contain arbitrarily many of these) to character vectors. On the other hand, metadata can be made available in different formats, with different ``variables''. We find it more convenient to use a \emph{constant} set of variables for a single transformation of a certain ``kind'' of OAI-PMH XML results. Thus, we do not immediately transform the metadata, but instead leave them as lists of XML nodes to be transformed in a second stage (with variables differing according to the metadata format; currently, metadata in the Dublin Core and RFC 1807 (\url{https://www.rfc-editor.org/rfc/rfc1807}) formats can be transformed). These principles (using lists of single observations on variables and possibly arranging them in a rectangular way, and transforming to constant sets of variables) applies for all transformations of OAI-PMH XML results. Transformations can be added by assigning functions in the (currently internal) environment \verb|oaih_transform_methods_db|. As an example consider WU Research, an electronic publication platform for research output provided by WU (Wirtschaftsuniversit\"at Wien), which provides an OAI repository at \url{https://research.wu.ac.at/ws/oai}. <<>>= library("OAIHarvester") baseurl <- "https://research.wu.ac.at/ws/oai" @ % If we print raw XML, ensure sanity. <>= if(inherits(tryCatch(oaih_identify(baseurl), error = identity), "error")) q() require("xml2") options(warnPartialMatchArgs = FALSE) options(width = 80) @ We can use \verb|oaih_identify| to retrieve information about the repository. <<>>= x <- oaih_identify(baseurl) rbind(x, deparse.level = 0) @ Here, \verb|rbind| achieves ``pretty-printing'': we can see that the repository provides no compression support, and \Sexpr{length(x$description)} % $ further description entries of kind <<>>= vapply(x$description, xml_name, "") @ % where entry 2 indicates that the repository complies with the OAI format for unique record identifiers: <<>>= oaih_transform(x$description[[2L]]) @ We can use \verb|oaih_list_metadata_formats| and \verb|oaih_list_sets| to find out about available metadata formats and sets: <<>>= oaih_list_metadata_formats(baseurl) sets <- oaih_list_sets(baseurl) rbind(head(sets, 3L), tail(sets, 3L)) @ The available formats include the mandatory Dublin Core format, and there is a fairly refined set hierarchy for selective harvesting. To get all publications from year 2005, we can use <<>>= x <- oaih_list_records(baseurl, set = "publications:year2005") @ This gives a ``list matrix'' with observations of \Sexpr{ncol(x)} variables on \Sexpr{nrow(x)} items: <<>>= dim(x) colnames(x) @ % Transforming the Dublin Core metadata is achieved by calling \verb|oaih_transform| on the metadata column, after first removing empty metadata (from deleted records): <<>>= m <- x[, "metadata"] m <- oaih_transform(m[lengths(m) > 0L]) dim(m) @ % giving observations on the 15 (simple) Dublin Core elements: <<>>= colnames(m) @ The topics of the records are available in the `subject' DC variable, with comment ``Typically, the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary.'' (see \url{https://dublincore.org/documents/dcmi-terms/#terms-subject}), but without a more detailed syntactic or semantic specification. Inspecting the output of \verb|m[, "subject"]|, e.g., <<>>= m[head(which(lengths(m[, "subject"]) > 0), 3L), "subject"] @ shows that ``keywords'' follow a common scheme of \verb|/dk/atira/pure/keywords| URNs, so we can obtain all keywords via <<>>= keywords <- unlist(m[, "subject"]) keywords <- keywords[!startsWith(keywords, "/dk/atira/pure")] @ giving a total of \Sexpr{length(keywords)} keywords. Many of these only occur once: <<>>= counts <- table(keywords) table(counts) @ The most frequently used keywords are <<>>= sort(counts[counts >= 10L], decreasing = TRUE) @ showing quite a busy year for the old Research Report Series of the Statistics and Mathematics unit. To find the records co-authored by myself, one can use <<>>= pos <- which(vapply(m[, "creator"], function(e) any(startsWith(e, "Hornik")), NA)) @ (note that each creator entry is a character vector of author names). This finds \Sexpr{length(pos)} records: <<>>= unlist(m[pos, "title"]) @ of various types: <<>>= table(unlist(m[pos, "type"])) @ some of which have keywords: <<>>= pos <- pos[lengths(m[pos, "subject"]) > 0L] @ Only one of these is somewhat useful: <<>>= unique(m[pos, "subject"]) @ Note that OAI-PMH objects obtained by OAI-PMH requests and subsequent transformations are made up of both character vectors and XML nodes from package \pkg{xml2}, with the latter lists of external pointers. Thus, some extra effort is necessary to save OAI-PMH objects to a file or to restore these from a file: see \verb|?oaih_save_RDS| for more information. {\small \bibliographystyle{abbrvnat} \bibliography{oaih} } \end{document}