--- title: "Introduction to the scimetr package" author: "UDC Ranking's Group" date: '`r paste0("scimetr ", packageVersion("scimetr"),": ", Sys.Date())`' output: rmarkdown::html_vignette: toc: yes toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to the scimetr package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.dim = c(8, 6), fig.align = "center", out.width = "80%") old.opt <- options(digits = 5) # rebuid <- FALSE # TRUE # knitr::spin("scimetr.R", knit = FALSE) # knitr::purl("scimetr.Rmd", documentation = 2) ``` ```{r } library(scimetr) ``` This vignette illustrates the use of the [`scimetr`](https://rubenfcasal.github.io/scimetr/) package for performing bibliometric analyses using datasets exported from Web of Science, highlighting the main workflows and functionalities. The package provides tools for scientometric and bibliometric research, including routines to import bibliographic records from [*Clarivate Analytics Web of Science*](https://www.webofscience.com/wos/) (WoS) and conduct bibliometric analyses. A list of other useful R packages for this type of analysis is available [here](https://rubenfcasal.github.io/scimetr/articles/docs/R_packages.html). # Installation Since the package is not yet available on CRAN, you need to install the development version from the GitHub repository [rubenfcasal/scimetr](https://github.com/rubenfcasal/scimetr): ```{r eval=FALSE} # install.packages("remotes") remotes::install_github("rubenfcasal/scimetr") ``` Alternatively, Windows users may install the corresponding *scimetr_X.Y.Z.zip* file in the [releases section](https://github.com/rubenfcasal/scimetr/releases/latest) of the github repository. It is recommended to first install its dependencies: ```{r eval=FALSE} # Dependencies install.packages(c("dplyr", "tidyr", "stringr", "ggplot2", "scales", "rlang", "openxlsx")) # Last released version install.packages("https://github.com/rubenfcasal/scimetr/releases/download/v1.2.0/scimetr_1.2.0.zip", repos = NULL) ``` Once the package is installed, it can be loaded as usual. # Bibliographic data We will focus exclusively on importing publication data from [WoS](https://www.webofscience.com/wos/) in text format. First, you need to download the corresponding files from the WoS website, for example, by following the steps described [here](https://rubenfcasal.github.io/scimetr/articles/WoS_export.html). ## Loading WoS data from a directory WoS files (which by default are limited to 500 records each) can be automatically loaded from a subdirectory: ```{r eval=FALSE} dir("UDC_2018-2023 (01-02-2024)", pattern = "*.txt") ``` ```{r echo=FALSE} # dput(dir("UDC_2014-2023 (01-02-2024)", pattern='*.txt')) c( "savedrecs01.txt", "savedrecs02.txt", "savedrecs03.txt", "savedrecs04.txt", "savedrecs05.txt", "savedrecs06.txt", "savedrecs07.txt", "savedrecs08.txt", "savedrecs09.txt", "savedrecs10.txt" ) ``` To combine the files into a `data.frame`, the `import_wos()` function is used: ```{r eval=FALSE} wos.data <- import_wos("UDC_2018-2023 (01-02-2024)") ``` Next, the database must be created using the `db_bib()` function, as shown later. ## Example data The package includes the example dataset `wosdf` (obtained using the `import_wos()` function), corresponding to a WoS search by the Affiliation field of *Universidade da Coruña (UDC)* (Affiliation: OG = Universidade da Coruna) in the research area `"Mathematics"` during the years 2018–2023. All data tables have an associated `variable.labels` attribute with the variable labels. These will be displayed below the variable names when viewed in RStudio (e.g. `View(wosdf)`). ```{r } wos.labels <- attr(wosdf, "variable.labels") knitr::kable(head(data.frame(wos.labels)), col.names = c("Variable", "Label") ) ``` ... A full list of the variables used in the database tables is shown in the final section [*Variable list*](#variables) of this document. # Bibliographic database `scimetr` uses lists with `data.frame` components as relational databases. To create the [bibliographic database](https://en.wikipedia.org/wiki/Bibliographic_database), use the `db_bib()` function (the result is a `wos.db`-class S3 object): ```{r wosdf} db <- db_bib(wosdf, label = "Mathematics_UDC_2018-2023") names(db) ``` ## Summaries You can generate either global summaries or yearly summaries of your database. ### Global summary The `summary()` method of a bibliographic database `(summary.wos.db()`), provides an overview of the entire database, including total documents, authors, journals, citations, and other aggregated statistics. ```{r summary} res1 <- summary(db) res1 ``` ### Yearly summary The `summary_year()` method breaks down the summary *by year*, showing trends over time in publications, citations, and other key metrics. ```{r } res2 <- summary_year(db) res2 ``` ## Visualizations The [`ggplot2`](https://ggplot2.tidyverse.org) package is used to create a wide variety of visualizations from the database. There are three main types of plots you can create: - Database plots (`plot(db)`). - Summary plots (`plot(summary(db)`). - Yearly summary plots (`plot(summary_year(db))`). Note: All `plot()` methods invisible return a list with the generated `ggplot2` objects (use `plot = FALSE` to avoid plotting). ### Database plots The `plot()` method of a bibliographic database (`plot.wos.db()`) provides a general visualization of its contents. ```{r plotdb, warning=FALSE, message=FALSE} plot(db) ``` ### Summary plots The plot method of a summary result (`plot.summary.wos.db()`) visualizes the results generating different types of plots: standard bar, line plots or Pie charts (`pie = TRUE`). ```{r } plot(res1) plot(res1, pie = TRUE) ``` ### Yearly summary plots The `plot()` method of a yearly summary (`plot.summary.year()`) visualizes the results generating different types of plots: standard bar and line plots for trends over time, or boxplots (`boxplot = TRUE`), to show variability within each year. ```{r } plot(res2) plot(res2, boxplot = TRUE) ``` ## Filtering To filter elements (entities) of the database, you can use the functions `get_id_()` to retrieve identification codes (IDs or entity key values). Any variable from the corresponding `
` may be used (multiple conditions are combined with `&`; see e.g. `dplyr::filter()`). Typically, these functions are combined together with the `get_id_docs()` function to obtain document IDs, which are finally used in argument `filter` of summary functions to filter documents in results. ### Retrieving element IDs (entity key values) - **`get_id_authors()`**: Retrieve author IDs (codes) Search for a specific author: ```{r} ida <- get_id_authors(db, AF == "Cao, Ricardo") ida ``` Search by partial name match: ```{r} idas <- get_id_authors(db, grepl("Cao", AF)) idas ``` - **`get_id_areas()`**: Retrieve codes for research areas ```{r} get_id_areas(db, SC == "Mathematics") get_id_areas(db, SC == "Mathematics" | SC == "Computer Science") ``` - **`get_id_categories()`**: Retrieve category codes ```{r} get_id_categories(db, grepl("Mathematics", WC)) ``` - **`get_id_sources()`**: Retrieve source codes (journals, books, or collections) ```{r} idtest <- get_id_sources(db, SO == "TEST") idtest knitr::kable(t(db$Sources[idtest, ]), caption = "Test journal", col.names = c("Variable", "Value") ) # get_id_sources(db, JI == 'Test') ``` ### Retrieving documents IDs (by authors, journals, etc.) The IDs retrieved above can be combined in `get_id_docs()`: ```{r} idocs <- get_id_docs(db, id_authors = ida) idocs ``` The document IDs can be used as filters, for example, in `summary.wos.db()`. ### Filtered Summaries Get a summary for one or more authors: ```{r} summary(db, idocs) ``` Get a year-by-year summary for one or more authors: ```{r } summary_year(db, idocs) ``` ## Author metrics Retrieve metrics for multiple authors: ```{r } author_metrics(db, idas) ``` # Bibliographic database with JCR metrics We can extends the bibliographic database by adding JCR metrics to sources, per year and WoS category. ## Import JCR data from WoS Excel files with JCR data (avaliable from WoS) can be automatically loaded from a subdirectory: ```{r eval=FALSE} dir("JCR_download", pattern = "*.xlsx") ``` ```{r echo=FALSE} # dput(dir("../../JCR_download", pattern='*.xlsx')) c( "JCR_SCIE_2018.xlsx", "JCR_SCIE_2019.xlsx", "JCR_SCIE_2020.xlsx", "JCR_SCIE_2021.xlsx", "JCR_SCIE_2022.xlsx", "JCR_SCIE_2023.xlsx", "JCR_SSCI_2018.xlsx", "JCR_SSCI_2019.xlsx", "JCR_SSCI_2020.xlsx", "JCR_SSCI_2021.xlsx", "JCR_SSCI_2022.xlsx", "JCR_SSCI_2023.xlsx" ) ``` To combine the files into a relational database, the `db_jcr()` function is used: ```{r db-jcr, eval=FALSE} jcr <- db_jcr("JCR_download") ``` ## Add JCR data to a bibliographic database The JCR data can be combined with a bibliographic database by using the `add_jcr()` function: ```{r add-jcr, eval=FALSE} dbjcr <- add_jcr(db, jcr) ``` Two additional tables, `JCRSour` and `JCRCatSour`, are added to the bibliographic database (the result is a `wos.jcr`-class S3 object). The database resulting from this particular example is provided as a dataset of the `scimetr` package: ```{r dbjcr} names(dbjcr) class(dbjcr) ``` ## Summaries Auxiliary functions are available to perform database queries: - **`get_jcr()`**: combines document indexes with their source JCR metrics per year. ```{r} head(get_jcr(dbjcr)) ``` - **`get_jcr_cat()`** combines document indexes with their source JCR metrics per year and WoS category (if `best = TRUE`, only the results for the WoS category with the best ranking for each document are returned). ```{r} head(get_jcr_cat(dbjcr, best = TRUE)) ``` The summary methods combine the above queries and generate global summaries or yearly summaries of a bibliographic database with JCR metrics (if `all = TRUE`, the corresponding `wos.db`-class results are also generated). ```{r} res1 <- summary(dbjcr) res1 res2 <- summary_year(dbjcr) res2 ``` Note: `res1$docjcrcat` contains the combined queries. ## Visualizations The `plot()` method (`plot.wos.jcr()`) generates histograms of documents JCR metrics (if argument `all = TRUE`, `plot.wos.bd()` is also called): ```{r plotdbjcr, message=FALSE, warning=FALSE, out.width = '100%'} plot(dbjcr) ``` Summary results have also a `plot()` method: ```{r plot.summary.jcr, fig.dim = c(10, 6), out.width = '100%'} plot(res1) ``` ```{r plot.summary.year.jcr, out.width = '100%'} plot(res2) ``` # Variable list {#variables} The following list shows all variables used in the database tables: ```{r echo=FALSE} all.labels <- data.frame(scimetr:::.all.labels) DT::datatable(all.labels, colnames = c("Variable", "Label"), filter = "top", options = list(pageLength = 10) ) ``` Note: The table above was generated using the `datatable()` function from the [DT](https://rstudio.github.io/DT/) package. By default, the first rows are displayed. Click on the page index (below) to change this. You can sort by column by clicking on the arrows to the right of its name. You can search for values (search box, top right) and filter values (click on the filter box below the name of each variable). ```{r echo=FALSE} # Restore user's options options(old.opt) ```