---
title: "Introduction to the scimetr package"
author: "UDC Ranking's Group"
date: '`r paste0("scimetr ", packageVersion("scimetr"),": ", Sys.Date())`'
output: 
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Introduction to the scimetr package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.dim = c(8, 6), fig.align = "center", 
                      out.width = "80%")
old.opt <- options(digits = 5)
# rebuid <- FALSE # TRUE
# knitr::spin("scimetr.R", knit = FALSE)
# knitr::purl("scimetr.Rmd", documentation = 2)
```


```{r }
library(scimetr)
```


This vignette illustrates the use of the [`scimetr`](https://rubenfcasal.github.io/scimetr/) package for performing bibliometric analyses using datasets exported from Web of Science, highlighting the main workflows and functionalities. The package provides tools for scientometric and bibliometric research, including routines to import bibliographic records from [*Clarivate Analytics Web of Science*](https://www.webofscience.com/wos/) (WoS) and conduct bibliometric analyses. 

<!-- 
For more information visit <https://rubenfcasal.github.io/scimetr>. 
-->

A list of other useful R packages for this type of analysis is available [here](https://rubenfcasal.github.io/scimetr/articles/docs/R_packages.html).


# Installation

Since the package is not yet available on CRAN, you need to install the development version from the GitHub repository [rubenfcasal/scimetr](https://github.com/rubenfcasal/scimetr):

```{r eval=FALSE}
# install.packages("remotes")
remotes::install_github("rubenfcasal/scimetr")
```

Alternatively, Windows users may install the corresponding *scimetr_X.Y.Z.zip* file in the [releases section](https://github.com/rubenfcasal/scimetr/releases/latest) of the github repository.
It is recommended to first install its dependencies:

<!-- 
Pendiente:
  Ultima versión en releases section
  Actualizar README.md
-->
```{r eval=FALSE}
# Dependencies
install.packages(c("dplyr", "tidyr", "stringr", "ggplot2", "scales", "rlang", "openxlsx"))
# Last released version
install.packages("https://github.com/rubenfcasal/scimetr/releases/download/v1.2.0/scimetr_1.2.0.zip", repos = NULL)
```

Once the package is installed, it can be loaded as usual.

# Bibliographic data

We will focus exclusively on importing publication data from [WoS](https://www.webofscience.com/wos/) in text format.
First, you need to download the corresponding files from the WoS website, for example, by following the steps described [here](https://rubenfcasal.github.io/scimetr/articles/WoS_export.html).

## Loading WoS data from a directory

WoS files (which by default are limited to 500 records each) can be automatically loaded from a subdirectory:

```{r eval=FALSE}
dir("UDC_2018-2023 (01-02-2024)", pattern = "*.txt")
```
```{r echo=FALSE}
# dput(dir("UDC_2014-2023 (01-02-2024)", pattern='*.txt'))
c(
  "savedrecs01.txt", "savedrecs02.txt", "savedrecs03.txt", "savedrecs04.txt",
  "savedrecs05.txt", "savedrecs06.txt", "savedrecs07.txt", "savedrecs08.txt",
  "savedrecs09.txt", "savedrecs10.txt"
)
```

To combine the files into a `data.frame`, the `import_wos()` function is used:

```{r eval=FALSE}
wos.data <- import_wos("UDC_2018-2023 (01-02-2024)")
```

Next, the database must be created using the `db_bib()` function, as shown later.


## Example data

The package includes the example dataset `wosdf` (obtained using the `import_wos()` function), corresponding to a WoS search by the Affiliation field of *Universidade da Coruña (UDC)* (Affiliation: OG = Universidade da Coruna) in the research area `"Mathematics"` during the years 2018–2023.

All data tables have an associated `variable.labels` attribute with the variable labels. 
These will be displayed below the variable names when viewed in RStudio (e.g. `View(wosdf)`).

```{r }
wos.labels <- attr(wosdf, "variable.labels")
knitr::kable(head(data.frame(wos.labels)),
  col.names = c("Variable", "Label")
)
```
...

A full list of the variables used in the database tables is shown in the final section [*Variable list*](#variables) of this document.


# Bibliographic database

`scimetr` uses lists with `data.frame` components as relational databases.

To create the [bibliographic database](https://en.wikipedia.org/wiki/Bibliographic_database), use the `db_bib()` function (the result is a `wos.db`-class S3 object):

```{r wosdf}
db <- db_bib(wosdf, label = "Mathematics_UDC_2018-2023")
names(db)
```


## Summaries

You can generate either global summaries or yearly summaries of your database.

### Global summary

The `summary()` method of a bibliographic database `(summary.wos.db()`), provides an overview of the entire database, including total documents, authors, journals, citations, and other aggregated statistics.

```{r summary}
res1 <- summary(db)
res1
```

### Yearly summary

The `summary_year()` method breaks down the summary *by year*, showing trends over time in publications, citations, and other key metrics. 

```{r }
res2 <- summary_year(db)
res2
```


## Visualizations

The [`ggplot2`](https://ggplot2.tidyverse.org) package is used to create a wide variety of visualizations from the database.  
There are three main types of plots you can create:

- Database plots (`plot(db)`).

- Summary plots (`plot(summary(db)`).

- Yearly summary plots (`plot(summary_year(db))`).

Note: All `plot()` methods invisible return a list with the generated `ggplot2` objects (use `plot = FALSE` to avoid plotting).

### Database plots 

The `plot()` method of a bibliographic database (`plot.wos.db()`) provides a general visualization of its contents.  

```{r plotdb, warning=FALSE, message=FALSE}
plot(db)
```

### Summary plots 

The plot method of a summary result (`plot.summary.wos.db()`) visualizes the results generating different types of plots: standard bar, line plots or Pie charts (`pie = TRUE`).

```{r }
plot(res1)
plot(res1, pie = TRUE)
```

### Yearly summary plots

The `plot()` method of a yearly summary (`plot.summary.year()`) visualizes the results generating different types of plots: standard bar and line plots for trends over time, or boxplots (`boxplot = TRUE`), to show variability within each year.

```{r }
plot(res2)
plot(res2, boxplot = TRUE)
```


## Filtering

To filter elements (entities) of the database, you can use the functions `get_id_<table>()` to retrieve identification codes (IDs or entity key values). Any variable from the corresponding `<table>` may be used (multiple conditions are combined with `&`; see e.g. `dplyr::filter()`).

Typically, these functions are combined together with the `get_id_docs()` function to obtain document IDs, which are finally used in argument `filter` of summary functions to filter documents in results.


### Retrieving element IDs (entity key values)

- **`get_id_authors()`**: Retrieve author IDs (codes)

    Search for a specific author:
      
    ```{r}
    ida <- get_id_authors(db, AF == "Cao, Ricardo")
    ida
    ```
      
    Search by partial name match:
      
    ```{r}
    idas <- get_id_authors(db, grepl("Cao", AF))
    idas
    ```

- **`get_id_areas()`**: Retrieve codes for research areas

    ```{r}
    get_id_areas(db, SC == "Mathematics")
    get_id_areas(db, SC == "Mathematics" | SC == "Computer Science")
    ```

- **`get_id_categories()`**: Retrieve category codes

    ```{r}
    get_id_categories(db, grepl("Mathematics", WC))
    ```

- **`get_id_sources()`**: Retrieve source codes (journals, books, or collections)

    ```{r}
    idtest <- get_id_sources(db, SO == "TEST")
    idtest
    knitr::kable(t(db$Sources[idtest, ]),
      caption = "Test journal",
      col.names = c("Variable", "Value")
    )
    # get_id_sources(db, JI == 'Test')
    ```


### Retrieving documents IDs (by authors, journals, etc.)

The IDs retrieved above can be combined in `get_id_docs()`:

```{r}
idocs <- get_id_docs(db, id_authors = ida)
idocs
```

The document IDs can be used as filters, for example, in `summary.wos.db()`.


### Filtered Summaries

Get a summary for one or more authors:

```{r}
summary(db, idocs)
```


Get a year-by-year summary for one or more authors:

```{r }
summary_year(db, idocs)
```


## Author metrics

Retrieve metrics for multiple authors:

```{r }
author_metrics(db, idas)
```


# Bibliographic database with JCR metrics

We can extends the bibliographic database by adding JCR metrics to sources, per year and WoS category.

## Import JCR data from WoS

Excel files with JCR data (avaliable from WoS) can be automatically loaded from a subdirectory:

```{r eval=FALSE}
dir("JCR_download", pattern = "*.xlsx")
```
```{r echo=FALSE}
# dput(dir("../../JCR_download", pattern='*.xlsx'))
c(
  "JCR_SCIE_2018.xlsx", "JCR_SCIE_2019.xlsx", "JCR_SCIE_2020.xlsx",
  "JCR_SCIE_2021.xlsx", "JCR_SCIE_2022.xlsx", "JCR_SCIE_2023.xlsx",
  "JCR_SSCI_2018.xlsx", "JCR_SSCI_2019.xlsx", "JCR_SSCI_2020.xlsx",
  "JCR_SSCI_2021.xlsx", "JCR_SSCI_2022.xlsx", "JCR_SSCI_2023.xlsx"
)
```

To combine the files into a relational database, the `db_jcr()` function is used:

```{r db-jcr, eval=FALSE}
jcr <- db_jcr("JCR_download")
```

## Add JCR data to a bibliographic database

The JCR data can be combined with a bibliographic database by using the `add_jcr()` function:

```{r add-jcr, eval=FALSE}
dbjcr <- add_jcr(db, jcr)
```

Two additional tables, `JCRSour` and `JCRCatSour`, are added to the bibliographic database (the result is a `wos.jcr`-class S3 object).

The database resulting from this particular example is provided as a dataset of the `scimetr` package:

```{r dbjcr}
names(dbjcr)
class(dbjcr)
```


## Summaries

Auxiliary functions are available to perform database queries:

- **`get_jcr()`**: combines document indexes with their source JCR metrics per year.

    ```{r}
    head(get_jcr(dbjcr))
    ```

- **`get_jcr_cat()`** combines document indexes with their source JCR metrics per year and WoS category (if `best = TRUE`, only the results for the WoS category with the best ranking for each document are returned).

    ```{r}
    head(get_jcr_cat(dbjcr, best = TRUE))
    ```

The summary methods combine the above queries and generate global summaries or yearly summaries of a bibliographic database with JCR metrics (if `all = TRUE`, the corresponding `wos.db`-class results are also generated).

```{r}
res1 <- summary(dbjcr)
res1
res2 <- summary_year(dbjcr)
res2
```

Note: `res1$docjcrcat` contains the combined queries.


## Visualizations

The `plot()` method (`plot.wos.jcr()`) generates histograms of documents JCR metrics (if argument `all = TRUE`, `plot.wos.bd()` is also called):

```{r plotdbjcr, message=FALSE, warning=FALSE, out.width = '100%'}
plot(dbjcr)
```

Summary results have also a `plot()` method:

```{r plot.summary.jcr, fig.dim = c(10, 6), out.width = '100%'}
plot(res1)
```
```{r plot.summary.year.jcr, out.width = '100%'}
plot(res2)
```


# Variable list {#variables}

The following list shows all variables used in the database tables:

```{r echo=FALSE}
all.labels <- data.frame(scimetr:::.all.labels)
DT::datatable(all.labels,
  colnames = c("Variable", "Label"), filter = "top",
  options = list(pageLength = 10)
)
```

Note: The table above was generated using the `datatable()` function from the [DT](https://rstudio.github.io/DT/) package.
By default, the first rows are displayed. 
Click on the page index (below) to change this. 
You can sort by column by clicking on the arrows to the right of its name. 
You can search for values (search box, top right) and filter values (click on the filter box below the name of each variable).

```{r echo=FALSE}
# Restore user's options
options(old.opt)
```


<!-- 
Pendiente:
  - Añadir summary_all(plot = TRUE)
-->

<!-- 
## Trabajo futuro

- Descarga y análisis de datos de [Scopus](https://www.scopus.com):
  empleando descarga de archivos (PARCIALMENTE IMPLEMENTADO)
  (al estilo del paquete
  [CITAN](https://CRAN.R-project.org/package=CITAN))
  o la [API de Scopus](https://dev.elsevier.com/sc_apis.html)
  (al estilo del paquete
  [rscopus](https://CRAN.R-project.org/package=rscopus)).
  
- Descarga y análisis de datos de fuentes en abierto (TFM Clever):
  in line with open source and open data to ensure transparency and reproducibility 

    - [openalexR](https://docs.ropensci.org/openalexR) (Aria et al. 2024):
    Getting Bibliographic Records from 'OpenAlex' Database Using 'DSL' API.
    
    - [rcrossref](https://docs.ropensci.org/rcrossref) (Chamberlain et al. 2014): R interface to various CrossRef APIs.
  

- Implementar análisis estadísticos "avanzados".
-->