---
title: "`createGADS`: Creating a relational data base"
author: "Benjamin Becker"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{`createGADS`: Creating a relational data base}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In the context of educational large-scale assessments (but also in other contexts) we frequently encounter data sets which have an hierarchical structure. In educational large-scale assessments these can, for example, be pupils nested in schools. Additional, hidden nested structures occur, if missing data are treated with multiple imputations or person parameters are estimated using plausible values. In these cases it is inefficient to store all the data in one rectangular data set. In other data science applications the use of relational data bases is a widespread measure to tackle this challenge. 

`eatGADS` supports creating such relational data bases (based on the open source software `SQLite3` and the `R` package `eatDB`) while maintaining its meta data and providing a very user friendly interface for data users who are unfamiliar with relational data bases. In doing so, it allows the handling of large data sets even on limited hardware settings. Furthermore, this approach allows the extraction of data from different hierarchy levels, which means that data has to be reshaped very rarely. 

This vignette illustrates how a relational `eatGADS` data base can be created from a rectangular `SPSS` (`.sav`) data file. For illustrative purposes we use a small example data set from the campus files of the German PISA Plus assessment. The complete campus files and the original data set can be accessed [here](https://www.iqb.hu-berlin.de/fdz/Datenzugang/CF-Antrag/AntragsformularCF) and [here](https://www.iqb.hu-berlin.de/fdz/Datenzugang/SUF-Antrag). 

## Importing data

```{r setup}
library(eatGADS)
```

We can import an `.sav` (or an compressed `.zsav`) data set via the `import_spss()` function. Checks on variable names for `SQLite3` compliance are performed automatically. Changes to the variable names are reported to the console.

```{r import spss}
sav_path <- system.file("extdata", "pisa.zsav", package = "eatGADS")
dat <- import_spss(sav_path)
```

The next steps depend on the data structure: If the different hierarchy levels are saved in different source data sets (e.g. different `.sav` files) the next section can be skipped. However, sometimes data from different hierarchy levels is saved in one data file. Then, splitting and reshaping becomes necessary.

## Splitting and Reshaping

In this case, we want to split the imported `GADSdat` object into its hierarchy levels (in our example: background data on level 1 and imputed competence data on level 2). This can be achieved by the `splitGADS()` function. We specify the hierarchical structure as a `list`. After this, we can extract separate `GADSdat` objects by name via the `extractGADS()` function. These objects can then be used for reshaping. 

For reasons of simplicity, the example only contains two hierarchy levels. In practice, often more hierarchy levels are present. Splitting can be performed into as many hierarchy levels as desired. The reshaping has to be performed for each hierarchy level separately.

```{r split GADS}
pvs <- grep("pv", namesGADS(dat), value = T)
splitted_gads <- splitGADS(dat, nameList = list(noImp = namesGADS(dat)[!namesGADS(dat) %in% pvs],
                    PVs = c("idstud", pvs)))
# new Structure
namesGADS(splitted_gads)

# Extract GADSdat objects
noImp_gads <- extractGADSdat(splitted_gads, "noImp")
pvs_gads <- extractGADSdat(splitted_gads, "PVs")
```

For reshaping data we highly recommend the `R` package `tidyr`. Its performance might be less optimized than for example the `data.table` package, however it is very intuitive and user friendly. For our example data set we need to reshape the `PVs` from wide to long format and then separate the resulting column into two columns, containing the `dimension` and imputation number (`imp`) (Note: This results in a data set in which different dimensions for a single student are stored in separate rows, not columns). For this, we directly access the data in the `GADSdat` object via `pvs_gads$dat`. The reshaping is performed by `tidyr::pivot_longer()`. `tidyr::separate()` is used to separate our two additional identifier columns (`dimension` and `imp`).  Finally, we clean the `imp` column and transform it to numeric.

```{r reshape PVs}
# Extract raw data from pv gads
pvs_wide <- pvs_gads$dat

# Wide format
head(pvs_wide)

pvs_long1 <- tidyr::pivot_longer(pvs_wide, cols = all_of(pvs))
pvs_long2 <- tidyr::separate(pvs_long1, col = "name", sep = "_", into = c("dimension", "imp"))
pvs_long2$imp <- as.numeric(gsub("pv", "", pvs_long2$imp))

# Finale long format
head(as.data.frame(pvs_long2))
```

## Handling meta data

After reshaping we adapt the meta data in our initial `GADSdat` object via `updateMeta()`. This is necessary, as variables have been removed from the data set (e.g. `"ma_pv1"` etc.) and new variables have replaced them (`"value"`, `"dimension"`, `"imp"`). Now we have to add some variable labels, as most of the old variable labels got lost due to the reshaping. For an extensive tutorial see the vignette [Handling Meta Data](import_raw.html).

```{r updateMeta}
pvs_gads2 <- updateMeta(pvs_gads, newDat = as.data.frame(pvs_long2))
extractMeta(pvs_gads2)

# 
pvs_gads2 <- changeVarLabels(pvs_gads2, varName = c("dimension", "imp", "value"),
                varLabel = c("Achievement dimension (math, reading, science)",
                             "Number of imputation of plausible values",
                             "Plausible Value"))
extractMeta(pvs_gads2)
```

## Preparing and Creating the data base

For the creation of a relational data base we recreate the initial hierarchical structure via `mergeLabels()` (which performs the reverse action as `extractGADS()`). Furthermore, we create two lists, a primary key list (`pkList`) and a foreign key list (`fkList`). Primary keys are the variables that uniquely identify each row within every hierarchy level. Foreign keys are the variables that allow merging between different hierarchy levels. In the list of foreign keys we also have to specify another hierarchy level, to which each hierarchy level connects. An exception is the lowest hierarchy levels, which serves as a basis. 

By setting up the order and the foreign keys of the data base we specify how the data is merged together when we extract data from it. In contrast to conventional relational data bases, `eatGADS` data bases are less flexible: The package does not support modifying the data base after its creation or extracting data from it with different merges than specified when it was set up.

```{r prepare data base}
merged_gads <- mergeLabels(noImp = noImp_gads, PVs = pvs_gads2)

pkList <- list(noImp = "idstud",
               PVs = c("idstud", "imp", "dimension"))
fkList <- list(noImp = list(References = NULL, Keys = NULL),
               PVs = list(References = "noImp", Keys = "idstud"))
```

Finally, we create the relational data base on disc via the `createGADS()` function.

```{r create data base}
temp_path <- paste0(tempfile(), ".db")

createGADS(merged_gads, pkList = pkList, fkList = fkList,
           filePath = temp_path)
```

For a detailed tutorial on how to use a relational `eatGADS` data base see the vignette [`getGADS`: Using a relational eatGADS data base](getGADS.html).