Comprehensive Data Cleaning Guide

Benjamin Becker, Marlen Holtmann, Johanna Busse

2023-08-25

eatGADS allows importing data from SPSS files basically without any loss of data or meta data. eatGADS stores these data and meta data in so-called GADSdat objects. These objects are lists of length two, containing both the data ("dat") as well as the meta data ("labels").

class(pisa)
#> [1] "GADSdat" "list"
names(pisa)
#> [1] "dat"    "labels"

As GADSdat objects have this specific structure, conventional data modification tools are not suitable for GADSdat objects. Instead, eatGADS provides designated data cleaning and data wrangling functions for most common data cleaning and data wrangling tasks. This is especially relevant for data cleaning steps that require simultaneous modification of both the data and the meta data (e.g., recoding of values and value labels).

In this vignette

For illustrative purposes, a small example data set from the campus files of the German PISA Plus assessment (called pisa) is used. The complete campus files and the original data set can be accessed here and here.

library(eatGADS)
gads <- pisa

Data structure

The raw data in a GADSdat object are represented as raw, unlabeled values. They can be accessed via GADSdat$dat. For the extraction of data suitable for data analyses, see the extractData2() function.

pisa$dat[1:5, 1:5]
#>   idstud idschool idclass schtype sameteach
#> 1      1      127     392       2         2
#> 2      2       65     201       3         1
#> 3      3       10      34       1         1
#> 4      4      103     319       3         2
#> 5      5       57     179       2         2

Meta data structure

Meta data are stored in a GADSdat object with the following structure:

#>    varName varLabel format display_width labeled value valLabel missings
#> 17  gender   Gender   F8.0            NA     yes     1   Female    valid
#> 18  gender   Gender   F8.0            NA     yes     2     Male    valid

On value level, additional meta data can be stored, namely:

Via the function extractMeta() the existing metadata of one, several or all variables in a GADSdat object can be inspected. This function is used frequently throughout the vignette to check whether changes to meta data have been performed successfully.

extractMeta(gads, vars = c("hisei", "schtype"))
#>    varName                             varLabel format display_width labeled value
#> 5  schtype                         School track   F8.0            NA     yes     1
#> 6  schtype                         School track   F8.0            NA     yes     2
#> 7  schtype                         School track   F8.0            NA     yes     3
#> 39   hisei Highest parental occupational status   F8.2            NA      no    NA
#>                                     valLabel missings
#> 5                 Gymnasium (academic track)    valid
#> 6                                 Realschule    valid
#> 7  schools with several courses of education    valid
#> 39                                      <NA>     <NA>

Modifying meta data

This section discusses changes on meta data level, such as changes to variable names or labels.

Changing variable names

Changes to meta data on variable level are straightforward. Variable names can be changed with the changeVarNames() function. The old variable names are overwritten. Multiple variable names can be adjusted at once.

# inspect original meta data
extractMeta(gads, vars = "hisei")
#>    varName                             varLabel format display_width labeled value valLabel
#> 39   hisei Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

# Change variable name
gads_labeled <- changeVarNames(GADSdat = gads, oldNames = "hisei", newNames = "hisei_new")

# inspect modified meta data
extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                             varLabel format display_width labeled value valLabel
#> 39 hisei_new Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

Changing variable labels

Variable labels can be adjusted analogously via the changeVarLabels() function. Again, multiple variable labels can be adjusted at once.

extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                             varLabel format display_width labeled value valLabel
#> 39 hisei_new Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

# Change variable label 
gads_labeled <- changeVarLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                varLabel = "Parental occupational status (highest)")

extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

Changing SPSS format

The same applies for the SPSS format of a variable using the changeSPSSformat() function.

extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

# Change SPSS format
gads_labeled <- changeSPSSformat(GADSdat = gads_labeled, varName = "hisei_new", 
                                 format = "F10.2")

extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

Changing value labels

Changes to meta data on value level follow the same principle. With the changeValLabels() function, value labels can be added or modified. Note that value labels and missing codes should be given to numeric values, even if a variable is a character variable.

# Adding value labels
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                value = c(-94, -99), valLabel = c("miss1", "miss2"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99    miss2
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94    miss1
#>    missings
#> 38    valid
#> 39    valid

# Changing value labels
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                value = c(-94, -99), 
                                valLabel = c("missing: Question omitted",
                                             "missing: Not administered"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 38 missing: Not administered    valid
#> 39 missing: Question omitted    valid

Removing value labels

Value labels can be deleted using the removeValLabels() function.

# Removing value labels
extractMeta(gads_labeled, "schtype")
#>   varName     varLabel format display_width labeled value                                  valLabel
#> 4 schtype School track   F8.0            NA     yes     1                Gymnasium (academic track)
#> 5 schtype School track   F8.0            NA     yes     2                                Realschule
#> 6 schtype School track   F8.0            NA     yes     3 schools with several courses of education
#>   missings
#> 4    valid
#> 5    valid
#> 6    valid
gads_labeled <- removeValLabels(GADSdat = gads_labeled, varName = "schtype", 
                                value = 1:3)
extractMeta(gads_labeled, "schtype")
#>   varName     varLabel format display_width labeled value valLabel missings
#> 4 schtype School track   F8.0            NA      no    NA     <NA>     <NA>

Changing missing tags

Missing tags (sometimes also referred to as missing codes) can be modified using the changeMissings() function. Valid entries for missings are "miss" and "valid".

# Defining missings
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 38 missing: Not administered    valid
#> 39 missing: Question omitted    valid
gads_labeled <- changeMissings(GADSdat = gads_labeled, varName = "hisei_new", 
                               value = c(-94, -99), missings = c("miss", "miss"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 36 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 37 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 36 missing: Not administered     miss
#> 37 missing: Question omitted     miss

Checking and adjusting missing tags and value labels

Usually an alignment of value labels and missing codes is desirable. For example, in the variable hisei_new the value -94 has received a missing tag and the value label "missing: Questions omitted". To make these alignments easier, the functions checkMissings() and checkMissingsByValues() exist. checkMissings() allows searching for regular expressions in the value labels and comparing missing tags and vice versa. Per default, missing codes are automatically adjusted (addMissingCode = TRUE) and value label mismatches just reported (addMissingLabel = FALSE). checkMissingsByValues() allows searching for labeled values in a specific value range (e.g., -50:-99).

# Creating a new value label for a missing value but leaving the missing code as valid
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "gender", 
                                value = -94, valLabel = "missing: Question omitted")
# Creating a new missing code but leaving the value label empty
gads_labeled <- changeMissings(GADSdat = gads_labeled, varName = "gender", 
                                value = -99, missings = "miss")

# Checking value label and missing code alignment
gads_labeled2 <- checkMissings(gads_labeled, missingLabel = "missing") 
#> The following variables have value labels including the term 'missing' which are not coded as missing:
#> gender
#> 'miss' is inserted into column missings for 1 rows.
#> The following variables have values coded as missings but value labels do not include the term 'missing':
#> gender

# Checking missing tags for a certain value range
gads_labeled <- checkMissingsByValues(gads_labeled, missingValues = -50:-99) 
#> The following variables have values in the 'missingValues' range which are not coded as missing:
#> gender
#> 'miss' is inserted into column missings for 1 rows.

Reusing meta data

Sometimes one variable already contains the meta data which should be added to another variable. reuseMeta() can copy meta data from one variable (other_varName) to another variable (varName), even across different data sets. The function allows us to transfer the complete meta data, only value labels or a specific selection of value labels (only valid values or missing codes). In the example below we transfer only the missing codes from variable hisei_new to variable age.

extractMeta(gads_labeled, "age")
#>    varName             varLabel format display_width labeled value valLabel missings
#> 18     age Age of student at T1   F8.2            NA      no    NA     <NA>     <NA>
gads_labeled <- reuseMeta(GADSdat = gads_labeled, varName = "age",
                          other_GADSdat = gads_labeled, other_varName = "hisei_new",
                          missingLabels = "only", addValueLabels = TRUE)
extractMeta(gads_labeled, "age")
#>    varName             varLabel format display_width labeled value                  valLabel
#> 18     age Age of student at T1   F8.2            NA     yes   -99 missing: Not administered
#> 19     age Age of student at T1   F8.2            NA     yes   -94 missing: Question omitted
#>    missings
#> 18     miss
#> 19     miss

Adding and removing variables

In GADSdat objects, meta data is stored alongside with the actual data set. Therefore, changes to the actual data often imply changes to the meta data. If a variable is removed from the data set, its meta data is no longer needed. If a new variable is created, new meta data needs to be created. If a variable is recoded, the meta data entries need to be recoded accordingly.

Selecting or removing variables

If a certain subset of variables in the GADSdat is needed, individual variables can either be extracted via extractVars() or removed via removeVars().

# Selecting variables
gads_motint <- extractVars(gads_labeled, 
                           vars = c("int_a", "int_b", "int_c", "int_d", "instmot_a"))
#> Removing the following rows from meta data: idstud, idschool, idclass, schtype, sameteach, g8g9, ganztag, classsize, repeated, gender, age, language, migration, hisced, hisei_new, homepos, books, pared, computer_age, internet_age, int_use_a, int_use_b, truancy_a, truancy_b, truancy_c, instmot_b, instmot_c, instmot_d, norms_a, norms_b, norms_c, norms_d, norms_e, norms_f, anxiety_a, anxiety_b, anxiety_c, anxiety_d, anxiety_e, selfcon_a, selfcon_b, selfcon_c, selfcon_d, selfcon_e, worketh_a, worketh_b, worketh_c, worketh_d, worketh_e, worketh_f, worketh_g, worketh_h, worketh_i, intent_a, intent_b, intent_c, intent_d, intent_e, behav_a, behav_b, behav_c, behav_d, behav_e, behav_f, behav_g, behav_h, teach_a, teach_b, teach_c, teach_d, teach_e, cognact_a, cognact_b, cognact_c, cognact_d, cognact_e, cognact_f, cognact_g, cognact_h, cognact_i, discpline_a, discpline_b, discpline_c, discpline_d, discpline_e, relation_a, relation_b, relation_c, relation_d, relation_e, belong_a, belong_b, belong_c, belong_d, belong_e, belong_f, belong_g, belong_h, belong_i, attitud_a, attitud_b, attitud_c, attitud_d, attitud_e, attitud_f, attitud_g, attitud_h, grade_de, grade_ma, grade_bio, grade_che, grade_phy, grade_sci, ma_pv1, ma_pv2, ma_pv3, ma_pv4, ma_pv5, rea_pv1, rea_pv2, rea_pv3, rea_pv4, rea_pv5, sci_pv1, sci_pv2, sci_pv3, sci_pv4, sci_pv5
#> No rows added to meta data.
namesGADS(gads_motint)
#> [1] "int_a"     "int_b"     "int_c"     "int_d"     "instmot_a"

gads_int <- removeVars(gads_motint, vars = "instmot_a") 
#> Removing the following rows from meta data: instmot_a
#> No rows added to meta data.
namesGADS(gads_int)
#> [1] "int_a" "int_b" "int_c" "int_d"

Cloning a variable

A variable can be cloned using the cloneVariable() function. Both the data and meta data are cloned. This function can be helpful if a modified copy of a variable should be created. For this purpose, the variable can be cloned and later modified (e.g., via recodeGADS).

# Clone the variable "sameteach"
gads_labeled <- cloneVariable(gads_labeled, varName = "sameteach", new_varName = "sameteach2")

Adding variables

Adding variables to a GADSdat object is unfortunately not straight forward and requires utilizing the underlying object structure. For adding variables, the dat object needs to be abstracted and so that new variables ca be added to it. Afterwards, the meta data needs to be added using the updateMeta() function.

# Extract the data
newDat <- gads_labeled$dat
# Adding a variable
newDat$classsize_kat <- ifelse(newDat$classsize > 15, 
                                         yes = "big", no = "small") 
# Updating meta data
gads_labeled2 <- updateMeta(gads_labeled, newDat = newDat)
#> No rows removed from meta data.
#> Adding meta data for the following variables: classsize_kat
extractMeta(gads_labeled2, "classsize_kat")
#>                     varName varLabel format display_width labeled value valLabel missings
#> classsize_kat classsize_kat     <NA>   <NA>            NA      no    NA     <NA>     <NA>

Recoding

eatGADS provides functionality for the manual and semi-automatic recoding of variables.

Removing all values from a variable

For instance for the purpose of ensuring the anonymity of person in a data set, it is sometimes desirable to empty sensitive variables. This can be performed using the emptyTheseVariables() function.

# Empty a variable completely
gads_labeled <- emptyTheseVariables(gads_labeled, vars = "idschool")
# Resulting frequency table
table(gads_labeled$dat$idschool, useNA = "ifany")
#> 
#> <NA> 
#>  500

Manual recoding

The function recodeGADS() allows the manual recoding of a variable.

# Original data and meta data
gads_labeled$dat$gender[1:10]
#>  [1] 1 1 2 2 1 1 2 2 1 1
extractMeta(gads_labeled, "gender")
#>    varName varLabel format display_width labeled value                  valLabel missings
#> 14  gender   Gender   F8.0            NA     yes   -99                      <NA>     miss
#> 15  gender   Gender   F8.0            NA     yes   -94 missing: Question omitted     miss
#> 16  gender   Gender   F8.0            NA     yes     1                    Female    valid
#> 17  gender   Gender   F8.0            NA     yes     2                      Male    valid
# Recoding 
gads_labeled <- recodeGADS(GADSdat = gads_labeled, varName = "gender", 
                           oldValues = c(1, 2), newValues = c(10, 20))
# New data and meta data
gads_labeled$dat$gender[1:10]
#>  [1] 10 10 20 20 10 10 20 20 10 10
extractMeta(gads_labeled, "gender")
#>    varName varLabel format display_width labeled value                  valLabel missings
#> 14  gender   Gender   F8.0            NA     yes   -99                      <NA>     miss
#> 15  gender   Gender   F8.0            NA     yes   -94 missing: Question omitted     miss
#> 16  gender   Gender   F8.0            NA     yes    10                    Female    valid
#> 17  gender   Gender   F8.0            NA     yes    20                      Male    valid

Moreover, recodeGADS() allows recoding values without value labels or even NA values.

# Recoding of NA values 
gads_labeled$dat$int_a[1:10]
#>  [1]  2  2  3  2  1  2 NA NA NA NA
gads_labeled <- recodeGADS(GADSdat = gads_labeled, varName = "int_a", 
                           oldValues = NA, newValues = -94)
gads_labeled$dat$int_a[1:10]
#>  [1]   2   2   3   2   1   2 -94 -94 -94 -94

Setting values to NA

For recoding specific values into NA values, the function recode2NA() exists. It allows recoding a specific value across multiple variables (while recodeGADS() allows recoding multiple values for a single variable). Existing value labels for the specified values are deleted. For each variable it is reported how many cases have been recoded.

# Recoding of values as Missing/NA
gads_labeled$dat$schtype[1:10]
#>  [1] 2 3 1 3 2 3 1 3 2 1
gads_labeled <- recode2NA(gads_labeled, recodeVars = c("hisei_new", "schtype"), 
                          value = "3")
#> Recodes in variable hisei_new: 0
#> Recodes in variable schtype: 111
gads_labeled$dat$schtype[1:10]
#>  [1]  2 NA  1 NA  2 NA  1 NA  2  1

Automatically recoding a character variable to a labeled numeric variable

A character variable can be automatically recoded into a labeled numeric variable via multiChar2fac().

# Example data set
test_df <- data.frame(id = 1:5, varChar = c("german", "English", 
                                            "english", "POLISH", "polish"),
                        stringsAsFactors = FALSE)
test_gads <- import_DF(test_df)

# Recoding a character variable to numeric
test_gads2 <- multiChar2fac(test_gads, vars = "varChar", var_suffix = "_new")
extractMeta(test_gads2, "varChar_new")
#>       varName  varLabel format display_width labeled value valLabel missings
#> 3 varChar_new (recoded)  F10.0            NA     yes     1  English    valid
#> 4 varChar_new (recoded)  F10.0            NA     yes     2   POLISH    valid
#> 5 varChar_new (recoded)  F10.0            NA     yes     3  english    valid
#> 6 varChar_new (recoded)  F10.0            NA     yes     4   german    valid
#> 7 varChar_new (recoded)  F10.0            NA     yes     5   polish    valid

Via the argument convertCase upper and lower case can be automatically adjusted.

# Recoding a character variable to numeric while simplying case
test_gads2 <- multiChar2fac(test_gads, vars = "varChar", var_suffix = "_new",
                            convertCase = "upperFirst")
extractMeta(test_gads2, "varChar_new")
#>       varName  varLabel format display_width labeled value valLabel missings
#> 3 varChar_new (recoded)  F10.0            NA     yes     1  English    valid
#> 4 varChar_new (recoded)  F10.0            NA     yes     2   German    valid
#> 5 varChar_new (recoded)  F10.0            NA     yes     3   Polish    valid

Automatically recoding a variable with a template

A variable can be automatically recoded into a labeled numeric variable via autoRecode(). This can be desirable, for instance, for the recoding of identifier variables.

id_df <- data.frame(id = c(1101, 1102, 1103, 1104, 1105), 
                    varChar = c("german", "English", "english", "POLISH", "polish"),
                        stringsAsFactors = FALSE)
id_gads <- import_DF(id_df)

# Recoding a character variable to numeric
id_gads2 <- autoRecode(id_gads, var = "id", var_suffix = "_new")
id_gads2$dat[, c("id", "id_new")]
#>     id id_new
#> 1 1101      1
#> 2 1102      2
#> 3 1103      3
#> 4 1104      4
#> 5 1105      5

Variable sorting

The sorting of variables in a GADSdat can be adjusted for individual variables and for the complete set of variables.

Relocating a specific variable

The function relocateVariable() allows the relocation of a single variable within a GADSdat.

namesGADS(gads_labeled)[1:5]
#> [1] "idstud"    "idschool"  "idclass"   "schtype"   "sameteach"

# Relocate a single variable within a the data set
gads_labeled <- relocateVariable(GADSdat = gads_labeled, var = "idschool",
                           after = "schtype")
namesGADS(gads_labeled)[1:5]
#> [1] "idstud"    "idclass"   "schtype"   "idschool"  "sameteach"

# Relocate a single variable to the beginning of the data set
gads_labeled <- relocateVariable(GADSdat = gads_labeled, var = "idschool",
                           after = NULL)
namesGADS(gads_labeled)[1:5]
#> [1] "idschool"  "idstud"    "idclass"   "schtype"   "sameteach"

Ordering all variables

The function orderLike() allows reordering all variables within a GADSdat.

Changing meta data (and data) via an Excel sheet

So far, the introduced functions work well for modifying the meta data of small data sets or for individual variables. However, frequently we are interested in modifying the meta data of a larger number of variables simultaneously. For this purpose eatGADS provides a workflow that works well with Excel spreadsheets. Thereby changes to meta data are divided into two levels: the variable and the value level.

Variable level

We start by extracting this change table via the getChangeMeta() function

# variable level
meta_var <- getChangeMeta(GADSdat = pisa, level = "variable")

While in principle one could modify the change table directly in R, it more convenient to do this in Excel. The change table can be written to .xlsx via the eatAnalysis::write_xlsx() function. To perform changes, entries are made into the corresponding “_new”-columns.

# write to excel
eatAnalysis::write_xlsx(meta_var, row.names = FALSE, "variable_changes.xlsx")

 

 

The excel file can be read back into R via readxl::read_xlsx().

# write to excel
meta_var_changed <- readxl::read_excel("variable_changes.xlsx", col_types = rep("text", 8))

The applyChangeMeta() function applies the meta data changes to the GADSdat object.

gads2 <- applyChangeMeta(meta_var_changed, GADSdat = pisa)
extractMeta(gads2, vars = c("idstud", "idschool", "schoolType"))
#>      varName                    varLabel format display_width labeled value
#> 2     idstud Student Identifier Variable   F8.0            NA      no    NA
#> 3   idschool                   School-ID  F10.0            NA      no    NA
#> 5 schoolType                School track   F8.0            NA     yes     1
#> 6 schoolType                School track   F8.0            NA     yes     2
#> 7 schoolType                School track   F8.0            NA     yes     3
#>                                    valLabel missings
#> 2                                      <NA>     <NA>
#> 3                                      <NA>     <NA>
#> 5                Gymnasium (academic track)    valid
#> 6                                Realschule    valid
#> 7 schools with several courses of education    valid

Value level

At value level, information on value, value labels or missings can be changed. The general workflow is identical

# value level
meta_val <- getChangeMeta(GADSdat = pisa, level = "value")
# write to excel
eatAnalysis::write_xlsx(meta_val, row.names = FALSE, "value_changes.xlsx")

 

 

# write to excel
meta_val_changed <- readxl::read_excel("value_changes.xlsx", 
                                       col_types = c("text", rep(c("numeric", "text", "text"), 2)))
gads3 <- applyChangeMeta(meta_val_changed, GADSdat = pisa)
extractMeta(gads3, vars = c("schtype", "sameteach"))
#>     varName                               varLabel format display_width labeled value
#> 4   schtype                           School track   F8.0            NA     yes     1
#> 5   schtype                           School track   F8.0            NA     yes     2
#> 6   schtype                           School track   F8.0            NA     yes     3
#> 7 sameteach Same math teacher in both school years   F8.0            NA     yes    10
#> 8 sameteach Same math teacher in both school years   F8.0            NA     yes    20
#>                                    valLabel missings
#> 4                            Acamedic Track    valid
#> 5                                Realschule    valid
#> 6 schools with several courses of education    valid
#> 7                                        No    valid
#> 8                                       Yes    valid