Special Missing Values

Nicholas Tierney

2024-03-05

Data sometimes have special missing values to indicate specific reasons for missingness. For example, “9999” is sometimes used in weather data, say for for example, the Global Historical Climate Network (GHCN) data, to indicate specific types of missingness, such as instrument failure.

You might be interested in creating your own special missing values so that you can mark specific, known reasons for missingness. For example, an individual dropping out of a study, known instrument failure in weather instruments, or for values being censored in analysis. In these cases, the data is missing, but we have information about why it is missing. Coding these cases as NA would cause us to lose this valuable information. Other stats programming languages like STATA, SAS, and SPSS have this capacity, but currently R does not. So, we need a way to create these special missing values.

We can use recode_shadow to recode missingness by recoding the special missing value as something like NA_reason. naniar records these values in the shadow part of nabular data, which is a special dataframe that contains missingness information.

This vignette describes how to add special missing values using the recode_shadow() function. First we consider some terminology to explain these ideas, if you are not familiar with the workflows in naniar.

Terminology

Missing data can be represented as a binary matrix of “missing” or “not missing”, which in naniar we call a “shadow matrix”, a term borrowed from Swayne and Buja, 1998.

library(naniar)
as_shadow(oceanbuoys)
#> # A tibble: 736 × 8
#>    year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
#>    <fct>   <fct>       <fct>        <fct>         <fct>         <fct>      
#>  1 !NA     !NA         !NA          !NA           !NA           !NA        
#>  2 !NA     !NA         !NA          !NA           !NA           !NA        
#>  3 !NA     !NA         !NA          !NA           !NA           !NA        
#>  4 !NA     !NA         !NA          !NA           !NA           !NA        
#>  5 !NA     !NA         !NA          !NA           !NA           !NA        
#>  6 !NA     !NA         !NA          !NA           !NA           !NA        
#>  7 !NA     !NA         !NA          !NA           !NA           !NA        
#>  8 !NA     !NA         !NA          !NA           !NA           !NA        
#>  9 !NA     !NA         !NA          !NA           !NA           !NA        
#> 10 !NA     !NA         !NA          !NA           !NA           !NA        
#> # ℹ 726 more rows
#> # ℹ 2 more variables: wind_ew_NA <fct>, wind_ns_NA <fct>

The shadow matrix has three key features to facilitate analysis

  1. Coordinated names: Variables in the shadow matrix gain the same name as in the data, with the suffix “_NA”.

  2. Special missing values: Values in the shadow matrix can be “special” missing values, indicated as NA_suffix, where “suffix” is a very short message of the type of missings.

  3. Cohesiveness: Binding the shadow matrix column-wise to the original data creates a cohesive “nabular” data form, useful for visualization and summaries.

We create nabular data by binding the shadow to the data:

bind_shadow(oceanbuoys)
#> # A tibble: 736 × 16
#>     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
#>    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
#>  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
#>  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
#>  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
#>  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
#>  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
#>  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
#>  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
#>  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
#>  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
#> 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
#> # ℹ 726 more rows
#> # ℹ 8 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
#> #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
#> #   wind_ew_NA <fct>, wind_ns_NA <fct>

This keeps the data values tied to their missingness, and has great benefits for exploring missing and imputed values in data. See the vignettes Getting Started with naniar and Exploring Imputations with naniar for more details.

Recoding missing values

To demonstrate recoding of missing values, we use a toy dataset, dat:

df <- tibble::tribble(
~wind, ~temp,
-99,    45,
68,    NA,
72,    25
)

df
#> # A tibble: 3 × 2
#>    wind  temp
#>   <dbl> <dbl>
#> 1   -99    45
#> 2    68    NA
#> 3    72    25

To recode the value -99 as a missing value “broken_machine”, we first create nabular data with bind_shadow:


dfs <- bind_shadow(df)

dfs
#> # A tibble: 3 × 4
#>    wind  temp wind_NA temp_NA
#>   <dbl> <dbl> <fct>   <fct>  
#> 1   -99    45 !NA     !NA    
#> 2    68    NA !NA     NA     
#> 3    72    25 !NA     !NA

Special types of missingness are encoded in the shadow part nabular data, using the recode_shadow function, we can recode the missing values like so:

dfs_recode <- dfs %>% 
  recode_shadow(wind = .where(wind == -99 ~ "broken_machine"))

This reads as “recode shadow for wind where wind is equal to -99, and give it the label”broken_machine”. The .where function is used to help make our intent clearer, and reads very much like the dplyr::case_when() function, but takes care of encoding extra factor levels into the missing data.

The extra types of missingness are recoded in the shadow part of the nabular data as additional factor levels:

levels(dfs_recode$wind_NA)
#> [1] "!NA"               "NA"                "NA_broken_machine"
levels(dfs_recode$temp_NA)
#> [1] "!NA"               "NA"                "NA_broken_machine"

All additional types of missingness are recorded across all shadow variables, even if those variables don’t contain that special missing value. This ensures all flavours of missingness are known.

To summarise, to use recode_shadow, the user provides the following information:

Under the hood, this special missing value is recoded as a new factor level in the shadow matrix, so that every column is aware of all possible new values of missingness.

Some examples of using recode_shadow in a workflow will be discussed in more detail in the near future, for the moment, here is a recommended workflow: