In this short vignette, we’ll demonstrate how the svrep package can be used to implement weighting adjustments for nonresponse in survey samples. For illustration, we’ll use an example survey measuring Covid-19 vaccination status and a handful of demographic variables, based on a simple random sample of 1,000 residents of Louisville, Kentucky.

library(dplyr) # For data manipulation
library(survey) # For complex survey analysis
library(srvyr) # For complex survey analysis with dplyr syntax
library(svrep)

# Load and inspect the data
data("lou_vax_survey", package = 'svrep')
head(lou_vax_survey)
#> # A tibble: 6 × 6
#>   RESPONSE_STATUS RACE_ETHNICITY                SEX   EDUC_ATTAINMENT VAX_STATUS
#>   <chr>           <chr>                         <chr> <chr>           <chr>     
#> 1 Nonrespondent   White alone, not Hispanic or… Fema… Less than high… <NA>      
#> 2 Nonrespondent   Black or African American al… Fema… High school or… <NA>      
#> 3 Respondent      White alone, not Hispanic or… Fema… Less than high… Vaccinated
#> 4 Nonrespondent   White alone, not Hispanic or… Fema… Less than high… <NA>      
#> 5 Nonrespondent   White alone, not Hispanic or… Fema… High school or… <NA>      
#> 6 Respondent      White alone, not Hispanic or… Fema… High school or… Vaccinated
#> # ℹ 1 more variable: SAMPLING_WEIGHT <dbl>
colnames(lou_vax_survey)
#> [1] "RESPONSE_STATUS" "RACE_ETHNICITY"  "SEX"             "EDUC_ATTAINMENT"
#> [5] "VAX_STATUS"      "SAMPLING_WEIGHT"

This vaccination survey has an overall response rate of 50.2%, which means that estimated vaccination rates may be significantly biased by nonresponse. We’ll use nonresponse weighting adjustments to try and reduce potential nonresponse bias.

lou_vax_survey |> count(RESPONSE_STATUS) |> mutate(pct = n/sum(n))
#> # A tibble: 2 × 3
#>   RESPONSE_STATUS     n   pct
#>   <chr>           <int> <dbl>
#> 1 Nonrespondent     498 0.498
#> 2 Respondent        502 0.502

Creating initial replicate weights

To begin with, we’ll create bootstrap replicate weights. In most cases, we can do this simply by describing the survey design using the svydesign() function and then using a function to create appropriate replicate weights. The function as.svrepdesign() from the ‘survey’ package can be used to create several types of replicate weights, using the argument type (with options 'JK1', 'JKn', 'bootstrap', 'BRR', 'Fay', etc.) In addition, the function as_bootstrap_design() can be used to create bootstrap weights using additional methods not supported in the ‘survey’ package.

# Describe the survey design
lou_vax_survey <- svydesign(ids = ~ 1, weights = ~ SAMPLING_WEIGHT,
                            data = lou_vax_survey)

print(lou_vax_survey)
#> Independent Sampling design (with replacement)
#> svydesign(ids = ~1, weights = ~SAMPLING_WEIGHT, data = lou_vax_survey)

# Create appropriate replicate weights
lou_vax_survey <- lou_vax_survey |>
  as_bootstrap_design(replicates = 100, mse = TRUE,
                      type = "Rao-Wu-Yue-Beaumont")

print(lou_vax_survey)
#> Call: as_bootstrap_design(lou_vax_survey, replicates = 100, mse = TRUE, 
#>     type = "Rao-Wu-Yue-Beaumont")
#> Survey bootstrap with 100 replicates and MSE variances.

For convenience, we’ll convert this survey design object into an object with class tbl_svy, which allows us to use convenient tidyverse/dplyr syntax (group_by(), summarize(), etc.) as well as other helpful functions from the srvyr package.

lou_vax_survey <- lou_vax_survey |> as_survey()

print(lou_vax_survey)
#> Call: Called via srvyr
#> Survey bootstrap with 100 replicates and MSE variances.
#> Data variables: 
#>   - RESPONSE_STATUS (chr), RACE_ETHNICITY (chr), SEX (chr), EDUC_ATTAINMENT
#>     (chr), VAX_STATUS (chr), SAMPLING_WEIGHT (dbl)

Redistributing weight from nonrespondents to respondents

A common form of nonresponse adjustment is to simply ‘redistribute’ weight from the nonrespondents to the respondents. In other words, the weight for each nonrespondent is set to \(0\), and the weight for each respondent is increased by a factor greater than one so that the sum of adjusted weights in the sample of respondents equals the sum of unadjusted weights from the full sample. For example, if the sum of weights among respondents is \(299,544.4\) and the sum of weights among nonrespondents is \(297,157.6\), then a basic nonresponse adjustment would set the weights among nonrespondents to \(0\) and multiply the weight for each respondent by an adjustment factor equal to \(1 + (297,157.6/299,544.4)\). This type of adjustment is succinctly described in mathematical notation below.

\[\begin{aligned} w_i &= \textit{Original sampling weight for case }i \\ &= 1/\pi_i, \textit{ where } \pi_i \textit{ is the probability case i }\textit{had of being sampled}\\ f_{NR,i} &= \textit{Nonresponse adjustment factor for case i} \\ w_{NR, i} &= w_i \times f_{NR,i} = \textit{Weight for case }i \textit{ after nonresponse adjustment} \\ \\ \sum_{i \in s_{resp}} w_i &= \textit{Sum of sampling weights among respondents} \\ \sum_{i \in s_{nonresp}} w_i &= \textit{Sum of sampling weights among nonrespondents} \\ \\ f_{NR_i} &= \begin{cases} 0 & \textit{if case }i\textit{ is a nonrespondent} \\ 1 + \frac{\sum_{i \in s_{nonresp}} w_i}{\sum_{i \in s_{resp}} w_i} & \textit{if case }i\textit{ is a respondent} \end{cases} \end{aligned}\]

We’ll illustrate this type of adjustment with the Louisville vaccination survey. First, we’ll inspect the sum of the sampling weights for respondents, nonrespondents, and the overall sample.

# Weights before adjustment
lou_vax_survey |>
  group_by(RESPONSE_STATUS) |>
  cascade(
    `Sum of Weights` = sum(cur_svy_wts()),
    .fill = "TOTAL"
  )
#> # A tibble: 3 × 2
#>   RESPONSE_STATUS `Sum of Weights`
#>   <chr>                      <dbl>
#> 1 Nonrespondent            297158.
#> 2 Respondent               299544.
#> 3 TOTAL                    596702

Next, we’ll redistribute weight from nonrespondents to respondents using the redistribute_weights() function, which adjusts the full-sample weights as well as each set of replicate weights. To specify which subset of data should have its weights reduced, we supply a logical expression to the argument reduce_if. To specify which subset of data should have its weights increased, we supply a logical expression to the argument increase_if.

# Conduct a basic nonresponse adjustment
nr_adjusted_survey <- lou_vax_survey |>
  redistribute_weights(
    reduce_if = RESPONSE_STATUS == "Nonrespondent",
    increase_if = RESPONSE_STATUS == "Respondent"
  )

After making the adjustment, we can check that all of the weight from nonrespondents has been redistributed to respondents.

# Check the sum of full-sample weights by response status
nr_adjusted_survey |>
  group_by(RESPONSE_STATUS) |>
  cascade(
    `Sum of Weights` = sum(cur_svy_wts()),
    .fill = "TOTAL"
  )
#> # A tibble: 3 × 2
#>   RESPONSE_STATUS `Sum of Weights`
#>   <chr>                      <dbl>
#> 1 Nonrespondent                  0
#> 2 Respondent                596702
#> 3 TOTAL                     596702

# Check sums of replicate weights by response status
nr_adjusted_survey |>
  summarize_rep_weights(
    type = "specific",
    by = "RESPONSE_STATUS"
  ) |> 
  arrange(Rep_Column, RESPONSE_STATUS) |>
  head(10)
#>    RESPONSE_STATUS Rep_Column   N N_NONZERO    SUM     MEAN        CV MIN
#> 1    Nonrespondent          1 498         0      0    0.000       NaN   0
#> 2       Respondent          1 502       318 596702 1188.649 0.9980021   0
#> 3    Nonrespondent          2 498         0      0    0.000       NaN   0
#> 4       Respondent          2 502       323 596702 1188.649 0.9832055   0
#> 5    Nonrespondent          3 498         0      0    0.000       NaN   0
#> 6       Respondent          3 502       312 596702 1188.649 0.9999221   0
#> 7    Nonrespondent          4 498         0      0    0.000       NaN   0
#> 8       Respondent          4 502       328 596702 1188.649 0.9418974   0
#> 9    Nonrespondent          5 498         0      0    0.000       NaN   0
#> 10      Respondent          5 502       306 596702 1188.649 1.0470362   0
#>         MAX
#> 1     0.000
#> 2  5907.941
#> 3     0.000
#> 4  5850.020
#> 5     0.000
#> 6  4707.708
#> 7     0.000
#> 8  5781.996
#> 9     0.000
#> 10 7247.393

Conducting weighting class adjustments

Nonresponse bias is liable to occur if different subpopulations systematically differ in terms of their response rates to the survey and also differ in terms of what the survey is trying to measure (in this case, vaccination status). In our example, we can see some fairly large differences in response rates across different race/ethnicity groups.

lou_vax_survey |>
  group_by(RACE_ETHNICITY) |>
  summarize(Response_Rate = mean(RESPONSE_STATUS == "Respondent"),
            Sample_Size = n(),
            n_Respondents = sum(RESPONSE_STATUS == "Respondent"))
#> # A tibble: 4 × 4
#>   RACE_ETHNICITY                         Response_Rate Sample_Size n_Respondents
#>   <chr>                                          <dbl>       <int>         <int>
#> 1 Black or African American alone, not …         0.452         188            85
#> 2 Hispanic or Latino                             0.378          45            17
#> 3 Other Race, not Hispanic or Latino             0.492          59            29
#> 4 White alone, not Hispanic or Latino            0.524         708           371

Weighting adjustments may be able to help reduce nonresponse bias caused by these differences in response rates. One standard form of adjustment known as weighting class adjustment is to redistribute weights from nonrespondents to respondents separately by different categories of auxiliary variables (such as race/ethnicity). The survey textbook Heeringa, West, and Berglund (2017) provides an excellent overview of weighting class adjustments. To implement a weighting class adjustment with the svrep package, we can simply use the by argument of redistribute_weights().

nr_adjusted_survey <- lou_vax_survey |>
  redistribute_weights(
    reduce_if = RESPONSE_STATUS == "Nonrespondent",
    increase_if = RESPONSE_STATUS == "Respondent",
    by = c("RACE_ETHNICITY")
  )

Multiple grouping variables may be supplied to the by argument. For example, one can specify by = c("STRATUM", "RACE_ETHNICITY") to redistribute weights separately by combinations of stratum and race/ethnicity category.

Propensity cell adjustment

The popular method of forming weighting classes based on estimated response propensities (known as propensity cell adjustment) can also be used, for example by adding a variable PROPENSITY_CELL to the data and using redistribute_weights(..., by = "PROPENSITY_CELL").

# Fit a response propensity model
response_propensity_model <- lou_vax_survey |>
  mutate(IS_RESPONDENT = ifelse(RESPONSE_STATUS == "Respondent", 1, 0)) |>
  svyglm(formula = IS_RESPONDENT ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
         family = quasibinomial(link = 'logit'))

# Predict response propensities for individual cases
lou_vax_survey <- lou_vax_survey |>
  mutate(
    RESPONSE_PROPENSITY = predict(response_propensity_model,
                                  newdata = cur_svy(),
                                  type = "response")
  )

# Divide sample into propensity classes
lou_vax_survey <- lou_vax_survey |>
  mutate(PROPENSITY_CELL = ntile(x = RESPONSE_PROPENSITY, n = 5))

lou_vax_survey |>
    group_by(PROPENSITY_CELL) |>
    summarize(n = n(),
              min = min(RESPONSE_PROPENSITY),
              mean = mean(RESPONSE_PROPENSITY),
              max = max(RESPONSE_PROPENSITY))
#> # A tibble: 5 × 5
#>   PROPENSITY_CELL     n   min  mean   max
#>             <int> <int> <dbl> <dbl> <dbl>
#> 1               1   200 0.357 0.424 0.459
#> 2               2   200 0.459 0.484 0.488
#> 3               3   200 0.488 0.488 0.512
#> 4               4   200 0.512 0.551 0.564
#> 5               5   200 0.564 0.564 0.564

# Redistribute weights by propensity class
nr_adjusted_survey <- lou_vax_survey |>
  redistribute_weights(
    reduce_if = RESPONSE_STATUS == "Nonrespondent",
    increase_if = RESPONSE_STATUS == "Respondent",
    by = "PROPENSITY_CELL"
  )

# Inspect weights before adjustment

lou_vax_survey |>
  summarize_rep_weights(type = "specific",
                        by = c("PROPENSITY_CELL")) |>
  arrange(Rep_Column, PROPENSITY_CELL) |>
  select(PROPENSITY_CELL, Rep_Column,
         N_NONZERO, SUM) |>
  head(10)
#>    PROPENSITY_CELL Rep_Column N_NONZERO      SUM
#> 1                1          1       117 112889.6
#> 2                2          1       134 117070.7
#> 3                3          1       117 111097.7
#> 4                4          1       136 129016.6
#> 5                5          1       130 126627.5
#> 6                1          2       122 117070.7
#> 7                2          2       125 109903.1
#> 8                3          2       134 124238.3
#> 9                4          2       131 127822.1
#> 10               5          2       129 117668.0

# Inspect weights after adjustment
nr_adjusted_survey |>
  summarize_rep_weights(type = "specific",
                        by = c("PROPENSITY_CELL", "RESPONSE_STATUS")) |>
  arrange(Rep_Column, PROPENSITY_CELL, RESPONSE_STATUS) |>
  select(PROPENSITY_CELL, RESPONSE_STATUS, Rep_Column,
         N_NONZERO, SUM) |>
  head(10)
#>    PROPENSITY_CELL RESPONSE_STATUS Rep_Column N_NONZERO      SUM
#> 1                1   Nonrespondent          1         0      0.0
#> 2                1      Respondent          1        50 112889.6
#> 3                2   Nonrespondent          1         0      0.0
#> 4                2      Respondent          1        63 117070.7
#> 5                3   Nonrespondent          1         0      0.0
#> 6                3      Respondent          1        61 111097.7
#> 7                4   Nonrespondent          1         0      0.0
#> 8                4      Respondent          1        68 129016.6
#> 9                5   Nonrespondent          1         0      0.0
#> 10               5      Respondent          1        76 126627.5

Saving the final weights to a data file

Once we’re satisfied with the weights, we can create a data frame with the analysis variables and columns of replicate weights. This format is easy to export to data files that can be loaded into R or other software later.

data_frame_with_nr_adjusted_weights <- nr_adjusted_survey |>
  as_data_frame_with_weights(
    full_wgt_name = "NR_ADJ_WGT",
    rep_wgt_prefix = "NR_ADJ_REP_WGT_"
  )

# Preview first few column names
colnames(data_frame_with_nr_adjusted_weights) |> head(12)
#>  [1] "RESPONSE_STATUS"     "RACE_ETHNICITY"      "SEX"                
#>  [4] "EDUC_ATTAINMENT"     "VAX_STATUS"          "SAMPLING_WEIGHT"    
#>  [7] "RESPONSE_PROPENSITY" "PROPENSITY_CELL"     "NR_ADJ_WGT"         
#> [10] "NR_ADJ_REP_WGT_1"    "NR_ADJ_REP_WGT_2"    "NR_ADJ_REP_WGT_3"

# Write the data to a CSV file
write.csv(
  x = data_frame_with_nr_adjusted_weights,
  file = "survey-data-with-nonresponse-adjusted-weights.csv"
)

Statistical background

The motivation for making this adjustment is that standard methods of statistical inference assume that every person in the population has a known, nonzero probability of participating in the survey (i.e. has a nonzero chance of being sampled and a nonzero chance of responding if they are sampled), denoted \(p_{i,overall}\). Basic results in survey sampling theory guarantee that if this assumption is true, we can produce unbiased estimates of population means and totals by weighting data from each respondent with the weight \(1/{p_{i,overall}}\). Crucially, the overall probability of participation \(p_{i,overall}\) is the product of two components: the probability that a person is sampled (denoted \(\pi_i\)), and the probability that a person would respond to the survey if they are sampled (denoted \(p_i\) and referred to as the “response propensity”). The sampling probability \(\pi_i\) is known since we can control the method of sampling, but the response propensity \(p_i\) is unknown and can only be estimated.

\[ \begin{aligned} w^{*}_i &= 1/p_{i,overall} \text{ (weights needed for unbiased estimation)} \\ p_{i,overall} &= \pi_i \times p_i \\ \pi_i &= \textbf{Sampling probability} \\ &\textit{i.e. the probability that case }i\textit{ is randomly sampled } \text{ (}\textit{Known}\text{)} \\ p_i &= \textbf{Response propensity} \\ &\textit{i.e. the probability that case }i\textit{ responds, if sampled } \text{ (}\textit{Unknown}\text{)} \\ \end{aligned} \]

The component \(p_i\) must be estimated using data (with estimate \(\hat{p}_i\)) and then nonresponse-adjusted weights for respondents can be formed as \(w_{NR,i} = 1/(\pi_i \times \hat{p}_i)\) and used to obtain approximately unbiased estimates of population means and totals. To use our earlier notation, the nonresponse adjustment factor for respondents \(f_{NR,i}\) is actually defined using \(1/\hat{p}_i\).

\[ \begin{aligned} w_i &= \textit{Original sampling weight for case }i \\ &= 1/\pi_i, \textit{ where } \pi_i \textit{ is the probability case i }\textit{had of being sampled}\\ w_{NR, i} &= w_i \times f_{NR,i} = \textit{Weight for case }i \textit{ after nonresponse adjustment} \\ \\ f_{NR,i} &= \begin{cases} 0 & \text{if case } i \text{ is a nonrespondent} \\ 1 / \hat{p}_i & \text{if case } i \text{ is a respondent} \\ \end{cases} \\ \hat{p}_i &= \textbf{Estimated response propensity} \end{aligned} \]

In essence, different methods of nonresponse weighting adjustments vary in terms of how they estimate \(\hat{p}_i\). The basic weight redistribution method in effect estimates \(p_i\) as constant across all \(i\), equal to the overall weighted response rate, and uses that to form the weights. In other words, the basic weight redistribution essentially is a way of forming an adjustment factor \(f_{NR,i}\) based on the estimated response propensity \(\hat{p}_i = \frac{\sum_{i \in s_{resp}}w_i}{\sum_{i \in s}w_i}\).

Weighting class adjustments and propensity cell adjustments are essentially more refined ways of forming \(f_{NR,i}\) by estimating \(p_i\) with a more realistic model, where \(p_i\) is not constant across the entire sample but instead varies among weighting classes or propensity cells.

The reason for conducting weighting adjustments not only in the full-sample weights but in the replicate weights is to account for the nonresponse adjustment process when estimating sampling variances and inferential statistics such as confidence intervals. Because of random sampling, the adjustment factors used in the nonresponse adjustment will vary from one sample to the next, and applying weighting adjustments separately for each replicate reflects this variability. As we’ve seen in this vignette, the redistribute_weights() function handles this for us: after nonresponse adjustment, all of the weight in each replicate is redistributed in the same manner that weight was redistributed for the full-sample weights.

Nonresponse Adjustments