Introduction

expss computes and displays tables with support for ‘SPSS’-style labels, multiple / nested banners, weights, multiple-response variables and significance testing. There are facilities for nice output of tables in ‘knitr’, R notebooks, ‘Shiny’ and ‘Jupyter’ notebooks. Proper methods for labelled variables add value labels support to base R functions and to some functions from other packages. Additionally, the package offers useful functions for data processing in marketing research / social surveys - popular data transformation functions from ‘SPSS’ Statistics and ‘Excel’ (‘RECODE’, ‘COUNT’, ‘COUNTIF’, ‘VLOOKUP’, etc.). Package is intended to help people to move data processing from ‘Excel’/‘SPSS’ to R. See examples below. You can get help about any function by typing ?function_name in the R console.

Installation

expss is on CRAN, so for installation you can print in the console install.packages("expss").

Cross-tablulation examples

We will use for demonstartion well-known mtcars dataset. Let’s start with adding labels to the dataset. Then we can continue with tables creation.

library(expss)
data(mtcars)
mtcars = apply_labels(mtcars,
                      mpg = "Miles/(US) gallon",
                      cyl = "Number of cylinders",
                      disp = "Displacement (cu.in.)",
                      hp = "Gross horsepower",
                      drat = "Rear axle ratio",
                      wt = "Weight (1000 lbs)",
                      qsec = "1/4 mile time",
                      vs = "Engine",
                      vs = c("V-engine" = 0,
                             "Straight engine" = 1),
                      am = "Transmission",
                      am = c("Automatic" = 0,
                             "Manual"=1),
                      gear = "Number of forward gears",
                      carb = "Number of carburetors"
)

For quick cross-tabulation there are fre and cross family of function. For simplicity we demonstrate here only cross_cpct which calculates column percent. Documentation for other functions, such as cross_cases for counts, cross_rpct for row percent, cross_tpct for table percent and cross_fun for custom summary functions can be seen by typing ?cross_cpct and ?cross_fun in the console.

# 'cross_*' examples
# just simple crosstabulation, similar to base R 'table' function
cross_cases(mtcars, am, vs)
 Engine 
 V-engine   Straight engine 
 Transmission 
   Automatic  12 7
   Manual  6 7
   #Total cases  18 14
# Table column % with multiple banners
cross_cpct(mtcars, cyl, list(total(), am, vs))
 #Total     Transmission     Engine 
   Automatic   Manual     V-engine   Straight engine 
 Number of cylinders 
   4  34.4   15.8 61.5   5.6 71.4
   6  21.9   21.1 23.1   16.7 28.6
   8  43.8   63.2 15.4   77.8
   #Total cases  32   19 13   18 14
# magrittr pipe usage and nested banners
mtcars %>% 
    cross_cpct(cyl, list(total(), am %nest% vs))      
 #Total     Transmission 
   Automatic     Manual 
   Engine     Engine 
   V-engine   Straight engine     V-engine   Straight engine 
 Number of cylinders 
   4  34.4   42.9   16.7 100
   6  21.9   57.1   50.0
   8  43.8   100   33.3
   #Total cases  32   12 7   6 7

We have more sophisticated interface for table construction with magrittr piping. Table construction consists of at least of three functions chained with pipe operator: %>%. At first we need to specify variables for which statistics will be computed with tab_cells. Secondary, we calculate statistics with one of the tab_stat_* functions. And last, we finalize table creation with tab_pivot, e. g.: dataset %>% tab_cells(variable) %>% tab_stat_cases() %>% tab_pivot(). After that we can optionally sort table with tab_sort_asc, drop empty rows/columns with drop_rc and transpose with tab_transpose. Resulting table is just a data.frame so we can use usual R operations on it. Detailed documentation for table creation can be seen via ?tables. For significance testing see ?significance. Generally, tables automatically translated to HTML for output in knitr or Jupyter notebooks. However, if we want HTML output in the R notebooks or in the RStudio viewer we need to set options for that: expss_output_rnotebook() or expss_output_viewer().

# simple example
mtcars %>% 
    tab_cells(cyl) %>% 
    tab_cols(total(), am) %>% 
    tab_stat_cpct() %>% 
    tab_pivot()
 #Total     Transmission 
   Automatic   Manual 
 Number of cylinders 
   4  34.4   15.8 61.5
   6  21.9   21.1 23.1
   8  43.8   63.2 15.4
   #Total cases  32   19 13
# table with caption
mtcars %>% 
    tab_cells(mpg, disp, hp, wt, qsec) %>%
    tab_cols(total(), am) %>% 
    tab_stat_mean_sd_n() %>%
    tab_last_sig_means(subtable_marks = "both") %>% 
    tab_pivot() %>% 
    set_caption("Table with summary statistics and significance marks.")
Table with summary statistics and significance marks.
 #Total     Transmission 
   Automatic     Manual 
   A     B 
 Miles/(US) gallon 
   Mean  20.1    17.1 < B   24.4 > A
   Std. dev.  6.0    3.8     6.2  
   Unw. valid N  32.0    19.0     13.0  
 Displacement (cu.in.) 
   Mean  230.7    290.4 > B   143.5 < A
   Std. dev.  123.9    110.2     87.2  
   Unw. valid N  32.0    19.0     13.0  
 Gross horsepower 
   Mean  146.7    160.3     126.8  
   Std. dev.  68.6    53.9     84.1  
   Unw. valid N  32.0    19.0     13.0  
 Weight (1000 lbs) 
   Mean  3.2    3.8 > B   2.4 < A
   Std. dev.  1.0    0.8     0.6  
   Unw. valid N  32.0    19.0     13.0  
 1/4 mile time 
   Mean  17.8    18.2     17.4  
   Std. dev.  1.8    1.8     1.8  
   Unw. valid N  32.0    19.0     13.0  
# Table with the same summary statistics. Statistics labels in columns.
mtcars %>% 
    tab_cells(mpg, disp, hp, wt, qsec) %>%
    tab_cols(total(label = "#Total| |"), am) %>% 
    tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n, method = list) %>%
    tab_pivot()
 #Total     Transmission 
       Automatic     Manual 
 Mean   Std. dev.   Valid N     Mean   Std. dev.   Valid N     Mean   Std. dev.   Valid N 
 Miles/(US) gallon  20.1 6.0 32   17.1 3.8 19   24.4 6.2 13
 Displacement (cu.in.)  230.7 123.9 32   290.4 110.2 19   143.5 87.2 13
 Gross horsepower  146.7 68.6 32   160.3 53.9 19   126.8 84.1 13
 Weight (1000 lbs)  3.2 1.0 32   3.8 0.8 19   2.4 0.6 13
 1/4 mile time  17.8 1.8 32   18.2 1.8 19   17.4 1.8 13
# Different statistics for different variables.
mtcars %>%
    tab_cols(total(), vs) %>%
    tab_cells(mpg) %>% 
    tab_stat_mean() %>% 
    tab_stat_valid_n() %>% 
    tab_cells(am) %>%
    tab_stat_cpct(total_row_position = "none", label = "col %") %>%
    tab_stat_rpct(total_row_position = "none", label = "row %") %>%
    tab_stat_tpct(total_row_position = "none", label = "table %") %>%
    tab_pivot(stat_position = "inside_rows") 
   #Total     Engine 
     V-engine   Straight engine 
 Miles/(US) gallon 
   Mean    20.1   16.6 24.6
   Valid N    32.0   18.0 14.0
 Transmission 
   Automatic   col %    59.4   66.7 50.0
    row %    100.0   63.2 36.8
    table %    59.4   37.5 21.9
   Manual   col %    40.6   33.3 50.0
    row %    100.0   46.2 53.8
    table %    40.6   18.8 21.9
# Table with split by rows and with custom totals.
mtcars %>% 
    tab_cells(cyl) %>% 
    tab_cols(total(), vs) %>% 
    tab_rows(am) %>% 
    tab_stat_cpct(total_row_position = "above",
                  total_label = c("number of cases", "row %"),
                  total_statistic = c("u_cases", "u_rpct")) %>% 
    tab_pivot()
   #Total     Engine 
     V-engine   Straight engine 
 Transmission 
   Automatic   Number of cylinders   #number of cases    19   12 7
    #row %    100   63.2 36.8
    4    15.8   42.9
    6    21.1   57.1
    8    63.2   100.0
   Manual   Number of cylinders   #number of cases    13   6 7
    #row %    100   46.2 53.8
    4    61.5   16.7 100.0
    6    23.1   50.0
    8    15.4   33.3
# Linear regression by groups.
mtcars %>% 
    tab_cells(sheet(mpg, disp, hp, wt, qsec)) %>% 
    tab_cols(total(label = "#Total| |"), am) %>% 
    tab_stat_fun_df(
        function(x){
            frm = reformulate(".", response = as.name(names(x)[1]))
            model = lm(frm, data = x)
            sheet('Coef.' = coef(model), 
                  confint(model)
            )
        }    
    ) %>% 
    tab_pivot() 
 #Total     Transmission 
       Automatic     Manual 
 Coef.   2.5 %   97.5 %     Coef.   2.5 %   97.5 %     Coef.   2.5 %   97.5 % 
 (Intercept)  27.3 9.6 45.1   21.8 -1.9 45.5   13.3 -21.9 48.4
 Displacement (cu.in.)  0.0 0.0 0.0   0.0 0.0 0.0   0.0 -0.1 0.1
 Gross horsepower  0.0 -0.1 0.0   0.0 -0.1 0.0   0.0 0.0 0.1
 Weight (1000 lbs)  -4.6 -7.2 -2.0   -2.3 -5.0 0.4   -7.7 -12.5 -2.9
 1/4 mile time  0.5 -0.4 1.5   0.4 -0.7 1.6   1.6 -0.2 3.4

Example of data processing with multiple-response variables

Here we use truncated dataset with data from product test of two samples of chocolate sweets. 150 respondents tested two kinds of sweets (codenames: VSX123 and SDF546). Sample was divided into two groups (cells) of 75 respondents in each group. In cell 1 product VSX123 was presented first and then SDF546. In cell 2 sweets were presented in reversed order. Questions about respondent impressions about first product are in the block A (and about second tested product in the block B). At the end of the questionnaire there was a question about the preferences between sweets.

List of variables:

data(product_test)

w = product_test # shorter name to save some keystrokes

# here we recode variables from first/second tested product to separate variables for each product according to their cells
# 'h' variables - VSX123 sample, 'p' variables - 'SDF456' sample
# also we recode preferences from first/second product to true names
# for first cell there are no changes, for second cell we should exchange 1 and 2.
w = w %>% 
    let_if(cell == 1, 
        h1_1 %to% h1_6 := recode(a1_1 %to% a1_6, other ~ copy),
        p1_1 %to% p1_6 := recode(b1_1 %to% b1_6, other ~ copy),
        h22 := recode(a22, other ~ copy), 
        p22 := recode(b22, other ~ copy),
        c1r = c1
    ) %>% 
    let_if(cell == 2, 
        p1_1 %to% p1_6 := recode(a1_1 %to% a1_6, other ~ copy), 
        h1_1 %to% h1_6 := recode(b1_1 %to% b1_6, other ~ copy),
        p22 := recode(a22, other ~ copy),
        h22 := recode(b22, other ~ copy), 
        c1r := recode(c1, 1 ~ 2, 2 ~ 1, other ~ copy) 
    ) %>% 
    let(
        # recode age by groups
        age_cat = recode(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2),
        # count number of likes
        # codes 2 and 99 are ignored.
        h_likes = count_row_if(1 | 3 %thru% 98, h1_1 %to% h1_6), 
        p_likes = count_row_if(1 | 3 %thru% 98, p1_1 %to% p1_6) 
    )

# here we prepare labels for future usage
codeframe_likes = num_lab("
    1 Liked everything
    2 Disliked everything
    3 Chocolate
    4 Appearance
    5 Taste
    6 Stuffing
    7 Nuts
    8 Consistency
    98 Other
    99 Hard to answer
")

overall_liking_scale = num_lab("
    1 Extremely poor 
    2 Very poor
    3 Quite poor
    4 Neither good, nor poor
    5 Quite good
    6 Very good
    7 Excellent
")

w = apply_labels(w, 
    c1r = "Preferences",
    c1r = num_lab("
        1 VSX123 
        2 SDF456
        3 Hard to say
    "),
    
    age_cat = "Age",
    age_cat = c("18 - 25" = 1, "26 - 35" = 2),
    
    h1_1 = "Likes. VSX123",
    p1_1 = "Likes. SDF456",
    h1_1 = codeframe_likes,
    p1_1 = codeframe_likes,
    
    h_likes = "Number of likes. VSX123",
    p_likes = "Number of likes. SDF456",
    
    h22 = "Overall quality. VSX123",
    p22 = "Overall quality. SDF456",
    h22 = overall_liking_scale,
    p22 = overall_liking_scale
)

Are there any significant differences between preferences? Yes, difference is significant.

# 'tab_mis_val(3)' remove 'hard to say' from vector 
w %>% tab_cols(total(), age_cat) %>% 
      tab_cells(c1r) %>% 
      tab_mis_val(3) %>% 
      tab_stat_cases() %>% 
      tab_last_sig_cases() %>% 
      tab_pivot()
 #Total     Age 
   18 - 25   26 - 35 
 Preferences 
   VSX123  94.0    46.0  48.0 
   SDF456  50.0    22.0  28.0 
   Hard to say   
   #Chi-squared p-value  <0.05    (warn.)
   #Total cases  144.0    68.0  76.0 

Further we calculate distribution of answers in the survey questions.

# lets specify repeated parts of table creation chains
banner = w %>% tab_cols(total(), age_cat, c1r) 
# column percent with significance
tab_cpct_sig = . %>% tab_stat_cpct() %>% 
                    tab_last_sig_cpct(sig_labels = paste0("<b>",LETTERS, "</b>"))

# means with siginifcance
tab_means_sig = . %>% tab_stat_mean_sd_n(labels = c("<b><u>Mean</u></b>", "sd", "N")) %>% 
                      tab_last_sig_means(
                          sig_labels = paste0("<b>",LETTERS, "</b>"),   
                          keep = "means")

# Preferences
banner %>% 
    tab_cells(c1r) %>% 
    tab_cpct_sig() %>% 
    tab_pivot() 
 #Total     Age     Preferences 
   18 - 25     26 - 35     VSX123     SDF456     Hard to say 
   A     B     A     B     C 
 Preferences 
   VSX123  62.7    65.7    60.0    100.0     
   SDF456  33.3    31.4    35.0      100.0   
   Hard to say  4.0    2.9    5.0        100.0 
   #Total cases  150    70    80    94    50   
# Overall liking
banner %>%  
    tab_cells(h22) %>% 
    tab_means_sig() %>% 
    tab_cpct_sig() %>%  
    tab_cells(p22) %>% 
    tab_means_sig() %>% 
    tab_cpct_sig() %>%
    tab_pivot() 
 #Total     Age     Preferences 
   18 - 25     26 - 35     VSX123     SDF456     Hard to say 
   A     B     A     B     C 
 Overall quality. VSX123 
   Mean  5.5    5.4    5.6    5.3    5.8 A   5.5 
   Extremely poor           
   Very poor           
   Quite poor  2.0    2.9    1.2     3.2      
   Neither good, nor poor  10.7    11.4    10.0     14.9 B   2.0    16.7 
   Quite good  39.3    45.7    33.8     40.4     38.0    33.3 
   Very good  33.3    24.3    41.2 A   30.9     38.0    33.3 
   Excellent  14.7    15.7    13.8     10.6     22.0    16.7 
   #Total cases  150    70    80     94     50   
 Overall quality. SDF456 
   Mean  5.4    5.3    5.4    5.4    5.3    5.7 
   Extremely poor           
   Very poor  0.7      1.2    1.1     
   Quite poor  2.7    4.3    1.2    2.1    4.0   
   Neither good, nor poor  16.7    20.0    13.8    18.1    14.0    16.7 
   Quite good  31.3    27.1    35.0    28.7    38.0    16.7 
   Very good  35.3    35.7    35.0    35.1    34.0    50.0 
   Excellent  13.3    12.9    13.8    14.9    10.0    16.7 
   #Total cases  150    70    80    94    50   
# Likes
banner %>% 
    tab_cells(h_likes) %>% 
    tab_means_sig() %>% 
    tab_cells(mrset(h1_1 %to% h1_6)) %>% 
    tab_cpct_sig() %>% 
    tab_cells(p_likes) %>% 
    tab_means_sig() %>% 
    tab_cells(mrset(p1_1 %to% p1_6)) %>% 
    tab_cpct_sig() %>%
    tab_pivot() 
 #Total     Age     Preferences 
   18 - 25     26 - 35     VSX123     SDF456     Hard to say 
   A     B     A     B     C 
 Number of likes. VSX123 
   Mean  2.0    2.0    2.1    1.9    2.2    2.3 
 Likes. VSX123 
   Liked everything           
   Disliked everything  3.3    1.4    5.0     4.3    2.0    
   Chocolate  34.0    38.6    30.0     35.1    32.0     33.3  
   Appearance  29.3    21.4    36.2 A   25.5    38.0     16.7  
   Taste  32.0    38.6    26.2     23.4    48.0 A   33.3  
   Stuffing  27.3    20.0    33.8     28.7    26.0     16.7  
   Nuts  66.7    72.9    61.3     69.1    60.0     83.3  
   Consistency  12.0    4.3    18.8 A   8.5    14.0     50.0 A B
   Other           
   Hard to answer           
   #Total cases  150    70    80     94    50     6  
 Number of likes. SDF456 
   Mean  2.0    2.0    2.1    2.0    2.0    2.0 
 Likes. SDF456 
   Liked everything           
   Disliked everything  1.3    1.4    1.2    2.1     
   Chocolate  32.0    27.1    36.2    29.8    34.0    50.0 
   Appearance  32.0    35.7    28.7    34.0    30.0    16.7 
   Taste  39.3    42.9    36.2    36.2    44.0    50.0 
   Stuffing  27.3    24.3    30.0    31.9    20.0    16.7 
   Nuts  61.3    60.0    62.5    58.5    68.0    50.0 
   Consistency  10.0    5.7    13.8    11.7    6.0    16.7 
   Other  0.7      1.2    1.1     
   Hard to answer           
   #Total cases  150    70    80    94    50   
# below more complicated table where we compare likes side by side
# Likes - side by side comparison
w %>% 
    tab_cols(total(label = "#Total| |"), c1r) %>% 
    tab_cells(list(unvr(mrset(h1_1 %to% h1_6)))) %>% 
    tab_stat_cpct(label = var_lab(h1_1)) %>% 
    tab_cells(list(unvr(mrset(p1_1 %to% p1_6)))) %>% 
    tab_stat_cpct(label = var_lab(p1_1)) %>% 
    tab_pivot(stat_position = "inside_columns") 
 #Total     Preferences 
       VSX123     SDF456     Hard to say 
 Likes. VSX123   Likes. SDF456     Likes. VSX123   Likes. SDF456     Likes. VSX123   Likes. SDF456     Likes. VSX123   Likes. SDF456 
 Liked everything       
 Disliked everything  3.3 1.3   4.3 2.1   2  
 Chocolate  34.0 32.0   35.1 29.8   32 34   33.3 50.0
 Appearance  29.3 32.0   25.5 34.0   38 30   16.7 16.7
 Taste  32.0 39.3   23.4 36.2   48 44   33.3 50.0
 Stuffing  27.3 27.3   28.7 31.9   26 20   16.7 16.7
 Nuts  66.7 61.3   69.1 58.5   60 68   83.3 50.0
 Consistency  12.0 10.0   8.5 11.7   14 6   50.0 16.7
 Other  0.7   1.1    
 Hard to answer       
 #Total cases  150 150   94 94   50 50   6 6

We can save labelled dataset as *.csv file with accompanying R code for labelling.

write_labelled_csv(w, file  filename = "product_test.csv")

Or, we can save dataset as *.csv file with SPSS syntax to read data and apply labels.

write_labelled_spss(w, file  filename = "product_test.csv")