pivot() vs Base R aggregate()Aggregation is the process forming groups of data based on the levels of categorical variables, then computing some statistical value, such as a mean, for each group on a numeric variable. For example, compute the mean Salary for each combination of the levels of Gender and Department. Refer to each group, the cross-classification of all the specified categorical variables, as a cell.
The lessR pivot() function performs this aggregation by relying upon the base R function aggregate(). By default, pivot() generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis functions as a standard data frame. The levels for all the grouping variables are listed in the rows, but if there are specified column grouping variables, pivot() relies upon base R reshape() to form a 2-d table for direct viewing instead of a data table to input into further analysis functions.
pivot() provides additional features than aggregate() provides.
value over which to aggregate, the sample size for each cell is provided with the number of missing values.value aggregated.Also, there is a different interface as the variables, for rows and columns, are each specified as a vector.
The following table lists many of the available statistical functions by which to aggregate over the cells.
| Statistic | Meaning | 
|---|---|
| sum | sum | 
| mean | arithmetic mean | 
| median | median | 
| min | minimum | 
| max | maximum | 
| sd | standard deviation | 
| var | variance | 
| IQR | inter-quartile range | 
| mad | mean absolute deviation | 
| tabulate | count of each cell only | 
Some statistical functions are available that return multiple values.
| Statistic | Meaning | 
|---|---|
| range | minimum, maximum | 
| quantile | range + quartiles | 
| summary | quantile + mean | 
These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame.
The following pivot() parameters specify the data, the statistic for the aggregation, the value(s) over which to aggregate, and the corresponding cells that contain the aggregated values. The first four parameter values listed below are required.
data: The data frame that includes the variables of interest.compute: The function for which to perform the aggregation.value: The variable(s) for which to summarize, i.e., aggregate.rows_by: The categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values.cols_by: The optional categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.Multiple values of parameters value, rows_by, and up to two cols_by variables may be specified. Express the multiple categorical variables over which to pivot as a vector, such as with the c() function. If cols_by is not specified, the result is a linear table as a long-form data frame that can be input into other data analysis procedures.
By default, missing values are eliminated from the aggregation. If there are no values for a cell for which to perform the aggregation, and it is desired to have the aggregated value listed as missing, then specify na_value as TRUE. If any of the levels of the by variables are missing, to report those missing cells, specify na_by as TRUE. Set na_remove to TRUE to have any aggregated value defined as missing if any individual data cell for value is missing.
To illustrate, use the Employee data set included in lessR, here read into the d data frame. Begin with no missing data.
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------Two categorical variables in the d data frame are Dept and Salary. A continuous variable is Salary. Create the long-form pivot table as a data frame that expresses the mean of Salary for all combinations of Dept and Salary.
##    Dept Gender  n miss   Salary
## 1  ACCT      F  3    0 63237.16
## 2  ADMN      F  4    0 81434.00
## 3  FINC      F  1    0 57139.90
## 4  MKTG      F  5    0 64496.02
## 5  SALE      F  5    0 64188.25
## 6  ACCT      M  2    0 59626.20
## 7  ADMN      M  2    0 80963.35
## 8  FINC      M  3    0 72967.60
## 9  MKTG      M  1    0 99062.66
## 10 SALE      M 10    0 86150.97The output of pivot() is a data frame of the aggregated variables. This can be saved for further analysis. Here, perform the same analysis, but list the required parameter values in order without the parameter names.
Because the output of pivot() with no cols_by variables is a standard R data frame, typical operations such as sorting can be applied, here using the lessR function Sort().
## 
## Sort Specification
##   1 -->  ascending##    Dept Gender  n miss   Salary
## 1  ACCT      F  3    0 63237.16
## 6  ACCT      M  2    0 59626.20
## 2  ADMN      F  4    0 81434.00
## 7  ADMN      M  2    0 80963.35
## 3  FINC      F  1    0 57139.90
## 8  FINC      M  3    0 72967.60
## 4  MKTG      F  5    0 64496.02
## 9  MKTG      M  1    0 99062.66
## 5  SALE      F  5    0 64188.25
## 10 SALE      M 10    0 86150.97Multiple value variables for which to aggregate over can also be specified. Round the numerical aggregated results to two decimal digits.
##    Dept Gender  n miss Years   Salary
## 1  ACCT      F  3    0  4.67 63237.16
## 2  ADMN      F  4    0  7.50 81434.00
## 3  FINC      F  1    0  7.00 57139.90
## 4  MKTG      F  5    0  8.20 64496.02
## 5  SALE      F  5    0  6.60 64188.25
## 6  ACCT      M  2    0  7.00 59626.20
## 7  ADMN      M  2    0 15.50 80963.34
## 8  FINC      M  3    0 11.33 72967.60
## 9  MKTG      M  1    0 18.00 99062.66
## 10 SALE      M 10    0 12.33 86150.97The following illustrates the pipe operator from the magrittr package with pivot(). The package is not included with lessR, so separately load with library(), either by itself or as part of the tidyverse package. As such, the following code is not run.
Perform the pivot computations on the d data frame. Then output the aggregated results to the a data frame.
Specify up to two cols_by categorical variables, to create a two-dimensional table with the specified columns. First, one cols_by variable, Gender. Specifying one or two categorical variables as cols_by variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a cross-classification table with categorical variables in the rows and columns.
Table: mean of Salary 
 Gender  F          M               
 Dept                        
-------  ---------  ---------
   ACCT   63237.16   59626.20
   ADMN   81434.00   80963.34
   FINC   57139.90   72967.60
   MKTG   64496.02   99062.66
   SALE   64188.25   86150.97Here two cols_by variables, specified as a vector.
Table: mean of Salary 
 Gender  F                                M                                      
 Plan    1          2          3          1          2           3             
 Dept                                                                     
-------  ---------  ---------  ---------  ---------  ----------  ---------
   MKTG   56772.95   66426.79         NA         NA          NA   99062.66
   SALE   60941.54   66352.73         NA   89393.40    82442.74   80131.91
   ACCT         NA   63237.16         NA   69547.60          NA   49704.79
   ADMN         NA   67724.21   122563.4   53788.26   108138.43         NA
   FINC         NA   57139.90         NA   61937.62          NA   95027.55Tabulation is counting. With tabulation there is no value variable per se. Instead, the number of data values in each cell are tabulated, that is, counted.
##    Dept Gender  n
## 1  ACCT      F  3
## 2  ADMN      F  4
## 3  FINC      F  1
## 4  MKTG      F  5
## 5  SALE      F  5
## 6  ACCT      M  2
## 7  ADMN      M  2
## 8  FINC      M  3
## 9  MKTG      M  1
## 10 SALE      M 10The pivot table follows with the missing data analysis when there is only one numerical value over which to aggregate. The variable Years has one missing value.
##     by  n miss     Years
## 1 ACCT  5    0  5.600000
## 2 ADMN  6    0 10.166667
## 3 FINC  4    0 10.250000
## 4 MKTG  6    0  9.833333
## 5 SALE 14    1 10.285714Set na_remove to FALSE to not remove any missing data in a cell with values to be aggregated. The resulting aggregated value will be missing if any of the constituent data values are missing. The corresponding level does not appear because cells with aggregated missing values are not shown.
##     by n miss     Years
## 1 ACCT 5    0  5.600000
## 2 ADMN 6    0 10.166667
## 3 FINC 4    0 10.250000
## 4 MKTG 6    0  9.833333The na_remove parameter specifies the value of the base R parameter na.rm for computations such as for the mean. See ?mean for its definition.
Here include all the cells, even those with a missing aggregated value.
##     by  n miss     Years
## 1 ACCT  5    0  5.600000
## 2 ADMN  6    0 10.166667
## 3 FINC  4    0 10.250000
## 4 MKTG  6    0  9.833333
## 5 SALE 14    1        NATo account for missing values of the categorical by variables, set na_by to TRUE.
##    Dept Gender n miss     Years
## 1  ACCT      F 3    0  4.666667
## 2  ADMN      F 4    0  7.500000
## 3  FINC      F 1    0  7.000000
## 4  MKTG      F 5    0  8.200000
## 5  SALE      F 5    0  6.600000
## 6  <NA>      F 1    0  5.000000
## 7  ACCT      M 2    0  7.000000
## 8  ADMN      M 2    0 15.500000
## 9  FINC      M 3    0 11.333333
## 10 MKTG      M 1    0 18.000000
## 11 SALE      M 9    1 12.333333Specify the count for levels that include missing cells. Invoke na_by for tabulation.
##    Dept Gender  n
## 1  ACCT      F  3
## 2  ADMN      F  4
## 3  FINC      F  1
## 4  MKTG      F  5
## 5  SALE      F  5
## 6  <NA>      F  1
## 7  ACCT      M  2
## 8  ADMN      M  2
## 9  FINC      M  3
## 10 MKTG      M  1
## 11 SALE      M 10
## 12 <NA>      M  0