In some situations, series from a data.frame have a natural two-dimensional (tabular) representation because each observation can be uniquely characterized by a combination of two indexes. Two major cases of this situations in applied econometrics are:
The idea of dfidx is to keep in the same object the data and the information about this structure. A dfidx is a data.frame with an idx column, which is a data.frame that contains the series that defines the indexes.
A dfidx is created using the homonymous function which has as main and mandatory argument a data.frame.
dfidx functionThe dfidx package is loaded using:
library("dfidx")To illustrate the use of dfidx, we'll use the TravelMode data set from the AER package which contains observations on 210 choice situations for 4 alternatives, air, train, bus and car (in this order). Rows are ordered by choice situation first and then by alternatives.
data("TravelMode", package = "AER")
head(TravelMode)##   individual  mode choice wait vcost travel gcost income size
## 1          1   air     no   69    59    100    70     35    1
## 2          1 train     no   34    31    372    71     35    1
## 3          1   bus     no   35    25    417    70     35    1
## 4          1   car    yes    0    10    180    30     35    1
## 5          2   air     no   64    58     68    68     30    2
## 6          2 train     no   44    31    354    84     30    2As the first two columns contains the two indexes, the idx argument can be unset:
TM1 <- dfidx(TravelMode, drop.index = FALSE)The resulting object is of class dfidx and is a data.frame with an idx column that can be retrieved using the idx function:
idx(TM1) %>% print(n = 3)## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2dfidx provides a customized print method which print the first lines of the data frame and of its index.
TM1 %>% print(n = 3)## ~~~~~~~
##  first 3 observations out of 840 
## ~~~~~~~
##   individual  mode choice wait vcost travel gcost income size    idx
## 1          1   air     no   69    59    100    70     35    1  1:air
## 2          1 train     no   34    31    372    71     35    1 1:rain
## 3          1   bus     no   35    25    417    70     35    1  1:bus
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2Note the use of the drop.index argument set to FALSE in order to keep the individual series which define the indexes as stand alone series in the data.frame. With the default value, these series are removed from the data.frame:
TM1 <- dfidx(TravelMode)
TM1 %>% print(n = 3)## ~~~~~~~
##  first 3 observations out of 840 
## ~~~~~~~
##   choice wait vcost travel gcost income size    idx
## 1     no   69    59    100    70     35    1  1:air
## 2     no   34    31    372    71     35    1 1:rain
## 3     no   35    25    417    70     35    1  1:bus
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2The idx argument may be a list or a character vector of length two in order to indicate which columns of the data frame contain the indexes:
TM2 <- dfidx(TravelMode, idx = c("individual", "mode"))
TM3 <- dfidx(TravelMode, idx = list("individual", "mode"))
c(identical(TM2, TM3), identical(TM1, TM2))
## [1] TRUE TRUEThe series contained in the idx data frame can be named, using the idnames argument.
TM3b <- dfidx(TravelMode, idnames = c(NA, "trmode"))Any NA in this vector will result in using the default name, which is the name of the original series.
In the case where the data set is balanced and observations are ordered by the first index first and then by the second, only one or none of the indexes can be provided. idx can in this case be either the name of the first index:
TravelMode2 <- dplyr::select(TravelMode, - mode)
TM4 <- dfidx(TravelMode2, idx = "individual", idnames = c("individual", "mode"))
TM4 %>% print(n = 3)## ~~~~~~~
##  first 3 observations out of 840 
## ~~~~~~~
##   choice wait vcost travel gcost income size idx
## 1     no   69    59    100    70     35    1 1:1
## 2     no   34    31    372    71     35    1 1:2
## 3     no   35    25    417    70     35    1 1:3
## 
## ~~~ indexes ~~~~
##   individual mode
## 1          1    1
## 2          1    2
## 3          1    3
## indexes:  1, 2or an integer equal to the cardinal of the first index:
TravelMode3 <- dplyr::select(TravelMode, - mode)
TM5 <- dfidx(TravelMode3, idx = 210, idnames = c("individual", "mode"))
TM5 %>% print(n = 3)## ~~~~~~~
##  first 3 observations out of 840 
## ~~~~~~~
##   choice wait vcost travel gcost income size idx
## 1     no   69    59    100    70     35    1 1:1
## 2     no   34    31    372    71     35    1 1:2
## 3     no   35    25    417    70     35    1 1:3
## 
## ~~~ indexes ~~~~
##   individual mode
## 1          1    1
## 2          1    2
## 3          1    3
## indexes:  1, 2Moreover, the levels of the second index can be indicated, using the levels argument.
TM4b <- dfidx(TravelMode2, levels = c("air", "train", "bus", "car"),
                   idnames = c("individual", "mode"),
                   idx = "individual")
TM5b <- dfidx(TravelMode3, idx = 210, idnames = c("individual", "mode"),
                    levels = c("air", "train", "bus", "car"))
c(identical(TM4b, TM1), identical(TM5b, TM1))
## [1] TRUE TRUEdfidxOne or both of the indexes may be nested in another series. In this case, the idx argument must be a list of length two, each element being a character of length two (if there is a nesting structure) or one.
As a first example, consider the JapaneseFDI data set of the mlogit package which deal with the implementation of Japanese production units in Europe. The first index firm refers to the the production units, the second one region to the European region where the production units is implemented. The country variable nests the second index.
data("JapaneseFDI", package = "mlogit")
JapaneseFDI <- dplyr::select(JapaneseFDI, 1:8)
head(JapaneseFDI, 3)##   firm country region choice choice.c     wage unemp elig
## 1    3      BE    BE0      0       FR 14.17371 0.103    0
## 2    3      BE    BE1      0       FR 16.57562 0.095    0
## 3    3      BE    BE2      0       FR 14.94713 0.136    0JP1 <- dfidx(JapaneseFDI, idx = list("firm", c("region", "country")))
head(JP1, 3)## ~~~~~~~
##  first 3 observations out of 25764 
## ~~~~~~~
##   choice choice.c     wage unemp elig   idx
## 1      0       FR 14.17371 0.103    0 3:BE0
## 2      0       FR 16.57562 0.095    0 3:BE1
## 3      0       FR 14.94713 0.136    0 3:BE2
## 
## ~~~ indexes ~~~~
##   firm region country
## 1    3    BE0      BE
## 2    3    BE1      BE
## 3    3    BE2      BE
## indexes:  1, 2, 2JP1b <- dfidx(JapaneseFDI, idx = list("firm", c("region", "country")),
              idnames = c("japf", "iso80"))
head(JP1b, 3)## ~~~~~~~
##  first 3 observations out of 25764 
## ~~~~~~~
##   choice choice.c     wage unemp elig   idx
## 1      0       FR 14.17371 0.103    0 3:BE0
## 2      0       FR 16.57562 0.095    0 3:BE1
## 3      0       FR 14.94713 0.136    0 3:BE2
## 
## ~~~ indexes ~~~~
##   japf iso80 country
## 1    3   BE0      BE
## 2    3   BE1      BE
## 3    3   BE2      BE
## indexes:  1, 2, 2The Produc data set from the plm package contains data for 48 American states for 17 years. The first index (state) is nested in the region variable:
data("Produc", package = "plm")
head(Produc, 3)##     state year region     pcap     hwy   water    util       pc   gsp    emp
## 1 ALABAMA 1970      6 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5
## 2 ALABAMA 1971      6 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9
## 3 ALABAMA 1972      6 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3
##   unemp
## 1   4.7
## 2   5.2
## 3   4.7Pr <- dfidx(Produc, idx = list(c("state", "region"), "year"))
head(Pr, 3)## ~~~~~~~
##  first 3 observations out of 816 
## ~~~~~~~
##       pcap     hwy   water    util       pc   gsp    emp unemp       idx
## 1 15865.66 7237.14 2208.10 6420.42 24082.38 38880 1197.5   5.6 CONN:1970
## 2 16559.99 7312.24 2406.84 6840.91 25147.44 38515 1164.3   8.9 CONN:1971
## 3 17346.79 7407.46 2642.54 7296.80 26191.58 40037 1190.4   8.2 CONN:1972
## 
## ~~~ indexes ~~~~
##         state region year
## 1 CONNECTICUT      1 1970
## 2 CONNECTICUT      1 1971
## 3 CONNECTICUT      1 1972
## indexes:  1, 1, 2dfidx can deal with data frames in wide format, i.e for which each series for a given value of the second index is a column of the data frame. The two supplementary arguments in this case are shape which should be set to wide and varying which indicate which columns should be merged in the resulting long formated data frame. Not that the shape argument is automatically set to wide if the varying argument is set.
data("Fishing", package = "mlogit")
head(Fishing, 3)##      mode price.beach price.pier price.boat price.charter catch.beach
## 1 charter     157.930    157.930    157.930       182.930      0.0678
## 2 charter      15.114     15.114     10.534        34.534      0.1049
## 3    boat     161.874    161.874     24.334        59.334      0.5333
##   catch.pier catch.boat catch.charter   income
## 1     0.0503     0.2601        0.5391 7083.332
## 2     0.0451     0.1574        0.4671 1250.000
## 3     0.4522     0.2413        1.0266 3750.000Fi <- dfidx(Fishing, shape = "wide", varying = 2:9)
Fi2 <- dfidx(Fishing, varying = 2:9)
identical(Fi, Fi2)## [1] TRUEhead(Fi, 3)## ~~~~~~~
##  first 3 observations out of 4728 
## ~~~~~~~
##      mode   income  price  catch    idx
## 1 charter 7083.332 157.93 0.0678 1:each
## 2 charter 7083.332 157.93 0.2601 1:boat
## 3 charter 7083.332 182.93 0.5391 1:rter
## 
## ~~~ indexes ~~~~
##   id1     id2
## 1   1   beach
## 2   1    boat
## 3   1 charter
## indexes:  1, 2In this case, the two indexes are auto-generated with default names id1 and id2. Customized names can be provided using the idnames argument.
data("Fishing", package = "mlogit")
Fi <- dfidx(Fishing, shape = "wide", varying = 2:9, idnames = c("chid", "alt"))
head(Fi, 3)## ~~~~~~~
##  first 3 observations out of 4728 
## ~~~~~~~
##      mode   income  price  catch    idx
## 1 charter 7083.332 157.93 0.0678 1:each
## 2 charter 7083.332 157.93 0.2601 1:boat
## 3 charter 7083.332 182.93 0.5391 1:rter
## 
## ~~~ indexes ~~~~
##   chid     alt
## 1    1   beach
## 2    1    boat
## 3    1 charter
## indexes:  1, 2A nesting structure can be indicated for the first index. As an example, in the Train data set of the mlogit package, each line describes the features of the choice situation (a choice between two artifactual train tickets A and B). Each individual face different choice situations so that there is an id variable which nests the choice situation variable called choiceid. Not that the second index cannot be provided for data frames in long format.
data("Train", package = "mlogit")
Train$choiceid <- 1:nrow(Train)
head(Train, 3)##   id choiceid choice price_A time_A change_A comfort_A price_B time_B change_B
## 1  1        1      A    2400    150        0         1    4000    150        0
## 2  1        2      A    2400    150        0         1    3200    130        0
## 3  1        3      A    2400    115        0         1    4000    115        0
##   comfort_B
## 1         1
## 2         1
## 3         0Tr <- dfidx(Train, shape = "wide", varying = 4:11, sep = "_",
                  idx = list(c("choiceid", "id")), idnames = c(NA, "alt"))
head(Tr, 3)## ~~~~~~~
##  first 3 observations out of 5858 
## ~~~~~~~
##   choice price time change comfort idx
## 1      A  2400  150      0       1 1:A
## 2      A  4000  150      0       1 1:B
## 3      A  2400  150      0       1 2:A
## 
## ~~~ indexes ~~~~
##   choiceid id alt
## 1        1  1   A
## 2        1  1   B
## 3        2  1   A
## indexes:  1, 1, 2The names (and position) of the idx column can be obtained as a named integer (the integer being the position of the column and the name its name) using the idx_name function:
idx_name(Tr)## idx 
##   6To get the name of one of the indexes, the second argument, n, is set either to 1 or 2 to get the first or the second index, ignoring the nesting variables:
idx_name(Tr, 1)
## [1] "choiceid"
idx_name(idx(Tr), 1)
## [1] "choiceid"
idx_name(Tr$change, 1)
## [1] "choiceid"
idx_name(Tr, 2)
## [1] "alt"Not that idx_name can be in this case applied to a dfidx, a idx or a xseries object.
To get a nesting variable, set the m argument to 2 or more:
idx_name(Tr, 1, 2)## [1] "id"idx_name(Tr, 2, 2)## NULLTo extract one or all the indexes, the idx function is used. This function has already been encountered when one wants to extract the idx column of a dfidx object. It works also for a idx and a xseries (in the first case, the function just returns its argument):
idx1 <- idx(Tr)
idx2 <- idx(idx(Tr))
idx3 <- idx(Tr$change)
c(identical(idx1, idx2), identical(idx1, idx3))## [1] TRUE TRUEUse the same n and m arguments as for the idx_name function in order to extract a specific series. For example, to extract the individual index, which nests the choice index, use:
id_index1 <- idx(Tr, n = 1, m = 2)
id_index2 <- idx(idx(Tr), n = 1, m = 2)
id_index3 <- idx(Tr$change, n = 1, m = 2)
c(identical(id_index1, id_index2), identical(id_index2, id_index3))## [1] TRUE TRUEExtractors for data.frame include:
[, which can be used with one element (defining columns to extract) or two arguments (defining lines and columns to extract),[[, which returns one column of the data.frame.data.frameConsider first the use of [. . If two arguments are provided, a data.frame is always returned except when a single series is selected, in which case the required series is returned (a data.frame is returned in this case if drop = FALSE).
A specific method ([.dfidx) is provided for one reason: the column that contain the indexes should be "sticky" (we borrow this idea from the sf package), which means that it should be always returned while using the extractor operator, even if it is not explicitely selected.
TM <- dfidx(TravelMode)TMsub <- TM[TM$size == 1, ]
TMsub %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 456 
## ~~~~~~~
##   choice wait vcost travel gcost income size    idx
## 1     no   69    59    100    70     35    1  1:air
## 2     no   34    31    372    71     35    1 1:rain
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2idx(TMsub) %>% print(n = 2)## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2TMsub2 <- TM[TM$size == 1, c("wait", "vcost")]
TMsub3 <- TM[TM$size == 1, "wait", drop = FALSE]All the previous command extract the lines for households of size 1 and, in the first case all the series, in the second case two of them and in the third case only one series. For this latter case, we added drop = FALSE so that a data.frame and not a series is returned.
When [ is used with only one argument or an empty first argument, no lines subsetting is performed.
TM[, c("wait", "gcost")] %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   wait gcost    idx
## 1   69    70  1:air
## 2   34    71 1:rain
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2wait1 <- TM[, c("wait"), drop = FALSE]
wait2 <- TM["wait"]
identical(wait1, wait2)## [1] TRUEwait1 %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   wait    idx
## 1   69  1:air
## 2   34 1:rain
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2A series can be extracted using any of the following commands:
wait1 <- TM[, "wait"]
wait2 <- TM[["wait"]]
wait3 <- TM$wait
c(identical(wait1, wait2), identical(wait1, wait3))## [1] TRUE TRUEThe results is a xseries which inherit the idx column from the data.frame it has been extracted from as an attribute :
wait1 %>% print(n = 3)## [1] 69 34 35
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2class(wait1)## [1] "xseries" "integer"idx(wait1) %>% print(n = 3)## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2Note that, except when dfidx hasn't been used with drop.index = FALSE, a series which define the indexes is droped from the data frame (but is one of the column of the idx column of the data frame). It can be therefore retrieved using:
TM1$idx$mode %>% head## [1] air   train bus   car   air   train
## Levels: air train bus caror
idx(TM1)$mode %>% head## [1] air   train bus   car   air   train
## Levels: air train bus caror more simply by applying the $ operator as if the series was a stand-alone series in the data frame :
TM1$mode %>% print(n = 3)## [1] air   train bus  
## Levels: air train bus car
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2In this last case, the resulting series is a xseries, ie it inherits the index data frame as an attribute.
While creating the dfidx, a pkg argument can be indicated, so that the resulting dfidx object and its series are respectively of class c(dfidx_pkg, dfidx) and c(xseries_pkg, xseries) which enables the definition of special methods for dfidx and xseries objects.
Pr2 <- dfidx(Produc, idx = list(c("state", "region"), "year"), pkg = "plm")
gsp1 <- Pr2[, "gsp"]
gsp2 <- Pr2[["gsp"]]
gsp3 <- Pr2$gsp
c(identical(gsp1, gsp2), identical(gsp1, gsp3))## [1] TRUE TRUEclass(gsp1)## [1] "xseries_plm" "xseries"     "integer"For example, we want to define a lag method for xseries_plm objects. While lagging there should be a NA not only on the first position of the resulting vector like for time-series, but each time we encounter a new individual. A minimal lag could therefore be written like:
lag.xseries_plm <- function(x, ...){
    .idx <- idx(x)
    class <- class(x)
    x <- unclass(x)
    id <- .idx[[1]]
    lgt <- length(id)
    lagid <- c("", id[- lgt])
    sameid <- lagid ==  id
    x <- c(NA, x[- lgt])
    x[! sameid] <- NA
    structure(x, class = class, idx = .idx)
}
lgsp1 <- stats::lag(gsp1)
lgsp1 %>% print(n = 3)## [1]    NA 38880 38515
## ~~~ indexes ~~~~
##         state region year
## 1 CONNECTICUT      1 1970
## 2 CONNECTICUT      1 1971
## 3 CONNECTICUT      1 1972
## indexes:  1, 1, 2class(lgsp1)## [1] "xseries_plm" "xseries"     "integer"rbind(gsp1, lgsp1)[, 1:20]##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## gsp1  38880 38515 40037 42157 41827 39870 41326 42976 44844 46008 45949 47397
## lgsp1    NA 38880 38515 40037 42157 41827 39870 41326 42976 44844 46008 45949
##       [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## gsp1  47241 50594 55117 58263 61750  8844  8958  9449
## lgsp1 47397 47241 50594 55117 58263    NA  8844  8958Note the use of stats::lag instead of lag which ensures that the stats::lag function is used, even if the dplyr (or tidyverse) package is attached.
tidyversetibblesdfidx support tibbles. Let first make our original data.frame a tibble:
TMtb <- as_tibble(TravelMode)
class(TMtb)## [1] "tbl_df"     "tbl"        "data.frame"TMtb %>% head(2)## # A tibble: 2 x 9
##   individual mode  choice  wait vcost travel gcost income  size
##   <fct>      <fct> <fct>  <int> <int>  <int> <int>  <int> <int>
## 1 1          air   no        69    59    100    70     35     1
## 2 1          train no        34    31    372    71     35     1A tibble adds classes tbl_df and tbl to a data.frame object. If the first argument of dfidx is a tibble the resulting object inherits tibble's classes:
TMtb <- dfidx(TMtb, clseries = "pseries")
class(TMtb)## [1] "dfidx"      "tbl_df"     "tbl"        "data.frame"Extracting from a dfidx-tibble returns a dfidx-tibble object:
ext1 <- TMtb[c("wait", "vcost")]
ext2 <- TMtb[, c("wait", "vcost")]
ext2 %>% head(2)## # A tibble: 840 x 3
##    wait vcost idx$individual $mode
## * <int> <int>          <int> <fct>
## 1    69    59              1 air  
## 2    34    31              1 train
## # … with 838 more rows
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2idx(ext2) %>% head(2)## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2identical(ext1, ext2)## [1] TRUEun1 <- TMtb[, c("wait"), drop = FALSE]
un2 <- TMtb["wait"]
identical(un1, un2)## [1] TRUEsub1 <- TMtb[TMtb$size == 2, c("wait", "vcost")]
sub2 <- TMtb[TMtb$size == 2, "wait", drop = FALSE]extracted series are identical to those obtained from a c("dfidx", "data.frame").
wait1 <- TMtb[, "wait"]
wait2 <- TMtb[["wait"]]
wait3 <- TMtb$wait
c(identical(wait1, wait2), identical(wait1, wait3))## [1] TRUE TRUEclass(idx(wait1))## [1] "idx"        "data.frame"dplyrdfidx supports some of the verbs of dplyr, namely, for the current version:
select to select columns,filter to select some rows using logical conditions,arrange to sort the lines according to one or several variables,mutate and transmute for creating new series,slice to select some rows using their position.dplyr verbs don't work with dfidx objects for two main reasons:
select is an exception), the returned object is a data.frame (or a tibble) and not a dfidx,transmute.Therefore, specific methods are provided for dplyr's verb. The general strategy consists on:
dfidx object),data.frame or a tibble using the as.data.frame method,dplyr's verb,transmute or while selecting a subset of columns which don't contain the index column),data.frame and returns the result.select(TM, vcost, idx, size) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   vcost    idx size
## 1    59  1:air    1
## 2    31 1:rain    1
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2select(TM, vcost, size) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   vcost size    idx
## 1    59    1  1:air
## 2    31    1 1:rain
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2arrange(TM, income, desc(vcost)) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   choice wait vcost travel gcost income size     idx
## 1     no   69    59     86    68      2    1 208:air
## 2    yes   50    29    265    57      2    1 208:bus
## 
## ~~~ indexes ~~~~
##   individual mode
## 1        208  air
## 2        208  bus
## indexes:  1, 2mutate(TM, linc = log(income), linc2 = linc ^ 2) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##   choice wait vcost travel gcost income size    idx     linc   linc2
## 1     no   69    59    100    70     35    1  1:air 3.555348 12.6405
## 2     no   34    31    372    71     35    1 1:rain 3.555348 12.6405
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2transmute(TM, linc = log(income), linc2 = linc ^ 2) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 840 
## ~~~~~~~
##       linc   linc2    idx
## 1 3.555348 12.6405  1:air
## 2 3.555348 12.6405 1:rain
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## indexes:  1, 2filter(TM, wait <= 50, income  == 35) %>% print(n = 2)## ~~~~~~~
##  first 2 observations out of 38 
## ~~~~~~~
##   choice wait vcost travel gcost income size    idx
## 1     no   34    31    372    71     35    1 1:rain
## 2     no   35    25    417    70     35    1  1:bus
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1 train
## 2          1   bus
## indexes:  1, 2slice(TM, 1:3)##   choice wait vcost travel gcost income size    idx
## 1     no   69    59    100    70     35    1  1:air
## 2     no   34    31    372    71     35    1 1:rain
## 3     no   35    25    417    70     35    1  1:bus
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          1   air
## 2          1 train
## 3          1   bus
## indexes:  1, 2To extract a series, the pull function can be used:
pull(TM, vcost)##  [1]  59  31  25  10  58  31  25  11 115  98
## ~~~ indexes ~~~~
##    individual  mode
## 1           1   air
## 2           1 train
## 3           1   bus
## 4           1   car
## 5           2   air
## 6           2 train
## 7           2   bus
## 8           2   car
## 9           3   air
## 10          3 train
## indexes:  1, 2The two main steps in R in order to estimate a model is to use the model.frame function to construct a data.frame, using a formula and a data.frame and then to extract from it the matrix of covariates using the model.matrix function.
model.frameThe default method of model.frame has as first two arguments formula and data. It returns a data.frame with a terms attribute. Some other methods exists in the stats package, for example for lm and glm object with a first and main argument called formula. This is quite unusual and misleading as for most of the generic functions in R, the first argument is called either x or object.
Another noticeable method for model.frame is provided by the Formula package, and in this case the first argument is a Formula object, which is an extended formula which can contains several parts on the left and/or on the right hand side of the formula.
We provide a model.frame method for dfidx objects, mainly because the idx column should be returned in the resulting data.frame. This leads to an unusual order of the arguments, the data frame first and then the formula. The method then first extract (and subset if necessary the idx column), call the formula/Formula method and then add to the resulting data frame the idx column.
mfTM <- model.frame(TM, choice ~ vcost | income + size | travel, subset = income > 50)
mfTM %>% print(n = 3)## ~~~~~~~
##  first 3 observations out of 156 
## ~~~~~~~
##   choice vcost income size travel    idx
## 1     no    49     70    3     68  4:air
## 2     no    26     70    3    354 4:rain
## 3     no    21     70    3    399  4:bus
## 
## ~~~ indexes ~~~~
##   individual  mode
## 1          4   air
## 2          4 train
## 3          4   bus
## indexes:  1, 2attr(mfTM, "terms")## choice ~ vcost + (income + size) + travel + (individual + mode)
## attr(,"variables")
## list(choice, vcost, income, size, travel, individual, mode)
## attr(,"factors")
##            vcost income size travel individual mode
## choice         0      0    0      0          0    0
## vcost          1      0    0      0          0    0
## income         0      1    0      0          0    0
## size           0      0    1      0          0    0
## travel         0      0    0      1          0    0
## individual     0      0    0      0          1    0
## mode           0      0    0      0          0    1
## attr(,"term.labels")
## [1] "vcost"      "income"     "size"       "travel"     "individual"
## [6] "mode"      
## attr(,"order")
## [1] 1 1 1 1 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(choice, vcost, income, size, travel, individual, mode)
## attr(,"dataClasses")
##     choice      vcost     income       size     travel individual       mode 
##   "factor"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"   "factor"attr(mfTM, "formula")## choice ~ vcost | income + size | travelmodel.matrixmodel.matrix is a generic function and for the default method, the first two arguments are a terms object and a data.frame. In lm, the terms is extracted from the model.frame internally constructed using the model.frame function. This means that, at least in this context, model.matrix doesn't need a formula/term argument and a data.frame, but only a data.frame returned by the model.frame, i.e. a data.frame with a terms attribute.
We use this idea for the model.matrix method for dfidx object ; the only required argument is a dfidx returned by the model.frame function. The formula is then extracted from the dfidx and the Formula or default method is then called.
model.matrix(mfTM, rhs = 1) %>% head(2)##   (Intercept) vcost
## 1           1    49
## 2           1    26model.matrix(mfTM, rhs = 2) %>% head(2)##   (Intercept) income size
## 1           1     70    3
## 2           1     70    3model.matrix(mfTM, rhs = 1:3) %>% head(2)##   (Intercept) vcost income size travel
## 1           1    49     70    3     68
## 2           1    26     70    3    354