--- title: "Using missRanger" date: "`r Sys.Date()`" bibliography: "biblio.bib" link-citations: true output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using missRanger} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ## Overview {missRanger} is a **multivariate imputation algorithm** based on random forests. It is a fast alternative to the beautiful 'MissForest' algorithm of @stekhoven, and uses the {ranger} package [@wright] to fit the random forests. The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration. - {missRanger} is **fast**. - It allows for **out-of-sample applications**. - It is **intuitive**: E.g., calling `missRanger(data, . ~ 1)` would impute all variables univariately, while `missRanger(data, Species ~ Sepal.Width)` would use `Sepal.Width` to impute `Species`. - It works for a **variety of data types**. - It combines random forest imputation with **predictive mean matching**. This avoids "new" values like 0.3334 in a 0-1 coded variable and helps to raise the variance of the imputations, which is especially important for **multiple imputation** (see additional vignettes). ## Installation ```r # From CRAN install.packages("missRanger") # Development version devtools::install_github("mayer79/missRanger") ``` ## Usage ``` {r} library(missRanger) set.seed(3) iris_NA <- generateNA(iris, p = 0.1) head(iris_NA) imp <- missRanger(iris_NA, num.trees = 100) head(imp) ``` ### Predictive mean matching It worked, but the new values appear overly exact. To avoid this, we can add predictive mean matching (PMM) to the OOB predictions: ``` {r} imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 100, verbose = 0) head(imp) ``` ### Controlling the random forests `missRanger()` offers many options. How would we use one feature per split (mtry = 1) with 200 trees? ``` {r} imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 200, mtry = 1, verbose = 0) ``` ### Extended output Setting `data_only = FALSE` (or `keep_forests = TRUE`) returns a "missRanger" object. With `keep_forests = TRUE`, this allows for out-of-sample applications: ```{r} imp <- missRanger( iris_NA, pmm.k = 5, num.trees = 100, keep_forests = TRUE, verbose = 0 ) imp summary(imp) # Out-of-sample application # saveRDS(imp, file = "imputation_model.rds") # imp <- readRDS("imputation_model.rds") predict(imp, head(iris_NA)) ``` ### Formulas By default, `missRanger()` uses all columns to impute all columns with missings. This can be modified by passing a formula: The left hand side specifies the variables to be imputed, while the right hand side lists the variables used for imputation. ```{r} # Impute all variables with all (default) m <- missRanger(iris_NA, formula = . ~ ., pmm.k = 5, num.trees = 100, verbose = 0) # Don't use Species for imputation m <- missRanger(iris_NA, . ~ . - Species, pmm.k = 5, num.trees = 100, verbose = 0) # Impute Sepal.Length by Species (or not?) m <- missRanger(iris_NA, Sepal.Length ~ Species, pmm.k = 5, num.trees = 100) head(m) # Only univariate imputation was done! Why? Because Species contains missing values # itself and needs to appear on the LHS as well: m <- missRanger(iris_NA, Sepal.Length + Species ~ Species, pmm.k = 5, num.trees = 100) head(m) # Impute all variables univariately m <- missRanger(iris_NA, . ~ 1, verbose = 0) ``` ### Speed-up things `missRanger()` fits a random forest per variable and iteration. Thus, imputation can take long. Some tweaks: - Use less trees, e.g., `num.trees = 100`. - Use a smaller tree depth, e.g., `max.depth = 6`. - Use large leaves, e.g., `min.node.size = 100`. - Use smaller bootstrap samples, e.g., `sample.fraction = 0.2`. - Use less iterations, e.g., `max.iter = 3`. The first three items also help to greatly reduce the size of the models, which might become relevant in out-of-sample applications with `keep_forests = TRUE`. ### Trick: Use `case.weights` to reduce impact of rows with many missings Using the `case.weights` argument, you can pass case weights to the imputation models. For instance, this allows to reduce the contribution of rows with many missings: ```r m <- missRanger( iris_NA, num.trees = 100, pmm.k = 5, case.weights = rowSums(!is.na(iris_NA)) ) ``` ## References