---
title: "Using 'splitTools'"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using 'splitTools'}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

## Overview

{splitTools} is a fast, lightweight toolkit for data splitting. 

Its two main functions `partition()` and `create_folds()` support

- data partitioning (e.g. into training, validation and test),
- creating (in- or out-of-sample) folds for cross-validation (CV),
- creating *repeated* folds for CV,
- stratified splitting, 
- grouped splitting as well as
- blocked splitting (if the sequential order of the data should be retained).

The function `create_timefolds()` does time-series splitting in the sense that the out-of-sample data follows the in-sample data.

We will now illustrate how to use {splitTools} in a typical modeling workflow.

## Usage

### Simple validation

We will go through the following steps:

1. We split the `iris` data into 60% training, 20% validation, and 20% test data, stratified by the variable `Sepal.Length`. Since this variable is numeric, stratification uses quantile binning.
2. We will model the response `Sepal.Length` with a linear regression, once with and once without interaction between `Species` and `Sepal.Width`.
3. After selecting the better of the two models via validation RMSE, we evaluate the final model on the test data.

```{r}
library(splitTools)

# Split data into partitions
set.seed(3451)
inds <- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
str(inds)

train <- iris[inds$train, ]
valid <- iris[inds$valid, ]
test <- iris[inds$test, ]

rmse <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

# Use simple validation to decide on interaction yes/no...
fit1 <- lm(Sepal.Length ~ ., data = train)
fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train)

rmse(valid$Sepal.Length, predict(fit1, valid))
rmse(valid$Sepal.Length, predict(fit2, valid))

# Yes! Choose and test final model
rmse(test$Sepal.Length, predict(fit2, test))
```

### CV

Since the `iris` data consists of only 150 rows, investing 20% of observations for validation seems like a waste of resources. Furthermore, the performance estimates might not be very robust. Let's replace simple validation by five-fold CV, again using stratification on the response variable.

1. Split `iris` into 80% training data and 20% test, stratified by the variable `Sepal.Length`. 
2. Use stratified five-fold CV to choose between the two models.
3. We evaluate the final model on the test data.

```{r}
# Split into training and test
inds <- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87)

train <- iris[inds$train, ]
test <- iris[inds$test, ]

# Get stratified CV in-sample indices
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734)

# Vectors with results per model and fold
cv_rmse1 <- cv_rmse2 <- numeric(5)

for (i in seq_along(folds)) {
  insample <- train[folds[[i]], ]
  out <- train[-folds[[i]], ]
  
  fit1 <- lm(Sepal.Length ~ ., data = insample)
  fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
  
  cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
  cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}

# CV-RMSE of model 1 -> close winner
mean(cv_rmse1)

# CV-RMSE of model 2
mean(cv_rmse2)

# Fit model 1 on full training data and evaluate on test data
final_fit <- lm(Sepal.Length ~ ., data = train)
rmse(test$Sepal.Length, predict(final_fit, test))
```

### Repeated CV

If feasible, *repeated* CV is recommended in order to reduce uncertainty in decisions. Otherwise, the process remains the same.

```{r}
# Train/test split as before

# 15 folds instead of 5
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3)
cv_rmse1 <- cv_rmse2 <- numeric(15)

# Rest as before...
for (i in seq_along(folds)) {
  insample <- train[folds[[i]], ]
  out <- train[-folds[[i]], ]
  
  fit1 <- lm(Sepal.Length ~ ., data = insample)
  fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
  
  cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
  cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}

mean(cv_rmse1)
mean(cv_rmse2)

# Refit and test as before
```

### Stratification on multiple columns

The function `multi_strata()` creates a stratification factor from multiple columns that can then be passed to `create_folds(, type = "stratified")` or `partition(, type = "stratified")`. The resulting partitions will be (quite) balanced regarding these columns. 

Two grouping strategies are offered:

1. k-means clustering based on scaled input.
2. All combinations of columns, where numeric input is being binned.

Let's have a look at a simple example where we want to model "Sepal.Width" as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response "Sepal.Width", but also regarding the important predictor "Species". In this case, we could use the following workflow:

```{r}
set.seed(3451)

ir <- iris[c("Sepal.Length", "Species")]
y <- multi_strata(ir, k = 5)
inds <- partition(
  y, p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
)

# Check
by(ir, inds, summary)
```