--- title: "Feature Engineering" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{feature-engineering} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Automated feature engineering is a cornerstone of the package. Below are some of the techniques we use in multivariate machine learning models, and the outside packages that make it possible. ## Missing Data and Outliers Missing data is filled in using the [pad_by_time](https://business-science.github.io/timetk/reference/pad_by_time.html) function from the timetk package. First, each time series is grouped and padded using their existing start and end dates. Missing values are padded using NA. Then the same process is ran again, this time padding data from the `hist_start_date` from `forecast_time_series()`, with missing values being filled in with zero. This ensures that missing data before a time series starts are all zeroes, but missing periods within the existing time series data are identified to be inputted with new values in the next step. After missing data is padded, the [ts_impute_vec](https://business-science.github.io/timetk/reference/ts_impute_vec.html) function from the timetk package is called to impute any NA values. This only happens if the `clean_missing_values` input from `forecast_time_series()` is set to TRUE, otherwise NA values are replaced with zero. Outliers are handled using the [ts_clean_vec](https://business-science.github.io/timetk/reference/ts_clean_vec.html) function from the timetk package. Outliers are replaced after the missing data process, and only runs if the `clean_outliers` input from `forecast_time_series()` is set to TRUE. **Important Note:** Missing values and outliers are replaced for the target variable and any numeric external regressors. ## Box-Cox Stabilizes the variance in each time series using the [box_cox_vec](https://business-science.github.io/timetk/reference/box_cox_vec.html) function from the timetk package. Applied to both the target variable and any external regressor before other transformations like differencing. You can control this within `prep_models()`. ## Differencing Uses the [feasts](https://feasts.tidyverts.org/reference/unitroot_ndiffs.html) package to check if each time series is stationary and applies the differencing required (up to two standard differences with lag one) in order to make the time series stationary. Uses the [diff_vec](https://business-science.github.io/timetk/reference/diff_vec.html) function from the timetk package to do the differencing. This is applied to the target variable and any external regressor before other features are created. Data is undifferenced before training for univariate models like arima, but differenced data is used for all multivariate models. You can control the differencing done within `prep_models()`. ## Date Features The [tk_augment_timeseries_signature](https://business-science.github.io/timetk/reference/tk_augment_timeseries.html) function from the timetk package easily extracts out various date features from the time stamp. The function doesn't differentiate between date type, so features need to be removed depending on the date type. For example, features related to week and day for a monthly forecast are automatically removed. Fourier series are also added using the [tk_augment_fourier](https://business-science.github.io/timetk/reference/tk_augment_fourier.html) function from timetk. ```{r, message = FALSE} library(dplyr) library(timetk) m4_monthly %>% timetk::tk_augment_timeseries_signature(date) %>% dplyr::group_by(id) %>% timetk::tk_augment_fourier(date, .periods = c(3, 6, 12), .K = 1) %>% dplyr::ungroup() ``` ## Lags, Rolling Windows, and Polynomial Transformations Lags of the target variable and external regressors are created using the [tk_augment_lags](https://business-science.github.io/timetk/reference/tk_augment_lags.html) function from timetk. Rolling window calculations of the target variable are created using the [tk_augment_slidify](https://business-science.github.io/timetk/reference/tk_augment_slidify.html) function from timetk. The below calculations are created over various window values. - sum - mean - standard deviation Polynomial transformations are created for the target variable, and lags are then created on top of them. The below transformations are created. - squared - cubed - log ## Custom Approaches In addition to the standard approaches above, finnts also does two different ways of preparing features to be created for a multivariate machine learning model. In the first recipe, referred to as "R1" in default finnts models, by default takes a single step horizon approach. Meaning all of the engineered target and external regressor features are used but the lags cannot be less than the forecast horizon. For example, a monthly data set with a forecast horizon of 3, finnts will take engineered features like lags and rolling window features but only use those lags that are for periods equal to or greater than 3. You can also run a multistep horizon approach by setting `multistep_horizon` to TRUE in `prep_models()`. The multistep approach will create features that can be used by specific multivariate models that optimize for each period in a forecast horizon. More on this in the "models used in finnts" vignette. Recursive forecasting is not supported in finnts multivariate machine learning models, since feeding forecast outputs as features to create another forecast adds complex layers of uncertainty that can easily spiral out of control and produce poor forecasts. NA values created by generating lag features are filled "up". This results in the first initial periods of a time series having some data leakage but the effect should be small if the time series is long enough. ```{r, message = FALSE} library(finnts) hist_data <- timetk::m4_monthly %>% dplyr::filter( date >= "2012-01-01", id == "M2" ) %>% dplyr::rename(Date = date) %>% dplyr::mutate(id = as.character(id)) run_info <- set_run_info( experiment_name = "finnts_fcst", run_name = "R1_run" ) prep_data( run_info = run_info, input_data = hist_data, combo_variables = c("id"), target_variable = "value", date_type = "month", forecast_horizon = 3, recipes_to_run = "R1", multistep_horizon = FALSE ) R1_prepped_data_tbl <- get_prepped_data( run_info = run_info, recipe = "R1" ) print(R1_prepped_data_tbl) ``` The second recipe is referred to as "R2" in default finnts models. It takes a very different approach than the "R1" recipe. For a 3 month forecast horizon on a monthly dataset, target and rolling window features are created depending on the horizon period. They are also constrained to be equal or less than the forecast horizon. In the below example, "Origin" and "Horizon" features are created for each time period. This results in duplicating rows in the original data set to create new features that are now specific to each horizon period. This helps the default finnts models find new unique relationships to model, when compared to a more formal approach in "R1". NA values created by generating lag features are filled "up". ```{r, message = FALSE} library(finnts) hist_data <- timetk::m4_monthly %>% dplyr::filter( date >= "2012-01-01", id == "M2" ) %>% dplyr::rename(Date = date) %>% dplyr::mutate(id = as.character(id)) run_info <- set_run_info( experiment_name = "finnts_fcst", run_name = "R2_run" ) prep_data( run_info = run_info, input_data = hist_data, combo_variables = c("id"), target_variable = "value", date_type = "month", forecast_horizon = 3, recipes_to_run = "R2" ) R2_prepped_data_tbl <- get_prepped_data( run_info = run_info, recipe = "R2" ) print(R2_prepped_data_tbl) ``` ## Model Specific Preprocessing In addition to everything called out above, some models have their own specific transformations that need to be applied before training a model. For example, the "glmnet" model needs to transform categorical variables into continuous variables and center/scale the data before training. Each default model in finnts has their own preprocessing steps that ensure the data fed into the model has the best chance of producing a high quality forecast. The [recipes](https://recipes.tidymodels.org/) package is used to easily apply various preprocessing transformations needed before training a model.