When dividing your data for model training and validation, DataRobot will typically choose random rows of your dataset to assign amongst different cross validation folds. This will verify you have not overfit your model to the training set and the model can perform well on new data. However when your data has an intrinsic time based component, you have to be careful to always use data from the past to predict the future and never use the future to predict the past. The latter is known as lookahead bias and can be thought of as another form of a data leak. DataRobot now posses datetime partitioning which will be diligent within model training & validation to guard against lookahead bias.
Let’s look at how we would frame a problem with a time component
within DataRobot. We will use a sample dataset from LendingClub, similar
to the Prediction Explanations Vignette. We want to train the model on
historical loans and validate on recent loans and would therefore like
to use a datetime Partition. Cross Validation folds are now known as
Backtests
with each backtest corresponding to a sliding
window of historical training data and more recent validation data. By
default DataRobot will create a single backtesting window. We can
control the number of backtests to use (up to 10), so we will use 5 as a
best practice similar to a cross sectional problem.
Let’s load datarobot
library(datarobot)
<- read.csv("https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv")
lending <- CreateDatetimePartitionSpecification(datetimePartitionColumn = "earliest_cr_line",
partition numberOfBacktests = 5)
<- StartProject(dataSource = lending,
proj projectName = "Lending_Club_Time_Series",
target = "is_bad",
mode = "quick",
partition = partition)
We took advantage of DataRobot’s automated partition date selection after we specified the number of backtests to use. DataRobot allows further control, where we can further specify the validation start date as well as duration. Let’s look at an example below.
<- list()
backtest # Dates are not project specific but rather example dates
1]] <- CreateBacktestSpecification(0, ConstructDurationString(),
backtest[["1989-12-01", ConstructDurationString(days = 100))
2]] <- CreateBacktestSpecification(1, ConstructDurationString(), "1999-10-01",
backtest[[ConstructDurationString(days = 100))
# create desired partition specification
<- CreateDatetimePartitionSpecification("earliest_cr_line",
partition numberOfBacktests = 2,
backtests = backtest)
Let’s continue with our original project. Often when training time-based models we would like to iterate within our workflow by by running all the backtest folds within a model to verify its stability. Finally we can retrain the best model on a larger or more recent time slice to prepare the model for model for deployment. Let’s look how we can accomplish these actions below:
# Request more granular information on the datetime partition specification
GetDatetimePartition(proj)
# View blueprints associated with a project
<- ListBlueprints(proj)
bps
# View the the models within the model leaderboard
<- ListModels(proj)
models
# Retrieve a datetime model. There is now a new retrieval function specific to datetime partitioning
<- GetDatetimeModel(proj, models[[1]]$modelId)
dt_model
# Score all Backtests
<- ScoreBacktests(dt_model)
scoreJobId WaitForJobToComplete(proj, scoreJobId) # To make synchronous
# now model information will also contain information about backtest scores
<- GetDatetimeModel(proj, dt_model$modelId)
dtModelWithBt
# Retrain a model using a different start & end date.
# One has to request a `Frozen` model to keep the hyper-parameters static and avoid lookahead bias.
# Within the context of deployment, this can be used to retrain a resulting model on more recent data.
UpdateProject(proj, holdoutUnlocked = TRUE) # If retraining on 100% of the data, we need to unlock the holdout set.
<- RequestFrozenDatetimeModel(dt_model,
modelJobId_frozen trainingStartDate = as.Date("1950/12/1"),
trainingEndDate = as.Date("1998/3/1"))
<- GetDatetimeModelFromJobId(proj, modelJobId_frozen)
new_dt_model_frozen
# Train & retrieve a new date-time model based on rowcount
<- RequestNewDatetimeModel(proj, bps[[1]], trainingRowCount = 100)
modelJobId <- GetDatetimeModelFromJobId(proj, modelJobId)
new_dt_model
# Train & retrieve a new date-time model based on duration
<- RequestNewDatetimeModel(proj, bps[[1]],
modelJobId trainingDuration = ConstructDurationString(months=10))
<- GetDatetimeModelFromJobId(proj, modelJobId) new_dt_model