In V2.7 release of DataRobot API, the following model insights have been added:
Insights provided by Lift Chart and ROC Curves are helpful in checking the performance of machine learning models. Word clouds are helpful for understanding useful words and phrases generated after applying different NLP techniques to unstructured data. We will explore each one of these in detail.
To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an endpoint, the URL address of the specific DataRobot server being used and a token, a previously validated access token.
token is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section.
endpoint depends on DataRobot modeling engine
installation (cloud-based vs. on-premise) you are using. Contact your
DataRobot admin for information on which endpoint to use if you do not
know. The endpoint for DataRobot cloud accounts is
https://app.datarobot.com/api/v2
.
The first access method uses a YAML configuration file with these two
elements - labeled token and endpoint
- located at $HOME/.config/datarobot/drconfig.yaml
. If this
file exists when the datarobot package is loaded, a connection to the
DataRobot modeling engine is automatically established during
library(datarobot)
. It is also possible to establish a
connection using this YAML file via the
ConnectToDataRobot()
function, by specifying the
configPath
parameter.
The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the endpoint and token parameters.
library(datarobot)
ConnectToDataRobot(endpoint = "http://<YOUR DR SERVER>/api/v2", token = "<YOUR API TOKEN>")
We will be using the Lending Club dataset, a sample dataset related to credit scoring open-sourced by LendingClub. We can create a project with this dataset like this:
<- "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv"
lendingClubURL <- StartProject(dataSource = lendingClubURL,
project projectName = "AdvancedModelInsightsVignette",
mode = "auto",
target = "is_bad",
workerCount = "max",
wait = TRUE)
Once the modeling process has completed, the ListModels
function returns an S3 object of class listOfModels
that
characterizes all of the models in a specified DataRobot project. It is
important to use WaitforAutopilot
before calling
ListModels
, as the function will return only a partial list
(and a warning) if the autopilot is not yet complete.
<- as.data.frame(ListModels(project))
results saveRDS(results, "resultsModelInsights.rds")
library(knitr)
kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE)
modelType | expandedModel | modelId | blueprintId | featurelistName | featurelistId | samplePct | validationMetric | |
---|---|---|---|---|---|---|---|---|
1 | Gradient Boosted Trees Classifier with Early Stopping | Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed | 5efa1dcfe157256402b66684 | 76406c9c52dc3f6a3a0ba8442fa17601 | Informative Features | 5efa1bd3f0f49455b0ccd765 | 64 | 0.36472 |
2 | eXtreme Gradient Boosted Trees Classifier with Early Stopping | eXtreme Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed | 5efa1dd0e157256402b66694 | 5964b39390e51b69a82d9a8dab7b2675 | Informative Features | 5efa1bd3f0f49455b0ccd765 | 64 | 0.36562 |
3 | ENET Blender | ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) | 5efa23c020433938c72a0153 | 81092c05cb849904f6b737b767799660 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36564 |
4 | AVG Blender | AVG Blender::Average Blender | 5efa23be20433938c72a014f | c294bee1a436f6f034fd680aa752b9d5 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36566 |
5 | ENET Blender | ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) | 5efa23c020433938c72a0155 | 83d1a0ca93741bd8ef06bfc47c75ac33 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36567 |
6 | Advanced AVG Blender | Advanced AVG Blender::Average Blender | 5efa23c020433938c72a0151 | c40db7cd1b9d3ee12d17c0369639cb3a | Multiple featurelists | Multiple featurelist ids | 64 | 0.36639 |
Lift chart data can be retrieved for a specific data partition
(validation, cross-validation, or holdout) or for all the data
partitions using GetLiftChart
and
ListLiftCharts
. To retrieve the data for holdout partition,
it needs to be unlocked first.
Let’s retrieve the validation partition data for top model using
GetLiftChart
. The GetLiftChart
function
returns data for validation partition by default. We can retrieve data
for specific data partition by passing value to source parameter in
GetLiftChart
.
<- GetProject("5eed0d790ef80408ae212f09")
project <- ListModels(project)
allModels saveRDS(allModels, "modelsModelInsights.rds")
<- as.data.frame(allModels)
modelFrame <- modelFrame$validationMetric
metric if (project$metric %in% c('AUC', 'Gini Norm')) {
<- which.max(metric)
bestIndex else {
} <- which.min(metric)
bestIndex
}<- allModels[[bestIndex]]
bestModel $modelType bestModel
[1] “Gradient Boosted Greedy Trees Classifier with Early Stopping”
This selects a Gradient Boosted Greedy Trees Classifier with Early Stopping model.
The lift chart data we retrieve from the server includes the mean of the model prediction and the mean of the actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.
<- GetLiftChart(bestModel)
lc saveRDS(lc, "liftChartModelInsights.rds")
head(lc)
actual predicted binWeight
1 0.00000000 0.01877918 27 2 0.03703704 0.02476968 27 3 0.00000000 0.02867826 26 4 0.00000000 0.03207965 27 5 0.07407407 0.03540244 27 6 0.03846154 0.03865136 26
<- GetLiftChart(bestModel, source = "validation")
ValidationLiftChart <- "#08233F"
dr_dark_blue <- "#1F77B4"
dr_blue <- "#FF7F0E"
dr_orange
# Function to plot lift chart
library(data.table)
<- function(ValidationLiftChart, bins = 10) {
LiftChartPlot if (60 %% bins == 0) {
$bins <- rep(seq(bins), each = 60 / bins)
ValidationLiftChart<- data.table(ValidationLiftChart)
ValidationLiftChart := mean(actual), by = bins]
ValidationLiftChart[, actual := mean(predicted), by = bins]
ValidationLiftChart[, predicted unique(ValidationLiftChart[, -"binWeight"])
else {
} "Please provide bins less than 60 and divisor of 60"
}
}<- LiftChartPlot(ValidationLiftChart)
LiftChartData saveRDS(LiftChartData, "LiftChartDataVal.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
All the available lift chart data can be retrieved using
ListLiftCharts
. Here is an example retrieving data for all
the available partitions, followed by plotting the cross validation
partition:
<- ListLiftCharts(bestModel)
AllLiftChart <- LiftChartPlot(AllLiftChart[["crossValidation"]])
LiftChartData saveRDS(LiftChartData, "LiftChartDataCV.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
We can also plot the lift chart using ggplot2
:
library(ggplot2)
$actual <- lc$actual / lc$binWeight
lc$predicted <- lc$predicted / lc$binWeight
lc<- lc[order(lc$predicted), ]
lc $binWeight <- NULL
lc<- data.frame(value = c(lc$actual, lc$predicted),
lc variable = c(rep("Actual", length(lc$actual)),
rep("Predicted", length(lc$predicted))),
id = rep(seq_along(lc$actual), 2))
ggplot(lc) + geom_line(aes(x = id, y = value, color = variable))
The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
ROC curve data can be generated for a specific data partition
(validation, cross validation, or holdout) or for all the data partition
using GetRocCurve
and ListRocCurves
.
To retrieve ROC curve information use GetRocCurve
:
<- GetRocCurve(bestModel)
roc saveRDS(roc, "ROCCurveModelInsights.rds")
You can then plot the results:
<- "#08233F"
dr_dark_blue <- "#03c75f"
dr_roc_green <- GetRocCurve(bestModel)
ValidationRocCurve <- ValidationRocCurve[["rocPoints"]]
ValidationRocPoints saveRDS(ValidationRocPoints, "ValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate,
main = "ROC Curve",
xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
col = dr_roc_green,
ylim = c(0,1), xlim = c(0,1),
pch = 20, type = "b")
All the available ROC curve data can be retrieved using
ListRocCurves
. Here again is an example to retrieve data
for all the available partitions, followed by plotting the cross
validation partition:
<- ListRocCurves(bestModel)
AllRocCurve <- AllRocCurve[['crossValidation']][['rocPoints']]
CrossValidationRocPoints saveRDS(CrossValidationRocPoints, 'CrossValidationRocPoints.rds')
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate,
main = "ROC Curve",
xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
col = dr_roc_green,
ylim = c(0, 1), xlim = c(0, 1),
pch = 20, type = "b")
You can also plot the ROC curve using ggplot2
:
ggplot(
ValidationRocPoints, aes(x = falsePositiveRate, y = truePositiveRate)
+ geom_line() )
You can get the recommended threshold value with maximal F1 score. That is the same threshold that is preselected in DataRobot when you open the “ROC curve” tab.
<- ValidationRocPoints$threshold[which.max(ValidationRocPoints$f1Score)] threshold
You can also estimate metrics for different threshold values. This will produce the same results as updating the threshold on the DataRobot “ROC curve” tab.
$threshold == tail(Filter(function(x) x > threshold,
ValidationRocPoints[ValidationRocPoints$threshold),
ValidationRocPoints1), ]
The word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.
This example will show you how to obtain word cloud data and visualize it, similar to how DataRobot visualizes the word cloud in the “Model Insights” tab interface.
The visualization example here uses the modelwordcloud
package.
Now let’s find our word cloud:
# Find word-based models by looking for "word" modelType
<- allModels[grep("Word", lapply(allModels, `[[`, "modelType"))]
wordModels <- wordModels[[1]]
wordModel # Get word cloud
<- GetWordCloud(project, wordModel$modelId)
wordCloud saveRDS(wordCloud, "wordCloudModelInsights.rds")
Now we plot it!
# Remove stop words
<- wordCloud[!wordCloud$isStopword, ]
wordCloud
# Specify colors similar to what DataRobot produces for
# a wordcloud in Insights
<- readRDS("colors.rds")
colors
# Make word cloud
suppressWarnings(
wordcloud(words = wordCloud$ngram,
freq = wordCloud$frequency,
coefficients = wordCloud$coefficient,
colors = colors,
scale = c(3, 0.3))
)