--- title: "Getting Started with tidylearn" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with tidylearn} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction `tidylearn` provides a **unified tidyverse-compatible interface** to R's machine learning ecosystem. It wraps proven packages like glmnet, randomForest, xgboost, e1071, cluster, and dbscan - you get the reliability of established implementations with the convenience of a consistent, tidy API. **What tidylearn does:** - Provides one consistent interface (`tl_model()`) to 20+ ML algorithms - Returns tidy tibbles instead of varied output formats - Offers unified ggplot2-based visualization across all methods - Enables pipe-friendly workflows **What tidylearn is NOT:** - A reimplementation of ML algorithms (uses established packages under the hood) - A replacement for the underlying packages (you can access the raw model via `model$fit`) ## Installation ```{r, eval = FALSE} # Install from CRAN install.packages("tidylearn") # Or install development version from GitHub # devtools::install_github("ces0491/tidylearn") ``` ```{r setup} library(tidylearn) library(dplyr) ``` ## The Unified Interface The core of tidylearn is the `tl_model()` function, which dispatches to the appropriate underlying package based on the method you specify. The wrapped packages include stats, glmnet, randomForest, xgboost, gbm, e1071, nnet, rpart, cluster, and dbscan. ### Supervised Learning #### Classification ```{r} # Classification with logistic regression model_logistic <- tl_model(iris, Species ~ ., method = "logistic") print(model_logistic) ``` ```{r} # Make predictions predictions <- predict(model_logistic) head(predictions) ``` #### Regression ```{r} # Regression with linear model model_linear <- tl_model(mtcars, mpg ~ wt + hp, method = "linear") print(model_linear) ``` ```{r} # Predictions predictions_reg <- predict(model_linear) head(predictions_reg) ``` ### Unsupervised Learning #### Dimensionality Reduction ```{r} # Principal Component Analysis model_pca <- tl_model(iris[, 1:4], method = "pca") print(model_pca) ``` ```{r} # Transform data transformed <- predict(model_pca) head(transformed) ``` #### Clustering ```{r} # K-means clustering model_kmeans <- tl_model(iris[, 1:4], method = "kmeans", k = 3) print(model_kmeans) ``` ```{r} # Get cluster assignments clusters <- model_kmeans$fit$clusters head(clusters) ``` ```{r} # Compare with actual species table(clusters$cluster, iris$Species) ``` ## Data Preprocessing tidylearn provides comprehensive preprocessing functions: ```{r} # Prepare data with multiple preprocessing steps processed <- tl_prepare_data( iris, Species ~ ., impute_method = "mean", scale_method = "standardize", encode_categorical = FALSE ) ``` ```{r} # Check preprocessing steps applied names(processed$preprocessing_steps) ``` ```{r} # Use processed data for modeling model_processed <- tl_model(processed$data, Species ~ ., method = "forest") ``` ## Train-Test Splitting ```{r} # Simple random split split <- tl_split(iris, prop = 0.7, seed = 123) # Train model model_train <- tl_model(split$train, Species ~ ., method = "logistic") # Test predictions predictions_test <- predict(model_train, new_data = split$test) head(predictions_test) ``` ```{r} # Stratified split (maintains class proportions) split_strat <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123) # Check proportions are maintained prop.table(table(split_strat$train$Species)) prop.table(table(split_strat$test$Species)) prop.table(table(iris$Species)) ``` ## Wrapped Packages tidylearn provides a unified interface to these established R packages: ### Supervised Methods | Method | Underlying Package | Function Called | |--------|-------------------|-----------------| | `"linear"` | stats | `lm()` | | `"polynomial"` | stats | `lm()` with `poly()` | | `"logistic"` | stats | `glm(..., family = binomial)` | | `"ridge"`, `"lasso"`, `"elastic_net"` | glmnet | `glmnet()` | | `"tree"` | rpart | `rpart()` | | `"forest"` | randomForest | `randomForest()` | | `"boost"` | gbm | `gbm()` | | `"xgboost"` | xgboost | `xgb.train()` | | `"svm"` | e1071 | `svm()` | | `"nn"` | nnet | `nnet()` | | `"deep"` | keras | `keras_model_sequential()` | ### Unsupervised Methods | Method | Underlying Package | Function Called | |--------|-------------------|-----------------| | `"pca"` | stats | `prcomp()` | | `"mds"` | stats, MASS, smacof | `cmdscale()`, `isoMDS()`, etc. | | `"kmeans"` | stats | `kmeans()` | | `"pam"` | cluster | `pam()` | | `"clara"` | cluster | `clara()` | | `"hclust"` | stats | `hclust()` | | `"dbscan"` | dbscan | `dbscan()` | ### Accessing the Underlying Model You always have access to the raw model from the underlying package via `$fit`: ```{r} # Example: Access the raw randomForest object model_forest <- tl_model(iris, Species ~ ., method = "forest") class(model_forest$fit) # This is the randomForest object # Use package-specific functions if needed # randomForest::varImpPlot(model_forest$fit) ``` ## Next Steps Now that you understand the basics, explore: 1. **Supervised Learning** - Dive deeper into classification and regression 2. **Unsupervised Learning** - Explore clustering and dimensionality reduction 3. **Integration Workflows** - Combine supervised and unsupervised learning 4. **AutoML** - Automated machine learning with `tl_auto_ml()` ## Summary tidylearn is a **wrapper package** that provides: - **Unified Interface**: One function (`tl_model()`) that dispatches to proven packages like glmnet, randomForest, xgboost, e1071, and others - **Transparency**: Access raw model objects via `model$fit` for package-specific functionality - **Tidy Output**: All results are tibbles for easy manipulation with dplyr and ggplot2 - **Consistent Visualization**: Unified ggplot2-based plots regardless of model type The underlying algorithms are unchanged - tidylearn simply makes them easier to use together. ```{r} # Quick example combining everything data_split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 42) data_prep <- tl_prepare_data(data_split$train, Species ~ ., scale_method = "standardize") model_final <- tl_model(data_prep$data, Species ~ ., method = "forest") test_preds <- predict(model_final, new_data = data_split$test) print(model_final) ```