--- title: "Data Ingestion with tidylearn" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Ingestion with tidylearn} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE ) ``` ## Overview Every machine learning workflow starts with data. tidylearn's `tl_read()` family provides a single consistent interface for loading data from files, databases, cloud storage, and APIs into tidy tibbles — ready for `tl_prepare_data()` and `tl_model()`. All readers return a `tidylearn_data` object, a tibble subclass that carries metadata about the source, format, and read timestamp. ```{r setup} library(tidylearn) library(dplyr) ``` --- ## The `tl_read()` Dispatcher `tl_read()` auto-detects the data format and dispatches to the appropriate backend — just like `tl_model()` routes to the right algorithm: ```{r dispatcher-concept, eval = FALSE} # Format is auto-detected from the file extension data <- tl_read("sales.csv") data <- tl_read("results.xlsx", sheet = "Q1") data <- tl_read("experiment.parquet") data <- tl_read("config.json") data <- tl_read("model_data.rds") # Override format detection when the extension is ambiguous data <- tl_read("export.txt", format = "tsv") ``` The result always prints with a metadata header: ```{r print-demo} tmp <- tempfile(fileext = ".csv") write.csv(mtcars, tmp, row.names = FALSE) data <- tl_read(tmp, .quiet = TRUE) data ``` ```{r cleanup-1, include = FALSE} unlink(tmp) ``` --- ## File Formats ### CSV and TSV Uses [readr](https://readr.tidyverse.org/) when available for fast, column-type-aware parsing. Falls back to base R automatically if readr is not installed. ```{r csv-demo} # Create example files tmp_csv <- tempfile(fileext = ".csv") tmp_tsv <- tempfile(fileext = ".tsv") write.csv(iris, tmp_csv, row.names = FALSE) write.table(iris, tmp_tsv, sep = "\t", row.names = FALSE) csv_data <- tl_read_csv(tmp_csv) tsv_data <- tl_read_tsv(tmp_tsv) nrow(csv_data) ``` ```{r cleanup-csv, include = FALSE} unlink(c(tmp_csv, tmp_tsv)) ``` ### Excel Reads `.xls`, `.xlsx`, and `.xlsm` files via [readxl](https://readxl.tidyverse.org/). Select sheets by name or position: ```{r excel-demo} library(readxl) path <- readxl_example("datasets.xlsx") excel_data <- tl_read_excel(path, sheet = "mtcars") head(excel_data, 3) ``` ### Parquet Lightweight, columnar storage for large datasets. Uses [nanoparquet](https://cran.r-project.org/package=nanoparquet) — a fast, dependency-free reader: ```{r parquet-demo} library(nanoparquet) tmp_pq <- tempfile(fileext = ".parquet") write_parquet(iris, tmp_pq) pq_data <- tl_read_parquet(tmp_pq) nrow(pq_data) ``` ```{r cleanup-pq, include = FALSE} unlink(tmp_pq) ``` ### JSON Reads tabular JSON (array of objects) via [jsonlite](https://cran.r-project.org/package=jsonlite). Nested structures are automatically flattened: ```{r json-demo} library(jsonlite) tmp_json <- tempfile(fileext = ".json") write_json(mtcars[1:5, ], tmp_json) json_data <- tl_read_json(tmp_json) json_data ``` ```{r cleanup-json, include = FALSE} unlink(tmp_json) ``` ### RDS and RData Native R serialisation formats — no extra packages needed: ```{r rds-demo} tmp_rds <- tempfile(fileext = ".rds") saveRDS(iris, tmp_rds) rds_data <- tl_read_rds(tmp_rds) nrow(rds_data) ``` ```{r rdata-demo} tmp_rdata <- tempfile(fileext = ".rdata") my_data <- mtcars save(my_data, file = tmp_rdata) # Name is auto-detected when there is a single data frame rdata_data <- tl_read_rdata(tmp_rdata) nrow(rdata_data) ``` ```{r cleanup-rds, include = FALSE} unlink(c(tmp_rds, tmp_rdata)) ``` --- ## Databases All database readers use [DBI](https://dbi.r-dbi.org/) as the interface layer. Each reader manages its own connection lifecycle — connect, query, disconnect — so you only need to provide the path or credentials and a SQL query. ### SQLite The simplest database backend — no server required: ```{r sqlite-demo} library(DBI) library(RSQLite) # Create an example database tmp_db <- tempfile(fileext = ".sqlite") conn <- dbConnect(SQLite(), tmp_db) dbWriteTable(conn, "iris_tbl", iris) dbDisconnect(conn) # Read with tl_read_sqlite db_data <- tl_read_sqlite( tmp_db, "SELECT * FROM iris_tbl WHERE Species = 'setosa'" ) nrow(db_data) ``` ```{r cleanup-sqlite, include = FALSE} unlink(tmp_db) ``` ### Using a Live Connection If you already have a DBI connection, use `tl_read_db()` directly — it will not close your connection: ```{r db-demo} conn <- dbConnect(SQLite(), ":memory:") dbWriteTable(conn, "mtcars_tbl", mtcars) sql <- "SELECT mpg, wt, hp FROM mtcars_tbl WHERE mpg > 20" db_result <- tl_read_db(conn, sql) db_result dbDisconnect(conn) ``` ### PostgreSQL, MySQL, and BigQuery These require a running database server or cloud service. The API is the same — provide connection details and a SQL query: ```{r remote-db, eval = FALSE} # PostgreSQL pg_data <- tl_read_postgres( dsn = "localhost", query = "SELECT * FROM sales WHERE year = 2025", dbname = "analytics", user = "myuser", password = "mypass" ) # MySQL / MariaDB # nolint: commented_code_linter. mysql_data <- tl_read_mysql( dsn = "mysql://user:pass@host:3306/mydb", query = "SELECT * FROM customers LIMIT 1000" ) # BigQuery bq_data <- tl_read_bigquery( project = "my-gcp-project", query = "SELECT * FROM `dataset.table` LIMIT 1000" ) ``` --- ## Cloud and API Sources ### Amazon S3 Downloads a file from S3 and auto-detects the format from the key's extension. Requires valid AWS credentials: ```{r s3, eval = FALSE} data <- tl_read_s3("s3://my-bucket/data/sales_2025.csv") data <- tl_read_s3("s3://my-bucket/data/results.parquet", region = "eu-west-1") ``` ### GitHub Downloads raw files directly from a repository. Accepts full URLs or `owner/repo` shorthand: ```{r github, eval = FALSE} # Read a CSV from a public GitHub repository data <- tl_read_github("tidyverse/dplyr", path = "data-raw/starwars.csv", ref = "main" ) ``` ### Kaggle Downloads datasets via the [Kaggle CLI](https://github.com/Kaggle/kaggle-cli). Install with `pip install kaggle` and configure your API credentials: ```{r kaggle, eval = FALSE} data <- tl_read_kaggle("zillow/zecon", file = "Zip_time_series.csv") data <- tl_read_kaggle("titanic", file = "train.csv", type = "competition") ``` --- ## Multi-File Reading Real-world data is often split across multiple files. tidylearn handles three common patterns. ### Multiple Paths Pass a character vector to `tl_read()` — each file is read and row-bound, with a `source_file` column tracking origin: ```{r multi-path} dir <- tempdir() write.csv(iris[1:50, ], file.path(dir, "batch1.csv"), row.names = FALSE) write.csv(iris[51:100, ], file.path(dir, "batch2.csv"), row.names = FALSE) paths <- file.path(dir, c("batch1.csv", "batch2.csv")) combined <- tl_read(paths, .quiet = TRUE) table(combined$source_file) ``` ```{r cleanup-multi, include = FALSE} unlink(file.path(dir, c("batch1.csv", "batch2.csv"))) ``` ### Directory Scanning Point `tl_read_dir()` at a directory. Filter by format, regex pattern, or scan recursively: ```{r dir-demo} dir <- tempfile(pattern = "tl_vignette_") dir.create(dir) write.csv(iris[1:50, ], file.path(dir, "jan.csv"), row.names = FALSE) write.csv(iris[51:100, ], file.path(dir, "feb.csv"), row.names = FALSE) write.csv(iris[101:150, ], file.path(dir, "mar.csv"), row.names = FALSE) # Read all CSVs from the directory all_data <- tl_read_dir(dir, format = "csv", .quiet = TRUE) nrow(all_data) table(all_data$source_file) ``` ```{r dir-pattern} # Filter with a regex pattern subset <- tl_read_dir(dir, pattern = "^(jan|feb)", .quiet = TRUE) nrow(subset) ``` ```{r cleanup-dir, include = FALSE} unlink(dir, recursive = TRUE) ``` Passing a directory path directly to `tl_read()` works too: ```{r dir-dispatch, eval = FALSE} data <- tl_read("data/monthly_exports/") ``` ### Zip Archives `tl_read_zip()` extracts the archive, auto-detects the file format, and reads the contents. Select a specific file or let it discover data files automatically: ```{r zip-demo} # Create an example zip dir <- tempfile(pattern = "tl_zip_src_") dir.create(dir) write.csv(iris, file.path(dir, "iris.csv"), row.names = FALSE) zip_path <- tempfile(fileext = ".zip") old_wd <- getwd() setwd(dir) utils::zip(zip_path, "iris.csv") setwd(old_wd) zip_data <- tl_read_zip(zip_path, .quiet = TRUE) nrow(zip_data) attr(zip_data, "tl_format") ``` ```{r cleanup-zip, include = FALSE} unlink(c(dir, zip_path), recursive = TRUE) ``` Zip files are also auto-detected by `tl_read()`: ```{r zip-dispatch, eval = FALSE} data <- tl_read("download.zip") data <- tl_read("download.zip", file = "train.csv") ``` --- ## The `tidylearn_data` Class Every reader returns a `tidylearn_data` object — a tibble subclass with three metadata attributes: | Attribute | Description | |-----------|-------------| | `tl_source` | File path, URL, or description of the data source | | `tl_format` | Detected or specified format (e.g., `"csv"`, `"zip+csv"`) | | `tl_timestamp` | POSIXct timestamp of when the data was read | Because `tidylearn_data` inherits from `tbl_df`, all dplyr verbs, ggplot2, and tidylearn functions work transparently: ```{r class-demo} tmp <- tempfile(fileext = ".csv") write.csv(mtcars, tmp, row.names = FALSE) data <- tl_read(tmp, .quiet = TRUE) # Check metadata attr(data, "tl_format") # Works with dplyr data %>% filter(mpg > 20) %>% select(mpg, wt, hp) %>% head(3) ``` ```{r cleanup-class, include = FALSE} unlink(tmp) ``` --- ## Full Pipeline Combining `tl_read()` with the rest of tidylearn gives you a complete workflow from raw data to published results: ```{r pipeline} # 1. Ingest tmp <- tempfile(fileext = ".csv") write.csv(iris, tmp, row.names = FALSE) data <- tl_read(tmp, .quiet = TRUE) # CSV files lose factor information, so convert character columns as needed data <- data %>% mutate(Species = as.factor(Species)) # 2. Split split <- tl_split(data, prop = 0.7, stratify = "Species", seed = 42) # 3. Model model <- tl_model(split$train, Species ~ ., method = "forest") # 4. Evaluate eval_result <- tl_evaluate(model, new_data = split$test) eval_result ``` ```{r cleanup-pipeline, include = FALSE} unlink(tmp) ``` --- ## Supported Formats Reference | Format | Function | Backend | Dependency | |--------|----------|---------|------------| | CSV | `tl_read_csv()` | readr / base R | Suggests (readr) | | TSV | `tl_read_tsv()` | readr / base R | Suggests (readr) | | Excel | `tl_read_excel()` | readxl | Suggests | | Parquet | `tl_read_parquet()` | nanoparquet | Suggests | | JSON | `tl_read_json()` | jsonlite | Suggests | | RDS | `tl_read_rds()` | base R | None | | RData | `tl_read_rdata()` | base R | None | | SQLite | `tl_read_sqlite()` | DBI + RSQLite | Suggests | | PostgreSQL | `tl_read_postgres()` | DBI + RPostgres | Suggests | | MySQL | `tl_read_mysql()` | DBI + RMariaDB | Suggests | | BigQuery | `tl_read_bigquery()` | bigrquery | Suggests | | S3 | `tl_read_s3()` | paws.storage | Suggests | | GitHub | `tl_read_github()` | base R | None | | Kaggle | `tl_read_kaggle()` | Kaggle CLI | None (system) | | Directory | `tl_read_dir()` | (dispatches) | — | | Zip | `tl_read_zip()` | base R + (dispatches) | — | | Multi-path | `tl_read(c(...))` | (dispatches) | — |