--- title: "Introduction to kit" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to kit} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(kit) ``` ## Overview **kit** provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection. Key features include: * **Parallel statistical functions**: Row-wise operations (`psum`, `pmean`, `pfirst`) using OpenMP. * **Vectorized conditionals**: Fast `if-else` logic (`iif`, `nif`, `vswitch`) that preserves attributes. * **Efficient set operations**: Faster `unique`, `duplicated`, and `count` for vectors and data frames. * **Partial sorting**: Retrieve top N elements without sorting the entire vector (`topn`). * **Factor utilities**: Fast character-to-factor conversion (`charToFact`) and level manipulation (`setlevels`). Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets. ## Parallel Statistical Functions Computing row-wise statistics across multiple vectors or data frame columns is a common task. While base R has `pmin()` and `pmax()`, it lacks efficient equivalents for sum, mean, or product. **kit** fills this gap. ### Row-wise Arithmetic `psum()`, `pmean()`, and `pprod()` compute parallel sum, mean, and product respectively. They accept multiple vectors or a single list/data frame. ```{r} x <- c(1, 3, NA, 5) y <- c(2, NA, 4, 1) z <- c(3, 4, 4, 1) # Parallel sum psum(x, y, z, na.rm = TRUE) # Parallel mean pmean(x, y, z, na.rm = TRUE) ``` They are particularly useful for data frames: ```{r} df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)) psum(df) ``` ### Row-wise Min, Max, and Range `fpmin()`, `fpmax()`, and `prange()` compute parallel minimum, maximum, and range (max - min) respectively. They complement base R's `pmin()` and `pmax()`, providing greater performance and the ability to work efficiently with data frames. ```{r} x <- c(1, 3, NA, 5) y <- c(2, NA, 4, 1) z <- c(3, 4, 4, 1) # Parallel minimum fpmin(x, y, z, na.rm = TRUE) # Parallel maximum fpmax(x, y, z, na.rm = TRUE) # Parallel range (max - min) prange(x, y, z, na.rm = TRUE) ``` Like `psum()` and `pmean()`, these functions preserve the input type when all inputs have the same type, and automatically promote to the highest type when inputs are mixed (logical < integer < double). `prange()` always returns double to avoid integer overflow. ```{r} # With data frames fpmin(df) fpmax(df) prange(df) ``` ### Coalescing Values `pfirst()` and `plast()` return the first or last non-missing value across a set of vectors. This is equivalent to the SQL `COALESCE` function (for `pfirst`). ```{r} primary <- c(NA, 2, NA, 4) secondary <- c(1, NA, 3, NA) fallback <- c(0, 0, 0, 0) # Take first available value pfirst(primary, secondary, fallback) ``` ### Logical and Count Operations You can check for conditions or count values row-wise with `pall`, `pany`, and `pcount`. ```{r} a <- c(TRUE, FALSE, NA, TRUE) b <- c(TRUE, NA, TRUE, FALSE) c <- c(NA, TRUE, FALSE, TRUE) # Any TRUE per row? pany(a, b, c, na.rm = TRUE) # Count NAs per row pcountNA(a, b, c) # Count specific value (e.g., TRUE) per row pcount(a, b, c, value = TRUE) ``` ## Vectorized Conditionals ### Fast If-Else (`iif`) Base R's `ifelse()` is known to be slow and often strips attributes (like `Date` class or factor levels). `iif()` is a faster, more robust alternative that preserves attributes from the `yes` argument. ```{r} dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03")) # Base ifelse strips class class(ifelse(dates > "2024-01-01", dates, dates - 1)) # iif preserves class class(iif(dates > "2024-01-01", dates, dates - 1)) ``` It also supports explicit `NA` handling: ```{r} x <- c(-2, -1, NA, 1, 2) iif(x > 0, "positive", "non-positive", na = "missing") ``` ### Nested Conditionals (`nif`) For multiple conditions, `nif()` offers a cleaner, more efficient syntax than nested `ifelse()` calls, similar to SQL's `CASE WHEN`. ```{r} score <- c(95, 82, 67, 45, 78) nif( score >= 90, "A", score >= 80, "B", score >= 70, "C", score >= 60, "D", default = "F" ) ``` ### Vectorized Switch (`vswitch`, `nswitch`) `vswitch()` maps input values to outputs efficiently. ```{r} status_code <- c(1L, 2L, 3L, 1L, 4L) vswitch( x = status_code, values = c(1L, 2L, 3L), outputs = c("pending", "approved", "rejected"), default = "unknown" ) ``` For pairwise syntax, `nswitch()` pairs values and outputs directly. ```{r} nswitch(status_code, 1L, "pending", 2L, "approved", 3L, "rejected", default = "unknown" ) ``` It can also replace with values from other vectors (columns), mixing scalars and vectors: ```{r} df <- data.frame( code = c(1, 2, 1, 3, 2), val_a = c(10, 20, 30, 40, 50), val_b = c(100, 200, 300, 400, 500) ) with(df, nswitch(code, 1, val_a, 2, val_b, 3, 0, default = NA_real_ )) ``` ## Fast Unique and Duplicates **kit** provides optimized versions of `unique()` and `duplicated()` that are significantly faster for vectors and data frames. ### Unique Values and Duplicates ```{r} vec <- c("a", "b", "a", "c", "b") # Get unique values funique(vec) # Check for duplicates fduplicated(vec) ``` `uniqLen()` efficiently counts the number of unique elements without allocating the unique vector itself: ```{r} df <- data.frame( x = c(1, 1, 2, 2), y = c("a", "a", "b", "b") ) uniqLen(df) funique(df) ``` ### Counting Occurrences `countOccur()` produces a frequency table (similar to `table()` or `dplyr::count()`) but returns a standard data frame. ```{r} countOccur(c("apple", "banana", "apple", "cherry")) ``` ## Sorting and Utilities ### Partial Sorting (`topn`) Sorting a large vector just to get the top few elements is inefficient. `topn()` uses a partial sorting algorithm to retrieve the top (or bottom) $N$ indices or values. ```{r} set.seed(42) x <- rnorm(1000) # Get indices of top 5 values topn(x, n = 5) # Get the actual values (decreasing = FALSE for bottom values) topn(x, n = 5, decreasing = FALSE, index = FALSE) ``` ### Factor Manipulation `charToFact()` is a fast alternative to `as.factor()` for character vectors, with control over `NA` levels. ```{r} charToFact(c("a", "b", NA, "a")) ``` `setlevels()` allows you to change factor levels by reference (in-place), avoiding object copying. ### Finding Positions (`fpos`) `fpos()` finds the positions of a pattern (needle) within a vector (haystack). It can be used to find occurrences of one vector inside another. ```{r} haystack <- c(1, 2, 3, 4, 1, 2, 5) needle <- c(1, 2) fpos(needle, haystack) ``` ## Summary | Task | kit function | Base R equivalent | |:---|:---|:---| | **Row-wise sum** | `psum()` | `rowSums(cbind(...))` | | **Row-wise mean** | `pmean()` | `rowMeans(cbind(...))` | | **Row-wise min** | `fpmin()` | `pmin(...)` | | **Row-wise max** | `fpmax()` | `pmax(...)` | | **Row-wise range** | `prange()` | `pmax(...) - pmin(...)` | | **First non-NA** | `pfirst()` | `apply(..., 1, function(x) x[!is.na(x)][1])` | | **Fast if-else** | `iif()` | `ifelse()` | | **Nested if-else** | `nif()` | Nested `ifelse()` | | **Switch** | `vswitch()` | `match()` + indexing | | **Unique values** | `funique()` | `unique()` | | **Top N indices** | `topn()` | `order()[1:n]` | | **Char to Factor** | `charToFact()` | `as.factor()` | For comprehensive details and performance benchmarks, please refer to the individual function documentation.