--- title: "nuggets: Get Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{nuggets: Get Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, include=FALSE} library(nuggets) library(dplyr) library(ggplot2) library(tidyr) options(tibble.width = Inf) ``` # Introduction Package `nuggets` searches for patterns that can be expressed as formulae in the form of elementary conjunctions, referred to in this text as *conditions*. Conditions are constructed from *predicates*, which correspond to data columns. The interpretation of conditions depends on the choice of underlying logic: - *Crisp (Boolean) logic*: each predicate takes values `TRUE` (1) or `FALSE` (0). The truth value of a condition is computed according to the rules of classical Boolean algebra. - *Fuzzy logic*: each predicate is assigned a *truth degree* from the interval $[0, 1]$. The truth degree of a conjunction is then computed using a chosen *triangular norm (t-norm)*. The package supports three common t-norms, which are defined for predicates' truth degrees $a, b \in [0, 1]$ as follows: - *Gödel* (minimum) t-norm: $\min(a, b)$ ; - *Goguen* (product) t-norm: $a \cdot b$ ; - *Łukasiewicz* t-norm: $\max(0, a + b - 1)$ Before applying `nuggets`, data columns intended as predicates must be prepared either by *dichotomization* (conversion into *dummy variables*) or by transformation into *fuzzy sets*. The package provides functions for both transformations. See the [Data Preparation](data-preparation.html) vignette for a comprehensive guide, or the section [Data Preparation](#data-preparation) below for a quick overview. `nuggets` implements functions to search for pre-defined types of patterns, for example: - `dig_associations()` for association rules, - `dig_baseline_contrasts()`, `dig_complement_contrasts()`, and `dig_paired_baseline_contrasts()` for various contrast patterns on numeric variables, - `dig_correlations()` for conditional correlations. See [Pre-defined Patterns](#pre-defined-patterns) below for further details. Discovered rules and patterns can be post-processed, visualized, and explored interactively. Section [Post-processing and Visualization](#post-processing-and-visualization) describes these features. Finally, the package allows users to provide custom evaluation functions for conditions and to search for *user-defined* types of patterns: - `dig()` is a general function for searching arbitrary pattern types. - `dig_grid()` is a wrapper around `dig()` for patterns defined by conditions and a pair of columns evaluated by a user-defined function. See [Custom Patterns](#custom-patterns) for more information. # Data Preparation Before applying `nuggets`, data columns intended as predicates must be prepared either by *dichotomization* (conversion into *dummy variables*) or by transformation into *fuzzy sets*. The package provides the `partition()` function for both transformations. For a detailed guide to data preparation, including information about all available functions and advanced techniques, please see the [Data Preparation](data-preparation.html) vignette. ## Crisp (Boolean) Predicates Example For crisp patterns, numeric columns are transformed to logical (`TRUE`/`FALSE`) columns. Here's a quick example using the built-in `mtcars` dataset: ```{r} # Transform the whole dataset to crisp predicates # First, convert cyl to a factor for illustration crisp_mtcars <- mtcars |> mutate(cyl = factor(cyl, levels = c(4, 6, 8), labels = c("four", "six", "eight"))) |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "crisp", .breaks = 3) head(crisp_mtcars, n = 3) ``` Now all columns are logical and can be used as predicates in crisp conditions. ## Fuzzy Predicates Example Fuzzy predicates express the degree to which a condition is satisfied, with values in the interval $[0,1]$. This allows modeling of smooth transitions between categories: ```{r, message=FALSE} # Start with fresh mtcars and transform to fuzzy predicates fuzzy_mtcars <- mtcars |> mutate(cyl = factor(cyl, levels = c(4, 6, 8), labels = c("four", "six", "eight"))) |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "triangle", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "triangle", .breaks = 3) head(fuzzy_mtcars, n = 3) ``` Note that the `cyl`, `vs`, `am`, and `gear` columns are still represented by dummy logical columns, while the numeric columns are now represented by fuzzy sets. This combination allows both crisp and fuzzy predicates to be used together in pattern discovery. ## Advanced Data Preparation Capabilities The `nuggets` package provides powerful and flexible data preparation tools. The [Data Preparation](data-preparation.html) vignette covers these capabilities in depth, including: - **Crisp (Boolean) partitioning** with customizable interval strategies: - Equal-width intervals for uniform discretization - Data-driven methods (quantile, k-means, hierarchical clustering, etc.) for optimal breakpoints that respect the data structure - Custom breakpoints for domain-specific intervals - **Fuzzy partitioning** for modeling gradual transitions and uncertainty: - Triangular membership functions for basic fuzzy sets - Raised-cosine membership functions for smoother transitions - Trapezoidal shapes using `.span` and `.inc` parameters for overlapping fuzzy sets - **Quality control utilities** to improve pattern mining: - `is_almost_constant()` and `remove_almost_constant()` to identify and filter uninformative columns - `dig_tautologies()` to find always-true rules that can be used to prune search spaces - **Custom labels** for predicates to make discovered patterns more interpretable For example, you can use quantile-based partitioning to ensure balanced predicates, or use raised-cosine fuzzy sets with custom labels to create meaningful linguistic terms like "very_low", "low", "medium", "high", and "very_high". These preparation choices significantly impact the interpretability and usefulness of patterns discovered in subsequent analyses. # Pre-defined Patterns The package `nuggets` provides a set of functions for discovering some of the best-known pattern types. These functions can process Boolean data, fuzzy data, or both. Each function returns a tibble, where every row represents one detected pattern. > **Note:** This section assumes that the data have already been **preprocessed** > — i.e., transformed into a binarized or fuzzified form. See the previous > section [Data Preparation](#data-preparation) for details on how to prepare > your dataset (for example, `crisp_mtcars` and `fuzzy_mtcars`). For more advanced workflows — such as defining custom pattern types or computing user-defined measures — see the section [Custom Patterns](#custom-patterns). ### Search for Association Rules **Association rules** identify conditions (*antecedents*) under which a specific feature (*consequent*) is present very often. \[ A \Rightarrow C \] If condition `A` is satisfied, then the feature `C` tends to be present. For example, `university_edu & middle_age & IT_industry => high_income` can be read as: *People in middle age with university education working in IT industry are very likely to have a high income.* In practice, the antecedent `A` is a set of predicates, and the consequent `C` is usually a single predicate. For a set of predicates \(I\), let \(\text{supp}(I)\) denote the *support* — the relative frequency (for logical data) or the mean truth degree (for fuzzy data) of rows satisfying all predicates in \(I\). Using this notation: - **Length** — number of predicates in the antecedent. - **Coverage** — \(\text{supp}(A)\). - **Consequent support** — \(\text{supp}(\{c\})\). - **Support** — \(\text{supp}(A \cup \{c\})\). - **Confidence** — \(\text{supp}(A \cup \{c\}) / \text{supp}(A)\). Optional additional measures (`"lift"`, `"conviction"`, `"added_value"`) can be computed using the `measures` argument. Before searching for rules, it is recommended to create a *vector of disjoints*, which specifies predicates that must not appear together in the same condition. This vector should have the same length as the number of dataset columns. For example, columns representing `gear=3` and `gear=4` are mutually exclusive, so their shared group label in `disj` prevents meaningless conditions like `gear=3 & gear=4`. You can conveniently generate this vector with `var_names()`: ```{r} disj <- var_names(colnames(fuzzy_mtcars)) print(disj) ``` The `dig_associations()` function searches for association rules. Its main arguments are: - `x`: the data matrix or data frame (logical or numeric); - `antecedent`, `consequent`: tidyselect expressions selecting columns for each side of the rule; - `disjoint`: a vector defining mutually exclusive predicates; - rule filtering thresholds such as `min_support`, `min_confidence`, `min_coverage`, and limits like `min_length`, `max_length`; - optional parameters such as `measures`, `t_norm`, and `contingency_table`. In the following example, we search for fuzzy association rules in the dataset `fuzzy_mtcars`, such that: - any column except those starting with `"am"` may appear in the antecedent; - columns starting with `"am"` may appear in the consequent; - minimum support is `0.02`; - minimum confidence is `0.8`; - additional quality measures `"lift"` and `"conviction"` are computed. ```{r} result <- dig_associations(fuzzy_mtcars, antecedent = !starts_with("am"), consequent = starts_with("am"), disjoint = disj, min_support = 0.02, min_confidence = 0.8, measures = c("lift", "conviction"), contingency_table = TRUE) ``` The result is a tibble containing the discovered rules and their quality metrics. You can arrange them, for example, by decreasing support: ```{r} result <- arrange(result, desc(support)) print(result) ``` This example illustrates the typical workflow for mining association rules with `nuggets`. The same structure and arguments apply when analyzing either fuzzy or Boolean datasets. ## Conditional Correlations TBD (`dig_correlations`) ## Contrast Patterns TBD (`dig_contrasts`) # Post-processing and Visualization TBD # Custom Patterns The `nuggets` package allows to execute a user-defined callback function on each generated frequent condition. That way a custom type of patterns may be searched. The following example replicates the search for associations rules with the custom callback function. For that, a dataset has to be dichotomized and the disjoint vector created as in the **Data Preparation** section above: ```{r} #head(fuzzyCO2) #print(disj) ``` As we want to search for associations rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent: ```{r} min_support <- 0.02 min_confidence <- 0.8 f <- function(condition, support, foci_supports) { conf <- foci_supports / support sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support conf <- conf[sel] supp <- foci_supports[sel] lapply(seq_along(conf), function(i) { list(antecedent = format_condition(names(condition)), consequent = format_condition(names(conf)[[i]]), support = supp[[i]], confidence = conf[[i]]) }) } ``` The callback function `f()` defines three arguments: `condition`, `support` and `foci_supports`. The names of the arguments are not random. Based on the argument names of the callback function, the searching algorithm provides information to the function. Here `condition` is a vector of indices representing the conjunction of predicates in a condition. By the predicate we mean the column in the source dataset. The `support` argument gets the relative frequency of the condition in the dataset. `foci_supports` is a vector of supports of special predicates, which we call "foci" (plural of "focus"), within the rows satisfying the condition. For associations rules, foci are potential rule consequents. Now we can run the digging for rules: ```{r} #result <- dig(fuzzyCO2, #f = f, #condition = !starts_with("Treatment"), #focus = starts_with("Treatment"), #disjoint = disj, #min_length = 1, #min_support = min_support) ``` As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame: ```{r} #result <- result |> #unlist(recursive = FALSE) |> #lapply(as_tibble) |> #do.call(rbind, args = _) |> #arrange(desc(support)) # #print(result) ```