--- title: "Incorporating Knowledge" vignette: > %\VignetteIndexEntry{Incorporating Knowledge} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: "#>" fig.align: "center" --- ```{r} #| label: setup library(causalDisco) ``` This vignette demonstrates how to use the `knowledge()` function to incorporate prior knowledge into causal discovery algorithms. The different supported knowledge types are explained below, along with examples of how to create `Knowledge` objects and use them with causal discovery methods. All knowledge types can be freely combined. Multiple calls or operators are additive: each call adds new edges to the `Knowledge` object. For example if we require an edge from `A` to `B` and then require an edge from A to C, the resulting `Knowledge` object will require both edges from `A` to `B` and from `A` to `C`. At a conceptual level, all knowledge is represented as constraints on edges, specifying which edges are required and forbidden. Some knowledge types provide higher-level abstractions for expressing common modeling assumptions more conveniently. # Required and forbidden knowledge At the most basic level, prior knowledge is expressed as required or forbidden edges between variables. These constraints apply to directed edges in the causal graph. - Required edges specify that a directed edge must exist between two variables. - Forbidden edges specify that a directed edge is not allowed between two variables. These constraints are specified using the `%-->%` (required) and `%!-->%` (forbidden) operators, with the exclamation mark (`!`) indicating negation of the edge, i.e. the absence of the edge. Conceptually, this could be written as `%!(-->)%`, but we find this syntax too verbose. ## Specifying required and forbidden edges Suppose we want to require an edge from A to B, from A to C, and forbid an edge from B to C: ```{r} #| label: required and forbidden knowledge kn_1 <- knowledge( A %-->% c(B, C), # Require edges from A to B and A to C B %!-->% C # Forbid edge from B to C ) ``` This `Knowledge` object can be visualized: ```{r} #| label: plot required and forbidden knowledge plot(kn_1) ``` The blue edge represents the required edge from A to B, while the red edge represents the forbidden edge from B to C. If one wishes to remove some edges (either required or forbidden) knowledge from an existing `Knowledge` object, the `remove_edge()` function can be used. For example, to remove the required edge from A to B: ```{r} #| label: remove required edge kn_1_removed <- remove_edge(kn_1, from = A, to = B) plot(kn_1_removed) ``` ## Specifying required and forbidden edges in a dataset We will use the `tpc_example` dataset available in the package for the following examples. The dataset contains 6 variables, as shown below: ```{r} #| label: dataset required and forbidden knowledge data(tpc_example) head(tpc_example) ``` When specifying knowledge, it is often more convenient to specify the dataset so that the variables are checked for existence and selected correctly. To do this, pass the dataset as the first argument to `knowledge()`. ```{r} #| label: required and forbidden knowledge with data kn_2 <- knowledge( tpc_example, child_x1 %-->% youth_x3, # Require edge from child_x1 to youth_x3 child_x2 %!-->% oldage_x5 # Forbid edge from child_x2 to oldage_x5 ) ``` This `Knowledge` object can also be visualized (we manually adjust the layout for better appearance): ```{r} #| label: plot required and forbidden knowledge with data cg <- knowledge_to_caugi(kn_2)$caugi layout <- caugi::caugi_layout_sugiyama(cg) layout[6, 2] <- layout[4, 2] plot(kn_2, layout = layout) ``` The plot then plots all variables in the dataset, with the required edges as blue edges and forbidden edges as red edges. ### Using tidyselect helpers To make specifying variables easier, you can use tidyselect helpers such as `starts_with`: ```{r} #| label: required and forbidden knowledge with tidyselect kn_3 <- knowledge( tpc_example, starts_with("child") %-->% starts_with("youth"), starts_with("oldage") %!-->% starts_with("youth") ) ``` This means, that all variables starting with "child" are required to have edges to all variables starting with "youth", and no variables starting with "oldage" can have edges to any variables starting with "youth". We can visualize this: ```{r} #| label: plot required and forbidden knowledge with tidyselect plot(kn_3) ``` For a list of all available tidyselect helpers we refer to the [tidyselect reference documentation](https://tidyselect.r-lib.org/reference/index.html). # Tiered knowledge Tiered knowledge provides a higher-level abstraction for expressing systematic ordering assumptions, such as temporal or logical precedence. Internally, tiered knowledge is translated into a collection of forbidden edges, but it is exposed separately because it provides a concise and structured way to express common ordering assumptions. For example, consider a dataset with three groups of variables: child, youth, and old. We may wish to enforce that child variables precede youth variables, which in turn precede old variables. This can be expressed using tiered knowledge. Tiered knowledge enforces that edges may only point from earlier tiers to later tiers. Edges within the same tier are unrestricted unless additional knowledge is supplied. ## Creating a tiered `Knowledge` object Suppose we observe variables over time: first the `A`'s, then the `B`'s, and finally the `C`'s. This ordering implies that causal direction cannot go backward in time (e.g., `B`'s cannot cause `A`'s). A tiered `Knowledge` object captures this temporal structure by specifying tiers and their associated variables. If numeric tiers are used, lower numbers indicate earlier tiers; otherwise, tiers are ordered by their appearance. The following specifications encode the same tier structure: ```{r} #| label: tier knowledge kn <- knowledge( tier( 1 ~ c(A1, A2), 2 ~ c(B1, B2), 3 ~ c(C1, C2) ) ) # Same object, since tiers are ordered numerically kn_same <- knowledge( tier( 1 ~ c(A1, A2), 3 ~ c(C1, C2), 2 ~ c(B1, B2) ) ) # Functionally equivalent, though not identical kn_almost <- knowledge( tier( 10 ~ c(A1, A2), 30 ~ c(C1, C2), 20 ~ c(B1, B2) ) ) # Again functionally equivalent kn_also_almost <- knowledge( tier( A ~ c(A1, A2), B ~ c(B1, B2), C ~ c(C1, C2) ) ) # Has a letter, so tiers are ordered by appearance, thus functionally equivalent kn_mixed <- knowledge( tier( 3 ~ c(A1, A2), B ~ c(B1, B2), 1 ~ c(C1, C2) ) ) ``` We can visualize the tiers: ```{r} #| label: plot tier knowledge plot(kn) ``` The plot then shows the tiers as layers, with the earliest tiers to the left and latest to the right. We can convert the meaning of the tiered knowledge into explicit forbidden edges using `convert_tiers_to_forbidden()`: ```{r} #| label: convert tiers to forbidden kn_converted <- convert_tiers_to_forbidden(kn) print(kn_converted) plot(kn_converted) ``` Tidyselect helpers such as `starts_with` can also be used to define tiers in a concise way, just as with required and forbidden edges. Different tidyselect helpers can be freely combined within a tier definition using `+`. For example, the following tiered `Knowledge` object defines two tiers, "young" and "old", by combining tidyselect helpers on the variables in the `tpc_example` dataset: ```{r} #| label: tier knowledge with tidyselect kn_tier_tidyselect <- knowledge( tpc_example, tier( young ~ starts_with("child") + ends_with(c("3", "4")), old ~ starts_with("old") ) ) plot(kn_tier_tidyselect) ``` # Exogenous variables knowledge Exogenous variables are those that have no incoming edges in the causal graph. That is, variables which are known causes but are not affected by other variables. Exogenous variables can be specified using the `exogenous()` function within `knowledge()`. ## Specifying exogenous variables The most natural usage is to supply the dataset so that the variables are checked for existence and selected correctly: ```{r} #| label: exogenous knowledge kn_exo_1 <- knowledge( tpc_example, exogenous("child_x1") ) ``` Instead of `exogenous()`, you can also use the shorthand function `exo()`. This `Knowledge` object can be visualized: ```{r} #| label: plot exogenous knowledge plot(kn_exo_1) ``` Below we add both `child_x1` and `child_x2` as exogenous variables using tidyselect helpers: ```{r} #| label: exogenous knowledge with tidyselect kn_exo_2 <- knowledge( tpc_example, exo(starts_with("child")) ) plot(kn_exo_2, layout = "bipartite", orientation = "columns") ``` # Combining different knowledge types Different knowledge types can be freely combined in a single `Knowledge` object. For example, we can combine tiered knowledge with required and forbidden edges: ```{r} #| label: combined knowledge kn_combined <- knowledge( tpc_example, tier( 1 ~ starts_with("child"), 2 ~ starts_with("youth"), 3 ~ starts_with("oldage") ), child_x1 %-->% youth_x3, child_x1 %!-->% child_x2 ) plot(kn_combined) ``` # Using knowledge with causal discovery Once prior knowledge has been specified, it can be supplied to causal discovery algorithms by passing the `Knowledge` object to the `disco()` function via the `knowledge` argument. For example, we can use the Temporal GES algorithm `tges()` with engine "causalDisco" and temporal BIC ("tbic") score, while providing tiered knowledge: ```{r} #| label: causal discovery with tier knowledge kn <- knowledge( tpc_example, tier( 1 ~ starts_with("child"), 2 ~ starts_with("youth"), 3 ~ starts_with("oldage") ) ) cd_tges <- tges(engine = "causalDisco", score = "tbic") disco_cd_tges <- disco(data = tpc_example, method = cd_tges, knowledge = kn) ``` The causal discovery algorithms will then learn the causal graph from the data given the constraints specified in the `Knowledge` object. ```{r} #| label: plot causal discovery with tier knowledge plot(disco_cd_tges) ``` The black edges are those inferred from the data. # Engine specific information about knowledge By engine we mean the underlying implementation of the causal discovery algorithm, i.e. the engine you specify to an algorithm such as `pc(engine = "bnlearn")` or `tges(engine = "causalDisco")`. ## bnlearn All knowledge types are supported with the bnlearn engine. When required knowledge is provided, bnlearn may emit a warning during structure learning. This occurs when the algorithm identifies a candidate v-structure (collider) from the data whose orientation conflicts with edges already oriented due to background knowledge. ```{r} #| label: bnlearn data(tpc_example) kn <- knowledge( tpc_example, child_x1 %-->% youth_x3 ) bnlearn_pc <- pc(engine = "bnlearn", test = "fisher_z", alpha = 0.05) output <- disco(data = tpc_example, method = bnlearn_pc, knowledge = kn) ``` The resulting causal graph will still respect the provided knowledge. ```{r} #| label: plot bnlearn plot(output) ``` ## causalDisco The causalDisco engine only supports tiered and forbidden knowledge. If required knowledge is provided, it will give a warning and ignore the required knowledge. ## pcalg Only symmetric forbidden-edge constraints are supported by the pcalg engine. In practice, this means an edge must be forbidden in both directions. Such constraints can be specified using `%!-->%` in both directions, as illustrated below: ```{r} #| label: pcalg data(tpc_example) kn <- knowledge( tpc_example, child_x1 %!-->% youth_x3, youth_x3 %!-->% child_x1 ) pc_pcalg <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05) output <- disco(data = tpc_example, method = pc_pcalg, knowledge = kn) ``` ## Tetrad All knowledge types are supported with the Tetrad engine.