--- title: "Tree based correlation model implementation" author: "Elias T Krainski, Denis Rustand, Anna Freni-Sterrantino, Janet van Niekerk, and HÃ¥vard Rue" date: "Started in November 2023, updated in `r format(Sys.time(), '%B, %Y')`" output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Tree based correlation model implementation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, include = !FALSE, fig.width = 7, fig.height = 5, out.width = '0.39\\textwidth') chinclude <- TRUE library(graphpcor) ``` # Abstract In this vignette we show how to work a tree based correlation model proposed in @sterrantino2025graphical. We recommend to look at the 'tree_model' vignette for details. # The model definition A directed acyclic graph is defined from a set of nodes and directed edges linking them. Because the edges are directed, there is no closed loop on it. A tree is a graph with only one (directed) path between a pair of nodes. We use this fact to define correlation model. The nodes hierarchy will imposes a hierarchy on the correlation as well. The definition start by considering two kind of nodes: parent or children. The $m$ variables of interest, those for we want to model their correlation, will be classified as children variables. They are labeled as $c_i$, $i\in\{1,\ldots,m\}$. The nodes to represent children variables are leafs of the tree, having always an ancestor (parent) but with no children. Each children node has a directed edge from its parent. The path between each $c_i$ goes through a set of $k$ parent variables, labeled as $p_j$, $j\in\{1,\ldots,k\}$. The $m$ parent variables are nodes with children nodes. and some may have parent but will still be classified as parent because they have children. In the \textbf{\textsf{R}} environment we represent the parent variables with the letter ``p`` along with an integer number (``p1``, $\ldots$, ``pk``) and the children variables with the letter ``c`` along with an integer number (``c1``, $\ldots$, ``cm``). We adopt a simple way to specify the parent children representation. We consider the ``~`` (tilde) to represent the directed link and ``+`` (plus) or ``-`` (minus) to append the descendant to a parent. E.g.: ``p1 ~ p2 + c1 + c2, p2 ~ c3 - c4`` . ## Intial example Let us consider a correlation model for three variables, with one parameter, that is the same absolute correlation between each pair but the sign may differ. This can be the case when these variables share the same latent factor. The parent is represented as ``p1``, and the children variables as ``c1``, ``c2`` and ``c3``. We consider that ``c3`` will be negatively correlated with ``c1`` and ``c2``. For this case we define the tree as ```{r graph1, include = chinclude} tree1 <- treepcor(p1 ~ c1 + c2 - c3) tree1 summary(tree1) ``` where the summary shows their relationship. The number of children and parent variables are obtained with ```{r dim1} dim(tree1) ``` This tree can be visualized with ```{r visgraph1, include = chinclude} plot(tree1) ``` From this model definition we will use the methods to build the correlation matrix. First, we build the precision matrix structure (that is not yet a precision matrix): ```{r qs1, include = chinclude} prec(tree1) ``` and we can use inform the log of $\gamma_1$, which is the standard error for $p_1$, with: ```{r q, include = chinclude} q1 <- prec(tree1, theta = 0) q1 ``` We can obtain the correlation matrix, which is our primarily interest, from the precision matrix. However, also have a covariance method to be directly applied with ```{r v1, include = chinclude} vcov(tree1) ## assume theta = 0 (\gamma_1 = 1) vcov(tree1, theta = 0.5) # \gamma_1^2 = exp(2 * 0.5) = exp(1) cov1a <- vcov(tree1, theta = 0) cov1a ``` from where we obtain the desired matrix with ```{r c1, include = chinclude} c1 <- cov2cor(cov1a) round(c1, 3) ``` # Correlation matrix with two parameters In this example, we model the correlation between four variables using two parameters. We consider ``c1`` and ``c2`` having the same parent, ``p1`` and ``c3`` and ``c4`` having the second parent as parent. We want to have the correlation between ``c3`` and ``c4`` higher than the correlation between ``c1`` and ``c3``. This requires ``p2`` to be children of ``p1``. The tree for this is set by ```{r graph2} tree2 <- treepcor( p1 ~ p2 + c1 + c2, p2 ~ c3 - c4) dim(tree2) tree2 summary(tree2) ``` which can be visualized by ```{r visgraph2} plot(tree2) ``` We can drop the last parent with ```{r drop1} drop(tree2) ``` We now have two parameters: $\gamma_1^2$ the variance of $p_1$ and $\gamma_2^2$ the conditional variance of $p_2$. For $\gamma_1 = \gamma_2 = 1$, the precision matrix can be obtained with: ```{r q2} q2 <- prec(tree2, theta = c(0, 0)) q2 ``` The correlation matrix can be obtained with ```{r c2} cov2 <- vcov(tree2, theta = c(0, 0)) cov2 c2 <- cov2cor(cov2) round(c2, 3) ``` ## Playing with sign We can change the sign at any edge of the graph. The change in the edge of parent to children is simpler to interpret, as we can see in the covariance/correlation from the two examples. Let us consider the second example but change the sign between the parents and swap the sign in both terms of the second equation: ```{r graph2b} tree2b <- treepcor( p1 ~ -p2 + c1 + c2, p2 ~ -c3 + c4) tree2b summary(tree2b) ``` This gives the precision matrix as ```{r prec2} q2b <- prec(tree2b, theta = c(0, 0)) q2b ``` The covariance computed from the full precision (and the correlation) between children is the same as before ```{r cov2b} all.equal(solve(q2)[1:4, 1:4], solve(q2b)[1:4, 1:4]) ``` Therefore, allowing flexibility in an edge of parent to another parent is not useful and will only imply more complexity. Therefore we will not consider it in the `vcov` method. NOTE: The `vcov` of a `treepcor` does not takes into account the sing between parent variables! So, please use it with care. # References