--- title: "Categorical Functional Data Analysis" author: "Cristian Preda, Quentin Grimonprez" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{cfda} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- See for more details about the mathematical background and an overview of the package. ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center", fig.width = 5, fig.height = 5) ``` ## Dataset Simulation ``` {r load, echo=TRUE, message=FALSE} library(cfda) set.seed(42) ``` We simulate the Jukes Cantor model of nucleotide replacement. Denote the set of nucleotides: $E = \{A, T, G, C\}$. A DNA sequence is seen as a string of nucleotides: `ATGCATTAC`. Each \textbf{site} in the sequence is subject to mutation over time. In the Jukes-Cantor model (1969), each site is a Markovian jump process with continuous time having as generator: $$ Q = \ \ \ \begin{array}{c|cccc} &A&T&G&C \\ \hline A&-3\alpha&\alpha&\alpha&\alpha\\ T&\alpha&-3\alpha&\alpha&\alpha\\ G&\alpha&\alpha&-3\alpha&\alpha\\ C&\alpha&\alpha&\alpha&-3\alpha\\ \hline \end{array}$$ for some $\alpha >0$. Let assume that the process is observed over the period $[0,T_{\max}]$ in $n$ sites with the nucleotide $A$ at time $t=0$. In the following, nucleotides are recoded: $A\leftrightarrow 1$, $T\leftrightarrow 2$, $G\leftrightarrow 3$, $C\leftrightarrow 4$. ```{r genData, echo=TRUE} K <- 4 PJK <- matrix(1 / 3, ncol = K, nrow = K) - diag(rep(1 / 3, K)) lambda_PJK <- c(1, 1, 1, 1) Tmax <- 10 n <- 100 d_JK <- generate_Markov(n = n, K = 4, P = PJK, lambda = lambda_PJK, Tmax = Tmax, labels = as.factor(c("A", "C", "G", "T"))) head(d_JK, 10) ``` The dataset is a data.frame with 3 columns: `id`, `time`, `state`. `id` contains the id of the different individuals (usually integers). `time` contains the time values at which a change occurs, note that for each individual, time values are ordered and start at 0. `state` contains the state that appears at the given time, state must be characters, factors, integers. This data format is used by the most part of cfda's functions. ```{r suumary, echo=TRUE} summary_cfd(d_JK) ``` We can compute the duration of each trajectory. ```{r, echo=TRUE} duration <- compute_duration(d_JK) head(duration) ``` All individuals has a different length. The computation of an optimal encoding requires that the end time of each individual is the same ($T_{\max}$), this can be done with the following function: ```{r cutT, echo=TRUE} d_JKT <- cut_data(d_JK, Tmax = Tmax) ``` Individuals can be plotted: ```{r plot, echo=TRUE, fig.height = 7} plotData(d_JKT) ``` ## Basic statistics Generally, in categorical functional data analysis, the following basic statistics are computed. ### Time spent in each state over a period of length $T_{\max}$: For each individual, the time spent in each of the $K$ states is computed. ```{r, echo=TRUE} timeSpent <- compute_time_spent(d_JKT) timeSpent[1:10, ] ``` The results can be plotted: ```{r, echo=TRUE} boxplot(timeSpent) ``` ### Number of jumps in $[0,T_{\max}]$: For each individual, the number of jumps occurring in $[0,T_{\max}]$ is computed. ```{r, echo=TRUE} nJump <- compute_number_jumps(d_JK) head(nJump) ``` The results can be plotted: ```{r, echo=TRUE} hist(nJump) ``` ### Probabilities to be in some state $x$ at time $t$, $p_{x}(t)$: An other interesting statistic is the evolution of the probability to be in each state. ```{r, echo=TRUE} pt_evol <- estimate_pt(d_JKT) pt_evol$pt[1:K, 1:10] head(pt_evol$t) ``` The output is a list with two elements: `t` the different time values and `pt` a matrix where each row contains the probability to be in a given state for all the time values. The result can be plotted: ```{r, echo=TRUE} plot(pt_evol, ribbon = TRUE) ``` ### Transitions in $[0,T_{\max}]$: The transitions between states can been studied by computing a frequency table counting the number of times each pair of states were observed in successive observation times. ```{r, echo=TRUE} statetable(d_JK) ``` Assuming that the data follows a Markov process, parameters $P$ (transition probability matrix) and $\lambda$ of the process can be estimated. ```{r, echo=TRUE} mark <- estimate_Markov(d_JK) mark ``` The estimated parameters are closed to the ones used. The results can be plotted through a transition graph where each node corresponds to a state with the mean time spent within (corresponding to $1/\lambda$) and arrows correspond to transition probabilities between states. ```{r, echo=TRUE} plot(mark) ``` ## Encoding ### Concept $X$ is a continuous stochastic process with jumps. We define categorical functional data as a set of sample paths of $X$. **cfda** is seen as an extension of the multiple correspondence analysis to a stochastic process. The idea is to find a scalar real random variable $z \in L_2(\Omega)$ that is the *most correlated* to ${X} = \{X_{t}\ : \ t\in [0,T_{\max}]\}$. ### Mathematical Background For a fixed $t\in [0,T_{\max}]$, let $E^{t}$ the projection operator, $$E^{t}(z) = \mathbb{E}(z|X_{t}) = \sum_{x\in S}{\mathbb{E}(z|X_t=x) \mathbf{1}_{X_t=x}}$$ Then, the correlation coefficient between $z$ and $X_{t}$ is: $$\eta^{2}(z, X_{t}) = \frac{var(\mathbb{E}(z|X_{t}))}{var(z)}$$ For $t, s \in [0,T_{\max}]$, the *(simple) correspondence analysis* looks for $z$ such that it maximizes $$\displaystyle\frac{1}{2}\left(\eta^{2}(z, X_{t})+\eta^{2}(z, X_{s})\right)$$ and the solution is: $$(E^{t}+E^{s})z = \lambda z$$ $z$ is called a *principal component*. As an extension, the *functional correspondence analysis* is defined as the solution of the optimization problem: $$ \arg\max_{z}\displaystyle\frac{1}{T_{max}}\int_{0}^{T_{max}}\eta^{2}(z, X_{t})dt$$ Solution: $$\int_{0}^{T_{max}}E^{s}z\ ds = \lambda z$$ Denote $\xi_t = E^{t}z$, $$\int_0^{T_{\max}}K(t,s)\xi_s\ ds = \lambda \xi_t, \ \ \ \ \mbox{with}\ K(s,t) = E^tE^s. $$ It follows that, $$\xi_{t} = \sum_{x\in S}a_{x}(t)\mathbf{1}_{X_t=x}$$ with $a_x(t) =\mathbb{E}(z|X_t= x), \ \ \forall x \in S$. For each $x\in S$, the functions $a_x: [0,T_{\max}]\rightarrow \mathbb{R}$ are called *optimal encoding* of the state $x$. The eigenvalue problem becomes in terms of encodings: $$\lambda a_x(t) = \sum_{y\in S}\displaystyle\int_{0}^{T_{\max}}\frac{p_{x,y}(t,s)}{p_x(t)}a_y(s)ds, \ \ \ \ \ \forall x \in S, \forall t\in [0,T_{\max}], $$ where $p_{x,y}(t,s) = \mathbb{P}(X_t=x, X_s=y)$ and $p_x(t)= \mathbb{P}(X_t=x)$. Under general conditions (continuity in probability of $X$), there exists $\{\lambda_i\}_{i\geq1}$ positive eigen-values and $\{a_x^i\}_{i\geq 1}$ eigen-functions. The following expansion formula holds (Mercer thm.): $$p_{x,y}(t,s) = p_x(t)p_y(s)\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)$$ For $x=y$, one obtains: $$p_x(t) =\displaystyle\frac{1}{\displaystyle\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)}$$ For the process $X = \{X_t, t\in[0, T]\}$ throughout the indicators $1_x = \{1_x(t), t\in [0, T]\}$: $$1_x(t)=\sum_{i\geq 1} z_i a_i^x(t)\frac{1}{p_x(t)}$$ We are interested in approximating the encoding functions, $a_x$, by $$a_{x} \approx \sum_{i=1}^m \alpha_{x,i}\phi_i, $$ where $\mathbf{\alpha_x} = (\alpha_{x,1}, \ldots, \alpha_{x,m}) \in \mathbb{R}^m$ are the expansion coefficients onto a basis of functions defined on $[0,T]$, $\{\phi_1, \ldots, \phi_m\}$, $m \geq 1$. The main result (Deville, 1982) is that $\alpha = (\alpha_{1}, \ldots, \alpha_{K}) \in \mathbb{R}^{Km}$ is the solution of the following eigen-value problem: $$F^{-1}G\alpha = \lambda\alpha, $$ where $G$ and $F$ the matrix defined by: $$G = cov(\{V_{(x,i)}\}_{x\in S; i=1:m }),$$ $$F = \mathbb{E}\left(\{F_{(x,i),(y,j)}\}_{x,y\in S ; i,j=1:m}\right)$$ with the random variables $V_{(x,i)}$ and $F_{(x,i),(x,j)}$ defined by: $$V_{(x,i)} = \int_{0}^{T_{\max}}\phi_i(t)\mathbf{1}_xdt \ \ \ \ \mbox{and}\ \ \ \ F_{(x,i), (y,j)} = \int_{0}^{T_{\max}}\phi_i(t)\phi_{j}(t)\mathbf{1}_x\mathbf{1}_y dt$$ ### Application Firstly, a basis of functions must be defined. We choose a B-splines basis of order 4. ```{r, echo=TRUE} m <- 20 basis <- create.bspline.basis(c(0, Tmax), nbasis = m, norder = 4) ``` The optimal encoding is computed using: ```{r, echo=TRUE} fmca <- compute_optimal_encoding(d_JKT, basis, verbose = FALSE) summary(fmca) ``` By default this function computes bootstrap estimates of the encoding functions in order to have a confidence interval. This is controlled by the `computeCI` arguments. The output is a list containing the different elements computed during the process: `eigenvalues`, `pc`, `alpha`, `F`, `G`, `V` and `basisobj`. Note that `alpha`contains the coefficients of the different encoding functions and `pc`the principal components. These components can be used with classical statistic methods (k-means, regression...). The eigenvalues can be computed using: ```{r, echo=TRUE} plotEigenvalues(fmca, cumulative = TRUE, normalize = TRUE) ``` The first encoding function coefficients $\alpha$'s: ```{r, echo=TRUE} print(fmca$alpha[[1]]) ``` The resulting encoding can be plotted: ```{r, echo=TRUE} plot(fmca) ``` or extracted as a *fd* object (or matrix): ```{r, echo=TRUE} encoding <- get_encoding(fmca, fdObject = TRUE) ``` Note that the first encoding mainly oppose the G state (in negative values) to other states. So, on the first component, individuals with a large negative values will tend to spend time in the G state after time 2.5. By default, it plots (and returns) the first encoding functions. Other encoding can be accessed with the `harm` argument. A confidence interval can be plotted using the `addCI` argument: ```{r, echo=TRUE} plot(fmca, addCI = TRUE, coeff = 2, states = "A") ``` Plot the two first components: ```{r, echo=TRUE} plotComponent(fmca, comp = c(1, 2), addNames = TRUE) ``` We can see that the individuals 3 and 14 have extreme ngative values for the first component and are opposed to 67 and 84. By plotting them, we can check the statement made on the first component meaning. ```{r, echo=TRUE} plotData(d_JKT[d_JKT$id %in% c(3, 14, 67, 84), ]) ``` The components can be used for some other tasks such as clustering, predicting,... #### Reconstruct indicators With the encodings, the indicators can be reconstructed as follows: ```{r, echo=TRUE} indicators <- reconstructIndicators(fmca) head(indicators) ``` ```{r, echo=TRUE} iInd <- 3 plotData(d_JKT[d_JKT$id == iInd, ]) plotIndicatorsReconstruction(indicators, id = iInd) ``` We see that the reconstructed indicators gives the same inforamtion as the true ones. Using more basis functions for computing encoding or other basis can improve the approximation.