---
title: "Comparison with known results"
output: 
  bookdown::pdf_document2: 
    toc: false
    number_sections: true
    citation_package: natbib
    highlight: tango
    fig_caption: true
vignette: >
  %\VignetteIndexEntry{Comparison with known results}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
header-includes: 
  - \usepackage{tikz}
  - \usepackage{pgfplots}
  - \pgfplotsset{compat=1.8}
  - \usepackage{caption}
  - \usepackage{amsmath}
  - \usepackage{mathtools}
  - \usepackage{dsfont}
  - \usepackage{prodint}
  - \usepackage{makecell}
  - \usepackage{booktabs}
  - \usepackage{prodint}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```








# Known results {#Known-results}

The `icmstate` package can be used to non-parametrically estimate the transition intensities in interval-censored Markov multi-state models without loops. Although estimation for general multi-state models is quite novel, some results already exist for specific interval-censored multi-state models. In this vignette, we compare the estimates obtained using the `icmstate` package with the known results.


# Frydman (1995) non-parametric estimator {#Frydman-NPMLE}

@Frydman1995 has derived the non-parametric estimator for a specific illness-death model (See Figure \@ref(fig:IDM)). The estimator can only be used if death times are observed exactly, and for each person it is known whether they have transitioned through illness on the way to death or not. 

\begin{figure}[!ht]
\centering
\begin{tikzpicture}
        \node (alive) [ draw, thick, rectangle, minimum width=2cm, minimum height = 1.5cm] at (0,0cm) {1. Alive};
        \node (illness) [ draw, thick, rectangle, minimum width=2cm, minimum height = 1.5cm] at (1.5,2cm) {2. Illness};
        \node (death) [ draw, thick, rectangle, minimum width=2cm, minimum height = 1.5cm] at (3,0cm)  {3. Death};
        \node (deathafterillness) [ draw, thick, dashed, rectangle, minimum width=2cm, minimum height = 1.5cm, text width=2cm] at (4.5,2cm) {4. Death after Illness};
        \draw[->, thick] (alive) -- (illness);
        \draw[->, thick, dotted] (illness) -- (death);
        \draw[->, thick, dashed] (illness) -- (deathafterillness);
        \draw[->, thick] (alive) -- (death);
\end{tikzpicture}
\caption{Graphical representation of the (extended) illness-death model. Extended: solid and dashed lines. Standard: solid and dotted lines.}
\label{fig:IDM}
\end{figure}

The non-parametric estimator described in @Frydman1995 is available through the `msm_frydman()` function. It allows for the non-parametric estimation of a mix of cumulative distribution functions and cumulative intensity functions. 


## Comparison with the icmstate package

Using the `npmsm()` function it is not possible to directly derive the NPMLE for the standard illness-death model described above. We can however derive an estimate by cleverly adjusting the fitted multi-state model. If we specify a transition matrix corresponding to the illness death model, a transition from healthy to death can indicate that the subject has either passed away after experiencing illness or directly. If we consider the extended illness-death model instead (See Figure \@ref(fig:IDM)), then we could fit the desired model. In this case, death after illness and death are separate states, allowing us to distinguish between these two pathways.

To compare the two estimators, we generate data that adheres to these requirements by simulating from an extended illness-death model:

```{r IEDgeneration, warning=FALSE}
library(icmstate)
library(mstate)
library(msm)
set.seed(1)
tmat_EID <- mstate::transMat(x = list( c(2, 3), c(4), c(), c() ))
qmatrix <- rbind(
    c(-0.15, 0.1, 0.05, 0),
    c(0, -0.1, 0, 0.1),
    c(0, 0, 0, 0),
    c(0, 0, 0, 0)
  )
n <- 30

#time = observation time, subject = subject identifier
simdat <- data.frame(time = c(replicate(n, c(0, seq(2, 12, by=2) + runif(6, 0, 2)))),
                    subject = rep(1:n, each = 7))
#Simulate interval-censored data. See help(simmulti.msm)
dat <- simmulti.msm(data = simdat, qmatrix = qmatrix, start = 1,
                    death = c(3,4))[, 1:3]
names(dat)[1] <- "id"

```

Let us visualise the generated data:

```{r visualiseIED}
visualise_msm(dat, tmat = tmat_EID)
```


We observe some direct $1 \to 4$ transitions. To use Frydman's result, we must have an interval for the $1 \to 2$ transition. To avoid copmlicated data processing, we simply remove subjects $5$ and $22$.

```{r remove14trans}
dat <- dat[!dat[, "id"] %in% c(5, 22),]
```

Unfortunately, the data format used by the `msm_frydman()` function differs quite a lot from the `npmsm()` function, so we will need to do some data processing anyways. To see what the data needs to look like, see `help(msm_frydman)`. In general, the data should be represented by the tuple $(\delta, \Delta)$ with $\delta$ indicating whether a transition to the illness (state 2) has occurred, and $\Delta$ indicating whether a transition to a death state (state 3 or 4) has occurred. We write a function which transforms to the correct data format:

```{r msmtoFrydman}
#Create Frydman data
msmtoFrydman <- function(gd){
  #Create Frydman data
  gd <- remove_redundant_observations(gd, tmat = tmat_EID)
  gd_frydman <- NULL
  for(j in unique(gd$id)){
    tempdat <- subset(gd, id == j)
    tempstates <- unique(tempdat$state)
    if(length(tempstates) == 1){ #If we only observe the subject in 1 state, right censored in 1
      gdi_frydman <- data.frame(delta = 0, Delta = 0,
                                L = NA,
                                R = NA,
                                time = tempdat$time[length(tempdat$time)])
    } else if(length(tempstates) == 2){ #If we only observe the subject in 2 states, either 1->2 or 1->3 has occured
      if(all(tempstates %in% c(1,2))){
        gdi_frydman <- data.frame(delta = 1, Delta = 0, 
                                  L = tempdat$time[which.min(tempdat$state == 1)-1],
                                  R = tempdat$time[which.min(tempdat$state == 1)],
                                  time = tempdat$time[length(tempdat$time)])
      } else if(all(tempstates %in% c(1,3))){
        gdi_frydman <- data.frame(delta = 0, Delta = 1, 
                                  L = NA,
                                  R = NA,
                                  time = tempdat$time[which.min(tempdat$state == 1)])
      }
    } else if(length(tempstates) == 3){ #If we observe 3 states, then 1->2->3 must have occured
      gdi_frydman <- data.frame(delta = 1, Delta = 1, 
                                L = tempdat$time[which.min(tempdat$state == 1)-1],
                                R = tempdat$time[which.min(tempdat$state == 1)],
                                time = tempdat$time[length(tempdat$time)])
    }
    gd_frydman <- rbind(gd_frydman, gdi_frydman)
  }
  return(gd_frydman)
}
```

We can then visualise the Frydman data as well using the `visualise_data()` function:

```{r visFrydmandat}
dat_frydman <- msmtoFrydman(dat)
visualise_data(dat_frydman)
```
And finally we can fit both models (we need to specify a tolerance for both, as they are both EM algorithms):

```{r fitmodels}
mod_npmsm <- npmsm(gd = dat, tmat = tmat_EID, maxit = 300, exact = c(3,4),
                   tol = 1e-6)
mod_frydman <- msm_frydman(data = dat_frydman, tol = 1e-6)
```

The first thing we can do is compare the support intervals (i.e. the intervals where the intensities can be non-zero) for the $1 \to 2$ transition. For this we can use the `support_npmsm()` function for the `npmsm` fit, which numerically determines on which intervals the transition intensities are estimated to be non-zero. The `msm_frydman` fit automatically returns the support intervals, as these can be determined from the theory in the article.

```{r printsupportmodels}
supp_npmsm <- support_npmsm(mod_npmsm, cutoff = 1e-9)
supp_frydman <- mod_frydman$supportMSM$Q_mat
print("Frydman Support")
supp_frydman
print("icmstate Support")
supp_npmsm$`State 1 -> State 2`$support
```


The support sets from @Frydman1995 do not necessarily all need to contain non-zero intensities, as some of them can still be zero. We therefore would like to compare the two estimators on their estimated survival functions. However, both models estimate different quantities. Where the `npmsm()` fit estimates transition intensities and can recover transition probabilities, the `msm_frydman()` fit estimates quite specific quantities. Let $S$ be the (unobserved) entry time into state $2$, and $T$ the entry time into state $3$ and finally $V = T - S$. Then the `msm_frydman()` function estimates the following quantities:

* $F_{12}(s) = \mathbb{P}(S \leq S, \delta = 1)$
* $F_{13}(s) = \mathbb{P}(S \leq s, \delta = 0)$
* $\Lambda_{23}(u) = \mathbb{P}(T = u | T \geq u, \delta = 1)$

From these quantities we can then also recover 

* $F(s) = \mathbb{P}(S \leq s) = F_{12}(s) + F_{13}(s)$ 
* $G_s(v) = \mathbb{P}(V > v | S = s, \delta = 1) = \Prodi_{x < u \leq s + v} (1 - \Lambda_{23}(\{u\}))$

It can be shown that the estimators from @Frydman1995 can be compared with the transition intensities/probabilities in the following way:
\begin{center}
\begin{tabular}{c|c|c}
    MSM & Frydman (95) & Support \\
    \hline
    $P_{11}(0, s)$ &  $1 - F(s) =  1- F_{12}(s) - F_{13}(s)$ & $\mathcal{S}_{12} \cup \mathcal{S}_{13}$\\
    $P_{13}(0, s)$ & $F_{13}(s)$ & $\mathcal{S}_{13}$ \\
    $P_{13}(0, s) + P_{11}(0,s)$ & $1 - F_{12}(s)$ & $\mathcal{S}_{12}$ \\
    $P_{24}(s,s+v) = \prod_{s < \tau_k \leq s + v} (1-\alpha_{24}^k)$ & $G_s(v) = \prod_{s < u \leq s+v} \left(  1 - \Lambda_{23}(\{u\}) \right)$& $\mathcal{S}_{23}$
\end{tabular}
\end{center} 


The support column indicates that we can only compare the quantities on the right endpoints of the corresponding support sets. The subscript indicates the transition of the support set. As an example, if the support set $\mathcal{S}_{12} = \{ (1, 2], (2.5, 3] \}$ then the comparison of the quantities $P_{13}(0, s) + P_{11}(0,s)$ and $1 - F_{12}(s)$ can only be made on the right endpoints $2$ and $3$. We determine the relevant right-endpoints:
```{r}
#Right-endpoints of the 1->2 transition
RE12 <- supp_frydman[, 2]
#Right-endpoints of the 1->3 transition
RE13 <- mod_frydman$data_idx$e_k_star
#Right-endpoints of the 2->3 transition
RE23 <- mod_frydman$data_idx$t_n_star
```

Let us perform the comparison.

### Survival in state 1

We compare $P_{11}(0, s)$ with $1 - F(s) =  1- F_{12}(s) - F_{13}(s)$. Transition probabilities can be recovered using the `transprob()` function in the package. The cdf of the $1 \to 2$ and $1 \to 3$ transition are contained in the `msm_frydman()` fit through the `$cdf` list. Note that the output of the cdf for the $1 \to 2$ transition contains both the upper and lower value of the cdf, as we do not know where exactly the cdf makes the jump in the support intervals. Additionally, we can only compare the values with each other on the right-endpoints of the support intervals.

```{r compsurvstate1}
#Transition probabilities from state 1 from time 0
P11 <- transprob(mod_npmsm, predt = 0)
#Extract times of interest
times1 <- P11[[1]][,1]
#1-F(s) for Frydman estimator:
#We take min(F_{12}(x)) as the cdf has only jumped at the right-endpoints
Frydman1minF <- sapply(times1, function(x) 1- (mod_frydman$cdf$F13(x) +
                                               min(mod_frydman$cdf$F12(x))))
#Comparison plot
plot(P11, main = "icmstate vs Frydman (dashed red line) 
     Comparison times (dotted blue lines)")
lines(times1, Frydman1minF, col = "red", type = "s", lwd = 2, lty = 2)
abline(v = unique(c(RE12, RE13)), col = "blue", lwd = 2, lty = 3)
```

We can see that the quantities correspond on the right-endpoints. 




### Transition to death

We perform a similar comparison between $P_{13}(0, 2)$ and $F_{13}(s)$.

```{r}
FrydmanF13 <- sapply(times1, function(x) mod_frydman$cdf$F13(x))
#Comparison plot
plot(P11, main = "icmstate vs Frydman (dashed red line) 
     Comparison times (dotted blue lines)", ord = c(3, 1, 2, 4))
lines(times1, FrydmanF13, col = "red", type = "s", lwd = 2, lty = 2)
abline(v = unique(RE13), col = "blue", lwd = 2, lty = 3)
```
Again, the two estimators align.

### Not catching illness

We compare $P_{13}(0, s) + P_{11}(0, s)$ with $1-F_{12}(s)$.

```{r}
FrydmanF12 <- sapply(times1, function(x) min(mod_frydman$cdf$F12(x)))
#Comparison plot
plot(P11, main = "icmstate vs Frydman (dashed red line) 
     Comparison times (dotted blue lines)", ord = c(3, 1, 2, 4))
lines(times1, 1-FrydmanF12, col = "red", type = "s", lwd = 2, lty = 2)
abline(v = unique(RE12), col = "blue", lwd = 2, lty = 3)
```

The estimate coincide!

### Dying after illness

We compare $P_{24}(0, s) = \prod_{0 < \tau_k \leq s} (1-\alpha_{24}^k)$ with $G_s(v) = \prod_{0 < u \leq s} \left(  1 - \Lambda_{23}(\{u\}) \right)$.

```{r}
#Calculate dA23
FrydmandA23 <- c(0, diff(sapply(times1, function(x) mod_frydman$cdf$Lambda23(x))))
#We calculate the product integral for the Frydman estimator
FrydmanG <- cumprod(1-FrydmandA23)
#Comparison plot
plot(P11, main = "icmstate vs Frydman (dashed red line) 
     Comparison times (dotted blue lines)", from = 2, ord = c(2, 4, 1, 3))
lines(times1, FrydmanG, col = "red", type = "s", lwd = 2, lty = 2)
abline(v = unique(RE23), col = "blue", lwd = 2, lty = 3)
```

And they coincide again!



<!-- # Hudgens (2001/2005) support determination {#Hudgens-support} -->

<!-- @Hudgens2005 demonstrated a result for determining the support intervals of interval censored data through the use of graph theory. He further extended the theory to determine when a NPMLE actually exists in the presence of truncation. The resulting theory can be used to determine existence of an NPMLE in the case of interval censored and truncated data using the `supportHudgens()` function.  -->

<!-- Additionally, an attempt was made to use the combination of @Hudgens2001 and @Hudgens2005 to determine the support of the general interval-censored multi-state problem. This attempt can be reproduced using the `estimate_support_msm()` function, although the resulting support intervals do not align with the empirical result of the `npmsm()` function.  -->

<!-- Will be covered at a later time, when we obtain a correct result. -->