--- title: "Censoring Types and Masked Causes" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Censoring Types and Masked Causes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` ## Two Independent Sources of Information Loss Masked-cause series system data involves two orthogonal forms of incomplete observation: 1. **Censoring of the failure time** — we may not observe $T_i$ exactly. The system may be observed only as "still working at $t$" (right-censored), "already failed by $t$" (left-censored), or "failed somewhere in $(t, t_{\text{upper}})$" (interval-censored). 2. **Masking of the failure cause** — even when $T_i$ is observed, the *component* that failed is often only known up to a candidate set $C_i$ that satisfies conditions C1, C2, and C3. `maskedhaz` handles all four censoring types uniformly via numerical integration when needed. This vignette walks through each observation type with a worked example, then cross-validates the numerical machinery against `maskedcauses`' closed-form exponential series likelihood. ## The `omega` Column Data frames passed to `loglik()`, `score()`, `hess_loglik()`, and `fit()` must include an `omega` column indicating the observation type per row: | `omega` | Meaning | Required columns | |---------|---------|------------------| | `"exact"` | $T_i = t_i$; failure observed | `t`, `x1..xm` | | `"right"` | $T_i > t_i$; still surviving at $t_i$ | `t` (candidate columns ignored) | | `"left"` | $T_i < t_i$; failed before $t_i$ | `t`, `x1..xm` | | `"interval"` | $t_i < T_i < t_{\text{upper},i}$; failed in interval | `t`, `t_upper`, `x1..xm` | Candidate sets are Boolean columns `x1`, `x2`, ..., `xm`. Right-censored rows are allowed to have all-`FALSE` candidate columns because they carry no cause-of-failure information. ## Contribution Formulas For a row of type $\omega$, the log-likelihood contribution is: - **Exact**: $\log\left(\sum_{j \in C_i} h_j(t_i; \theta_j)\right) - H_{\text{sys}}(t_i; \theta)$ - **Right**: $-H_{\text{sys}}(t_i; \theta)$ - **Left**: $\log \int_0^{t_i} \left[\sum_{j \in C_i} h_j(u; \theta_j)\right] S_{\text{sys}}(u; \theta) \, du$ - **Interval**: $\log \int_{t_i}^{t_{\text{upper},i}} \left[\sum_{j \in C_i} h_j(u; \theta_j)\right] S_{\text{sys}}(u; \theta) \, du$ The exact and right contributions are closed-form regardless of the component family. Left and interval contributions require integration of the candidate-weighted instantaneous hazard over the observation interval, which `maskedhaz` does via `stats::integrate()`. ## Worked Examples ### Exact and right-censored A reliability study runs $n = 400$ systems for $\tau = 5$ time units. Any failure observed during $[0, \tau]$ is `"exact"`; any system still running at $\tau$ is `"right"`-censored. ```{r setup-exact} library(maskedhaz) model <- dfr_series_md(components = list( dfr_weibull(shape = 2, scale = 6), dfr_exponential(0.08), dfr_exponential(0.12) )) set.seed(100) df <- rdata(model)(theta = c(2, 6, 0.08, 0.12), n = 400, tau = 5, p = 0.4) table(df$omega) ``` Fit directly: ```{r fit-exact} result <- fit(model)(df, par = c(1.5, 7, 0.05, 0.1)) rbind(truth = c(2, 6, 0.08, 0.12), estimate = round(coef(result), 3)) ``` Recovery is reasonable: the Weibull scale (parameter 2) is almost exact, the two exponential rates are within 10%, and the Weibull shape is ~15% high. With only 317 exact failures and $p = 0.4$ (each non-failed component in the candidate set 40% of the time), this is normal sampling variation — not a problem with the method. ### Left-censored Left censoring arises in inspection-based designs: we check a system at time $t_i$ and observe that it has *already* failed, but don't know when. The candidate set $C_i$ reflects what the post-mortem diagnostic reveals about which component caused the failure. ```{r setup-left} df_left <- data.frame( t = c(2, 3, 4, 3.5, 5), omega = "left", x1 = c(TRUE, TRUE, FALSE, TRUE, TRUE), x2 = c(FALSE, TRUE, TRUE, FALSE, TRUE), x3 = c(TRUE, FALSE, TRUE, TRUE, FALSE), stringsAsFactors = FALSE ) df_left ``` Evaluating the log-likelihood exercises the integration path: ```{r loglik-left} ll_fn <- loglik(model) ll_fn(df_left, par = c(2, 6, 0.08, 0.12)) ``` Each left-censored row contributes $\log P(T_i < t_i, K_i \in C_i \mid \theta)$ — the log of a probability, which must be negative. The five rows above average about $-1$ each, reflecting the fact that by $t \in [2, 5]$ the system has a non-trivial probability of having failed (so each contribution is not too far below zero). ### Interval-censored Interval censoring arises when inspection times bracket the failure: at time $t_i$ the system was working, at time $t_{\text{upper},i}$ it had failed. The candidate set is again the diagnostic summary. ```{r setup-interval} df_interval <- data.frame( t = c(1.0, 2.0, 3.5), t_upper = c(2.5, 4.0, 5.0), omega = "interval", x1 = c(TRUE, FALSE, TRUE), x2 = c(TRUE, TRUE, FALSE), x3 = c(FALSE, TRUE, TRUE), stringsAsFactors = FALSE ) df_interval ll_fn(df_interval, par = c(2, 6, 0.08, 0.12)) ``` ### Mixed observation types In practice, a single dataset often contains all four types — different subjects entered the study at different times and were observed by different protocols. ```{r mixed} df_mixed <- data.frame( t = c(3, 5, 2, 1.5, 4.5), t_upper = c(NA, NA, NA, 3.0, NA), omega = c("exact", "right", "left", "interval", "exact"), x1 = c(TRUE, FALSE, TRUE, TRUE, FALSE), x2 = c(FALSE, FALSE, TRUE, FALSE, TRUE), x3 = c(TRUE, FALSE, FALSE, TRUE, TRUE), stringsAsFactors = FALSE ) ll_fn(df_mixed, par = c(2, 6, 0.08, 0.12)) ``` `maskedhaz` dispatches each row to the appropriate contribution formula automatically — no special setup required. ## Cross-Validation Against a Reference Implementation Both `maskedhaz` and `maskedcauses` implement the `series_md` protocol — the same likelihood, the same data schema, the same C1–C2–C3 conditions. They differ only in *how* the likelihood is computed. `maskedcauses` ships a closed-form reference implementation for exponential-series models via `exp_series_md_c1_c2_c3()`. `maskedhaz` computes the same likelihood numerically (even for exponential components, the left/interval paths go through `integrate()`). Comparing the two evaluates the numerical machinery against a trusted analytical baseline. ```{r cross-validate, eval = requireNamespace("maskedcauses", quietly = TRUE)} library(maskedcauses) # Same three-component exponential series in both packages rates <- c(0.1, 0.2, 0.3) exp_model_haz <- dfr_series_md( components = list(dfr_exponential(), dfr_exponential(), dfr_exponential()), n_par = c(1L, 1L, 1L) ) exp_model_mc <- exp_series_md_c1_c2_c3() # A single-row data frame of each observation type df_cv <- data.frame( t = c(3.0, 5.0, 2.0, 1.5), t_upper = c(NA, NA, NA, 3.0), omega = c("exact", "right", "left", "interval"), x1 = c(TRUE, FALSE, TRUE, TRUE), x2 = c(FALSE, FALSE, TRUE, FALSE), x3 = c(TRUE, FALSE, FALSE, TRUE), stringsAsFactors = FALSE ) # Evaluate both log-likelihoods at the same parameters ll_haz <- loglik(exp_model_haz)(df_cv, par = rates) ll_mc <- likelihood.model::loglik(exp_model_mc)(df_cv, par = rates) c(maskedhaz = ll_haz, maskedcauses = ll_mc, difference = abs(ll_haz - ll_mc)) ``` The two log-likelihoods agree to machine precision on this example (`integrate()`'s default relative tolerance of $10^{-8}$ gave us all the digits). This is a direct test of the numerical machinery: `maskedcauses` uses closed-form expressions for each contribution (which exist for exponential components), while `maskedhaz` uses `integrate()` for the left and interval contributions. The fact that they match end-to-end means the integration path is correct. Because both packages return the *same* `fisher_mle` class on `fit()`, the cross-validation also works at the estimation level — you can fit the same data with both, compare `coef()`, `vcov()`, and `logLik()` side by side, and any agreement (or disagreement) is directly interpretable because the return types are identical. ## Identifiability Under Masking Identifiability of individual component rates in an exponential series system depends on the masking structure, and it's worth being precise about the limits. ```{r identifiability-exp, eval = requireNamespace("maskedcauses", quietly = TRUE)} # A realistic-sized dataset with partial masking set.seed(2024) df_id <- likelihood.model::rdata(exp_model_mc)( theta = rates, n = 300, tau = 5, p = 0.5 ) result_id <- suppressWarnings( fit(exp_model_haz)(df_id, par = c(0.5, 0.5, 0.5)) ) cat("individual MLE:", round(coef(result_id), 3), " (truth:", rates, ")\n") cat("sum of MLE: ", round(sum(coef(result_id)), 3), " (truth:", sum(rates), ")\n") ``` With `p = 0.5` (each non-failed component in the candidate set half the time), the individual rates are recovered well — candidate sets of varying composition carry real information about which component is which. We can verify this isn't seed luck by checking the log-likelihood at very different rate vectors with the same sum: ```{r identifiability-structure, eval = requireNamespace("maskedcauses", quietly = TRUE)} ll_fn <- loglik(exp_model_haz) c(truth = ll_fn(df_id, par = c(0.10, 0.20, 0.30)), balanced = ll_fn(df_id, par = c(0.20, 0.20, 0.20)), skewed = ll_fn(df_id, par = c(0.05, 0.05, 0.50))) ``` The three parameter vectors all have sum $0.6$, but the log-likelihoods differ substantially. The candidate-set structure is informative about the *ratios* of rates, not just the sum. That said, individual identifiability degrades as masking grows. In the limit `p = 1` (full masking, every candidate set is $\{1, 2, 3\}$), only the sum $\sum_j \lambda_j$ remains identifiable — the candidate sets tell you nothing about which component failed, and the likelihood reduces to a function of the total rate alone. And under pure right-censoring (no exact failures), no candidate set information is ever carried, so only the sum is identified regardless of `p`. The `series_md` protocol doesn't preclude fitting under these conditions; it just returns a flat likelihood surface and noisy individual estimates. Mixing distribution families (as in `vignette("custom-components")`) sidesteps this fragility entirely — the qualitatively different hazard shapes pin down each component's parameters through temporal signal, even when masking is heavy. ## Summary - The `omega` column is the single switch between contribution formulas - Exact and right contributions are closed-form; left and interval use `integrate()` - Candidate sets encode whatever diagnostic information narrows the cause - Numerical likelihood agrees with `maskedcauses` closed-form to integrator tolerance - Exponential-only systems have identifiability limitations — mixed families don't