--- title: "maskedhaz: Masked-Cause Likelihood for General Series Systems" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{maskedhaz: Masked-Cause Likelihood for General Series Systems} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` ## The Problem A series system with $m$ components fails when *any* component fails. We observe the system failure time $T_i = \min(T_{i1}, \ldots, T_{im})$, but the component that caused the failure is often **masked** — diagnostics only narrow it to a set of candidates $C_i \subseteq \{1, \ldots, m\}$. Given observations $\{(t_i, C_i)\}$ from $n$ systems, we want to recover the component-level parameters $\theta = (\theta_1, \ldots, \theta_m)$ from system-level data. This is the **masked-cause likelihood problem**, and under masking conditions C1, C2, and C3 it has a clean likelihood form: $$ L(\theta) = \prod_{i=1}^{n} S_{\text{sys}}(t_i; \theta) \cdot \sum_{j \in C_i} h_j(t_i; \theta_j) $$ where $h_j$ is component $j$'s hazard and $S_{\text{sys}}(t) = \exp(-\sum_j H_j(t))$ is the system survival. ## The `series_md` Protocol and This Implementation The masked-cause likelihood problem has a protocol package: [`maskedcauses`](https://github.com/queelius/maskedcauses). It specialises the general [`likelihood.model`](https://github.com/queelius/likelihood.model) framework for the masked-series-system setting — adding the `series_md` S3 class (which inherits from `likelihood_model`), the C1–C2–C3 data schema, and domain-specific generics such as `conditional_cause_probability`, `cause_probability`, `component_hazard`, and `ncomponents`. It also ships reference implementations for exponential and Weibull component families as illustrations of the protocol in use; other implementations can be written against the same generics. `maskedhaz` is such an implementation, for **arbitrary** component hazard functions. The model class `dfr_series_md` inherits from `series_md` — and therefore also from `likelihood_model` — and wraps a `dfr_dist_series` from [`serieshaz`](https://github.com/queelius/serieshaz), which in turn is built from `dfr_dist` components in [`flexhaz`](https://github.com/queelius/flexhaz). Components can be anything: Gompertz aging, log-logistic mid-life peaks, user-defined bathtub curves, or a heterogeneous mixture of all of the above. Because `dfr_series_md` implements the `likelihood_model` / `series_md` generics, every workflow written against those protocols — loglik evaluation, MLE fitting, diagnostics, simulation, cross-package comparison — works without modification. The implementation choices that distinguish `maskedhaz` from the closed-form reference implementations: - **Log-likelihood**: closed-form for exact and right-censored rows; numerical integration (`stats::integrate`) for left and interval censoring - **Score and Hessian**: numerical differentiation via `numDeriv` - **No family restrictions**: any `dfr_dist` component plugs in — no closed-form assumption anywhere Fitting a model with `fit(model)(df, par)` returns a `fisher_mle` object from `likelihood.model`, which realises the `mle_fit` concept from [`algebraic.mle`](https://github.com/queelius/algebraic.mle). That's what lets the downstream diagnostics — standard errors, bias, FIM, delta-method confidence intervals, MSE — work uniformly across maskedhaz, maskedcauses, and any other likelihood implementation in this ecosystem. ## Quick Tour ```{r quick-tour} library(maskedhaz) # A three-component series system with a mixed-distribution setup model <- dfr_series_md(components = list( dfr_weibull(shape = 2, scale = 100), # wear-out dfr_gompertz(a = 0.001, b = 0.02), # aging dfr_exponential(0.005) # random failure )) model ``` The model knows the parameter layout, the column conventions, and the full series system distribution. Every generic in the likelihood ecosystem dispatches through this one object. ### Step 1: generate data ```{r data} set.seed(42) rdata_fn <- rdata(model) df <- rdata_fn( theta = c(2, 100, 0.001, 0.02, 0.005), # matches the parameter layout n = 300, tau = 50, # right-censoring time p = 0.4 # masking probability for non-failed components ) head(df, 3) table(df$omega) ``` A few things to notice in the output. Rows 1 and 2 are right-censored (`omega = "right"`) at exactly `t = tau = 50` — that's the convention for right-censored rows: `t` is the censoring time, and the candidate columns `x1, x2, x3` are all `FALSE` because a right-censored observation carries no information about which component would have caused the (unobserved) failure. Row 3 is an exact failure at `t = 40.45` with all three candidate columns `TRUE` — a legitimate outcome under `p = 0.4` (each non-failing component is included in the candidate set with probability 0.4, so two of them being included has probability $0.4^2 = 0.16$). This is the maximum-masking extreme: we know the system failed and when, but the diagnostic narrowed the cause to "any of the three." The opposite extreme is a singleton candidate set, in which case the cause is fully known. Candidate sets in general are encoded as Boolean columns (`x1`, `x2`, ..., `xm`). The C1 condition guarantees the true failed component is always `TRUE`; the additional masking depends on the diagnostic procedure (under C1–C2 with uniform masking, `p` is the per-non-failed inclusion probability). ### Step 2: evaluate the log-likelihood ```{r loglik} ll_fn <- loglik(model) true_par <- c(2, 100, 0.001, 0.02, 0.005) ll_fn(df, par = true_par) ``` `loglik()` is the core generic from the `likelihood.model` package — `maskedhaz` just provides the method for the `dfr_series_md` class. The generic returns a *closure*, not a value: `loglik(model)` gives you a function of `(df, par)` that computes the log-likelihood for that model. This two-step pattern is shared by `score(model)`, `hess_loglik(model)`, `fit(model)`, and `rdata(model)`; once you learn it for one, you have it for all. The value here (`-790.5` on 300 observations, or roughly $-2.6$ nats per system) is an unremarkable log-likelihood magnitude — nothing about the number itself tells you whether the fit is good. What matters is comparing log-likelihoods *at different parameter values*: the MLE should give a larger loglik than any nearby point, and the curvature around the MLE feeds the asymptotic variance. ### Step 3: fit via MLE ```{r fit} solver <- fit(model) result <- solver(df, par = c(1.5, 120, 0.0005, 0.015, 0.01)) ``` `fit()` is the generic from `generics` (specialised by `likelihood.model`'s `fit.likelihood_model` and by `maskedhaz`'s `fit.dfr_series_md`). It returns a `fisher_mle` object. Let's put the estimate next to the truth: ```{r fit-compare} rbind(truth = true_par, estimate = round(coef(result), 4)) ``` Each parameter is recovered to within a fraction of its standard error — see below. Standard coefficient / variance methods all dispatch through the usual `stats` generics: ```{r mle-methods-basic} # 95% Wald confidence intervals — note the Gompertz rows (3 and 4) # cross zero. That's a known limitation of Wald intervals near # a boundary: the asymptotic normal approximation is too symmetric # for parameters bounded below by zero when the SE is comparable # to the estimate. confint(result) ``` ### Step 4: the fit is a distribution A `fisher_mle` also realises the `mle_fit` concept from [`algebraic.mle`](https://github.com/queelius/algebraic.mle), which in turn is an `algebraic.dist` distribution (under the standard MLE asymptotic normal approximation). That means you get a second layer of generics "for free" — they weren't implemented by `maskedhaz`, but they work because the class inheritance wires them up. Attach those two packages to bring their generics onto the search path: ```{r ecosystem-load} library(algebraic.mle) # se(), bias(), observed_fim(), mse(), ... library(algebraic.dist) # as_dist(), sampler(), expectation(), cdf(), ... ``` MLE-algebra generics work directly on the fit: ```{r algebraic-mle} se(result) # standard errors (from algebraic.mle) bias(result) # first-order asymptotic bias — zero for Fisherian MLEs observed_fim(result) # observed Fisher information — the negative Hessian ``` Because the MLE *is* a distribution (asymptotically multivariate Normal with mean $\theta_0$ and covariance $I(\theta_0)^{-1}$), the `algebraic.dist` generics also apply: ```{r algebraic-dist} # Convert to an explicit algebraic.dist object mle_dist <- as_dist(result) # from algebraic.dist class(mle_dist) # The expected value of the sampling distribution is the MLE itself expectation(mle_dist) # from algebraic.dist # Draw 5 samples from the MLE's sampling distribution set.seed(7) samp <- sampler(mle_dist) # from algebraic.dist round(samp(5), 4) ``` Every row above is a plausible "alternative MLE" under the asymptotic approximation. Note that some samples have *negative* values for the Gompertz parameters (columns 3 and 4) — that's not a bug, it's the MVN approximation being symmetric around the MLE when the true parameter space is $(0, \infty)$. For uncertainty propagation over nonlinear functions of the MLE this is usually fine (the probability mass is still concentrated in the feasible region); for simulation-based CIs of boundary-near parameters you'd want a log-reparametrisation or a profile likelihood instead. The punchline: none of this `algebraic.mle` / `algebraic.dist` machinery knows or cares that `result` came from a masked-cause series-system model. It just sees an `mle_fit`. That's what the layered design buys — each layer provides capabilities that compose uniformly with everything below it. ### Step 5: diagnostics Once fit, you can ask domain-specific cause-of-failure questions. `conditional_cause_probability` is a `maskedcauses` generic (part of the `series_md` protocol) — `maskedhaz` provides the method for `dfr_series_md`: ```{r diagnostics} ccp_fn <- conditional_cause_probability(model) ccp_fn(t = c(25, 50), par = coef(result)) ``` At $t = 25$, component 3 (the exponential) has the highest conditional probability ($\approx 0.46$) because the Weibull hazard $h_1(t) = 2t/100^2$ is still small at $t=25$ (about $0.005$) and the Gompertz component has a mild rate. By $t = 50$, the Weibull hazard has grown enough ($\approx 0.01$) that component 1 overtakes — its conditional probability rises to $\approx 0.52$. The Gompertz (component 2) is a minor, steady contributor throughout because its parameters ($a=0.001, b=0.02$) keep its hazard well below the other two on this time scale. The marginal version averages over the system failure time distribution: ```{r diagnostics-marginal} cp_fn <- cause_probability(model) set.seed(1) cp_fn(par = coef(result), n_mc = 5000) ``` Roughly: over the population of failed systems, $\approx 48\%$ are caused by the Weibull component, $\approx 34\%$ by the exponential, and $\approx 17\%$ by the Gompertz. The Weibull comes out ahead marginally even though it was second at $t=25$, because the system failure-time distribution has enough mass at later times (where Weibull dominates) to tip the average. ## The Ecosystem `maskedhaz` sits at the intersection of three stacks — a **protocol stack** defining what a masked-cause likelihood model is, a **component stack** providing the hazard functions inside a series system, and an **MLE result stack** defining what the fit object looks like. ### Protocol stack (model classes) ``` likelihood.model::likelihood_model ▲ │ (specialises) maskedcauses::series_md ▲ │ (specialises) maskedhaz::dfr_series_md ``` - **[`likelihood.model`](https://github.com/queelius/likelihood.model)** defines the general Fisherian likelihood framework: the `likelihood_model` class and the core generics (`loglik`, `score`, `hess_loglik`, `fim`, `rdata`, `fit`, `assumptions`). - **[`maskedcauses`](https://github.com/queelius/maskedcauses)** specialises that framework for masked-series-system problems: the `series_md` subclass, C1–C2–C3 validation, and domain generics (`conditional_cause_probability`, `cause_probability`, `component_hazard`, `ncomponents`). Its exp/Weibull likelihoods are reference implementations; other implementations plug into the same generics. - **`maskedhaz`** provides `dfr_series_md`, the `series_md` implementation for systems whose components are arbitrary `dfr_dist` objects. ### Component stack (what goes inside a series system) ``` flexhaz::dfr_dist (hazard-first component distributions) │ ▼ (composed into) serieshaz::dfr_dist_series (system = sum of component hazards) │ ▼ (wrapped by) maskedhaz::dfr_series_md (masked-cause likelihood over that system) ``` - **[`flexhaz`](https://github.com/queelius/flexhaz)** provides the `dfr_dist` abstraction: define $h(t; \theta)$ and get $S$, $F$, $f$, quantiles, and sampling automatically. - **[`serieshaz`](https://github.com/queelius/serieshaz)** composes `dfr_dist` components into a series system where $h_{\text{sys}}(t) = \sum_j h_j(t)$. ### MLE result stack ``` algebraic.dist::dist (probability distribution algebra) ▲ │ (an MLE IS a distribution, via asymptotic normality) algebraic.mle::mle_fit (MLE concept: coefs, vcov, CIs, delta-method) ▲ │ (realised by) likelihood.model::fisher_mle (Fisherian MLE with observed FIM, score at MLE) ``` A call to `fit(model)(df, par)` produces a `fisher_mle`. Because it inherits `mle_fit`, standard MLE diagnostics — `coef()`, `vcov()`, `confint()`, `se()`, `bias()`, `observed_fim()`, `mse()` — dispatch through the `algebraic.mle` algebra. Because `mle_fit` in turn realises [`algebraic.dist`](https://github.com/queelius/algebraic.dist) (an MLE is a sampling distribution, approximately multivariate Normal under standard theory), you can also `density()`, `cdf()`, `expectation()`, and sample from a fitted model — composing it with other distributions, propagating uncertainty, and so on. The same `fisher_mle` returned by `maskedhaz`, `maskedcauses`, or any other implementation plugs into all of this uniformly. ## Where to Go Next - **`vignette("custom-components", package = "maskedhaz")`** — a worked example with components that have no closed-form likelihood: Gompertz aging, log-logistic early failure, and how the numerical-integration path handles them transparently. - **`vignette("censoring-and-masking", package = "maskedhaz")`** — all four observation types (`exact`, `right`, `left`, `interval`) with realistic examples, plus a cross-validation check against `maskedcauses` for the exponential case. - **`?dfr_series_md`** — constructor reference, including the data-column naming conventions. ## Assumptions `maskedhaz` inherits the classical C1-C2-C3 assumptions from the foundational masked-cause literature: ```{r assumptions} assumptions(model) ``` These hold in most practical reliability scenarios where candidate sets come from unbiased diagnostic procedures. If your data-generating process violates them (e.g., mechanics preferentially blame the cheapest component), you need relaxed-masking extensions — see [`mdrelax`](https://github.com/queelius/mdrelax).