--- title: "Fitting Series Systems to Data" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Fitting Series Systems to Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` ## Introduction In practice, we observe system-level failure data: the time at which a system failed (or was censored), but often *not* which component caused the failure. The `serieshaz` package fits series system models to such data via maximum likelihood estimation (MLE). ## Data Format The fitting functions expect a data frame with two columns: | Column | Type | Meaning | |--------|------|---------| | `t` | numeric | Observed time | | `delta` | integer | Censoring indicator | The censoring indicator values are: - `1` — exact observation (the system failed at time `t`) - `0` — right-censored (the system was still running at time `t`) - `-1` — left-censored (the system had already failed by time `t`) ## Basic Fitting Workflow ### Step 1: Define the System Model ```{r define-model} library(serieshaz) # Hypothesize: system has two failure modes # - Wear-out (Weibull) # - Random shock (exponential) model <- dfr_dist_series(list( dfr_weibull(shape = 2, scale = 100), dfr_exponential(0.01) )) ``` ### Step 2: Simulate Training Data In practice you'd have real field data. Here we simulate from known parameters to demonstrate the workflow. ```{r simulate-data} set.seed(42) true_par <- c(2, 100, 0.01) # shape, scale, rate samp <- sampler(model) n <- 300 times <- samp(n, par = true_par) train <- data.frame(t = times, delta = rep(1L, n)) summary(train$t) ``` ### Step 3: Fit the Model ```{r fit-model} solver <- fit(model) # Provide initial parameter guesses init_par <- c(1.5, 80, 0.02) result <- solver(train, par = init_par) ``` ### Step 4: Extract Results ```{r extract-results} # Fitted parameters cat("True parameters:", true_par, "\n") cat("Fitted parameters:", coef(result), "\n") # Log-likelihood at MLE cat("Log-likelihood:", logLik(result), "\n") # Variance-covariance matrix vcov(result) ``` ## Initial Parameter Selection The choice of initial parameters can affect convergence. Strategies include: 1. **Domain knowledge**: Use engineering estimates or handbook values 2. **Simpler model first**: Fit a single exponential to get a rough rate, then use that as a starting point 3. **Multiple starts**: Try several initial values and keep the best fit ```{r multi-start} # Strategy: try a few initial guesses, keep the best init_guesses <- list( c(1.5, 80, 0.02), c(2.5, 120, 0.005), c(1.0, 50, 0.05) ) best_ll <- -Inf best_result <- NULL for (init in init_guesses) { res <- tryCatch(solver(train, par = init), error = function(e) NULL) if (!is.null(res) && logLik(res) > best_ll) { best_ll <- logLik(res) best_result <- res } } cat("Best log-likelihood:", best_ll, "\n") cat("Best parameters:", coef(best_result), "\n") ``` ## Identifiability ### Exponential Series: Non-Identifiable When all components are exponential, only the **sum of rates** is identifiable from system-level data. The system hazard $h_{sys}(t) = \sum_j \lambda_j$ is constant, so any partition of rates that sums to the same total produces the same likelihood. ```{r identifiability-exp} # True rates: 0.1, 0.2, 0.3 (sum = 0.6) true_rates <- c(0.1, 0.2, 0.3) exp_sys <- dfr_dist_series(lapply(true_rates, dfr_exponential)) set.seed(123) exp_samp <- sampler(exp_sys) exp_data <- data.frame(t = exp_samp(500), delta = rep(1L, 500)) exp_solver <- fit(exp_sys) exp_result <- exp_solver(exp_data, par = c(0.15, 0.25, 0.25)) cat("True sum of rates:", sum(true_rates), "\n") cat("Fitted sum of rates:", sum(coef(exp_result)), "\n") cat("Individual rates (unreliable):", coef(exp_result), "\n") ``` The sum is accurately estimated, but individual rates are not meaningful. ### Mixed Types: Identifiable When components have different distributional forms, the system is generally identifiable because each component shapes the hazard differently. ```{r identifiability-mixed} # Weibull (increasing hazard) + Exponential (constant hazard) mixed <- dfr_dist_series(list( dfr_weibull(shape = 2, scale = 100), dfr_exponential(0.01) )) set.seed(42) mixed_samp <- sampler(mixed) mixed_data <- data.frame(t = mixed_samp(500), delta = rep(1L, 500)) mixed_solver <- fit(mixed) mixed_result <- mixed_solver(mixed_data, par = c(1.5, 80, 0.02)) cat("True: shape=2.00, scale=100.0, rate=0.010\n") cat(sprintf("Fitted: shape=%.2f, scale=%.1f, rate=%.3f\n", coef(mixed_result)[1], coef(mixed_result)[2], coef(mixed_result)[3])) ``` ## Handling Censored Data In reliability studies, some units are removed from the test before failing (right-censoring). The MLE correctly accounts for this. ```{r censored-data} set.seed(42) model <- dfr_dist_series(list( dfr_weibull(shape = 2, scale = 100), dfr_exponential(0.01) )) true_par <- c(2, 100, 0.01) # Generate true failure times n <- 500 true_times <- sampler(model)(n, par = true_par) # Right-censor at a fixed time horizon censor_time <- 80 observed_t <- pmin(true_times, censor_time) delta <- as.integer(true_times <= censor_time) cens_data <- data.frame(t = observed_t, delta = delta) cat("Proportion censored:", round(1 - mean(delta), 3), "\n") # Fit with censored data solver <- fit(model) cens_result <- solver(cens_data, par = c(1.5, 80, 0.02)) cat("True parameters: ", true_par, "\n") cat("Fitted (censored): ", round(coef(cens_result), 4), "\n") cat("Fitted (uncensored):", round(coef(result), 4), "\n") ``` Both the censored and uncensored fits recover parameters in the right ballpark. With finite samples, the MLE is a random variable --- different random seeds and censoring patterns can shift estimates in either direction. The censored fit uses less information (observations after $t = 80$ are truncated), but the MLE correctly accounts for this through the censored likelihood contribution $S(t_i)$. In general, heavier censoring increases standard errors, but does not introduce systematic bias. ## Model Diagnostics ### Log-Likelihood Comparison Compare competing models using log-likelihood, AIC, or BIC: ```{r model-comparison} # Model 1: Weibull + Exponential (3 parameters) m1 <- dfr_dist_series(list( dfr_weibull(shape = 2, scale = 100), dfr_exponential(0.01) )) # Model 2: Two Weibulls (4 parameters) m2 <- dfr_dist_series(list( dfr_weibull(shape = 2, scale = 100), dfr_weibull(shape = 1.5, scale = 200) )) # Generate data from Model 1 set.seed(42) data_m1 <- data.frame( t = sampler(m1)(300, par = c(2, 100, 0.01)), delta = rep(1L, 300) ) r1 <- fit(m1)(data_m1, par = c(1.5, 80, 0.02)) r2 <- fit(m2)(data_m1, par = c(1.5, 80, 1.2, 150)) k1 <- length(coef(r1)) k2 <- length(coef(r2)) ll1 <- logLik(r1) ll2 <- logLik(r2) aic1 <- -2 * ll1 + 2 * k1 aic2 <- -2 * ll2 + 2 * k2 cat(sprintf("Model 1 (Weibull+Exp): LL=%.2f, k=%d, AIC=%.2f\n", ll1, k1, aic1)) cat(sprintf("Model 2 (Weibull+Weibull): LL=%.2f, k=%d, AIC=%.2f\n", ll2, k2, aic2)) cat("Preferred model (lower AIC):", ifelse(aic1 < aic2, "Model 1", "Model 2"), "\n") ``` ### Model Assumptions Review what assumptions your model makes: ```{r assumptions} assumptions(m1) ``` ## Confidence Intervals ### Wald Confidence Intervals Use the variance-covariance matrix from `vcov()` to construct Wald-type confidence intervals: ```{r wald-ci} result <- fit(m1)(data_m1, par = c(1.5, 80, 0.02)) est <- coef(result) se <- sqrt(diag(vcov(result))) alpha <- 0.05 z <- qnorm(1 - alpha / 2) ci_lower <- est - z * se ci_upper <- est + z * se ci_table <- data.frame( Parameter = c("shape", "scale", "rate"), Estimate = est, SE = se, Lower = ci_lower, Upper = ci_upper ) print(ci_table, digits = 4) ``` ### Delta Method for MTBF The mean time between failures (MTBF) is a function of the parameters. The delta method provides approximate confidence intervals for functions of parameters. ```{r delta-method} # For this system, MTBF is not available in closed form, but we can # approximate it numerically and use the delta method par_hat <- coef(result) V <- vcov(result) # Numerical MTBF estimate via integration of survival function S_fn <- surv(m1) # Wrap S_fn for integrate() which may pass vectorized t values compute_mtbf <- function(p) { integrate(function(t) vapply(t, function(ti) S_fn(ti, par = p), numeric(1)), 0, Inf, subdivisions = 1000L)$value } mtbf <- compute_mtbf(par_hat) cat("Estimated MTBF:", round(mtbf, 2), "\n") # Delta method: gradient of MTBF w.r.t. parameters eps <- 1e-5 grad_mtbf <- numeric(length(par_hat)) for (k in seq_along(par_hat)) { par_plus <- par_minus <- par_hat par_plus[k] <- par_hat[k] + eps par_minus[k] <- par_hat[k] - eps grad_mtbf[k] <- (compute_mtbf(par_plus) - compute_mtbf(par_minus)) / (2 * eps) } se_mtbf <- sqrt(t(grad_mtbf) %*% V %*% grad_mtbf) cat(sprintf("MTBF = %.2f +/- %.2f (95%% CI: [%.2f, %.2f])\n", mtbf, 1.96 * se_mtbf, mtbf - 1.96 * se_mtbf, mtbf + 1.96 * se_mtbf)) ```