---
title: "Migrating from SAS HAZARD to TemporalHazard"
format: html
vignette: >
  %\VignetteIndexEntry{Migrating from SAS HAZARD to TemporalHazard}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| include: false
has_pkg <- requireNamespace("TemporalHazard", quietly = TRUE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = has_pkg
)
```

```{r}
#| label: setup
library(TemporalHazard)
```

## Overview

If you've been running multiphase hazard analyses in SAS HAZARD —
the macro suite that wraps the C HAZARD binary written by Blackstone,
Naftel, and Turner at UAB — this vignette is the bridge to running
the same analyses in R. Every SAS `HAZARD` statement, every common
macro option, and every output field maps to something in
`TemporalHazard`; this document spells out the correspondences and
flags the small handful of features the R port doesn't yet cover.

The mapping is faithful, not literal. SAS HAZARD is a step-driven
DSL: you write a `PROC HAZARD` block with separate `TIME`, `EVENT`,
`PARMS`, `EARLY`, `CONSTANT`, and `SELECTION` statements, and the
macro assembles them into a binary call. R is function-driven: you
write a single `hazard()` call with named arguments. The conceptual
pieces are identical; the syntactic ergonomics differ. Use this
vignette to translate an existing SAS analysis, or as a reference
when comparing the package's output against a SAS HAZARD reference
fit for parity testing.

The full formal argument mapping table is available programmatically
and ships with the package — handy when you want to grep for a
specific SAS parameter name and find its R equivalent without
scrolling through the prose below:

```{r}
#| label: mapping-table
knitr::kable(
  TemporalHazard::hzr_argument_mapping(),
  caption = "Formal argument map: SAS HAZARD/C → hazard()",
  col.names = c(
    "SAS Statement", "Legacy Input", "C Concept",
    "R Parameter", "Required", "Expected Type",
    "Transform Rule", "Status", "Notes"
  )
)
```

---

## Statement-by-statement mapping

A SAS HAZARD analysis is a stack of statements inside a `PROC HAZARD`
block. We walk through them in roughly the order they appear in a
typical analysis script, showing the SAS form and the corresponding
R call.

### `PROC HAZARD DATA=`

The procedure-level statement that names the dataset and sets
global options. The `NOCOV` / `NOCOR` flags suppress covariance and
correlation output; `CONDITION=` is a tolerance switch for the
convergence check.

```sas
PROC HAZARD DATA=AVCS NOCOV NOCOR CONDITION=14;
```

Maps to `hazard()` `control` list:

```{r}
#| eval: false
fit <- hazard(
  ...,
  control = list(
    nocov      = TRUE,   # suppress covariance output
    nocor      = TRUE,   # suppress correlation output
    condition  = 14      # CONDITION= switch
  )
)
```

Additional `PROC HAZARD` options with no direct R equivalent yet are passed
through `...` as named arguments and stored in `fit$legacy_args`.

---

### `TIME`

Names the follow-up time variable. In SAS HAZARD this is a separate
statement; in R it's the first argument to `Surv()` inside the
formula.

```sas
TIME INT_DEAD;
```

Maps directly to `time`:

```{r}
#| eval: false
fit <- hazard(
  time   = avcs$INT_DEAD,
  ...
)
```

- Must be a non-negative numeric vector in the same time unit as the original
  analysis (months, in the AVC example).
- Missing values are not permitted.

---

### `EVENT`

Names the event-indicator variable. Like `TIME`, this is a separate
statement in SAS HAZARD but enters the R formula through `Surv()`.

```sas
EVENT DEAD;
```

Maps to `status`:

```{r}
#| eval: false
fit <- hazard(
  status = avcs$DEAD,   # 1 = event, 0 = censored
  ...
)
```

- `TemporalHazard` coerces `status` to `numeric`; logical vectors are also accepted.
- The HAZARD convention uses `1` = occurred, `0` = censored/alive at last contact.

---

### `PARMS`

Supplies starting values for the optimizer and flags which
parameters to hold fixed (`FIXM`, `FIXMU`, etc.). The starting
values matter — for the multiphase optimizer in particular, a poor
starting point can park the fit at a local minimum well away from
the global MLE.

```sas
PARMS MUE=0.3504743 THALF=0.1905077 NU=1.437416 M=1 FIXM
      MUC=4.391673E-07;
```

Maps to `theta` (coefficient/parameter vector) and `control` (fix flags):

```{r}
#| eval: false
fit <- hazard(
  theta   = c(MUE = 0.3504743, THALF = 0.1905077, NU = 1.437416,
              M = 1,           MUC   = 4.391673e-07),
  control = list(
    fix = c("M")   # FIXM → freeze M during optimization
  ),
  ...
)
```

> **Note:** Full SAS `PARMS` syntax is not mirrored one-for-one in the public API.
> In the current package, supply the parameter vector directly via `theta` and
> record fixed-parameter intent in `control$fix`. The legacy parity helpers do
> generate SAS-style control text when comparing against the historical binaries.

---

### `EARLY` and `CONSTANT` covariate blocks

These statements assign covariates to specific phases. In SAS HAZARD
each block lists the covariates that affect that phase and their
starting coefficients. Covariates can appear in multiple blocks with
different starting values — same column, different phase-specific
effect.

```sas
EARLY  AGE=-0.03205774, COM_IV=1.336675, MAL=0.6872028,
       OPMOS=-0.01963377, OP_AGE=0.0002086689, STATUS=0.5169533;
CONSTANT INC_SURG=1.375285, ORIFICE=3.11765, STATUS=1.054988;
```

In HAZARD, `EARLY` and `CONSTANT` define phase-specific covariate coefficients.
In `TemporalHazard` these are unified into a single design matrix `x` and
coefficient vector `theta` during M1. Phase assignment will be formalised in
M2 when the multi-phase likelihood is implemented.

**Current convention (M1):** combine all covariates into `x` and supply the
corresponding starting coefficients in `theta`.

```{r}
#| eval: false
# Build design matrix from the AVC data set
X <- data.matrix(avcs[, c("AGE", "COM_IV", "MAL", "OPMOS", "OP_AGE",
                           "STATUS", "INC_SURG", "ORIFICE")])

# Starting values from SAS EARLY + CONSTANT blocks combined
theta_start <- c(
  AGE      = -0.03205774,
  COM_IV   =  1.336675,
  MAL      =  0.6872028,
  OPMOS    = -0.01963377,
  OP_AGE   =  0.0002086689,
  STATUS   =  0.5169533,   # EARLY phase coefficient
  INC_SURG =  1.375285,
  ORIFICE  =  3.11765
)

fit <- hazard(
  time   = avcs$INT_DEAD,
  status = avcs$DEAD,
  x      = X,
  theta  = theta_start,
  dist   = "weibull"
)
```

---

### `SELECTION`

Triggers stepwise covariate selection inside the hazard fit. `SLE`
is the significance level for entry, `SLS` for staying. SAS HAZARD
embedded this in the same `PROC HAZARD` call; in R the stepwise
loop is a separate `hzr_stepwise()` function that wraps a fitted
`hazard()` object.

```sas
SELECTION SLE=0.2 SLS=0.1;
```

`TemporalHazard` now ships `hzr_stepwise()`, which implements forward,
backward, and two-way stepwise selection with SAS-style SLENTRY /
SLSTAY thresholds, phase-specific entry for multiphase models, and a
MOVE oscillation guard.  FAST-screening (Lawless-Singhal approximate
Wald updates) is not yet implemented; every candidate currently gets a
full refit.  See `?hzr_stepwise` for the full option list.

---

## SAS macro equivalents

The legacy SAS distribution ships a suite of macros for nonparametric
baselines, goodness-of-fit diagnostics, bootstrap inference, and
variable calibration that sit alongside `PROC HAZARD`.  Each has an R
equivalent in `TemporalHazard`:

| SAS macro | R function | Purpose |
|:---|:---|:---|
| `kaplan.sas` | `hzr_kaplan()` | KM survival with logit-transformed exact CL |
| `nelsonl.sas` / `nelsont.sas` | `hzr_nelson()` | Nelson cumulative hazard with lognormal CL |
| `hazplot.sas` | `hzr_gof()` | Parametric vs. KM overlay + CoE goodness-of-fit |
| `deciles.hazard.sas` | `hzr_deciles()` | Decile-of-risk calibration + chi-square GOF |
| `logit.sas` / `logitgr.sas` | `hzr_calibrate()` | Quantile-group calibration (logit / Gompertz / Cox link) |
| `bootstrap.hazard.sas` | `hzr_bootstrap()` | Bootstrap resampling + summary |
| `markov.sas` | `hzr_competing_risks()` | Aalen-Johansen competing-risks incidence |

Minimal illustrations on the AVC dataset follow.  The
`vignette("inference-diagnostics")` walkthrough builds these into a
full analysis workflow.

```{r}
#| label: macro-demos-setup
library(survival)
data(avc)
avc <- na.omit(avc)

# Base parametric fit for downstream diagnostics
fit <- hazard(
  Surv(int_dead, dead) ~ age + status + mal + com_iv,
  data = avc, dist = "weibull",
  theta = c(mu = 0.2, nu = 1, rep(0, 4)),
  fit = TRUE, control = list(maxit = 500)
)
```

```{r}
#| label: kaplan-demo
km <- hzr_kaplan(time = avc$int_dead, status = avc$dead)
head(km, 4)
```

```{r}
#| label: nelson-demo
nel <- hzr_nelson(time = avc$int_dead, event = avc$dead)
head(nel, 4)
```

```{r}
#| label: gof-demo
head(hzr_gof(fit), 4)
```

```{r}
#| label: deciles-demo
hzr_deciles(fit, time = 120, groups = 10)
```

```{r}
#| label: calibrate-demo
hzr_calibrate(avc$age, avc$dead, groups = 10, link = "logit")
```

```{r}
#| label: bootstrap-demo
set.seed(1)
hzr_bootstrap(fit, n_boot = 20)  # small for vignette build
```

```{r}
#| label: competing-demo
data(valves)
# Synthesize a competing-risks indicator for illustration.
valves$ev <- with(valves, ifelse(dead == 1, 1L,
                                   ifelse(pve == 1,  2L,
                                   ifelse(reop == 1, 3L, 0L))))
head(hzr_competing_risks(valves$int_dead, valves$ev), 4)
```

---

## Full worked example: AVC death after repair

Statement-by-statement mapping is fine for reference, but the full
gestalt only lands when you see a complete SAS HAZARD analysis and
its R translation side by side. The example below is the final
multivariable model from `examples/hm.death.AVC.sas` in the
reference C repository — death after atrioventricular canal repair,
two-phase model with covariates assigned to specific phases. It's
the canonical "this is what a real SAS HAZARD analysis looks like"
specimen.

### SAS (original)

The original SAS code, exactly as it appears in the reference
distribution. Notice how the statements stack inside a single
`%HAZARD()` macro call.

```sas
%HAZARD(
PROC HAZARD DATA=AVCS P CONSERVE OUTHAZ=OUTEST CONDITION=14 QUASI;
     TIME INT_DEAD;
     EVENT DEAD;
     PARMS MUE=0.3504743 THALF=0.1905077 NU=1.437416 M=1 FIXM
           MUC=4.391673E-07;
     EARLY   AGE=-0.03205774, COM_IV=1.336675,  MAL=0.6872028,
             OPMOS=-0.01963377, OP_AGE=0.0002086689, STATUS=0.5169533;
     CONSTANT INC_SURG=1.375285, ORIFICE=3.11765, STATUS=1.054988;
);
```

### R equivalent (current runnable translation pattern)

The same model in `TemporalHazard`. Every SAS statement above has a
corresponding R argument here: `TIME` and `EVENT` collapse into
`Surv(int_dead, dead)` on the left of the formula; the global
covariate list goes on the right; the `PARMS` starting values
become the `theta` vector; phase-specific assignments use the
`formula` argument inside each `hzr_phase()`; `FIXM` becomes
`fixed = "shapes"` (or a subset of shape names). The line-by-line
correspondence is the point — once you've translated one analysis
this way, the pattern carries to every other.

```{r}
#| label: avc-example
#| eval: false
# Assumed: avcs is a data.frame read from the AVC flat file
# (same variables as the SAS DATA step)

avcs <- avcs |>
  transform(
    LN_AGE   = log(AGE),
    LN_OPMOS = log(OPMOS),
    LN_INC   = ifelse(is.na(INC_SURG), NA, log(INC_SURG + 1)),
    LN_NYHA  = log(STATUS)
  )

# Replace missing INC_SURG with column mean (mirrors PROC STANDARD REPLACE)
avcs$INC_SURG[is.na(avcs$INC_SURG)] <- mean(avcs$INC_SURG, na.rm = TRUE)

X <- data.matrix(avcs[, c("AGE", "COM_IV", "MAL", "OPMOS", "OP_AGE",
                           "STATUS", "INC_SURG", "ORIFICE")])

fit <- hazard(
  time    = avcs$INT_DEAD,
  status  = avcs$DEAD,
  x       = X,
  theta   = c(
    # Hazard shape parameters (from PARMS)
    MUE   = 0.3504743,
    THALF = 0.1905077,
    NU    = 1.437416,
    M     = 1,
    MUC   = 4.391673e-07,
    # Covariate coefficients (from EARLY + CONSTANT blocks)
    AGE      = -0.03205774,
    COM_IV   =  1.336675,
    MAL      =  0.6872028,
    OPMOS    = -0.01963377,
    OP_AGE   =  0.0002086689,
    STATUS   =  0.5169533,
    INC_SURG =  1.375285,
    ORIFICE  =  3.11765
  ),
  dist    = "weibull",
  control = list(
    condition = 14,
    quasi     = TRUE,
    conserve  = TRUE,
    fix       = c("M")   # FIXM
  )
)

fit
```

---

## Data preparation differences

| SAS HAZARD                             | TemporalHazard R                                     |
|----------------------------------------|---------------------------------------------------|
| `PROC STANDARD REPLACE` for missing    | Replace with `mean(..., na.rm = TRUE)` manually  |
| Log transforms in `DATA` step          | `transform()` or `dplyr::mutate()`                |
| Variable labels via `LABEL`            | Column names of `x` matrix carry through          |
| `FILENAME` / `INFILE` / `INPUT`       | `read.table()`, `read.csv()`, or `haven::read_sas()` |
| SAS formats (date, currency, etc.)     | Standard R numeric; apply `as.Date()` if needed  |

---

## Output object

`hazard()` returns a list of class `hazard`:

| Slot                  | Contents                                                     |
|-----------------------|--------------------------------------------------------------|
| `$call`               | Unevaluated `match.call()` for reproducibility              |
| `$spec$dist`          | Baseline distribution label                                 |
| `$spec$control`       | Control options passed at construction                      |
| `$data$time`          | Follow-up time vector                                       |
| `$data$status`        | Event indicator (numeric)                                   |
| `$data$x`             | Design matrix                                               |
| `$fit$theta`          | Coefficient vector (starting values at M1; fitted at M2+)  |
| `$fit$converged`      | `NA` at M1; `TRUE`/`FALSE` from M2 optimizer               |
| `$fit$objective`      | Log-likelihood at convergence (M2+)                        |
| `$legacy_args`        | Named pass-through arguments for parity                     |

---

## Prediction

In SAS HAZARD the `P` (predict / print) option on the `PROC HAZARD`
line writes predicted survival, hazard, and cumulative-hazard tables
to the output dataset. In R the same predictions come from
`predict.hazard()` on the fitted object, with the requested quantity
chosen via the `type=` argument. `predict.hazard()` currently
supports four output types — `"linear_predictor"`, `"hazard"`,
`"survival"`, and `"cumulative_hazard"` — covering every quantity
the SAS `P` option produces:

```{r}
#| eval: false
#| label: predict-example
# Linear predictor (X %*% theta)
eta <- predict(fit, type = "linear_predictor")

# Hazard scale (exp(eta))
hz  <- predict(fit, type = "hazard")
```

The code chunk above shows only `"linear_predictor"` and `"hazard"`;
`"survival"` and `"cumulative_hazard"` follow the same pattern. For
multiphase fits pass `decompose = TRUE` to get per-phase cumulative
hazard contributions instead of just the total.

---

## Known limitations vs. SAS HAZARD

A SAS HAZARD veteran migrating to `TemporalHazard` should be aware of
the following scope limits as of v0.9.8.  Detailed status per feature is
tracked in `inst/dev/SAS-PARITY-GAP-ANALYSIS.md` and
`inst/dev/DEVELOPMENT-PLAN.md`.

### Stepwise variable selection (`SELECTION` statement)

- **Supported:** `hzr_stepwise()` implements forward / backward /
  two-way stepwise with SAS-style SLENTRY / SLSTAY thresholds,
  per-phase entry for multiphase, and a MOVE oscillation guard.
- **Not yet supported:** FAST-screening (Lawless-Singhal approximate
  Wald updates).  Every candidate currently gets a full refit, which
  is slower but always correct.

### Per-phase time-varying windows

- **Supported:** `time_windows` applies one piecewise-constant cut-point
  set across all covariates.
- **Not yet supported:** distinct `EARLY`/`LATE` window sets per phase.
  Workaround: expand the design matrix manually before calling
  `hazard()`.

### Output datasets (`OUTEST=`, `OUTVCOV=`)

- The R equivalents are `coef(fit)` and `vcov(fit)` in memory; there
  is no automatic dataset-export mode.  `saveRDS()` them yourself if
  you need an on-disk artefact.

### Density / quantile prediction types

- `predict.hazard()` covers `hazard`, `cumulative_hazard`, `survival`,
  and `linear_predictor`.  Density and quantile (median survival)
  types are not wired up; derive them from `cumulative_hazard` and
  `survival` predictions if needed.

### Previously listed gaps that have since been closed

For users migrating from older TemporalHazard versions or reading
older SAS parity notes:

- **`weights` on all distributions** — shipped v0.9.6.  Weibull,
  exponential, log-logistic, log-normal, and multiphase all honour
  observation weights end-to-end (LL + analytic gradient + multiphase
  Conservation of Events).
- **`Surv(start, stop, event)` with `start > 0`** — shipped v0.9.7.
  Counting-process rows contribute `H(stop) - H(start)` for Weibull
  and multiphase.  The previous `hazard()` guard against non-zero
  starts is gone.
- **Delta-method prediction confidence limits** — shipped v0.9.8.
  Use `predict(..., se.fit = TRUE, level = 0.95)` to get a data frame
  with `fit`, `se.fit`, `lower`, and `upper` per row.  Closed-form
  Jacobian for Weibull and multiphase, `numDeriv::jacobian` fallback
  for exponential / log-logistic / log-normal.

---

## Rcpp acceleration note

`TemporalHazard` is currently pure R. If profiling against large real datasets
turns up bottlenecks in the likelihood kernel or the optimizer inner loop, those
specific functions will be re-implemented with Rcpp. The public interface
(`hazard()`, `predict.hazard()`, etc.) will not change.


## References {.unnumbered}

Blackstone EH, Naftel DC, Turner ME Jr. The decomposition of time-varying
hazard into phases, each incorporating a separate stream of concomitant
information. *J Am Stat Assoc.* 1986;81(395):615–624.
doi: [10.1080/01621459.1986.10478314](https://doi.org/10.1080/01621459.1986.10478314)