---
title: "Empirical Regime Classification with KRONXnbc"
author: "Oscar Linares"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Empirical Regime Classification with KRONXnbc}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

# Overview

**KRONXnbc** implements a *Clock of Regimes* (COR) classifier: a Student-t
Naive Bayes model designed for non-stationary financial market data.  Three
market regimes are distinguished:

| Regime  | Economic intuition |
|---------|--------------------|
| **Calm**   | Low volatility, mean-reverting returns |
| **Steady** | Moderate drift, controlled drawdowns |
| **Stress** | Fat-tailed returns, deep drawdowns, elevated ruin probability |

The distinguishing engineering choice is a **profile grid search** over the
degrees-of-freedom parameter $\nu$ of the Student-t likelihood.  Rather than
fixing $\nu$ or solving a numerically fragile continuous optimisation, the
model evaluates a discrete grid
$\nu \in \{3, 4, \ldots, 30, 40, 60, 100\}$ for every (class, feature) pair
and selects the $\nu$ that maximises the profile log-likelihood.  This
prevents the $-\infty$ log-density underflow that collapses a standard
Gaussian NBC when a crisis observation falls in the far tail.

---

# Step 1 — Feature Engineering

The raw input is an hourly equity price series (e.g. E-mini S&P 500 futures,
`data.csv`) paired with a file of decoded HMM regime labels
(`decoded_states.csv`).  The `input2nbc.R` pipeline constructs six continuous
predictors over a 24-hour rolling window.

```{r feature-engineering, eval = FALSE}
library(zoo)

es_data <- read.csv("data.csv",          stringsAsFactors = FALSE)
decoded <- read.csv("decoded_states.csv", stringsAsFactors = FALSE)

es_data <- es_data[!is.na(es_data$ret), ]          # drop leading NA
stopifnot(nrow(es_data) == nrow(decoded))

n_roll <- 24L                                       # 24-hour window

cor_data <- data.frame(
  timestamp  = es_data$timestamp,
  log_return = es_data$ret
)
```

## Rolling Volatility

Standard deviation of log-returns over the window; floored at 0.0001 to
avoid zero-SD degeneracy on flat-market bars.

```{r feat-vol, eval = FALSE}
cor_data$rolling_volatility <- rollapply(
  es_data$ret, width = n_roll, FUN = sd, fill = NA, align = "right"
)
cor_data$rolling_volatility <- pmax(cor_data$rolling_volatility, 0.0001)
```

## Drawdown

Measures how far the current close has fallen from the rolling 24-hour
peak.  Values are zero or negative; a reading of $-0.03$ means the price is
3 % below its recent high.

$$
\text{Drawdown}_t = \frac{\text{Close}_t - \max_{s \in [t-23,\, t]} \text{Close}_s}
                         {\max_{s \in [t-23,\, t]} \text{Close}_s}
$$

```{r feat-dd, eval = FALSE}
rolling_max          <- rollapply(es_data$close, width = n_roll,
                                  FUN = max, fill = NA, align = "right")
cor_data$drawdown    <- (es_data$close - rolling_max) / rolling_max
```

## Downside Semi-deviation (Transition Stress Proxy)

Unlike rolling volatility — which treats up and down moves symmetrically —
the downside semi-deviation isolates the *left tail* of the return
distribution.  It is the root-mean-square of negative returns only,
making it highly sensitive to the onset of a Stress episode even when
overall volatility is still moderate.

$$
\text{SemiDev}_t = \sqrt{\frac{1}{|\mathcal{N}|} \sum_{r \in \mathcal{N}} r^2},
\qquad \mathcal{N} = \{r_s : r_s < 0,\; s \in [t-23,\, t]\}
$$

```{r feat-semidev, eval = FALSE}
downside_dev <- function(x) {
  neg_x <- x[x < 0]
  if (length(neg_x) == 0L) return(0)
  sqrt(mean(neg_x^2))
}
cor_data$transition_stress <- rollapply(
  es_data$ret, width = n_roll, FUN = downside_dev, fill = NA, align = "right"
)
cor_data$transition_stress[is.na(cor_data$transition_stress)] <- 0.0001
```

## Residence Pressure

Counts consecutive hours spent in drawdown (defined as drawdown $< -0.5\%$).
A long, uninterrupted drawdown streak signals structural regime persistence
rather than a momentary spike.

```{r feat-res, eval = FALSE}
is_dd <- ifelse(cor_data$drawdown < -0.005, 1L, 0L)
is_dd[is.na(is_dd)] <- 0L
cor_data$residence_pressure <- ave(
  is_dd, cumsum(is_dd == 0L), FUN = cumsum
)
cor_data$residence_pressure <- pmax(cor_data$residence_pressure, 0.0001)
```

## Ruin Proxy

The probability of a $-2\%$ or worse move under the *current* rolling
distribution — i.e. $\Phi\!\left(\frac{-0.02 - \hat\mu_t}{\hat\sigma_t}\right)$.
This forward-looking tail-risk measure rises sharply just before a Stress
transition.

```{r feat-ruin, eval = FALSE}
rolling_mean        <- rollapply(es_data$ret, width = n_roll,
                                 FUN = mean, fill = NA, align = "right")
cor_data$ruin_proxy <- pnorm(-0.02,
                             mean = rolling_mean,
                             sd   = cor_data$rolling_volatility)
cor_data$ruin_proxy <- pmax(cor_data$ruin_proxy, 0.0001)
```

## Attach Regime Labels and Export

```{r feat-labels, eval = FALSE}
# KRONX empirical label mapping (derived from HMM state ordering)
state_labels    <- c("1" = "Stress", "2" = "Calm", "3" = "Steady")
cor_data$regime <- factor(
  state_labels[as.character(decoded$state)],
  levels = c("Calm", "Steady", "Stress")
)

cor_data <- cor_data[complete.cases(cor_data), ]   # drop rolling-window NAs
write.csv(cor_data, file = "nbc_analysis_report.txt", row.names = FALSE)
```

---

# Step 2 — Why Random Sampling, Not a Chronological Split

A natural instinct for time-series data is to train on the first 80 % of
observations and test on the last 20 %.  For COR data this fails for a
structural reason: **financial regimes cluster**.

Hourly market data exhibits strong regime persistence — a Stress episode
may last 48–200 consecutive hours.  A chronological cut therefore risks
placing an *entire regime cluster* exclusively in the test set, leaving the
training set with zero (or near-zero) Stress observations.  The classifier
then has no template for Stress and is forced to assign all Stress
observations to the nearest alternative regime, producing classification
collapse rather than a meaningful accuracy estimate.

Random 80/20 sampling breaks the temporal adjacency of observations,
ensuring every regime class is represented in both partitions regardless
of where in calendar time the Stress episodes happened to occur.

> **Trade-off acknowledged:** random sampling leaks *distributional*
> information across the split boundary (observations from the same cluster
> appear in both train and test).  For a production backtesting framework
> a purged, embargo-based cross-validation scheme (e.g. `mlr3` + `PurgedCV`)
> is preferred.  For this diagnostic classifier the random split is the
> correct choice.

```{r train-test-split, eval = FALSE}
cor_data <- read.csv("nbc_analysis_report.txt", stringsAsFactors = FALSE)
cor_data$regime <- factor(cor_data$regime, levels = c("Calm", "Steady", "Stress"))
cor_data <- cor_data[!is.na(cor_data$regime), ]

features <- c("log_return", "rolling_volatility", "drawdown",
              "transition_stress", "residence_pressure", "ruin_proxy")

set.seed(123)
train_idx <- sample(seq_len(nrow(cor_data)), size = floor(0.80 * nrow(cor_data)))
train     <- cor_data[ train_idx, ]
test      <- cor_data[-train_idx, ]

x_train <- as.matrix(train[, features]);  y_train <- train$regime
x_test  <- as.matrix(test[,  features]);  y_test  <- test$regime
```

---

# Step 3 — Fitting the Student-t Naive Bayes Classifier

```{r fit-model, eval = FALSE}
library(kronxNBC)

model <- student_t_naive_bayes(x = x_train, y = y_train)
print(model)
```

A self-contained synthetic demonstration using the same six feature names:

```{r fit-synthetic}
library(kronxNBC)

set.seed(42L)
n  <- 300L
mk <- n / 3L

# Mimic the distributional shape of each regime
X_syn <- rbind(
  data.frame(                                          # Calm
    log_return         = rnorm(mk, 0.0002, 0.003),
    rolling_volatility = rnorm(mk, 0.004,  0.001),
    drawdown           = rnorm(mk, -0.002, 0.002),
    transition_stress  = abs(rnorm(mk, 0.001, 0.0005)),
    residence_pressure = rpois(mk, 1),
    ruin_proxy         = rbeta(mk, 1, 20)
  ),
  data.frame(                                          # Steady
    log_return         = rnorm(mk, 0.0005, 0.005),
    rolling_volatility = rnorm(mk, 0.008,  0.002),
    drawdown           = rnorm(mk, -0.008, 0.004),
    transition_stress  = abs(rnorm(mk, 0.003, 0.001)),
    residence_pressure = rpois(mk, 3),
    ruin_proxy         = rbeta(mk, 2, 10)
  ),
  data.frame(                                          # Stress: fat-tailed
    log_return         = rt(mk, df = 3) * 0.012,
    rolling_volatility = rnorm(mk, 0.022,  0.005),
    drawdown           = rnorm(mk, -0.030, 0.010),
    transition_stress  = abs(rnorm(mk, 0.015, 0.005)),
    residence_pressure = rpois(mk, 12),
    ruin_proxy         = rbeta(mk, 5, 3)
  )
)
X_syn <- as.matrix(X_syn)

y_syn <- factor(
  rep(c("Calm", "Steady", "Stress"), each = mk),
  levels = c("Calm", "Steady", "Stress")
)

set.seed(7L)
tr_idx  <- sample(n, size = floor(0.8 * n))
x_train <- X_syn[ tr_idx, ];  y_train <- y_syn[ tr_idx]
x_test  <- X_syn[-tr_idx, ];  y_test  <- y_syn[-tr_idx]

model <- student_t_naive_bayes(x_train, y_train)
summary(model)
```

---

# Step 4 — Inspecting the Fitted Parameters

## Parameter Tables

```{r tables}
tabs <- tables(model)
print(tabs)
```

## Coefficient Data Frame

```{r coef}
coef(model)
```

## Density Plots

```{r plot-model, fig.cap = "Per-feature Student-t densities by regime.  Heavier tails in the Stress curves are visible where nu is low."}
plot(model, prob = "conditional")
```

---

# Step 5 — Out-of-Sample Evaluation

## Predictions

```{r predict}
pred_class <- predict(model, newdata = x_test, type = "class")
pred_prob  <- predict(model, newdata = x_test, type = "prob")

accuracy <- mean(pred_class == y_test)
cat("Out-of-sample accuracy:", round(accuracy, 4), "\n")
```

## Confusion Matrix

```{r confusion}
table(Actual = y_test, Predicted = pred_class)
```

## COR Stress Alert

Observations where the posterior probability of the Stress regime exceeds
60 % trigger a **COR Stress Alert** — an actionable signal for risk
managers to review position sizing or hedging.

```{r alert}
stress_prob <- pred_prob[, "Stress"]
alert_flag  <- ifelse(stress_prob > 0.60, "COR Stress Alert", "No Alert")

cat("\nCOR Stress Alert Summary (test period):\n")
print(table(alert_flag))

cat("\nPosterior Stress probability — first 10 test observations:\n")
print(round(head(stress_prob, 10L), 4))
```

---

# Step 6 — Interpreting the $\nu$ Parameter

The most theoretically important output is the per-feature, per-class
degrees-of-freedom estimates.  Extracting them directly from the parameter
matrices:

```{r nu-table}
nu_df <- as.data.frame(t(model$params$nu))
colnames(nu_df) <- paste0("nu.", c("Calm", "Steady", "Stress"))
nu_df
```

### What low $\nu$ means

Under a Student-t distribution:

* $\nu \approx 3$–$5$ implies **very heavy tails** — the fourth moment may
  not exist.  A $4\sigma$ return is orders of magnitude more probable than
  under a Gaussian.
* $\nu \approx 15$–$30$ approaches Gaussian behaviour in the body of the
  distribution but still allows for occasional large moves.
* $\nu \geq 60$ is practically indistinguishable from a Normal distribution.

### Validation of the heavy-tail hypothesis

When fitted to real COR data, the Stress regime consistently receives
$\nu \approx 3$–$6$ on `log_return` and `drawdown`, while Calm receives
$\nu > 20$.  This is not a modelling assumption — it is an *empirical
finding* that emerges from the profile grid search.

This finding validates the core financial hypothesis:

> **Crisis episodes are not merely high-volatility Gaussian events.  They
> are draws from a genuinely different, fat-tailed distribution that a
> Gaussian NBC cannot represent without catastrophic classification
> failure.**

The grid search selects the $\nu$ that best explains the observed data
under the Student-t family.  A low $\nu$ on Stress features is therefore
both a diagnostic of past crises and a structural reason why the KRONXnbc
classifier is more reliable than a standard Gaussian Naive Bayes during
market dislocations.

```{r nu-commentary}
nu_stress_ret <- model$params$nu["Stress", "log_return"]
nu_calm_ret   <- model$params$nu["Calm",   "log_return"]

cat(sprintf(
  "log_return: nu(Stress) = %.0f  |  nu(Calm) = %.0f\n",
  nu_stress_ret, nu_calm_ret
))

if (nu_stress_ret < nu_calm_ret) {
  cat("=> Stress regime shows heavier tails on log_return, as hypothesised.\n")
} else {
  cat("=> Note: with this synthetic data nu ordering may differ from empirical results.\n")
}
```

---

# Appendix — Session Info

```{r session-info}
sessionInfo()
```