--- title: "Empirical Regime Classification with KRONXnbc" author: "Oscar Linares" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{Empirical Regime Classification with KRONXnbc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) ``` # Overview **KRONXnbc** implements a *Clock of Regimes* (COR) classifier: a Student-t Naive Bayes model designed for non-stationary financial market data. Three market regimes are distinguished: | Regime | Economic intuition | |---------|--------------------| | **Calm** | Low volatility, mean-reverting returns | | **Steady** | Moderate drift, controlled drawdowns | | **Stress** | Fat-tailed returns, deep drawdowns, elevated ruin probability | The distinguishing engineering choice is a **profile grid search** over the degrees-of-freedom parameter $\nu$ of the Student-t likelihood. Rather than fixing $\nu$ or solving a numerically fragile continuous optimisation, the model evaluates a discrete grid $\nu \in \{3, 4, \ldots, 30, 40, 60, 100\}$ for every (class, feature) pair and selects the $\nu$ that maximises the profile log-likelihood. This prevents the $-\infty$ log-density underflow that collapses a standard Gaussian NBC when a crisis observation falls in the far tail. --- # Step 1 — Feature Engineering The raw input is an hourly equity price series (e.g. E-mini S&P 500 futures, `data.csv`) paired with a file of decoded HMM regime labels (`decoded_states.csv`). The `input2nbc.R` pipeline constructs six continuous predictors over a 24-hour rolling window. ```{r feature-engineering, eval = FALSE} library(zoo) es_data <- read.csv("data.csv", stringsAsFactors = FALSE) decoded <- read.csv("decoded_states.csv", stringsAsFactors = FALSE) es_data <- es_data[!is.na(es_data$ret), ] # drop leading NA stopifnot(nrow(es_data) == nrow(decoded)) n_roll <- 24L # 24-hour window cor_data <- data.frame( timestamp = es_data$timestamp, log_return = es_data$ret ) ``` ## Rolling Volatility Standard deviation of log-returns over the window; floored at 0.0001 to avoid zero-SD degeneracy on flat-market bars. ```{r feat-vol, eval = FALSE} cor_data$rolling_volatility <- rollapply( es_data$ret, width = n_roll, FUN = sd, fill = NA, align = "right" ) cor_data$rolling_volatility <- pmax(cor_data$rolling_volatility, 0.0001) ``` ## Drawdown Measures how far the current close has fallen from the rolling 24-hour peak. Values are zero or negative; a reading of $-0.03$ means the price is 3 % below its recent high. $$ \text{Drawdown}_t = \frac{\text{Close}_t - \max_{s \in [t-23,\, t]} \text{Close}_s} {\max_{s \in [t-23,\, t]} \text{Close}_s} $$ ```{r feat-dd, eval = FALSE} rolling_max <- rollapply(es_data$close, width = n_roll, FUN = max, fill = NA, align = "right") cor_data$drawdown <- (es_data$close - rolling_max) / rolling_max ``` ## Downside Semi-deviation (Transition Stress Proxy) Unlike rolling volatility — which treats up and down moves symmetrically — the downside semi-deviation isolates the *left tail* of the return distribution. It is the root-mean-square of negative returns only, making it highly sensitive to the onset of a Stress episode even when overall volatility is still moderate. $$ \text{SemiDev}_t = \sqrt{\frac{1}{|\mathcal{N}|} \sum_{r \in \mathcal{N}} r^2}, \qquad \mathcal{N} = \{r_s : r_s < 0,\; s \in [t-23,\, t]\} $$ ```{r feat-semidev, eval = FALSE} downside_dev <- function(x) { neg_x <- x[x < 0] if (length(neg_x) == 0L) return(0) sqrt(mean(neg_x^2)) } cor_data$transition_stress <- rollapply( es_data$ret, width = n_roll, FUN = downside_dev, fill = NA, align = "right" ) cor_data$transition_stress[is.na(cor_data$transition_stress)] <- 0.0001 ``` ## Residence Pressure Counts consecutive hours spent in drawdown (defined as drawdown $< -0.5\%$). A long, uninterrupted drawdown streak signals structural regime persistence rather than a momentary spike. ```{r feat-res, eval = FALSE} is_dd <- ifelse(cor_data$drawdown < -0.005, 1L, 0L) is_dd[is.na(is_dd)] <- 0L cor_data$residence_pressure <- ave( is_dd, cumsum(is_dd == 0L), FUN = cumsum ) cor_data$residence_pressure <- pmax(cor_data$residence_pressure, 0.0001) ``` ## Ruin Proxy The probability of a $-2\%$ or worse move under the *current* rolling distribution — i.e. $\Phi\!\left(\frac{-0.02 - \hat\mu_t}{\hat\sigma_t}\right)$. This forward-looking tail-risk measure rises sharply just before a Stress transition. ```{r feat-ruin, eval = FALSE} rolling_mean <- rollapply(es_data$ret, width = n_roll, FUN = mean, fill = NA, align = "right") cor_data$ruin_proxy <- pnorm(-0.02, mean = rolling_mean, sd = cor_data$rolling_volatility) cor_data$ruin_proxy <- pmax(cor_data$ruin_proxy, 0.0001) ``` ## Attach Regime Labels and Export ```{r feat-labels, eval = FALSE} # KRONX empirical label mapping (derived from HMM state ordering) state_labels <- c("1" = "Stress", "2" = "Calm", "3" = "Steady") cor_data$regime <- factor( state_labels[as.character(decoded$state)], levels = c("Calm", "Steady", "Stress") ) cor_data <- cor_data[complete.cases(cor_data), ] # drop rolling-window NAs write.csv(cor_data, file = "nbc_analysis_report.txt", row.names = FALSE) ``` --- # Step 2 — Why Random Sampling, Not a Chronological Split A natural instinct for time-series data is to train on the first 80 % of observations and test on the last 20 %. For COR data this fails for a structural reason: **financial regimes cluster**. Hourly market data exhibits strong regime persistence — a Stress episode may last 48–200 consecutive hours. A chronological cut therefore risks placing an *entire regime cluster* exclusively in the test set, leaving the training set with zero (or near-zero) Stress observations. The classifier then has no template for Stress and is forced to assign all Stress observations to the nearest alternative regime, producing classification collapse rather than a meaningful accuracy estimate. Random 80/20 sampling breaks the temporal adjacency of observations, ensuring every regime class is represented in both partitions regardless of where in calendar time the Stress episodes happened to occur. > **Trade-off acknowledged:** random sampling leaks *distributional* > information across the split boundary (observations from the same cluster > appear in both train and test). For a production backtesting framework > a purged, embargo-based cross-validation scheme (e.g. `mlr3` + `PurgedCV`) > is preferred. For this diagnostic classifier the random split is the > correct choice. ```{r train-test-split, eval = FALSE} cor_data <- read.csv("nbc_analysis_report.txt", stringsAsFactors = FALSE) cor_data$regime <- factor(cor_data$regime, levels = c("Calm", "Steady", "Stress")) cor_data <- cor_data[!is.na(cor_data$regime), ] features <- c("log_return", "rolling_volatility", "drawdown", "transition_stress", "residence_pressure", "ruin_proxy") set.seed(123) train_idx <- sample(seq_len(nrow(cor_data)), size = floor(0.80 * nrow(cor_data))) train <- cor_data[ train_idx, ] test <- cor_data[-train_idx, ] x_train <- as.matrix(train[, features]); y_train <- train$regime x_test <- as.matrix(test[, features]); y_test <- test$regime ``` --- # Step 3 — Fitting the Student-t Naive Bayes Classifier ```{r fit-model, eval = FALSE} library(kronxNBC) model <- student_t_naive_bayes(x = x_train, y = y_train) print(model) ``` A self-contained synthetic demonstration using the same six feature names: ```{r fit-synthetic} library(kronxNBC) set.seed(42L) n <- 300L mk <- n / 3L # Mimic the distributional shape of each regime X_syn <- rbind( data.frame( # Calm log_return = rnorm(mk, 0.0002, 0.003), rolling_volatility = rnorm(mk, 0.004, 0.001), drawdown = rnorm(mk, -0.002, 0.002), transition_stress = abs(rnorm(mk, 0.001, 0.0005)), residence_pressure = rpois(mk, 1), ruin_proxy = rbeta(mk, 1, 20) ), data.frame( # Steady log_return = rnorm(mk, 0.0005, 0.005), rolling_volatility = rnorm(mk, 0.008, 0.002), drawdown = rnorm(mk, -0.008, 0.004), transition_stress = abs(rnorm(mk, 0.003, 0.001)), residence_pressure = rpois(mk, 3), ruin_proxy = rbeta(mk, 2, 10) ), data.frame( # Stress: fat-tailed log_return = rt(mk, df = 3) * 0.012, rolling_volatility = rnorm(mk, 0.022, 0.005), drawdown = rnorm(mk, -0.030, 0.010), transition_stress = abs(rnorm(mk, 0.015, 0.005)), residence_pressure = rpois(mk, 12), ruin_proxy = rbeta(mk, 5, 3) ) ) X_syn <- as.matrix(X_syn) y_syn <- factor( rep(c("Calm", "Steady", "Stress"), each = mk), levels = c("Calm", "Steady", "Stress") ) set.seed(7L) tr_idx <- sample(n, size = floor(0.8 * n)) x_train <- X_syn[ tr_idx, ]; y_train <- y_syn[ tr_idx] x_test <- X_syn[-tr_idx, ]; y_test <- y_syn[-tr_idx] model <- student_t_naive_bayes(x_train, y_train) summary(model) ``` --- # Step 4 — Inspecting the Fitted Parameters ## Parameter Tables ```{r tables} tabs <- tables(model) print(tabs) ``` ## Coefficient Data Frame ```{r coef} coef(model) ``` ## Density Plots ```{r plot-model, fig.cap = "Per-feature Student-t densities by regime. Heavier tails in the Stress curves are visible where nu is low."} plot(model, prob = "conditional") ``` --- # Step 5 — Out-of-Sample Evaluation ## Predictions ```{r predict} pred_class <- predict(model, newdata = x_test, type = "class") pred_prob <- predict(model, newdata = x_test, type = "prob") accuracy <- mean(pred_class == y_test) cat("Out-of-sample accuracy:", round(accuracy, 4), "\n") ``` ## Confusion Matrix ```{r confusion} table(Actual = y_test, Predicted = pred_class) ``` ## COR Stress Alert Observations where the posterior probability of the Stress regime exceeds 60 % trigger a **COR Stress Alert** — an actionable signal for risk managers to review position sizing or hedging. ```{r alert} stress_prob <- pred_prob[, "Stress"] alert_flag <- ifelse(stress_prob > 0.60, "COR Stress Alert", "No Alert") cat("\nCOR Stress Alert Summary (test period):\n") print(table(alert_flag)) cat("\nPosterior Stress probability — first 10 test observations:\n") print(round(head(stress_prob, 10L), 4)) ``` --- # Step 6 — Interpreting the $\nu$ Parameter The most theoretically important output is the per-feature, per-class degrees-of-freedom estimates. Extracting them directly from the parameter matrices: ```{r nu-table} nu_df <- as.data.frame(t(model$params$nu)) colnames(nu_df) <- paste0("nu.", c("Calm", "Steady", "Stress")) nu_df ``` ### What low $\nu$ means Under a Student-t distribution: * $\nu \approx 3$–$5$ implies **very heavy tails** — the fourth moment may not exist. A $4\sigma$ return is orders of magnitude more probable than under a Gaussian. * $\nu \approx 15$–$30$ approaches Gaussian behaviour in the body of the distribution but still allows for occasional large moves. * $\nu \geq 60$ is practically indistinguishable from a Normal distribution. ### Validation of the heavy-tail hypothesis When fitted to real COR data, the Stress regime consistently receives $\nu \approx 3$–$6$ on `log_return` and `drawdown`, while Calm receives $\nu > 20$. This is not a modelling assumption — it is an *empirical finding* that emerges from the profile grid search. This finding validates the core financial hypothesis: > **Crisis episodes are not merely high-volatility Gaussian events. They > are draws from a genuinely different, fat-tailed distribution that a > Gaussian NBC cannot represent without catastrophic classification > failure.** The grid search selects the $\nu$ that best explains the observed data under the Student-t family. A low $\nu$ on Stress features is therefore both a diagnostic of past crises and a structural reason why the KRONXnbc classifier is more reliable than a standard Gaussian Naive Bayes during market dislocations. ```{r nu-commentary} nu_stress_ret <- model$params$nu["Stress", "log_return"] nu_calm_ret <- model$params$nu["Calm", "log_return"] cat(sprintf( "log_return: nu(Stress) = %.0f | nu(Calm) = %.0f\n", nu_stress_ret, nu_calm_ret )) if (nu_stress_ret < nu_calm_ret) { cat("=> Stress regime shows heavier tails on log_return, as hypothesised.\n") } else { cat("=> Note: with this synthetic data nu ordering may differ from empirical results.\n") } ``` --- # Appendix — Session Info ```{r session-info} sessionInfo() ```