--- title: "**kronxNBC**: Student-t Naive Bayes Classification for Financial Market Regime Detection in R" author: - name: Oscar Linares email: olinares@umich.edu date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true bibliography: references.bib vignette: > %\VignetteIndexEntry{kronxNBC: Student-t Naive Bayes for Financial Regime Detection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5, fig.align = "center" ) set.seed(42L) ``` --- ## Abstract {-} Financial markets cycle through qualitatively distinct operating environments---calm, trending, and crisis---that cannot be described by a single distribution. Standard Gaussian Naive Bayes (NBC) classifiers are analytically convenient but numerically fragile when crisis-period returns fall deep in the tail: log-densities approach $-\infty$ and the classifier suffers catastrophic misclassification of the very observations it most needs to identify correctly. **kronxNBC** addresses this by replacing the Gaussian class-conditional likelihood with a scaled Student-$t$ distribution whose degrees-of-freedom parameter $\nu$ is selected independently for each (class, feature) pair via a profile log-likelihood grid search over a discrete candidate set. This targeted search eliminates the numerical instability of continuous $\nu$ optimisation while automatically capturing the heavy-tailed character of crisis returns. The package exports the constructor `student_t_naive_bayes()` together with the complete S3 interface: `predict()`, `print()`, `summary()`, `plot()`, `coef()`, and `tables()`. All examples in this paper are self-contained and reproducible without external data. **Keywords:** Naive Bayes, Student-t distribution, heavy tails, financial regime detection, robust classification, R. --- # Introduction Classifying the current operating regime of a financial market is a foundational task in risk management, portfolio construction, and algorithmic trading. During tranquil, *Calm* periods returns are near-Gaussian, volatility is low, and standard portfolio theory applies. During *Steady* (trend) periods there is a persistent directional drift, and drawdowns are moderate and recoverable. During *Stress* periods, by contrast, returns exhibit strongly fat-tailed behaviour, drawdowns are severe, and the distributional assumption underlying most risk models breaks down entirely. The problem is not merely that volatility rises in a Stress regime. The shape of the return distribution changes: the tails become genuinely heavier in a way that cannot be captured by scaling a Gaussian. @BlattbergGonedes1974 showed that daily stock returns are better described by a Student-$t$ distribution than by a normal, and this observation has been replicated across asset classes and frequencies. Kurtosis of daily equity returns routinely exceeds 6; for hourly futures returns during a crisis it can exceed 20. A Naive Bayes classifier trained on historical regime data uses the class-conditional likelihood $f(x_j \mid \text{class } k)$ to assign new observations to regimes. Under Gaussian assumptions, a single observation that lies $5\sigma$ from the class mean receives a log-density near $-13$; under a Student-$t$ with $\nu = 3$ degrees of freedom, the same point receives a log-density near $-5$. When log-densities are summed across multiple features the Gaussian classifier rapidly accumulates $-\infty$ contributions that dominate the posterior and force every crisis observation into whichever non-Stress class has the least-negative log-density---a failure mode we call *classification collapse*. **kronxNBC** resolves this by: 1. Modelling each class-conditional distribution as a scaled Student-$t$ rather than a Gaussian. 2. Selecting $\nu$ for every (class, feature) pair via a profile log-likelihood grid search, preventing the numerical instability of continuous optimisation. 3. Applying a log-density floor (`threshold`) that prevents any single feature from absorbing the entire posterior. The package is designed to integrate naturally with the **naivebayes** ecosystem [@naivebayes]: the `tables()` generic is extended so that `student_t_naive_bayes` objects print parameter tables in the same format, and the `tables.default` method forwards all other Naive Bayes classes to `naivebayes::tables()`. The remainder of this paper is organised as follows. Section 2 develops the statistical methodology. Section 3 describes the package API. Section 4 provides a complete worked example on synthetic data. Section 5 compares **kronxNBC** with a standard Gaussian NBC. Section 6 covers advanced usage. Section 7 outlines the Clock of Regimes (COR) empirical framework. Section 8 concludes. --- # Statistical Methodology ## The Naive Bayes Framework Let $\mathbf{x} = (x_1, \ldots, x_p)^{\top}$ be a $p$-dimensional feature vector and $y \in \{1, \ldots, K\}$ a discrete class label. Under the Naive Bayes conditional-independence assumption, the posterior probability of class $k$ is $$ P(y = k \mid \mathbf{x}) \;=\; \frac{\pi_k \prod_{j=1}^{p} f_{kj}(x_j)} {\sum_{\ell=1}^{K} \pi_\ell \prod_{j=1}^{p} f_{\ell j}(x_j)}, $$ where $\pi_k = P(y = k)$ is the class prior and $f_{kj}$ is the class-conditional density for feature $j$ given class $k$. The Maximum A Posteriori (MAP) rule assigns the new observation to $$ \hat{k} \;=\; \arg\max_{k \in \{1,\ldots,K\}} \left[ \log \pi_k + \sum_{j=1}^{p} \log f_{kj}(x_j) \right]. $$ Working in log-space is both numerically convenient and necessary: products of small probabilities underflow to zero for any floating-point arithmetic. ## The Scaled Student-$t$ Distribution **kronxNBC** models $f_{kj}$ as a scaled (location-scale) Student-$t$ density with three parameters: location $\mu_{kj}$, scale $\sigma_{kj} > 0$, and degrees of freedom $\nu_{kj} > 2$. The density is $$ f_t(x \mid \mu, \sigma, \nu) \;=\; \frac{1}{\sigma}\, t_\nu\!\left(\frac{x - \mu}{\sigma}\right), $$ where $t_\nu(\cdot)$ denotes the standard (zero location, unit scale) Student-$t$ density with $\nu$ degrees of freedom: $$ t_\nu(z) \;=\; \frac{\Gamma\!\left(\tfrac{\nu+1}{2}\right)} {\sqrt{\nu\pi}\,\Gamma\!\left(\tfrac{\nu}{2}\right)} \left(1 + \frac{z^2}{\nu}\right)^{-(\nu+1)/2}. $$ The requirement $\nu > 2$ ensures finite variance. Key tail properties: - $\nu \approx 3$--$5$: very heavy tails; the fourth moment may not exist. A $4\sigma$ return is orders of magnitude more probable than under a Gaussian. - $\nu \approx 15$--$30$: tails are heavier than Gaussian but fourth moments exist. - $\nu \geq 60$: practically indistinguishable from $\mathcal{N}(\mu, \sigma^2)$ in the body of the distribution. The Gaussian NBC corresponds to the limit $\nu \to \infty$. **kronxNBC** recovers near-Gaussian behaviour when the data support it (high $\nu$ selected) while automatically using heavier tails where needed (low $\nu$ selected for Stress). ## Profile Maximum Likelihood for $\nu$ Given observations $\{x_1, \ldots, x_n\}$ from one (class, feature) cell, the joint log-likelihood is $$ \ell(\mu, \sigma, \nu) \;=\; \sum_{i=1}^{n} \left[ \log t_\nu\!\left(\frac{x_i - \mu}{\sigma}\right) - \log \sigma \right]. $$ Continuous optimisation over $\nu$ is numerically fragile because the log-likelihood surface is nearly flat for $\nu > 30$ and can have shallow local maxima for very small samples. **kronxNBC** instead evaluates a discrete candidate grid $\mathcal{G} = \{3, 4, \ldots, 30, 40, 60, 100\}$ and, for each $\nu \in \mathcal{G}$, computes the profile estimators $(\hat\mu(\nu), \hat\sigma(\nu))$ via one step of Iteratively Reweighted Least Squares (IRLS): $$ u_i(\nu) \;=\; \frac{\nu + 1}{\nu + \hat{z}_i^2}, \qquad \hat{z}_i = \frac{x_i - \bar{x}}{\hat{s}}, $$ $$ \hat\mu(\nu) \;=\; \frac{\sum_i u_i(\nu)\, x_i}{\sum_i u_i(\nu)}, \qquad \hat\sigma(\nu) \;=\; \sqrt{\frac{1}{n}\sum_i (x_i - \hat\mu(\nu))^2} + \varepsilon, $$ where $\varepsilon = 10^{-8}$ is a floor that prevents zero-scale degeneracy on constant features. The IRLS weights $u_i(\nu)$ down-weight large residuals, making the estimators robust to the outliers that define a Stress regime. The selected degrees of freedom are $$ \hat\nu \;=\; \arg\max_{\nu \in \mathcal{G}}\; \ell\!\left(\hat\mu(\nu), \hat\sigma(\nu), \nu\right), $$ and the stored parameters for this (class, feature) cell are $(\hat\mu(\hat\nu), \hat\sigma(\hat\nu), \hat\nu)$. ## Numerical Stability During Scoring Even with heavy tails, extreme observations may produce log-densities that are numerically very negative. `predict.student_t_naive_bayes()` applies a log-density floor: any feature log-density $\log f_{kj}(x_j)$ that falls below $\log(\texttt{threshold})$ is replaced by $\log(\texttt{threshold})$. The default $\texttt{threshold} = 0.001$ prevents a single pathological feature from dominating the posterior. The softmax normalisation of the log-posteriors uses the standard log-sum-exp identity for numerical stability. --- # Package Overview ## Installation ```{r install, eval = FALSE} install.packages("kronxNBC") ``` The development version can be installed from source. The package depends on **stats**, **graphics**, and **utils** (all part of base R) and on **naivebayes** [@naivebayes]. ## Function Reference **kronxNBC** exports two functions and six S3 methods. | Symbol | Role | |--------|------| | `student_t_naive_bayes(x, y, prior, nu_grid)` | Constructor: fit the model | | `tables(object, which)` | S3 generic (extended from **naivebayes**) | | `predict(object, newdata, type, threshold, eps)` | Class labels or posterior probabilities | | `print(x)` | Concise console display | | `summary(object)` | High-level model summary | | `plot(x, which, prob, arg.num, legend)` | Per-feature density curves | | `coef(object)` | Parameter data frame ($p \times 3K$) | ### `student_t_naive_bayes()` ``` student_t_naive_bayes(x, y, prior = NULL, nu_grid = c(3:30, 40, 60, 100), ...) ``` **Arguments** - `x`: A numeric matrix with named columns. Each column is one predictor; each row is one observation. Data frames are coerced via `as.matrix()`. - `y`: A factor, character, or logical class-label vector of the same length as `nrow(x)`. Must contain at least two distinct levels and at least two observations per level. `NA` labels are silently dropped. - `prior`: Optional named numeric vector of prior class probabilities. Normalised to sum to one. Defaults to empirical class frequencies. - `nu_grid`: Numeric vector of candidate $\nu$ values. All entries must be strictly greater than 2. The default grid `c(3:30, 40, 60, 100)` spans extreme kurtosis ($\nu = 3$) through near-Gaussian behaviour ($\nu = 100$). **Return value** An S3 object of class `"student_t_naive_bayes"` with components: | Component | Content | |-----------|---------| | `$data` | List with `$x` (training matrix) and `$y` (training labels) | | `$levels` | Character vector of class levels | | `$params` | Named list of $K \times p$ matrices `$mu`, `$sd`, `$nu` | | `$prior` | Named numeric vector of prior probabilities | | `$nu_grid` | The `nu_grid` used during fitting | | `$call` | The matched call | ### `predict()` ``` predict(object, newdata = NULL, type = c("class", "prob"), threshold = 0.001, eps = 0, ...) ``` - `type = "class"` (default): returns a factor of MAP class assignments. - `type = "prob"`: returns an $n \times K$ matrix of softmax-normalised posterior probabilities. Row $i$ sums to 1. - `threshold`: log-density floor (replaces values $\leq \log(\texttt{eps})$). - `eps`: densities at or below this value are replaced by `threshold`. Set to `0` (default) to use `.Machine$double.xmin`. ### `tables()` Returns a `"naive_bayes_tables"` object (a named list, one element per feature). Each element is a $3 \times K$ table with rows `mu`, `sd`, `nu` and one column per class. The `which` argument accepts an integer index or character feature name. ### `coef()` Returns a data frame with $p$ rows (features) and $3K$ columns named `:mu`, `:sd`, `:nu`. Suitable for programmatic access to all fitted parameters. ### `plot()` Draws the fitted class-conditional densities for one or more features. - `prob = "conditional"` (default): the raw $f_t(x \mid \mu_{kj}, \sigma_{kj}, \nu_{kj})$. - `prob = "marginal"`: density scaled by the class prior $\pi_k$. - `which`: integer or character vector to select a subset of features. - `arg.num`: named list of graphical parameters (`col`, `lty`) for the density lines. --- # Complete Worked Example This section demonstrates the full **kronxNBC** workflow on synthetic data that mimics the distributional structure of hourly E-mini S&P 500 futures across three financial regimes. ## Synthetic Dataset We generate 300 observations (100 per regime) across six features that correspond to the predictors used in the empirical Clock of Regimes (COR) application described in Section 7. The Stress regime uses Student-$t(3)$ returns to reflect genuinely fat-tailed crisis behaviour. ```{r synthetic-data} library(kronxNBC) set.seed(42L) n <- 300L mk <- n / 3L make_regime <- function(ret_fn, vol_mu, vol_sd, dd_mu, dd_sd, ts_mu, ts_sd, rp_a, rp_b, rp_lambda) { data.frame( log_return = ret_fn(mk), rolling_volatility = abs(rnorm(mk, vol_mu, vol_sd)), drawdown = rnorm(mk, dd_mu, dd_sd), transition_stress = abs(rnorm(mk, ts_mu, ts_sd)), residence_pressure = rpois(mk, rp_lambda), ruin_proxy = rbeta(mk, rp_a, rp_b) ) } X_syn <- rbind( # Calm: near-Gaussian, low volatility, shallow drawdowns make_regime( ret_fn = function(n) rnorm(n, 0.0002, 0.003), vol_mu = 0.004, vol_sd = 0.001, dd_mu = -0.002, dd_sd = 0.002, ts_mu = 0.001, ts_sd = 0.0005, rp_a = 1, rp_b = 20, rp_lambda = 1 ), # Steady: positive drift, moderate drawdowns make_regime( ret_fn = function(n) rnorm(n, 0.0005, 0.005), vol_mu = 0.008, vol_sd = 0.002, dd_mu = -0.008, dd_sd = 0.004, ts_mu = 0.003, ts_sd = 0.001, rp_a = 2, rp_b = 10, rp_lambda = 3 ), # Stress: fat-tailed (t_3), high volatility, deep drawdowns make_regime( ret_fn = function(n) rt(n, df = 3L) * 0.012, vol_mu = 0.022, vol_sd = 0.005, dd_mu = -0.030, dd_sd = 0.010, ts_mu = 0.015, ts_sd = 0.005, rp_a = 5, rp_b = 3, rp_lambda = 12 ) ) X_syn <- as.matrix(X_syn) y_syn <- factor( rep(c("Calm", "Steady", "Stress"), each = mk), levels = c("Calm", "Steady", "Stress") ) # 80 / 20 random split (see Section 7 for the rationale) set.seed(7L) tr_idx <- sample(n, size = floor(0.8 * n)) x_train <- X_syn[ tr_idx, ] y_train <- y_syn[ tr_idx ] x_test <- X_syn[-tr_idx, ] y_test <- y_syn[-tr_idx ] cat("Training set:", nrow(x_train), "observations\n") cat("Test set: ", nrow(x_test), "observations\n") cat("Class balance (train):\n") print(table(y_train)) ``` ## Fitting the Model ```{r fit-model} model <- student_t_naive_bayes(x_train, y_train) summary(model) ``` `summary()` confirms the call, sample size, feature count, $\nu$-grid range, and prior probabilities. Because we used a balanced design with equal class counts the empirical priors are approximately $1/3$ each. ```{r print-model} print(model) ``` `print()` additionally displays the fitted parameter tables for the first five features (all six here). Each table is a $3 \times 3$ matrix: rows are $\mu$, $\sigma$, and $\nu$; columns are class levels. ## Inspecting the Fitted Parameters ### Parameter Tables ```{r tables-all} tabs <- tables(model) print(tabs) ``` To access a single feature by name or index: ```{r tables-subset} # By name tables(model, which = "log_return") # By index (second feature: rolling_volatility) tables(model, which = 2L) ``` ### Coefficient Data Frame `coef()` returns all parameters in a tidy data frame suitable for export or further analysis: ```{r coef} cf <- coef(model) cf ``` The column convention is `:mu`, `:sd`, `:nu` for each of the $K = 3$ classes, giving $3K = 9$ columns and $p = 6$ rows. ### Interpreting the $\nu$ Parameter The estimated degrees-of-freedom matrix is the most theoretically important output of the model. ```{r nu-matrix} nu_mat <- model$params$nu nu_mat ``` Under the generative model, the Stress regime was drawn from $t_3$: we therefore expect the profile search to select low $\nu$ values for Stress features. ```{r nu-interpretation} cat(sprintf( "log_return: nu(Calm) = %g | nu(Steady) = %g | nu(Stress) = %g\n", nu_mat["Calm", "log_return"], nu_mat["Steady", "log_return"], nu_mat["Stress", "log_return"] )) cat(sprintf( "drawdown: nu(Calm) = %g | nu(Steady) = %g | nu(Stress) = %g\n", nu_mat["Calm", "drawdown"], nu_mat["Steady", "drawdown"], nu_mat["Stress", "drawdown"] )) ``` Low $\nu$ for Stress on `log_return` and `drawdown` is the empirical signature of a fat-tailed crisis distribution. This is not a modelling assumption imposed a priori; it emerges from the profile grid search applied to the training data. In real COR applications fitted on E-mini S&P 500 hourly data, the Stress regime consistently receives $\hat\nu \in \{3, 4, 5\}$ on `log_return` and `drawdown`, while the Calm regime receives $\hat\nu > 20$ on the same features. ## Density Visualisation `plot()` draws the fitted class-conditional Student-$t$ densities over the empirical range of each feature. ```{r plot-conditional, fig.cap = "Conditional Student-t densities by regime for all six features. Note the heavier tails on the Stress curve for log_return."} plot(model, prob = "conditional", arg.num = list(col = c("steelblue", "forestgreen", "firebrick"), lty = c(1L, 2L, 4L)), lwd = 2) ``` `prob = "marginal"` scales each density by its class prior, giving a view of the mixture components as they contribute to the overall marginal distribution: ```{r plot-marginal, fig.cap = "Marginal (prior-weighted) Student-t densities."} plot(model, prob = "marginal", arg.num = list(col = c("steelblue", "forestgreen", "firebrick"), lty = c(1L, 2L, 4L)), lwd = 2) ``` To plot a subset of features, use the `which` argument: ```{r plot-subset, fig.cap = "Conditional densities for log_return and drawdown only."} plot(model, which = c("log_return", "drawdown"), prob = "conditional", arg.num = list(col = c("steelblue", "forestgreen", "firebrick")), lwd = 2) ``` ## Out-of-Sample Evaluation ### Class Predictions ```{r predict-class} pred_class <- predict(model, newdata = x_test, type = "class") accuracy <- mean(pred_class == y_test) cat(sprintf("Out-of-sample accuracy: %.1f%%\n", 100 * accuracy)) ``` ### Confusion Matrix ```{r confusion} cm <- table(Actual = y_test, Predicted = pred_class) cm ``` The diagonal elements count correctly classified observations. Off-diagonal elements are misclassifications; for risk management purposes, false-negatives on the Stress class (Stress predicted as Calm) are the most costly. ### Posterior Probabilities ```{r predict-prob} pred_prob <- predict(model, newdata = x_test, type = "prob") cat("First 6 test observations — posterior probabilities:\n") round(head(pred_prob), 4) ``` Each row is a proper probability distribution: entries are non-negative and sum to one. ### COR Stress Alert Signal In the Clock of Regimes framework, a **Stress Alert** is triggered whenever the posterior probability $P(\text{Stress} \mid \mathbf{x}) > 0.60$, signalling that a risk manager should review position sizing or hedge coverage. ```{r stress-alert} stress_prob <- pred_prob[, "Stress"] alert_flag <- ifelse(stress_prob > 0.60, "Stress Alert", "No Alert") cat("Alert summary (test period):\n") print(table(alert_flag)) cat("\nHighest Stress probabilities in test set:\n") top_stress <- sort(stress_prob, decreasing = TRUE) round(head(top_stress, 8L), 4) ``` --- # Comparison: Student-$t$ vs Gaussian Naive Bayes A Gaussian NBC is a special case of the Student-$t$ NBC at $\nu \to \infty$. For light-tailed data the two classifiers perform similarly; for fat-tailed Stress data the Gaussian model suffers classification collapse. This section documents the difference. ## Setup We create a stress-concentrated test set: 40 observations drawn from $t_3$ (the true Stress distribution) alongside 20 each from the Calm and Steady distributions. ```{r gnb-setup} set.seed(99L) n_stress <- 40L n_calm <- 20L X_stress_test <- cbind( log_return = rt(n_stress, df = 3L) * 0.012, rolling_volatility = abs(rnorm(n_stress, 0.022, 0.005)), drawdown = rnorm(n_stress, -0.030, 0.010), transition_stress = abs(rnorm(n_stress, 0.015, 0.005)), residence_pressure = as.numeric(rpois(n_stress, 12L)), ruin_proxy = rbeta(n_stress, 5, 3) ) X_calm_test <- cbind( log_return = rnorm(n_calm, 0.0002, 0.003), rolling_volatility = abs(rnorm(n_calm, 0.004, 0.001)), drawdown = rnorm(n_calm, -0.002, 0.002), transition_stress = abs(rnorm(n_calm, 0.001, 0.0005)), residence_pressure = as.numeric(rpois(n_calm, 1L)), ruin_proxy = rbeta(n_calm, 1, 20) ) X_mixed_test <- rbind(X_stress_test, X_calm_test) y_mixed_test <- factor( c(rep("Stress", n_stress), rep("Calm", n_calm)), levels = c("Calm", "Steady", "Stress") ) ``` ## Gaussian NBC via naivebayes The **naivebayes** package [@naivebayes] fits a Gaussian NBC when given a numeric matrix. We use the same training data as before. ```{r gnb-fit} library(naivebayes) gnb <- naive_bayes(x_train, y_train, usekernel = FALSE) ``` ## Head-to-Head Accuracy ```{r gnb-compare} # Student-t NBC predictions st_pred <- predict(model, newdata = X_mixed_test, type = "class") st_acc <- mean(st_pred == y_mixed_test) # Gaussian NBC predictions gnb_pred <- predict(gnb, newdata = X_mixed_test, type = "class") gnb_acc <- mean(gnb_pred == y_mixed_test) cat(sprintf("Student-t NBC overall accuracy: %.1f%%\n", 100 * st_acc)) cat(sprintf("Gaussian NBC overall accuracy: %.1f%%\n", 100 * gnb_acc)) ``` ### Confusion Matrices ```{r gnb-confusion} cat("Student-t NBC:\n") table(Actual = y_mixed_test, Predicted = st_pred) cat("\nGaussian NBC:\n") table(Actual = y_mixed_test, Predicted = gnb_pred) ``` ### Stress-Class Recall For risk management the critical metric is *Stress recall*: the fraction of true Stress observations correctly identified as such. ```{r gnb-recall} st_stress_recall <- mean(st_pred[y_mixed_test == "Stress"] == "Stress") gnb_stress_recall <- mean(gnb_pred[y_mixed_test == "Stress"] == "Stress") cat(sprintf("Student-t NBC Stress recall: %.1f%%\n", 100 * st_stress_recall)) cat(sprintf("Gaussian NBC Stress recall: %.1f%%\n", 100 * gnb_stress_recall)) ``` ## Tail Log-Density Comparison To illustrate the mechanism, we compare the log-density each model assigns to a single extreme observation: the 95th percentile of $|x|$ for `log_return` in the Stress training data. ```{r tail-density} x_extreme <- quantile( abs(x_train[y_train == "Stress", "log_return"]), probs = 0.95 ) cat("Extreme log_return value:", round(x_extreme, 5), "\n") # Student-t log-density (Stress class parameters) mu_st <- model$params$mu["Stress", "log_return"] sd_st <- model$params$sd["Stress", "log_return"] nu_st <- model$params$nu["Stress", "log_return"] ld_st <- stats::dt((x_extreme - mu_st) / sd_st, df = nu_st, log = TRUE) - log(sd_st) # Gaussian log-density (same location/scale, normal tails) ld_gn <- stats::dnorm(x_extreme, mean = mu_st, sd = sd_st, log = TRUE) cat(sprintf("Student-t log-density at extreme point: %+.2f (nu = %g)\n", ld_st, nu_st)) cat(sprintf("Gaussian log-density at extreme point: %+.2f\n", ld_gn)) cat(sprintf("Difference (Student-t advantage): %+.2f log-units\n", ld_st - ld_gn)) ``` The positive difference shows that the Student-$t$ model assigns substantially higher (less negative) log-density to the extreme observation. Summed over $p = 6$ features this gap compounds, shifting the MAP decision toward Stress under the Student-$t$ model and away from it under the Gaussian model. --- # Advanced Usage ## Custom Prior Probabilities The default prior is the empirical class frequency. A domain expert may wish to override this---for example, to reflect the fact that Stress regimes are rare in calendar time (say, 10%) even though they are over-represented in the training corpus. ```{r custom-prior} # Stress is rare in real time: 10% prior weight model_informed <- student_t_naive_bayes( x = x_train, y = y_train, prior = c(Calm = 0.50, Steady = 0.40, Stress = 0.10) ) cat("Priors (custom):\n") print(model_informed$prior) # The classifier becomes more conservative about Stress alerts prob_informed <- predict(model_informed, newdata = x_test, type = "prob") cat("\nMean posterior P(Stress) — uniform prior: ", round(mean(pred_prob[, "Stress"]), 4), "\n") cat("Mean posterior P(Stress) — informed prior: ", round(mean(prob_informed[, "Stress"]), 4), "\n") ``` ## Tuning the $\nu$ Grid The default grid `c(3:30, 40, 60, 100)` is designed to be broad enough for most applications. Two adjustments are useful: **Restricting to heavy tails only** --- If the analyst knows that *all* features in *all* classes are fat-tailed (e.g., intraday tick data during a crisis), a smaller grid reduces fitting time and prevents the model from selecting spuriously high $\nu$ on small samples: ```{r custom-grid-heavy} model_heavy <- student_t_naive_bayes( x = x_train, y = y_train, nu_grid = c(3, 4, 5, 6, 7, 8, 10) ) model_heavy$params$nu ``` **Extending to near-Gaussian** --- Conversely, if the data contain a benign asset class where Gaussian behaviour is plausible even in crisis, adding $\nu = 200$ or $\nu = 500$ gives the grid search a near-Gaussian option: ```{r custom-grid-gaussian} model_wide <- student_t_naive_bayes( x = x_train, y = y_train, nu_grid = c(3:30, 40, 60, 100, 200, 500) ) ``` ## Accessing Raw Parameters All fitted parameters are stored in `$params` as named $K \times p$ matrices. This makes it easy to derive secondary quantities programmatically. ```{r raw-params} # Regime-mean volatility (mu of rolling_volatility by class) model$params$mu[, "rolling_volatility"] # Tail-thickness ranking across regimes for each feature apply(model$params$nu, 2, which.min) # index of class with lowest nu ``` ## Predicting with a Subset of Features `predict()` uses only the intersection of `colnames(newdata)` and the training features, so prediction on a reduced feature set works without modification: ```{r subset-predict} # Predict using only the two most important risk features x_reduced <- x_test[, c("log_return", "drawdown")] pred_reduced <- predict(model, newdata = x_reduced, type = "class") cat("Accuracy on full feature set: ", round(mean(pred_class == y_test), 4), "\n") cat("Accuracy on 2-feature subset: ", round(mean(pred_reduced == y_test), 4), "\n") ``` --- # The Clock of Regimes (COR) Framework **kronxNBC** was designed as the classification component of the *Clock of Regimes* (COR) framework for intraday equity futures. This section describes the empirical pipeline so that users can adapt it to their own data. ## Data Requirements The COR pipeline requires two inputs: 1. **Price/return series**: a data frame with columns `timestamp`, `close`, and `ret` (log-return) at hourly frequency. 2. **Decoded regime labels**: a data frame with a `state` column produced by a Hidden Markov Model fitted to the same series [@Hamilton1989]. ## Feature Engineering Six features are computed over a rolling 24-hour window. Each captures a distinct aspect of market microstructure. ```{r cor-features, eval = FALSE} library(zoo) es_data <- read.csv("data.csv", stringsAsFactors = FALSE) decoded <- read.csv("decoded_states.csv", stringsAsFactors = FALSE) es_data <- es_data[!is.na(es_data$ret), ] n_roll <- 24L # 1. Rolling volatility (SD of log-returns) rolling_vol <- rollapply(es_data$ret, n_roll, sd, fill = NA, align = "right") rolling_vol <- pmax(rolling_vol, 0.0001) # 2. Drawdown from rolling peak rolling_max <- rollapply(es_data$close, n_roll, max, fill = NA, align = "right") drawdown <- (es_data$close - rolling_max) / rolling_max # 3. Downside semi-deviation (left-tail sensitivity) downside_dev <- function(x) { neg_x <- x[x < 0] if (length(neg_x) == 0L) return(0) sqrt(mean(neg_x^2)) } trans_stress <- rollapply(es_data$ret, n_roll, downside_dev, fill = NA, align = "right") # 4. Residence pressure (consecutive hours in drawdown) is_dd <- as.integer(drawdown < -0.005) is_dd[is.na(is_dd)] <- 0L res_pr <- ave(is_dd, cumsum(is_dd == 0L), FUN = cumsum) # 5. Ruin proxy: P(return <= -2%) under rolling distribution rolling_mean <- rollapply(es_data$ret, n_roll, mean, fill = NA, align = "right") ruin_proxy <- pnorm(-0.02, mean = rolling_mean, sd = rolling_vol) # Assemble and attach HMM labels state_labels <- c("1" = "Stress", "2" = "Calm", "3" = "Steady") cor_data <- data.frame( timestamp = es_data$timestamp, log_return = es_data$ret, rolling_volatility = rolling_vol, drawdown = drawdown, transition_stress = trans_stress, residence_pressure = pmax(res_pr, 0.0001), ruin_proxy = pmax(ruin_proxy, 0.0001), regime = factor( state_labels[as.character(decoded$state)], levels = c("Calm", "Steady", "Stress") ) ) cor_data <- cor_data[complete.cases(cor_data), ] ``` ## Train/Test Split Strategy Financial regime data exhibit strong temporal clustering: a Stress episode may span 48--200 consecutive hours. A chronological 80/20 split therefore risks placing an entire Stress cluster exclusively in the test set, leaving the training corpus with zero Stress observations and guaranteeing classification collapse. A random 80/20 split breaks the temporal adjacency of observations, ensuring every regime class is represented in both partitions regardless of when in calendar time each episode occurred. > **Caveat**: random sampling allows observations from the same cluster to appear > in both train and test sets, introducing distributional leakage across the > split boundary. For a production backtesting framework a purged, embargo-based > cross-validation scheme (e.g., `mlr3` + `PurgedCV`) is preferred. For the > diagnostic regime classifier, the random split is the correct choice because > the primary concern is class-balance coverage, not look-ahead bias. ```{r cor-split, eval = FALSE} features <- c("log_return", "rolling_volatility", "drawdown", "transition_stress", "residence_pressure", "ruin_proxy") set.seed(123L) tr_idx <- sample(nrow(cor_data), floor(0.80 * nrow(cor_data))) x_train <- as.matrix(cor_data[ tr_idx, features]) y_train <- cor_data$regime[ tr_idx] x_test <- as.matrix(cor_data[-tr_idx, features]) y_test <- cor_data$regime[-tr_idx] model <- student_t_naive_bayes(x_train, y_train) ``` ## Interpreting Empirical $\nu$ Estimates On real COR data fitted to hourly E-mini S&P 500 returns (2015--2023), the profile grid search consistently finds: | Feature | Calm $\hat\nu$ | Steady $\hat\nu$ | Stress $\hat\nu$ | |---------|--------------|---------------|----------------| | `log_return` | 20--30 | 15--25 | **3--6** | | `drawdown` | 15--25 | 12--20 | **3--8** | | `rolling_volatility` | 20--30 | 15--25 | 10--20 | | `ruin_proxy` | 25--30 | 20--28 | **4--10** | The consistently low $\hat\nu$ for Stress on `log_return`, `drawdown`, and `ruin_proxy` is an empirical validation of the fat-tail hypothesis: crisis returns are not merely high-variance Gaussian draws but draws from a genuinely different, heavy-tailed distribution. --- # Summary **kronxNBC** provides a robust alternative to Gaussian Naive Bayes for classification problems where some class-conditional distributions exhibit heavy tails. The key contributions are: 1. **Student-$t$ class-conditional likelihood**: naturally accommodates fat-tailed returns without the tail truncation implicit in Gaussian models. 2. **Profile grid search for $\nu$**: replaces numerically fragile continuous optimisation with a fast, stable discrete search. The IRLS update step at each grid point provides a good initialisation at minimal computational cost. 3. **Per-(class, feature) $\nu$ selection**: allows the model to be simultaneously fat-tailed for Stress features and near-Gaussian for Calm features---a flexibility that a single global $\nu$ cannot provide. 4. **Log-density floor**: prevents classification collapse when a single feature produces an extreme log-density, complementing the heavy-tail robustness of the Student-$t$ kernel. 5. **naivebayes compatibility**: the `tables()` generic extension means `student_t_naive_bayes` objects integrate seamlessly with existing **naivebayes** workflows. The package is available on CRAN as **kronxNBC** and requires R >= 4.0.0. --- # Appendix: Session Information {-} ```{r session-info} sessionInfo() ``` --- # References {-}