Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement

Kjell Nygren

2026-04-30

Directional tail diagnostic (current implementation)

This appendix documents the current directional_tail() diagnostic, how to call it directly, and how it is used inside summary.glmb().

The diagnostic measures prior–posterior disagreement in a multivariate direction (Evans and Moshonov 2006) rather than coefficient-by-coefficient only. It reports:

Theoretical interpretation and relation to t and F

Let mu0 denote the reference vector (prior mean by default, or a null/reference mode), and let \(\beta^{(m)}\) be posterior draws. directional_tail() computes a whitening map from posterior precision so that

\[ Z^{(m)} = W\left(\beta^{(m)} - \mu_0\right), \]

with \(W^\top W = \text{Prec}_{post}\). In this scale, Euclidean distance corresponds to Mahalanobis distance in the original coefficient space.

Define

\[ \delta = E[Z \mid y], \qquad d = \|\delta\|_2. \]

The function returns \(d\) as mahalanobis_shift, and the directional-tail probability

\[ p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y\right), \]

estimated by posterior draws.

Intuition: \(\delta\) is the posterior mean shift away from the reference in whitened space. The event \(\delta^\top Z \le 0\) is the half-space opposite the direction of disagreement, so \(p_{\text{dir}}\) quantifies how much posterior mass lies “against” that shift.

Under a Gaussian approximation in whitened space, \[ Z \mid y \approx N(\delta, I_p), \] then with \(u = \delta/\|\delta\|\), \[ u^\top Z \sim N(d, 1), \qquad p_{\text{dir}} = P(u^\top Z \le 0) = \Phi(-d). \] So the directional tail is a one-dimensional tail probability driven entirely by the standardized distance \(d\).

Relation to univariate t statistics

For \(p=1\), whitening reduces to scalar standardization and \[ d = \frac{\mu_{post} - \mu_0}{\sigma_{post}}, \] so \(d\) is the posterior z/t-type contrast against \(\mu_0\). In this case:

  • two-sided t/z-style tail: \(2\Phi(-|d|)\) (or Student-t analogue with df)
  • directional tail used here: \(\Phi(-|d|)\) when the direction is set by \(\delta\)

Thus directional_tail() plays the same role as a one-sided standardized test in the univariate case, while still extending naturally to multivariate models.

Relation to multivariate Wald/F-style tests

The squared shift \[ d^2 = \delta^\top \delta \] is the whitened quadratic distance from the reference. This is the same geometric core as multivariate Wald statistics (and, after scaling, Hotelling/F forms): larger \(d^2\) means stronger global departure from the reference.

Difference in emphasis:

  • Wald/Hotelling/F use radial evidence (any direction).
  • directional_tail() uses directional half-space evidence along \(\delta\).

Under the Gaussian approximation, both are monotone in \(d\): as \(d\) increases, quadratic-test evidence strengthens and \(p_{\text{dir}}=\Phi(-d)\) decreases.

In this sense, directional_tail() is a directional companion to multivariate Wald/F-style testing: it retains the same Mahalanobis geometry but reports evidence on an interpretable one-sided directional probability scale.

Interpretation: what probability is being assessed?

This is the key conceptual difference from classical p-values (Bernardo and Smith 1994).

In a classical t-test, the p-value is a sampling probability under the null: \[ P_{H_0}\!\left(T(Y^{rep}) \ge T(y_{obs})\right), \] that is, how extreme the observed statistic would be in repeated samples if \(H_0\) were true.

By contrast, the directional-tail quantity here is a posterior probability conditional on the observed data: \[ p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y,\ \text{model},\ \text{prior},\ \mu_0\right). \]

So the probability being assessed is not a prior probability and not a frequentist repeated-sampling probability. It is posterior mass, after updating by the data.

Practical interpretation:

For summary.glmb(), this means:

Univariate connection to known Bayesian tail/sign summaries

In the one-parameter case, directional tail reduces to a one-sided posterior tail area relative to \(\mu_0\):

  • if posterior shift is positive: \(p_{\text{dir}} = P(\beta \le \mu_0 \mid y)\)
  • if posterior shift is negative: \(p_{\text{dir}} = P(\beta \ge \mu_0 \mid y)\)

This is closely related to well-known Bayesian sign/tail summaries (including posterior sign probability / “probability of direction” style summaries in applied Bayesian reporting (Gelman et al. 2013)). In that language, for a scalar parameter, p_directional is essentially the opposite-side posterior mass, and the sign probability is approximately \(1 - p_{\text{dir}}\).

The main contribution of directional_tail() is the multivariate extension: it uses whitening plus projection onto the posterior shift direction to provide a single coherent directional-tail probability even when coefficients are correlated and inference is genuinely multivariate.

How magnitude compares to classical tests

Because the targets differ (posterior mass vs repeated-sampling extremeness), p_directional is not numerically identical to a classical p-value. Still, the two are often comparable through the standardized shift magnitude \(d\).

In the scalar Gaussian approximation: \[ p_{\text{dir}} \approx \Phi(-|d|), \qquad p_{\text{post,2s}} \approx 2\Phi(-|d|), \] while a classical two-sided p-value is typically \[ p_{\text{classical,2s}} \approx 2\{1-F_t(|t|;\nu)\} \] (or \(2\Phi(-|z|)\) asymptotically).

Strong-prior regime

When the prior is strong relative to the likelihood, two effects dominate:

  1. Centering effect: posterior mean is pulled toward the prior/reference.
  2. Precision effect: posterior covariance contracts along prior-informed directions.

Consequences for p_directional vs classical p-values:

  • If prior and null/reference are aligned with each other (and with little data contradiction), the posterior shift from \(\mu_0\) is usually smaller, so \(d\) is smaller and p_directional tends to be larger (weaker apparent evidence against the reference) than classical p-values.
  • If the prior is strong but centered away from the classical null in the same direction as the data, \(d\) can increase and p_directional can be smaller (stronger evidence in that direction) than classical tails.
  • In multivariate settings, this divergence is often anisotropic: prior strength can increase/decrease evidence differently across directions, and directional_tail() summarizes the net directional result along \(\delta\).

Weak-prior (data-dominant) limit

As prior precision becomes small relative to information in the data:

  • posterior mean approaches the MLE (or quasi-MLE),
  • posterior covariance approaches the inverse observed information,
  • \(d\) approaches the usual standardized likelihood-based contrast.

Hence in the univariate case, p_directional approaches a one-sided likelihood-based tail measure, and \(2\,p_{\text{dir}}\) approaches the familiar two-sided z/t-style tail (asymptotically).

In the multivariate case, \(d^2\) approaches the corresponding Wald quadratic form (and its Hotelling/F scaling in finite samples), so Bayesian directional and classical quadratic tests tend to agree more closely in large-sample/weak-prior regimes.

Scalar decomposition (normal-gamma heuristic)

For one coefficient, the posterior-vs-classical comparison can be decomposed into a center effect and a scale effect. Write \[ t_{\text{post}} = \frac{\hat\beta_{\text{post}}-\mu_0}{\text{SE}_{\text{post}}}, \qquad t_{\text{class}} = \frac{\hat\beta_{\text{MLE}}-\mu_0}{\text{SE}_{\text{class}}}. \] Then \[ t_{\text{post}} = t_{\text{class}} \times \underbrace{\frac{\hat\beta_{\text{post}}-\mu_0}{\hat\beta_{\text{MLE}}-\mu_0}}_{\text{center/shrinkage factor } \kappa} \times \underbrace{\frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}}}_{\text{scale factor}}. \]

In conjugate normal-gamma style parameterizations, a common heuristic is \[ \text{SE}_{\text{post}} \approx \text{SE}_{\text{class}} \sqrt{\frac{n-p}{n+n_{\text{prior}}}}, \] so \[ \frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}} \approx \sqrt{\frac{n+n_{\text{prior}}}{n-p}}. \] If one also uses a simple shrinkage approximation \(\kappa \approx \frac{n}{n+n_{\text{prior}}}\), then \[ t_{\text{post}} \approx t_{\text{class}} \cdot \frac{n}{n+n_{\text{prior}}} \cdot \sqrt{\frac{n+n_{\text{prior}}}{n-p}}. \]

Using \(n_{\text{prior}} = n\,\frac{\text{pwt}}{1-\text{pwt}}\), this can be rewritten as \[ t_{\text{post}} \approx t_{\text{class}} \cdot \sqrt{1-\text{pwt}} \cdot \sqrt{\frac{n}{n-p}}. \]

This expression shows why posterior directional tails can be either larger or smaller than classical tails:

  • centering/shrinkage (\(\kappa\)) tends to pull \(t_{\text{post}}\) toward 0,
  • scale correction can push \(|t_{\text{post}}|\) upward,
  • the net result depends on their product.

A useful threshold from this heuristic is: \[ \sqrt{\frac{n+n_{\text{prior}}}{n-p}} > 1 \;\;\Longleftrightarrow\;\; n_{\text{prior}} > p, \] so once prior pseudo-sample size is large relative to dimension, scale effects can dominate unless center shrinkage is also strong.

This is the scalar analogue of the multivariate directional-tail story: center and scale are both active, and the final probability is driven by the standardized shift magnitude after whitening.

Incorporating the package example (Ex_directional_tail.R)

The package example already demonstrates the intended workflow and interpretation. In particular, it shows:

A compact usage sketch from that example is:

example("directional_tail", package = "glmbayes")

# Core objects used in interpretation:
dt <- directional_tail(fit)
dt$mahalanobis_shift
dt$p_directional

# Same quantities are also surfaced in summary output:
s <- summary(fit)
s$dir_tail
s$dir_tail_null

Illustrative plots from Ex_directional_tail.R

The example script also includes two useful diagnostic visualizations:

  1. Whitened space plot: tail vs non-tail draws, directional boundary, and shift radius.
  2. Raw coefficient space plot: tail vs non-tail draws with prior and posterior centers.
dt <- directional_tail(fit)

Z     <- dt$draws$Z
flag  <- dt$draws$is_tail
delta <- dt$delta
w     <- delta

plot(
  Z,
  col = ifelse(flag, "red", "blue"),
  pch = 19,
  xlab = "Z1",
  ylab = "Z2",
  main = "Directional Tail Diagnostic (Whitened Space)"
)

# Decision boundary orthogonal to direction vector
abline(a = 0, b = -w[1] / w[2], col = "darkgreen", lty = 2)

# Radius corresponding to Mahalanobis shift
r <- sqrt(sum(delta^2))
symbols(
  delta[1], delta[2],
  circles = r,
  inches  = FALSE,
  add     = TRUE,
  lwd     = 2,
  fg      = "gray"
)

points(0, 0, pch = 4, col = "black", lwd = 2)          # reference center in whitened space
points(delta[1], delta[2], pch = 3, col = "purple", lwd = 2) # posterior shift

B       <- dt$draws$B
flag    <- dt$draws$is_tail
mu0     <- as.numeric(fit$Prior$mean)
mu_post <- colMeans(B)

plot(
  B,
  col  = ifelse(flag, "red", "blue"),
  pch  = 19,
  xlab = "Coefficient 1",
  ylab = "Coefficient 2",
  main = "Directional Tail Diagnostic (Raw Coefficient Space)"
)

points(mu0[1], mu0[2], pch = 4, col = "black", cex = 1.5)       # reference center
points(mu_post[1], mu_post[2], pch = 3, col = "darkgreen", cex = 1.5)  # posterior center

legend(
  "topright",
  legend = c("Tail draws", "Non-tail draws", "Reference", "Posterior"),
  col    = c("red", "blue", "black", "darkgreen"),
  pch    = c(19, 19, 4, 3),
  bty    = "n"
)

How directional_tail() is called

The function is exported and can be called directly on a fitted object (glmb or lmb):

fit <- glmb(
  counts ~ outcome + treatment,
  family = poisson(),
  pfamily = dNormal(mu = mu, Sigma = V)
)

dt_prior <- directional_tail(fit)      # reference = prior mean
dt_prior
print(dt_prior)

To compare against a user-specified reference (for example a null/intercept-only mode), pass mu0 explicitly:

mu0 <- rep(0, length(fit$Prior$mean))
names(mu0) <- names(fit$Prior$mean)
mu0["(Intercept)"] <- coef(glm(counts ~ 1, family = poisson(), data = fit$data))[1]

dt_null <- directional_tail(fit, mu0 = mu0)
dt_null

How summary.glmb() uses directional tail

summary.glmb() computes two directional diagnostics internally:

  1. directional_tail(object) : posterior vs prior mean
  2. directional_tail(object, mu0 = null_est_full) : posterior vs null reference

These are returned in the summary object:

s <- summary(fit)

s$dir_tail        # vs prior
s$dir_tail_null   # vs null

The print method displays a compact “Directional Tail Summaries” table with:

Minimal reproducible workflow

The package example inst/examples/Ex_directional_tail.R provides a full reproducible script. A minimal call pattern is:

ps <- Prior_Setup(weight ~ group, family = gaussian(), data = dat2)
fit <- lmb(weight ~ group, dNormal(ps$mu, ps$Sigma, dispersion = ps$dispersion), data = dat2, n = 10000)

dt <- directional_tail(fit)
print(dt)

Supplementary notes: why the Bayesian tail can be smaller

** 1.0 Why the Bayesian tail can be smaller

There are three distinct, concrete reasons the Bayesian posterior tail probability for a coefficient can be smaller than the classical t tail

*** 1.1 Different effective degrees of freedom (v)

In the conjugate parameterization above, posterior \(\boldsymbol{v = 2\alpha_{0} + n}\). If the prior has positive \(\boldsymbol{\alpha_{0}}\) (an informative prior on the variance), the posterior \(\boldsymbol{v}\) will be larger than the classical \(n - p\). Increasing \(v\) makes the Student‑t closer to normal (lighter tails), so the tail probability at the same \(t\) value is smaller.

– Even for vaguely inforative priors, the prior contributes “pseudo-observations” to the posterior df. This is often the main cause of smaller Bayesian tails

*** 1.2 Different standard error (scale) used in the denominator

Bayesian marginal variance for \(\boldsymbol{\beta_{j}}\) is \(\boldsymbol{\frac{\beta_{n}}{\alpha_{n}} V_{n}[j,j]}\).

Two effects happen here:
(a) \(\boldsymbol{V_{n}}\) generally shrinks relative to \((X^{T}X)^{-1}\) because of the prior precision \(\boldsymbol{V_{0}^{-1}}\), and
(b) the posterior scale factor \(\boldsymbol{\frac{\beta_{n}}{\alpha_{n}}}\) can be smaller than the classical estimate \(\hat{s}^{2}\) depending on prior hyperparameters and the data (e.g., prior favors smaller variance).

Either effect reduces the denominator and can increase the raw \(t\) value — but combined with larger \(\boldsymbol{v}\) the net tail probability can still be smaller.

So you must compare both the numerator (posterior mean \(\boldsymbol{m_{n,j}}\)) vs. the OLS estimate \(\hat{\beta}_{j}\), and the denominator (posterior scale vs. \(\boldsymbol{\hat{s}\sqrt{(X^{T}X)^{-1}_{jj}}}\)). Prior shrinkage often makes \(m_{n,j}\) closer to zero and \(V_{n}[j,j]\) smaller, both of which reduce evidence against \(H_{0}\).

1.3 Different centering (posterior mean vs. OLS) and prior shrinkage

For Bayesian linear models with normal-gamma priors, it seems like standard errors used in t-statistic calculations need to be adjusted relative to the classical estimates. In particular, the adjustment seems to be

The adjusted standard error is:

\[ \text{Std\_err}_t = \text{Std\_err} \cdot \sqrt{ \frac{n - p}{n + n_{\text{prior}}} } \]

which leads to:

\[ t_{\text{adj},i} = \frac{n}{n + n_{\text{prior}}} \, t_{\text{unadjusted},i} \, \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

where the first ratio shows the skrinkage effect (lower t-value) and the second effect the impact of the adjustment to the standard error (which boosts the t-value).

\[ t_{\text{adj},i} = \sqrt{\frac{n}{n + n_{\text{prior}}}} \cdot \sqrt{\frac{n}{n -p}} \cdot t_{\text{unadjusted},i} \]

\[ t_{\text{adj},i} = \sqrt{\frac{n}{n + n_{\text{prior}}}} \cdot \sqrt{\frac{n}{n -p}} \cdot t_{\text{unadjusted},i} \]

\[ t_{\text{adj},i} = \sqrt{1-pwt} \cdot \sqrt{\frac{n}{n -p}} \cdot t_{\text{unadjusted},i} \]

It can be seen from the above that t-statitic in the Bayesian context is larger than the unadjusted t-statistic (i.e., inflated) whenver

\[ n_{\text{prior}} < p \cdot \frac{n}{n - p} \]

which implies tail probabilities smaller than their classical counterparts.

\[ n + n_{\text{prior}} = n + n\left(\frac{\text{pwt}}{1 - \text{pwt}}\right) = \frac{n(1 - \text{pwt}) + n(\text{pwt})}{1 - \text{pwt}} = \frac{n}{1 - \text{pwt}}. \]

The adjusted standard error is \[ \text{StdErr}_{t} = \text{StdErr} \cdot \sqrt{\frac{n - p}{n + n_{\text{prior}}}}, \] which leads to the adjusted t-statistic

\[ t_{\text{adj},i} = t_{\text{unadjusted},i} \cdot \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

For reference, the same expression can be written as

\[ \text{StdErr}_{t} = \text{StdErr} \cdot \sqrt{\frac{n - p}{n + n_{\text{prior}}}}. \]

which leads to

\[ t_{\text{adj},i} = t_{\text{unadjusted},i} \cdot \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

References

Bernardo, José M., and Adrian F. M. Smith. 1994. Bayesian Theory. Wiley.
Evans, Michael, and Hadas Moshonov. 2006. “Checking for Prior-Data Conflict.” Bayesian Analysis 1 (4): 893–914. https://doi.org/10.1214/06-BA129.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. CRC Press.