Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement

Directional tail diagnostic (current implementation)

This appendix documents the current directional_tail() diagnostic, how to call it directly, and how it is used inside summary.glmb().

The diagnostic measures prior–posterior disagreement in a multivariate direction (Evans and Moshonov 2006) rather than coefficient-by-coefficient only. It reports:

mahalanobis_shift: the standardized prior-posterior shift magnitude in whitened space
p_directional: directional tail probability, P(delta^T Z <= 0)
delta: posterior mean shift vector in whitened coordinates

Theoretical interpretation and relation to t and F

Let mu0 denote the reference vector (prior mean by default, or a null/reference mode), and let \(\beta^{(m)}\) be posterior draws. directional_tail() computes a whitening map from posterior precision so that

\[ Z^{(m)} = W\left(\beta^{(m)} - \mu_0\right), \]

with \(W^\top W = \text{Prec}_{post}\). In this scale, Euclidean distance corresponds to Mahalanobis distance in the original coefficient space.

Define

\[ \delta = E[Z \mid y], \qquad d = \|\delta\|_2. \]

The function returns \(d\) as mahalanobis_shift, and the directional-tail probability

\[ p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y\right), \]

estimated by posterior draws.

Intuition: \(\delta\) is the posterior mean shift away from the reference in whitened space. The event \(\delta^\top Z \le 0\) is the half-space opposite the direction of disagreement, so \(p_{\text{dir}}\) quantifies how much posterior mass lies “against” that shift.

Under a Gaussian approximation in whitened space, \[ Z \mid y \approx N(\delta, I_p), \] then with \(u = \delta/\|\delta\|\), \[ u^\top Z \sim N(d, 1), \qquad p_{\text{dir}} = P(u^\top Z \le 0) = \Phi(-d). \] So the directional tail is a one-dimensional tail probability driven entirely by the standardized distance \(d\).

Relation to univariate t statistics

For \(p=1\), whitening reduces to scalar standardization and \[ d = \frac{\mu_{post} - \mu_0}{\sigma_{post}}, \] so \(d\) is the posterior z/t-type contrast against \(\mu_0\). In this case:

two-sided t/z-style tail: \(2\Phi(-|d|)\) (or Student-t analogue with df)
directional tail used here: \(\Phi(-|d|)\) when the direction is set by \(\delta\)

Thus directional_tail() plays the same role as a one-sided standardized test in the univariate case, while still extending naturally to multivariate models.

Relation to multivariate Wald/F-style tests

The squared shift \[ d^2 = \delta^\top \delta \] is the whitened quadratic distance from the reference. This is the same geometric core as multivariate Wald statistics (and, after scaling, Hotelling/F forms): larger \(d^2\) means stronger global departure from the reference.

Difference in emphasis:

Wald/Hotelling/F use radial evidence (any direction).
directional_tail() uses directional half-space evidence along \(\delta\).

Under the Gaussian approximation, both are monotone in \(d\): as \(d\) increases, quadratic-test evidence strengthens and \(p_{\text{dir}}=\Phi(-d)\) decreases.

In this sense, directional_tail() is a directional companion to multivariate Wald/F-style testing: it retains the same Mahalanobis geometry but reports evidence on an interpretable one-sided directional probability scale.

Interpretation: what probability is being assessed?

This is the key conceptual difference from classical p-values (Bernardo and Smith 1994).

In a classical t-test, the p-value is a sampling probability under the null: \[ P_{H_0}\!\left(T(Y^{rep}) \ge T(y_{obs})\right), \] that is, how extreme the observed statistic would be in repeated samples if \(H_0\) were true.

By contrast, the directional-tail quantity here is a posterior probability conditional on the observed data: \[ p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y,\ \text{model},\ \text{prior},\ \mu_0\right). \]

So the probability being assessed is not a prior probability and not a frequentist repeated-sampling probability. It is posterior mass, after updating by the data.

Practical interpretation:

Small p_directional: posterior mass is concentrated in the direction away from the reference \(\mu_0\), indicating stronger directional disagreement.
Large p_directional (near 0.5): weak directional separation from the reference.
Very large values (close to 1) can occur when the sign of \(\delta\) is reversed by construction or when the chosen reference is on the opposite side of most posterior mass.

For summary.glmb(), this means:

dir_tail (vs prior): posterior directional disagreement with the prior mean.
dir_tail_null (vs null): posterior directional disagreement with the null/intercept-only reference used in the summary routine.

Univariate connection to known Bayesian tail/sign summaries

In the one-parameter case, directional tail reduces to a one-sided posterior tail area relative to \(\mu_0\):

if posterior shift is positive: \(p_{\text{dir}} = P(\beta \le \mu_0 \mid y)\)
if posterior shift is negative: \(p_{\text{dir}} = P(\beta \ge \mu_0 \mid y)\)

This is closely related to well-known Bayesian sign/tail summaries (including posterior sign probability / “probability of direction” style summaries in applied Bayesian reporting (Gelman et al. 2013)). In that language, for a scalar parameter, p_directional is essentially the opposite-side posterior mass, and the sign probability is approximately \(1 - p_{\text{dir}}\).

The main contribution of directional_tail() is the multivariate extension: it uses whitening plus projection onto the posterior shift direction to provide a single coherent directional-tail probability even when coefficients are correlated and inference is genuinely multivariate.

How magnitude compares to classical tests

Because the targets differ (posterior mass vs repeated-sampling extremeness), p_directional is not numerically identical to a classical p-value. Still, the two are often comparable through the standardized shift magnitude \(d\).

In the scalar Gaussian approximation: \[ p_{\text{dir}} \approx \Phi(-|d|), \qquad p_{\text{post,2s}} \approx 2\Phi(-|d|), \] while a classical two-sided p-value is typically \[ p_{\text{classical,2s}} \approx 2\{1-F_t(|t|;\nu)\} \] (or \(2\Phi(-|z|)\) asymptotically).

Strong-prior regime

When the prior is strong relative to the likelihood, two effects dominate:

Centering effect: posterior mean is pulled toward the prior/reference.
Precision effect: posterior covariance contracts along prior-informed directions.

Consequences for p_directional vs classical p-values:

If prior and null/reference are aligned with each other (and with little data contradiction), the posterior shift from \(\mu_0\) is usually smaller, so \(d\) is smaller and p_directional tends to be larger (weaker apparent evidence against the reference) than classical p-values.
If the prior is strong but centered away from the classical null in the same direction as the data, \(d\) can increase and p_directional can be smaller (stronger evidence in that direction) than classical tails.
In multivariate settings, this divergence is often anisotropic: prior strength can increase/decrease evidence differently across directions, and directional_tail() summarizes the net directional result along \(\delta\).

Weak-prior (data-dominant) limit

As prior precision becomes small relative to information in the data:

posterior mean approaches the MLE (or quasi-MLE),
posterior covariance approaches the inverse observed information,
\(d\) approaches the usual standardized likelihood-based contrast.

Hence in the univariate case, p_directional approaches a one-sided likelihood-based tail measure, and \(2\,p_{\text{dir}}\) approaches the familiar two-sided z/t-style tail (asymptotically).

In the multivariate case, \(d^2\) approaches the corresponding Wald quadratic form (and its Hotelling/F scaling in finite samples), so Bayesian directional and classical quadratic tests tend to agree more closely in large-sample/weak-prior regimes.

Scalar decomposition (normal-gamma heuristic)

For one coefficient, the posterior-vs-classical comparison can be decomposed into a center effect and a scale effect. Write \[ t_{\text{post}} = \frac{\hat\beta_{\text{post}}-\mu_0}{\text{SE}_{\text{post}}}, \qquad t_{\text{class}} = \frac{\hat\beta_{\text{MLE}}-\mu_0}{\text{SE}_{\text{class}}}. \] Then \[ t_{\text{post}} = t_{\text{class}} \times \underbrace{\frac{\hat\beta_{\text{post}}-\mu_0}{\hat\beta_{\text{MLE}}-\mu_0}}_{\text{center/shrinkage factor } \kappa} \times \underbrace{\frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}}}_{\text{scale factor}}. \]

In conjugate normal-gamma style parameterizations, a common heuristic is \[ \text{SE}_{\text{post}} \approx \text{SE}_{\text{class}} \sqrt{\frac{n-p}{n+n_{\text{prior}}}}, \] so \[ \frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}} \approx \sqrt{\frac{n+n_{\text{prior}}}{n-p}}. \] If one also uses a simple shrinkage approximation \(\kappa \approx \frac{n}{n+n_{\text{prior}}}\), then \[ t_{\text{post}} \approx t_{\text{class}} \cdot \frac{n}{n+n_{\text{prior}}} \cdot \sqrt{\frac{n+n_{\text{prior}}}{n-p}}. \]

Using \(n_{\text{prior}} = n\,\frac{\text{pwt}}{1-\text{pwt}}\), this can be rewritten as \[ t_{\text{post}} \approx t_{\text{class}} \cdot \sqrt{1-\text{pwt}} \cdot \sqrt{\frac{n}{n-p}}. \]

This expression shows why posterior directional tails can be either larger or smaller than classical tails:

centering/shrinkage (\(\kappa\)) tends to pull \(t_{\text{post}}\) toward 0,
scale correction can push \(|t_{\text{post}}|\) upward,
the net result depends on their product.

A useful threshold from this heuristic is: \[ \sqrt{\frac{n+n_{\text{prior}}}{n-p}} > 1 \;\;\Longleftrightarrow\;\; n_{\text{prior}} > p, \] so once prior pseudo-sample size is large relative to dimension, scale effects can dominate unless center shrinkage is also strong.

This is the scalar analogue of the multivariate directional-tail story: center and scale are both active, and the final probability is driven by the standardized shift magnitude after whitening.

Incorporating the package example (`Ex_directional_tail.R`)

The package example already demonstrates the intended workflow and interpretation. In particular, it shows:

fitting a model (lmb/glmb),
calling directional_tail(fit),
extracting mahalanobis_shift, p_directional, and the draw-level objects,
visualizing tail vs non-tail draws in whitened and raw coefficient spaces.

A compact usage sketch from that example is:

example("directional_tail", package = "glmbayes")

# Core objects used in interpretation:
dt <- directional_tail(fit)
dt$mahalanobis_shift
dt$p_directional

# Same quantities are also surfaced in summary output:
s <- summary(fit)
s$dir_tail
s$dir_tail_null

Illustrative plots from `Ex_directional_tail.R`

The example script also includes two useful diagnostic visualizations:

Whitened space plot: tail vs non-tail draws, directional boundary, and shift radius.
Raw coefficient space plot: tail vs non-tail draws with prior and posterior centers.

dt <- directional_tail(fit)

Z     <- dt$draws$Z
flag  <- dt$draws$is_tail
delta <- dt$delta
w     <- delta

plot(
  Z,
  col = ifelse(flag, "red", "blue"),
  pch = 19,
  xlab = "Z1",
  ylab = "Z2",
  main = "Directional Tail Diagnostic (Whitened Space)"
)

# Decision boundary orthogonal to direction vector
abline(a = 0, b = -w[1] / w[2], col = "darkgreen", lty = 2)

# Radius corresponding to Mahalanobis shift
r <- sqrt(sum(delta^2))
symbols(
  delta[1], delta[2],
  circles = r,
  inches  = FALSE,
  add     = TRUE,
  lwd     = 2,
  fg      = "gray"
)

points(0, 0, pch = 4, col = "black", lwd = 2)          # reference center in whitened space
points(delta[1], delta[2], pch = 3, col = "purple", lwd = 2) # posterior shift

B       <- dt$draws$B
flag    <- dt$draws$is_tail
mu0     <- as.numeric(fit$Prior$mean)
mu_post <- colMeans(B)

plot(
  B,
  col  = ifelse(flag, "red", "blue"),
  pch  = 19,
  xlab = "Coefficient 1",
  ylab = "Coefficient 2",
  main = "Directional Tail Diagnostic (Raw Coefficient Space)"
)

points(mu0[1], mu0[2], pch = 4, col = "black", cex = 1.5)       # reference center
points(mu_post[1], mu_post[2], pch = 3, col = "darkgreen", cex = 1.5)  # posterior center

legend(
  "topright",
  legend = c("Tail draws", "Non-tail draws", "Reference", "Posterior"),
  col    = c("red", "blue", "black", "darkgreen"),
  pch    = c(19, 19, 4, 3),
  bty    = "n"
)

How `directional_tail()` is called

The function is exported and can be called directly on a fitted object (glmb or lmb):

fit <- glmb(
  counts ~ outcome + treatment,
  family = poisson(),
  pfamily = dNormal(mu = mu, Sigma = V)
)

dt_prior <- directional_tail(fit)      # reference = prior mean
dt_prior
print(dt_prior)

To compare against a user-specified reference (for example a null/intercept-only mode), pass mu0 explicitly:

mu0 <- rep(0, length(fit$Prior$mean))
names(mu0) <- names(fit$Prior$mean)
mu0["(Intercept)"] <- coef(glm(counts ~ 1, family = poisson(), data = fit$data))[1]

dt_null <- directional_tail(fit, mu0 = mu0)
dt_null

How `summary.glmb()` uses directional tail

summary.glmb() computes two directional diagnostics internally:

directional_tail(object) : posterior vs prior mean
directional_tail(object, mu0 = null_est_full) : posterior vs null reference

These are returned in the summary object:

s <- summary(fit)

s$dir_tail        # vs prior
s$dir_tail_null   # vs null

The print method displays a compact “Directional Tail Summaries” table with:

Mahalanobis Distance (vs Null, vs Prior)
Tail Probability (vs Null, vs Prior)

Minimal reproducible workflow

The package example inst/examples/Ex_directional_tail.R provides a full reproducible script. A minimal call pattern is:

ps <- Prior_Setup(weight ~ group, family = gaussian(), data = dat2)
fit <- lmb(weight ~ group, dNormal(ps$mu, ps$Sigma, dispersion = ps$dispersion), data = dat2, n = 10000)

dt <- directional_tail(fit)
print(dt)

Supplementary notes: why the Bayesian tail can be smaller

** 1.0 Why the Bayesian tail can be smaller

There are three distinct, concrete reasons the Bayesian posterior tail probability for a coefficient can be smaller than the classical t tail

*** 1.1 Different effective degrees of freedom (v)

In the conjugate parameterization above, posterior \(\boldsymbol{v = 2\alpha_{0} + n}\). If the prior has positive \(\boldsymbol{\alpha_{0}}\) (an informative prior on the variance), the posterior \(\boldsymbol{v}\) will be larger than the classical \(n - p\). Increasing \(v\) makes the Student‑t closer to normal (lighter tails), so the tail probability at the same \(t\) value is smaller.

– Even for vaguely inforative priors, the prior contributes “pseudo-observations” to the posterior df. This is often the main cause of smaller Bayesian tails

*** 1.2 Different standard error (scale) used in the denominator

Bayesian marginal variance for \(\boldsymbol{\beta_{j}}\) is \(\boldsymbol{\frac{\beta_{n}}{\alpha_{n}} V_{n}[j,j]}\).

Two effects happen here:
(a) \(\boldsymbol{V_{n}}\) generally shrinks relative to \((X^{T}X)^{-1}\) because of the prior precision \(\boldsymbol{V_{0}^{-1}}\), and
(b) the posterior scale factor \(\boldsymbol{\frac{\beta_{n}}{\alpha_{n}}}\) can be smaller than the classical estimate \(\hat{s}^{2}\) depending on prior hyperparameters and the data (e.g., prior favors smaller variance).

Either effect reduces the denominator and can increase the raw \(t\) value — but combined with larger \(\boldsymbol{v}\) the net tail probability can still be smaller.

So you must compare both the numerator (posterior mean \(\boldsymbol{m_{n,j}}\)) vs. the OLS estimate \(\hat{\beta}_{j}\), and the denominator (posterior scale vs. \(\boldsymbol{\hat{s}\sqrt{(X^{T}X)^{-1}_{jj}}}\)). Prior shrinkage often makes \(m_{n,j}\) closer to zero and \(V_{n}[j,j]\) smaller, both of which reduce evidence against \(H_{0}\).

1.3 Different centering (posterior mean vs. OLS) and prior shrinkage

If the prior pulls \(\boldsymbol{m_{n,j}}\) toward \(\boldsymbol{\beta_{0}}\) (e.g., 0), the numerator of the Bayesian \(t\) is smaller in magnitude than \(\hat{\beta}_{j}\), further reducing tail probability. In combination with (1) and (2), the posterior tail probability will often be noticeably smaller than the classical one. Griffin & Brown and other shrinkage‑prior literature discuss this tradeoff explicitly for normal–gamma shrinkage priors.

For Bayesian linear models with normal-gamma priors, it seems like standard errors used in t-statistic calculations need to be adjusted relative to the classical estimates. In particular, the adjustment seems to be

The adjusted standard error is:

\[ \text{Std\_err}_t = \text{Std\_err} \cdot \sqrt{ \frac{n - p}{n + n_{\text{prior}}} } \]

which leads to:

\[ t_{\text{adj},i} = \frac{n}{n + n_{\text{prior}}} \, t_{\text{unadjusted},i} \, \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

where the first ratio shows the skrinkage effect (lower t-value) and the second effect the impact of the adjustment to the standard error (which boosts the t-value).

\[ t_{\text{adj},i} = \sqrt{\frac{n}{n + n_{\text{prior}}}} \cdot \sqrt{\frac{n}{n -p}} \cdot t_{\text{unadjusted},i} \]

\[ t_{\text{adj},i} = \sqrt{1-pwt} \cdot \sqrt{\frac{n}{n -p}} \cdot t_{\text{unadjusted},i} \]

It can be seen from the above that t-statitic in the Bayesian context is larger than the unadjusted t-statistic (i.e., inflated) whenver

\[ n_{\text{prior}} < p \cdot \frac{n}{n - p} \]

which implies tail probabilities smaller than their classical counterparts.

\[ n + n_{\text{prior}} = n + n\left(\frac{\text{pwt}}{1 - \text{pwt}}\right) = \frac{n(1 - \text{pwt}) + n(\text{pwt})}{1 - \text{pwt}} = \frac{n}{1 - \text{pwt}}. \]

The adjusted standard error is \[ \text{StdErr}_{t} = \text{StdErr} \cdot \sqrt{\frac{n - p}{n + n_{\text{prior}}}}, \] which leads to the adjusted t-statistic

\[ t_{\text{adj},i} = t_{\text{unadjusted},i} \cdot \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

For reference, the same expression can be written as

\[ \text{StdErr}_{t} = \text{StdErr} \cdot \sqrt{\frac{n - p}{n + n_{\text{prior}}}}. \]

which leads to

\[ t_{\text{adj},i} = t_{\text{unadjusted},i} \cdot \sqrt{\frac{n + n_{\text{prior}}}{n - p}}. \]

Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement

Kjell Nygren

2026-05-18

Directional tail diagnostic (current implementation)

Theoretical interpretation and relation to t and F

Relation to univariate t statistics

Relation to multivariate Wald/F-style tests

Interpretation: what probability is being assessed?

Univariate connection to known Bayesian tail/sign summaries

How magnitude compares to classical tests

Strong-prior regime

Weak-prior (data-dominant) limit

Scalar decomposition (normal-gamma heuristic)

Incorporating the package example (`Ex_directional_tail.R`)

Illustrative plots from `Ex_directional_tail.R`

How `directional_tail()` is called

How `summary.glmb()` uses directional tail

Minimal reproducible workflow

Supplementary notes: why the Bayesian tail can be smaller

1.3 Different centering (posterior mean vs. OLS) and prior shrinkage

References

Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement

Kjell Nygren

2026-05-18

Directional tail diagnostic (current implementation)

Theoretical interpretation and relation to t and F

Relation to univariate t statistics

Relation to multivariate Wald/F-style tests

Interpretation: what probability is being assessed?

Univariate connection to known Bayesian tail/sign summaries

How magnitude compares to classical tests

Strong-prior regime

Weak-prior (data-dominant) limit

Scalar decomposition (normal-gamma heuristic)

Incorporating the package example (Ex_directional_tail.R)

Illustrative plots from Ex_directional_tail.R

How directional_tail() is called

How summary.glmb() uses directional tail

Minimal reproducible workflow

Supplementary notes: why the Bayesian tail can be smaller

1.3 Different centering (posterior mean vs. OLS) and prior shrinkage

References

Incorporating the package example (`Ex_directional_tail.R`)

Illustrative plots from `Ex_directional_tail.R`

How `directional_tail()` is called

How `summary.glmb()` uses directional tail