---
title: "Chapter 11: Estimating Models with unknown dispersion parameters"
output: rmarkdown::html_vignette
bibliography: REFERENCES.bib
reference-section-title: References
vignette: >
  %\VignetteIndexEntry{Chapter 11: Estimating Models with unknown dispersion parameters}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning= FALSE,
  message=FALSE
)
```

```{r setup}
library(glmbayes)
```
## 1. Introduction

Earlier chapters developed Bayesian linear and generalized linear models under the assumption
that the dispersion parameter, denoted by \( \phi \) (or equivalently the residual variance
\( \sigma^{2} \)), was known.
For Gaussian models this corresponds to fixing the variance of the error term,
and for quasi‑families it corresponds to fixing the scale parameter.

This chapter extends the framework to unknown dispersion parameters---a standard topic in Bayesian linear and generalized linear models [@OHaganForster2004; @Gelman2013]---showing how glmbayes
supports:

(1) Priors on the dispersion only  
(2) Joint conjugate Normal–Gamma priors  
(3) Independent Normal–Gamma priors  
      • iid sampling via truncated Gamma priors  
      • two‑block Gibbs sampling  

Full derivations for Gaussian dispersion, shape, and rate calibration from `Prior_Setup()` / `compute_gaussian_prior()`—and how those objects connect to conjugate and independent Normal--Gamma priors—are given in **Chapter A12** (<https://knygren.r-universe.dev/articles/glmbayes/Chapter-A12.html>).

All Gaussian examples use the **Dobson** plant‑weight data [@Dobson1990], first introduced in Chapter 02.


```{r dobson}
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group  <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)

```


## 2. Exponential-Family Background and Log-Concavity

Many likelihoods used in generalized linear models belong to the exponential family.  
With observation weights \(w_1,\dots,w_n \ge 0\), the weighted exponential-family density is

\[
f(y \mid \theta,\phi,w)
= \exp\left\{
  \sum_{i=1}^{n}
    w_i \left[
      \frac{y_i \theta_i - b(\theta_i)}{a(\phi)}
      + c(y_i,\phi)
    \right]
\right\}.
\]

Here:

- \( \theta_i \) is the canonical parameter  
- \( b(\theta_i) \) is the cumulant function  
- \( a(\phi) \) is the dispersion function  
- \( c(y_i,\phi) \) collects terms not involving \( \theta_i \)  
- \( w_i \) is the observation weight  

The **weighted log-likelihood** is therefore

\[
\ell(\beta,\phi)
=
\sum_{i=1}^{n}
w_i \left[
\frac{y_i\theta_i - b(\theta_i)}{a(\phi)}
+ c(y_i,\phi)
\right].
\]

This is the form used by both `glm()` and the Bayesian functions  
`glmb()`, `lmb()`, `rglmb()`, and `rlmb()`.


---

## 2.2 Special Case: Weighted Gaussian Linear Regression

Consider the weighted Gaussian model

\[
y_i \mid \beta,\phi \sim N(x_i^\top\beta,\;\phi/w_i),
\qquad i=1,\dots,n.
\]

Let \(W = \mathrm{diag}(w_1,\dots,w_n)\) and define the precision

\[
\tau = \frac{1}{\phi}.
\]

### **Weighted log-likelihood**

The log-likelihood (up to constants) is

\[
\ell(\beta,\phi)
=
-\frac{1}{2\phi}
\sum_{i=1}^{n}
w_i (y_i - x_i^\top\beta)^2
\;-\;
\frac{1}{2}
\sum_{i=1}^{n}
w_i \log(2\pi\phi).
\]

Equivalently, in precision form:

\[
\ell(\beta,\tau)
=
\frac{\tau}{2}
\sum_{i=1}^{n}
w_i (y_i - x_i^\top\beta)^2
\;-\;
\frac{1}{2}
\sum_{i=1}^{n}
w_i \log\left(\frac{2\pi}{\tau}\right).
\]

### **Exponential-family identification**

This matches the weighted exponential-family form with

\[
\theta_i = \mu_i = x_i^\top\beta,
\qquad
b(\theta_i) = \tfrac{1}{2}\theta_i^2,
\qquad
a(\phi) = \phi,
\]

and

\[
c(y_i,\phi)
=
-\frac{1}{2\phi}y_i^2
-\frac{1}{2}\log(2\pi\phi).
\]

### **Weighted least squares representation**

Base R implements weighted least squares via

\[
y^\ast = W^{1/2} y,
\qquad
X^\ast = W^{1/2} X.
\]

The glmbayes functions use the same weighted likelihood, so all theoretical
expressions above apply directly.


## Log-Concavity Properties in the Weighted Gaussian Case

1. **Concavity in \( \beta \)**  
   For fixed \( \phi \), the negative weighted log-likelihood is  
   \[
   -\ell(\beta \mid \phi)
   =
   \frac{1}{2\phi}
   (y - X\beta)^{T} W (y - X\beta)
   + \text{const},
   \]
   which is strictly convex whenever \( W \) has positive diagonal entries.  
   Therefore the weighted log-likelihood is strictly concave in \( \beta \).

2. **Concavity in precision \( \tau = 1/\phi \)**  
   Rewriting the weighted likelihood in terms of \( \tau \):
   \[
   \ell(\beta,\tau)
   =
   -\frac{\tau}{2}
   (y - X\beta)^{T} W (y - X\beta)
   - \frac{1}{2}
   \left( \sum_{i=1}^{n} w_i \right)
   \log \tau
   + \text{const},
   \]
   which is concave in \( \tau \) because the first term is linear in \( \tau \)
   and the second is concave.

These properties guarantee:

- unique posterior modes for both \( \beta \) and \( \tau \)  
- efficient iid sampling under Normal, Normal-Gamma, and Independent Normal-Gamma priors  
- stable two-block Gibbs sampling when the dispersion is updated separately  

The weighted Gaussian model thus provides the cleanest illustration of how
weights enter the exponential-family structure and how they preserve
log-concavity.

### 2.3 Special Case: Gamma Regression With a Log Link

Consider the weighted Gamma regression model with log link:

\[
y_i \mid \beta,\phi \;\sim\; \text{Gamma}\!\left(
\text{shape}=\frac{1}{\phi},\;
\text{scale}=\phi\,\mu_i
\right),
\qquad 
\mu_i = \exp(x_i^\top\beta),
\]

with observation weights \(w_i \ge 0\).  
Let the **precision** be

\[
\tau = \frac{1}{\phi}.
\]

---

#### **Weighted log-likelihood**

Up to an additive constant, the weighted log-likelihood is

\[
\ell(\beta,\tau)
=
\sum_{i=1}^n 
w_i\left[
\tau\log\tau
- \tau\log \mu_i
+ (\tau-1)\log y_i
- \frac{\tau y_i}{\mu_i}
\right].
\]

Using \(\mu_i = \exp(x_i^\top\beta)\), this becomes

\[
\ell(\beta,\tau)
=
\sum_{i=1}^n 
w_i\left[
\tau\log\tau
- \tau x_i^\top\beta
+ (\tau-1)\log y_i
- \tau y_i e^{-x_i^\top\beta}
\right].
\]

---

#### **Log‑concavity**

For fixed \(\tau\):

- the log-likelihood is **concave in \(\beta\)** because  
  \(\mu_i = \exp(x_i^\top\beta)\) enters through a convex function and the Gamma
  log-likelihood is concave in the canonical parameter.

For fixed \(\beta\):

- the log-likelihood is **concave in \(\tau\)** because  
  \(\tau\log\tau\), \((\tau-1)\log y_i\), and \(-\tau y_i/\mu_i\) combine into a
  concave function.

These properties ensure:

- unique posterior modes  
- stable envelope construction to support accept-reject sampling for \(\beta\)  
- valid accept–reject sampling for \(\tau\)  


### Implications for Bayesian GLMs

The Gamma log–link model thus shares the same structural advantages as the
Gaussian and Poisson cases:

* concavity in both \( \beta \) and \( v \),
* unique posterior modes,
* stable envelope construction,
* efficient iid sampling for the dispersion parameter,
* compatibility with two–block Gibbs sampling when desired.

These properties make the Gamma log–link model an especially clean example of
how exponential–family structure supports fast, reliable posterior simulation in
glmbayes.

## 3. Gaussian Model With Unknown Dispersion

For the Gaussian model

\[
y = X\beta + \varepsilon,\qquad
\varepsilon \sim N(0,\phi I_{n}),
\]

the likelihood is

\[
p(y \mid \beta,\phi)
\propto
\phi^{-n/2}\exp\!\left(-\frac{S(\beta)}{2\phi}\right),
\]

where

\[
S(\beta) = (y - X\beta)^{T}(y - X\beta)
\]

is the residual sum of squares.

To support the different prior specification below, we use the Prior_Setup
function to set and extract default settings for the needed parameters.


```{r Prior_Setup,results = "hide"}
ps <- Prior_Setup(weight ~ group)
x  <- ps$x
mu <- ps$mu
V  <- ps$Sigma
y <- ps$y
shape    <- ps$shape
rate     <- ps$rate
```


## 3. Priors for Gaussian Models With Unknown Dispersion

We now specify priors for \((\beta,\tau)\) and derive the corresponding
log-priors and log-posteriors.

Throughout:

- dispersion: \( \phi \)  
- precision: \( \tau = 1/\phi \)  
- weighted residual sum of squares:

\[
\mathrm{RSS}(\beta)
=
\sum_{i=1}^{n}
w_i (y_i - x_i^\top\beta)^2.
\]

---

## 3.1 Prior on the Dispersion Only (Gamma Prior on Precision)

We place a Gamma prior on the precision:

\[
\tau \sim \mathrm{Gamma}(a_0,b_0),
\]

with log-prior

\[
\log p(\tau)
=
(a_0 - 1)\log\tau - b_0\tau + \text{const}.
\]

Given fixed \(\beta\), the conditional posterior is

\[
\tau \mid \beta,y
\sim
\mathrm{Gamma}\left(
a_0 + \frac{n_{w}}{2},
\;
b_0 + \frac{\mathrm{RSS}(\beta)}{2}
\right),
\]

where \(n_{w} = \sum_i w_i\).

---


To illustrate this, we fix \(\beta\) at the full-model coefficient vector from
`Prior_Setup()` (`ps$coefficients`, the internal GLM maximum-likelihood estimate).
We can then generate conditional Bayesian draws for the dispersion as below.
This is essentially the equivalent of the classical function
**gamma.dispersion()** function provided by the **MASS** package.


```{r glm_call_gamma_prior}
out_rlmb <- rlmb(
    n = 1000,
    y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = ps$coefficients)
  )


```


The posterior mean of the dispersion is then:

```{r glm_call_gamma_mean}
  mean(out_rlmb$dispersion)
```

## 3.2 Joint Conjugate Normal–Gamma Prior

The conjugate Normal--Gamma prior couples \(\beta\) and \(\tau\) [@LindleySmith1972; @OHaganForster2004]:

\[
\beta \mid \tau \sim N(\mu_0,\;(\tau\Sigma_0)^{-1}),
\qquad
\tau \sim \mathrm{Gamma}(a_0,b_0).
\]

### **Log-prior**

\[
\log p(\beta,\tau)
=
(a_0 - 1)\log\tau - b_0\tau
-\frac{\tau}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+\text{const}.
\]

### **Log-posterior**

Combining with the Gaussian log-likelihood:

\[
\log p(\beta,\tau \mid y)
=
\frac{\tau}{2}\mathrm{RSS}(\beta)
-\frac{n_{w}}{2}\log\left(\frac{2\pi}{\tau}\right)
+(a_0 - 1)\log\tau - b_0\tau
-\frac{\tau}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+\text{const}.
\]

This yields closed-form conditional posteriors:

- \( \tau \mid y \) is Gamma  
- \( \beta \mid \tau,y \) is Normal  

allowing pure i.i.d. Monte Carlo sampling.

---

The defining property of conjugacy is that the **posterior has the same
posterior functional form**.
In glmbayes, this model can be estimated as below.


```{r Conjugate_prior}
## Conjugate Normal_Gamma Prior

lmb.D9_v2 <- lmb(
  weight ~ group,
  pfamily = dNormal_Gamma(
    ps$mu,
    Sigma_0 = ps$Sigma_0,
    shape = ps$shape,
    rate  = ps$rate
  )
)

summary(lmb.D9_v2)$dispersion

```

## 3.3 Independent Normal–Gamma Prior

Here the prior factorizes:

\[
\beta \sim N(\mu_0,\Sigma_0),
\qquad
\tau \sim \mathrm{Gamma}(a_0,b_0),
\]

so \(\beta\) and \(\tau\) are *a priori independent*.

### **Log-prior**

\[
\log p(\beta,\tau)
=
-\frac{1}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+
(a_0 - 1)\log\tau - b_0\tau
+\text{const}.
\]

### **Log-posterior**

\[
\log p(\beta,\tau \mid y)
=
\frac{\tau}{2}\mathrm{RSS}(\beta)
-\frac{n_{w}}{2}\log\left(\frac{2\pi}{\tau}\right)
-\frac{1}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+
(a_0 - 1)\log\tau - b_0\tau
+\text{const}.
\]

A variation on this prior specification—using a **truncated Gamma prior** for the
dispersion—is the model developed in **Chapter A07**, where:

- the conditional posterior in \( \beta \) is *not conjugate*,  
- the conditional posterior in \( \tau = 1/\phi \) is *not conjugate*, and  
- the joint posterior remains **log‑concave**, enabling  
  the **accept–reject envelope sampler**.

This truncated‑Gamma, non‑conjugate Normal–Gamma model is therefore a direct
extension of the independent Normal–Gamma prior described above, but requires
the more advanced envelope‑based sampling strategy documented in Chapter A07.

### 3.3.1 Independent Normal Gamma Prior in glmbayes

This prior is implemented in glmbayes using the below specification (but where a
truncated gamma replaces the full gamma).

The chunk below is **not evaluated** when the vignette is built.

Precomputed results are read from **`inst/extdata/`** file
**`Chapter11_Dobson_two_sampler.rds`**.

To regenerate that file, run
**`make_Chapter11_Dobson_two_sampler_output.R`** under **`data-raw/`**,
from the package source root.

```{r chapter11-load-two-sampler}
ch11_path <- system.file(
  "extdata", "Chapter11_Dobson_two_sampler.rds",
  package = "glmbayes"
)
stopifnot(nzchar(ch11_path), file.exists(ch11_path))
ch11_saved <- readRDS(ch11_path)
stopifnot(
  ncol(ch11_saved$gibbs_two_block$beta_out) == ncol(ps$x),
  nrow(ch11_saved$indep_norm_gamma$coefficients) == ch11_saved$indep_norm_gamma$n_draws
)
```

```{r Independent_normal_gamma_prior, eval = FALSE}
lmb.D9_v3 <- lmb(n = 10000,
  weight ~ group,
  dIndependent_Normal_Gamma(
    ps$mu,
    ps$Sigma,
    shape = ps$shape_ING,
    rate  = ps$rate,
    max_disp_perc = 0.99,
    disp_lower = NULL,
    disp_upper = NULL
  )
)
summary(lmb.D9_v3)$coefficients
summary(lmb.D9_v3)$dispersion
sd(lmb.D9_v3$dispersion)
```

```{r Independent_normal_gamma_loaded}
inc <- ch11_saved$indep_norm_gamma
coef_summ_iid <- data.frame(
  posterior_mean = colMeans(inc$coefficients),
  posterior_SD = apply(inc$coefficients, 2, sd),
  row.names = inc$coef_colnames
)
coef_summ_iid
cat("Dispersion: posterior mean =", mean(inc$dispersion),
    "  SD =", sd(inc$dispersion), "\n")
```


### 3.3.2 Two-Block Gibbs Sampling using glmbayes

If iid sampling is not desired, or if the user wants to inspect mixing behavior,
a simple two-block Gibbs sampler [@RobertCasella2004] may be used:

1. Sample \( \beta \mid \tau, y \) from a Normal distribution  
2. Sample \( \tau \mid \beta, y \) from a Gamma distribution

The following chunk is **not evaluated** during the vignette build (`eval = FALSE`).
Stored draws are loaded from the same RDS as above.

```{r two_block_Gibbs_sampler, eval = FALSE}
set.seed(180)
dispersion2 <- ps$dispersion

## Burn-in
for (i in 1:1000) {
  out1 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2)
  )
  out2 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1, ])
  )
  dispersion2 <- out2$dispersion
}

## Sampling
beta_out <- matrix(0, nrow = 10000, ncol = 2)
disp_out <- rep(0, 10000)

for (i in 1:10000) {
  out1 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2)
  )
  out2 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1, ])
  )
  beta_out[i, ] <- out1$coefficients[1, 1:2]
  disp_out[i]   <- out2$dispersion
}

colMeans(beta_out)
sd(beta_out[, 1])
sd(beta_out[, 2])
mean(disp_out)
sd(disp_out)
```

```{r two_block_Gibbs_loaded}
gb <- ch11_saved$gibbs_two_block
beta_out <- gb$beta_out
disp_out <- gb$disp_out
colMeans(beta_out)
sd(beta_out[, 1])
sd(beta_out[, 2])
mean(disp_out)
sd(disp_out)
```

### 3.3.3 Comparison of the Two Samplers

Under default settings, both the truncated‑Gamma iid sampler and the two‑block
Gibbs sampler produce nearly identical posterior summaries for the regression
coefficients and the dispersion parameter.

To make this explicit, the table below reports posterior means and standard
deviations from:

* the iid sampler using the independent Normal–Gamma prior (same specification
  as `lmb(..., dIndependent_Normal_Gamma(...), n = 10000)` above), and  
* the two‑block Gibbs sampler constructed from alternating Normal and Gamma updates.

These values illustrate that both samplers target essentially the same posterior
distribution (although the former has a truncated prior), with only minor Monte
Carlo variation.

```{r chapter11-sampler-comparison-table, echo = FALSE}
inc <- ch11_saved$indep_norm_gamma
gb <- ch11_saved$gibbs_two_block
cn <- inc$coef_colnames
iid_m <- colMeans(inc$coefficients)
iid_s <- apply(inc$coefficients, 2, sd)
gib_m <- colMeans(gb$beta_out)
gib_s <- apply(gb$beta_out, 2, sd)
cmp <- data.frame(
  Parameter = c(cn, "Dispersion"),
  iid_Mean = c(iid_m, mean(inc$dispersion)),
  iid_SD = c(iid_s, sd(inc$dispersion)),
  Gibbs_Mean = c(gib_m, mean(gb$disp_out)),
  Gibbs_SD = c(gib_s, sd(gb$disp_out))
)
knitr::kable(
  cmp,
  digits = 4,
  caption = paste(
    "Dobson plant weight: independent Normal-Gamma (iid) vs",
    "two-block Gibbs"
  )
)
```

The difference in the iid and Gibbs SD for the dispersion is largely due to the
difference between a truncated and an unrestricted gamma prior.

However, the iid sampler has two clear advantages:

* **Independent draws**: no autocorrelation and no loss of effective sample size  
* **No tuning or burn‑in**: each draw is a direct sample from the posterior

The Gibbs sampler mixes well for this model class, but it inevitably shows mild
autocorrelation and convergence diagnostics are beyond the scope of this
vignette.

For most users, the iid sampler is easier to work with and safer as no
convergence diagnostics are needed.

For readers who want the full mathematical details of the iid sampler, see
**Chapter A07**, which documents the complete envelope construction and
accept‑reject algorithm for the independent Normal–Gamma prior.



## 4. Gamma Regression Models With Unknown Dispersion

We now specify priors for \((\beta,\tau)\) and derive the corresponding
log‑priors and log‑posteriors, using the **carinsca** data [@BaileySimon1960; @statsciCarinsca; @McCullagh1989].

Throughout:

- dispersion: \(\phi\)  
- precision: \(\tau = 1/\phi\)  
- mean: \(\mu_i = \exp(x_i^\top\beta)\)  
- weighted log-likelihood: \(\ell(\beta,\tau)\) from Section 2.3  




```{r Carinsca}
data(carinsca)
carinsca$Merit <- ordered(carinsca$Merit)
carinsca$Class <- factor(carinsca$Class)
oldopt <- options(contrasts = c("contr.treatment", "contr.treatment"))
Claims=carinsca$Claims
Insured=carinsca$Insured
Merit=carinsca$Merit
Class=carinsca$Class
Cost=carinsca$Cost
Claims_Adj<-Claims/1000

glm.carinsca <- glm(Cost / Claims ~ Merit + Class,
                    family = Gamma(link = "log"),
                    weights = Claims_Adj, x = TRUE)

```



```{r Prior_Setup_gamma,results = "hide"}
ps <- Prior_Setup(
  Cost / Claims ~ Merit + Class,
  family = Gamma(link = "log"),
  weights = Claims_Adj
)
mu=ps$mu
V=ps$Sigma
shape    <- ps$shape
rate     <- ps$rate
x  <- ps$x
y  <- ps$y

m<-nrow(x)
p<-ncol(x)

## Starting dispersion for beta | tau in the two-block Gibbs sampler: same quasi-likelihood
## estimate as used elsewhere for this Carinsca Gamma GLM (not the Dobson Gaussian ps).
dispersion2 <- gamma.dispersion(glm.carinsca)

options(oldopt)

```



---

### 4.1 Prior on the Dispersion Only (Gamma Prior on Precision)

We place a Gamma prior on the precision:

\[
\tau \sim \text{Gamma}(a_0,b_0),
\]

with log‑prior

\[
\log p(\tau)
=
(a_0 - 1)\log\tau - b_0\tau + \text{const}.
\]

Given fixed \(\beta\), the conditional posterior is

\[
\log p(\tau \mid \beta,y)
=
\ell(\beta,\tau)
+ (a_0 - 1)\log\tau
- b_0\tau
+ \text{const}.
\]

This posterior is **log‑concave in \(\tau\)**, enabling envelope‑based
accept–reject sampling in the Gamma regression case.


```{r glm_call_gamma_prior2}
## Carinsca Gamma GLM (already fitted in the Carinsca chunk for gamma.dispersion etc.)
gamma.dispersion(glm.carinsca)

out2 <- rglmb(
    n = 1000, y = y, x = x,
    family  =Gamma(link="log"),
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = ps$coefficients),
weights = Claims_Adj
  )

mean(out2$dispersion)

```




### 4.2 Independent Normal–Gamma Prior

Here the priors factorize:

\[
\beta \sim N(\mu_0,\Sigma_0), 
\qquad
\tau \sim \text{Gamma}(a_0,b_0),
\]

so \(\beta\) and \(\tau\) are **a priori independent**.

---

#### **Log‑prior**

\[
\log p(\beta,\tau)
=
-\tfrac12(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+ (a_0 - 1)\log\tau
- b_0\tau
+ \text{const}.
\]

---

#### **Log‑posterior**

\[
\log p(\beta,\tau \mid y)
=
\ell(\beta,\tau)
-\tfrac12(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0)
+ (a_0 - 1)\log\tau
- b_0\tau
+ \text{const}.
\]

In Gamma regression:

- the conditional posterior in \(\beta\) is **not conjugate**  
- the conditional posterior in \(\tau\) is **not conjugate**  
- the joint posterior remains **log‑concave**

This in theory enables two sampling strategies for the Gamma regression model:

1. **Two‑block Gibbs sampler**  
   - \( \beta \mid \tau, y \): sampled using a Normal approximation (via `rglmb`)  
   - \( \tau \mid \beta, y \): sampled using a Gamma‑based quasi‑likelihood update  
   This is the **only fully implemented method** for Gamma regression in the
   current version of *glmbayes*.

2. **Accept–reject envelope sampler (Chapter A07)**
   Chapter A07 develops an accept–reject sampler using a **truncated Gamma prior**
   for \( \tau \) and a joint envelope over \((\beta,\tau)\).
   **However, this method is implemented only for the Gaussian family.**  
   Extending the envelope‑based sampler to the Gamma regression model would require
   additional mathematical development and corresponding modifications to the code.

The truncated‑Gamma independent Normal–Gamma model in **Chapter A07** therefore
applies exclusively to the Gaussian case, while Gamma regression currently relies
on the two‑block Gibbs sampler described above.

Implementation details: alternate `rglmb` draws for \(\beta \mid \tau, y\) using
`dNormal(..., dispersion = dispersion2)` with draws for \(\tau \mid \beta, y\)
using `dGamma(...)` on the Gamma regression likelihood, updating `dispersion2`
after each \(\tau\) step.
A **1000‑iteration burn‑in** is followed by **1000 stored iterations**
(coefficient matrix `beta_out`, dispersion vector `disp_out`, and accept–reject
counts `iters_out`).
The same logic is written to **`inst/extdata/Chapter11_Carinsca_gamma_gibbs.rds`**
by **`data-raw/make_Chapter11_Carinsca_gamma_gibbs_output.R`**
(**`set.seed(190)`**).

The chunk **`Block_Gibbs_gamma_Regression`** below is **not evaluated** when the
vignette is built (`eval = FALSE`).
Stored draws are loaded from **`inst/extdata/Chapter11_Carinsca_gamma_gibbs.rds`**
in **`chapter11-load-carinsca-gamma`**, and summaries are produced in
**`Block_Gibbs_gamma_loaded`**.

```{r chapter11-load-carinsca-gamma}
ch11_cg_path <- system.file(
  "extdata", "Chapter11_Carinsca_gamma_gibbs.rds",
  package = "glmbayes"
)
stopifnot(nzchar(ch11_cg_path), file.exists(ch11_cg_path))
ch11_cg <- readRDS(ch11_cg_path)
stopifnot(
  ncol(ch11_cg$gibbs_gamma$beta_out) == ncol(ps$x),
  length(ch11_cg$gibbs_gamma$disp_out) == nrow(ch11_cg$gibbs_gamma$beta_out)
)
```

```{r Block_Gibbs_gamma_Regression, eval = FALSE}
set.seed(190)
dispersion2 <- gamma.dispersion(glm.carinsca)

suppressWarnings(
  suppressMessages(
for(i in 1:1000){

  ## --- Block 1: Regression coefficients ---
  out1 <- rglmb(
    n = 1, y = y, x = x,
    family  = Gamma(link="log"),
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2),
    weights = Claims_Adj
  )

  ## --- Block 2: Dispersion (quasi-likelihood sampler) ---
  out2 <- rglmb(
    n = 1, y = y, x = x,
    family  = Gamma(link="log"),
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1,]),
    weights = Claims_Adj
  )

  ## --- SCALE dispersion for the next beta update ---
  ## Convert quasi  (from out2) to MLE for consistency
  dispersion2 <- out2$dispersion ##* ((m - p)/m)

}

)
)

## Run 1000 additional iterations and store output
beta_out <- matrix(0, nrow = 1000, ncol = ncol(x))
disp_out <- rep(0, 1000)
iters_out <- rep(0, 1000)


suppressWarnings(
  suppressMessages(
    for (i in 1:1000) {
      out1 <- rglmb(
        n = 1, y = y, x = x,
        family = Gamma(link="log"),
        pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2),
        weights = Claims_Adj
      )
      out2 <- rglmb(
        n = 1, y = y, x = x,
        family = Gamma(link="log"),
        pfamily = dGamma(shape = shape, rate = rate,
                         beta = out1$coefficients[1,]),
        weights = Claims_Adj
      )
      dispersion2 <- out2$dispersion ##* ((m - p) / m)
      beta_out[i, ] <- out1$coefficients[1, seq_len(ncol(x))]
      disp_out[i] <- out2$dispersion ##* ((m - p) / m)
      iters_out[i]<-out2$iters
    }
  )
)

## Review output


beta_mean  <- colMeans(beta_out)
beta_sd    <- apply(beta_out, 2, sd)
beta_tlike <- beta_mean / beta_sd   # analogous to GLM t-values

bayes_coef_table <- data.frame(
  Estimate = beta_mean,
  Std.Error = beta_sd,
  t.like = beta_tlike
)

bayes_coef_table

mean(disp_out)
mean(iters_out)

```

```{r Block_Gibbs_gamma_loaded}
cg <- ch11_cg$gibbs_gamma
beta_out <- cg$beta_out
disp_out <- cg$disp_out
iters_out <- cg$iters_out

beta_mean  <- colMeans(beta_out)
beta_sd    <- apply(beta_out, 2, sd)
beta_tlike <- beta_mean / beta_sd

bayes_coef_table <- data.frame(
  Estimate = beta_mean,
  Std.Error = beta_sd,
  t.like = beta_tlike
)

bayes_coef_table

mean(disp_out)
mean(iters_out)

```




### 4.3 Conditional Sampling for the Dispersion Parameter

Chapter A06 provides a detailed derivation of the conditional posterior for the
precision parameter \( v = 1/\phi \) in Gamma regression.
The posterior can be written in the form

\[
\ell(v)
= ( \text{shape2} - 1 ) \log v
  - \text{rate1} \, v
  + \text{testfunc}(v)
  + \text{const},
\]

where \( \text{testfunc}(v) \) is a concave remainder term capturing the curvature
contributed by the Gamma likelihood.
This decomposition enables a **tangent–envelope accept–reject sampler**, in which:

1. A Gamma proposal distribution is constructed using the linear part  
   \[
   ( \text{shape2} - 1 ) \log v - \text{rate2} \, v,
   \]
   where \( \text{rate2} = \text{rate1} - g^{*} \) and \( g^{*} \) is the
   derivative of \( \text{testfunc}(v) \) at the anchor point \( v^{*} \).

2. A proposed value \( v_{\text{prop}} \) is accepted with probability  
   \[
   \exp\!\left[
     \text{testfunc}(v_{\text{prop}})
     - \text{testfunc}(v^{*})
     - g^{*}(v_{\text{prop}} - v^{*})
   \right].
   \]

Because \( \text{testfunc}(v) \) is concave, the tangent line at \( v^{*} \)
always lies above it, ensuring a valid rejection sampler.




## 8. Summary

This chapter introduced Bayesian estimation of Gaussian models with unknown dispersion
[@Gelman2013; @OHaganForster2004],
including:

• Gamma priors on the precision \( \tau \)  
• Joint Normal–Gamma priors  
• Independent Normal–Gamma priors  
• iid sampling via envelope methods  
• two‑block Gibbs sampling  

The log‑concavity of the Gaussian likelihood in both \( \beta \) and \( \tau \)
ensures that all samplers are stable, efficient, and theoretically well‑behaved.

## A1: Posterior Distribution Details for Conjugate Normal-Gamma prior

### A1.1 Marginal Posterior for \( \tau \)

Integrating out \( \beta \) yields a Gamma posterior **independent of \( \beta \)**:

\[
\tau \mid y
\sim
\mathrm{Gamma}\!\left(
  a_{0} + \frac{n}{2},\;
  b_{0} + \frac{1}{2} S_{\text{marg}}
\right),
\]

where

\[
S_{\text{marg}}
=
(y - X\mu_{0})^{T}
\left(I + X\Sigma_{0}X^{T}\right)^{-1}
(y - X\mu_{0}).
\]

Here \( \Sigma_{0} \) is the same matrix as in Section 3.2 and in A1.2 (so that
\( \Sigma_{0}^{-1} \) is the prior precision factor in
\( -\frac{\tau}{2}(\beta-\mu_{0})^{\mathsf T}\Sigma_{0}^{-1}(\beta-\mu_{0}) \)).
Let \( \hat\beta = (X^{\mathsf T}X)^{-1} X^{\mathsf T} y \) be the ordinary least
squares estimator and
\( \mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}(y - X\hat\beta) \).
Write \( G = X^{\mathsf T}X \) (invertible in the usual full-rank setup).
The **same** scalar \( S_{\text{marg}} \) can be written as

\[
S_{\text{marg}}
=
\mathrm{RSS}
+
(\hat\beta - \mu_{0})^{\mathsf T}
\left(\Sigma_{0} + G^{-1}\right)^{-1}
(\hat\beta - \mu_{0}).
\]

Equivalently, expanding \( (I + X\Sigma_{0}X^{\mathsf T})^{-1} \) with the
Sherman--Morrison--Woodbury formula gives the same middle matrix in the
precision form
\[
G - G\left(\Sigma_{0}^{-1} + G\right)^{-1}G
=
\left(\Sigma_{0} + G^{-1}\right)^{-1}.
\]
In particular, if \( \mu_{0} = \hat\beta \) then the second term vanishes and
\( S_{\text{marg}} = \mathrm{RSS} \).

**Step-by-step (Woodbury + least-squares split).** Set
\(e = y - X\hat\beta\), \(\delta = \hat\beta - \mu_{0}\), so
\(y - X\mu_{0} = e + X\delta\). Normal equations give \(X^{\mathsf T}e = 0\), hence
\(\|e + X\delta\|^{2} = \mathrm{RSS} + \delta^{\mathsf T}G\delta\) and
\(X^{\mathsf T}(e + X\delta) = G\delta\). Woodbury says
\((I + X\Sigma_{0}X^{\mathsf T})^{-1}
= I - X(\Sigma_{0}^{-1}+G)^{-1}X^{\mathsf T}\).
Substitute \(y-X\mu_{0}=e+X\delta\) into the quadratic form for
\(S_{\text{marg}}\); cross terms vanish because \(e^{\mathsf T}X=0\), and the
algebra reduces to
\(\mathrm{RSS} + \delta^{\mathsf T}\bigl(G - G(\Sigma_{0}^{-1}+G)^{-1}G\bigr)\delta\),
which equals the display above.

For observation weights \( w_{i} \) (Section 2.2), define
\(G = X^{\mathsf T}WX\),
\(\hat\beta = G^{-1} X^{\mathsf T}Wy\), and
\(\mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}W(y - X\hat\beta)\).
The **RSS + \((\hat\beta-\mu_{0})^{\mathsf T}(\Sigma_{0}+G^{-1})^{-1}(\hat\beta-\mu_{0})\)**
identity is unchanged (same proof in whitened coordinates
\(\tilde y = W^{1/2}y\), \(\tilde X = W^{1/2}X\)).
The **first** display becomes
\((y-X\mu_{0})^{\mathsf T}W^{1/2}(I + W^{1/2}X\Sigma_{0}X^{\mathsf T}W^{1/2})^{-1}W^{1/2}(y-X\mu_{0})\)
for any factor \(W^{1/2}\) with \(W^{1/2}(W^{1/2})^{\mathsf T}=W\)
(e.g.\ diagonal \(\sqrt{w_i}\)).

This closed‑form marginal is what makes the Normal–Gamma prior so computationally
attractive:
sampling \( \tau \) requires **only a single Gamma draw**.

### A1.2 Conditional Posterior for \( \beta \)

Given \( \tau \), the posterior for \( \beta \) is Gaussian:

\[
\beta \mid \tau,y \sim N(m_{\text{post}}, V_{\text{post}}),
\]

with

\[
V_{\text{post}}
=
\left(\Sigma_{0}^{-1} + \tau X^{T}X\right)^{-1},
\qquad
m_{\text{post}}
=
V_{\text{post}}
\left(\Sigma_{0}^{-1}\mu_{0} + \tau X^{T}y\right).
\]

### A1.3 Sampling the Posterior

Because the Normal--Gamma prior is *conjugate* to the Gaussian likelihood [@OHaganForster2004], the
**joint posterior factorizes into two standard distributions**:

\[
p(\tau,\beta \mid y)
= p(\tau \mid y)\, p(\beta \mid \tau, y).
\]

This structure means **no MCMC is required**.  
Posterior draws are obtained by **independent Monte Carlo sampling**, not by a Gibbs chain.

---

#### Step 1: Draw the marginal posterior for \( \tau \)

The dispersion (precision) parameter has a closed‑form marginal posterior:

\[
\tau \mid y \sim 
\mathrm{Gamma}\!\left(
  a_{0} + \frac{n}{2},\;
  b_{0} + \frac{1}{2} S(\hat\beta_{\text{GLS}})
\right),
\]

where

\[
S(\hat\beta_{\text{GLS}})
= (y - X\hat\beta_{\text{GLS}})^{T}(y - X\hat\beta_{\text{GLS}})
\]

and \( \hat\beta_{\text{GLS}} \) is the generalized least‑squares estimator under
the prior precision.

Each draw of \( \tau \) is **i.i.d.** from this Gamma distribution.

---

#### Step 2: Draw \( \beta \) conditionally on each sampled \( \tau \)

Given a sampled value \( \tau^{(s)} \), the regression coefficients have a Normal posterior:

\[
\beta \mid \tau^{(s)}, y
\sim
N\!\left(m_{\text{post}}^{(s)},\, V_{\text{post}}^{(s)}\right),
\]

with

\[
V_{\text{post}}^{(s)}
=
\left(\Sigma_{0}^{-1} + \tau^{(s)} X^{T}X\right)^{-1},
\]

\[
m_{\text{post}}^{(s)}
=
V_{\text{post}}^{(s)}
\left(\Sigma_{0}^{-1}\mu_{0} + \tau^{(s)} X^{T}y\right).
\]

Each draw of \( \beta \) is **conditionally independent** given its
corresponding \( \tau^{(s)} \).

---

### Result: Pure i.i.d. Monte Carlo Sampling

The conjugate posterior allows:

- **direct sampling** of \( \tau \) from its marginal Gamma distribution  
- **direct sampling** of \( \beta \) from its conditional Normal distribution  

No Markov chain is constructed, and **no Gibbs or Metropolis steps** are involved.  
All posterior samples are **independent**, making the conjugate Normal–Gamma
model one of the simplest Bayesian regression frameworks for simulation.

## A2. Detailed sampling procedured for the Independent Normal-Gamma prior

### A2.1 iid Sampling Under Truncated Gamma Priors

When iid sampling is requested, glmbayes uses a truncated Gamma prior for the
precision parameter \( \tau \).
The truncation is introduced for numerical stability and does not change the
basic Normal-Gamma structure.

The iid sampler for this prior family is based on the envelope accept-reject
method described in:

**Chapter A07 — Accept-Reject Sampling for Gaussian Regression Models with
Independent Normal-Gamma Priors**

That vignette explains the full simulation procedure, including:

* how the dispersion parameter is truncated,
* how the envelope is constructed across dispersion values,
* how the mixture of truncated normal proposals for \( \beta \) is formed,
* how the Gamma proposal for \( \tau \) is calibrated.

The key point for users is that the posterior remains log-concave, so the
accept-reject sampler produces independent draws for \( (\beta, \tau) \).
There is no Markov chain, no burn-in, and no autocorrelation.

## A3. Posterior mean of \( \beta \) and marginal covariance (Zellner-type prior)

This subsection records closed-form **posterior moments** for \( \beta \) in the
**conjugate Normal-Gamma** Gaussian regression model, in the same weighted
likelihood notation as Section 2.2 and in the **prior weight / effective prior
sample size** notation used by `Prior_Setup()`.

### A3.1 Weighted likelihood and prior precision

Assume

\[
y \mid \beta,\phi \sim N\!\left(X\beta,\;\phi W^{-1}\right),
\qquad
\tau = 1/\phi,
\]

with \(W = \mathrm{diag}(w_{1},\dots,w_{n})\) and \(w_i \ge 0\).
Define the **effective sample size**

\[
n_{w} = \sum_{i=1}^{n} w_i
\]

(the quantity `sum(wt)` in `rNormalGammaReg()`).
Write the **weighted Gram matrix** as

\[
G = X^{\mathsf T} W X.
\]

Let \(\hat\beta\) denote the **weighted least squares** estimator

\[
\hat\beta = G^{-1} X^{\mathsf T} W y
\]

and the **weighted residual sum of squares**

\[
\mathrm{RSS}
=
(y - X\hat\beta)^{\mathsf T} W (y - X\hat\beta).
\]

Take a **Zellner-type** conditional Normal prior on \(\beta\) given \(\tau\)
that is proportional to the data precision:

\[
\beta \mid \tau \sim N\!\left(\mu_{0},\; \tau^{-1}\left(\frac{pwt}{1-pwt}G\right)^{-1}\right),
\qquad
0 < pwt < 1,
\]

equivalently prior **precision** \(\tau\, (pwt/(1-pwt))\, G\).
This is the structure implied by `Prior_Setup()` together with
`dNormal_Gamma(..., Sigma_0, ...)` when
\(\Sigma_0 = ((1-pwt)/pwt)\, G^{-1}\) (Zellner \(g\)-prior
relative to \(G^{-1}\); see Section 3.2).

Define the **effective prior sample size** \(n_{\mathrm{prior}} > 0\) by

\[
pwt = \frac{n_{\mathrm{prior}}}{n_{\mathrm{prior}} + n_{w}},
\qquad
1 - pwt = \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}.
\]

This is the relationship implemented when `Prior_Setup()` is called with
scalar `n_prior` (`n_prior` and `n_effective` in the function).

### A3.2 Posterior mean of \( \beta \)

Combining likelihood precision \(\tau G\) with prior precision
\(\tau (pwt/(1-pwt)) G\) gives posterior precision **given \(\tau\)**:

\[
\tau G + \tau \frac{pwt}{1-pwt} G
=
\frac{\tau}{1-pwt}\, G.
\]

Hence

\[
\mathrm{Cov}(\beta \mid \tau, y)
=
\frac{1-pwt}{\tau}\, G^{-1}
=
\frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,\frac{1}{\tau}\, G^{-1}.
\]

The **posterior mean** \(m_{\mathrm{post}} = E(\beta \mid \tau,y)\) solves the
same normal equations as the **precision-weighted average** of \(\hat\beta\) and
\(\mu_{0}\), and simplifies to

\[
\boxed{
E(\beta \mid y)
=
(1-pwt)\,\hat\beta + pwt\,\mu_{0}
=
\frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,\hat\beta
+
\frac{n_{\mathrm{prior}}}{n_{\mathrm{prior}} + n_{w}}\,\mu_{0}.
}
\]

The right-hand side **does not depend on \(\tau\)** (for this prior), so
\(E(\beta \mid y) = E(\beta \mid \tau,y)\).

### A3.3 Marginal covariance of \( \beta \) and the Gamma parameters

Marginalize \(\tau\) with

\[
\tau \mid y \sim \mathrm{Gamma}(a_{n}, b_{n})
\]

in the **shape-rate** parameterization
\(p(\tau) \propto \tau^{a_{n}-1} e^{-b_{n}\tau}\) used in `rNormalGammaReg()`
(`R::rgamma(shape, 1/rate)`), so
\(E(\tau^{-1} \mid y) = b_{n}/(a_{n}-1)\) for \(a_{n} > 1\).

Let

\[
V_{n} = (1-pwt)\, G^{-1}
=
\frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1},
\]

so \(\mathrm{Cov}(\beta \mid \tau,y) = \tau^{-1} V_{n}\).
Because \(E(\beta \mid \tau,y)\) is constant in \(\tau\), the **law of total
covariance** gives

\[
\boxed{
\mathrm{Cov}(\beta \mid y)
=
E(\tau^{-1} \mid y)\, V_{n}
=
\frac{b_{n}}{a_{n}-1}\,
\frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,
G^{-1},
\qquad a_{n} > 1.
}
\]

In the sampler,

\[
a_{n} = a_{0} + \frac{n_{w}}{2},
\qquad
b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}},
\]

with \(S_{\mathrm{marg}}\) the marginal sum of squares from Section A1.1. In the
notation of this section (\(G = X^{\mathsf T}WX\), weighted least squares
\(\hat\beta = G^{-1} X^{\mathsf T}Wy\), weighted
\(\mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}W(y - X\hat\beta)\), and
\(\Sigma_{0}\) as in Section 3.2),

\[
S_{\mathrm{marg}}
=
\mathrm{RSS}
+
(\hat\beta - \mu_{0})^{\mathsf T}
\left(\Sigma_{0} + G^{-1}\right)^{-1}
(\hat\beta - \mu_{0}).
\]

This is the same scalar as the augmented residual sum of squares `S` produced by
`rNormal_reg.wfit()` (Section A1).
Under the Zellner \(g\)-prior of A3.1, the prior **covariance** is
\(\Sigma_{0} = ((1-pwt)/pwt)\,G^{-1}\), so
\(\Sigma_{0} + G^{-1} = G^{-1}/pwt\) and
\((\Sigma_{0} + G^{-1})^{-1} = pwt\, G\). Equivalently,
\[
S_{\mathrm{marg}}
=
\mathrm{RSS}
+
pwt\,
(\hat\beta - \mu_{0})^{\mathsf T} G (\hat\beta - \mu_{0}).
\]
The **structural** point is that the prior--data mismatch enters only through the
scalar factor **\(pwt\)** multiplying a nonnegative quadratic in
\(\hat\beta-\mu_{0}\). Hence, holding the data and \(\mu_{0}\) fixed so that
\((\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\) stays finite,
\[
pwt \to 0^{+}
\quad\Longrightarrow\quad
pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) \to 0,
\]
and \(S_{\mathrm{marg}} \to \mathrm{RSS}\): weak prior weight **washes out** the
gap even when \(\mu_{0} \neq \hat\beta\).
Equivalently, \(pwt\to 0\) whenever
\(n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w})\to 0\)---for example
\(n_{\mathrm{prior}}\to 0\) with \(n_{w}\) held fixed, or \(n_{w}\to\infty\) with
\(n_{\mathrm{prior}}\) held fixed.
The **exact** cancellation \(S_{\mathrm{marg}} = \mathrm{RSS}\) also holds for
any \(pwt\) when \(\mu_{0} = \hat\beta\) (for example
`Prior_Setup(..., intercept_source = "full_model", effects_source = "full_model")`).
For fixed \(pwt>0\) and \(\mu_{0} \neq \hat\beta\), the second term is strictly
positive unless \(\hat\beta-\mu_{0}\) lies in the null space of \(G\).
Under standard regularity, with fixed \(n_{\mathrm{prior}}\) and
\(n_{w}\to\infty\), one has \(pwt = O(1/n_{w})\) while
\((\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\) is typically
\(O_{p}(n_{w})\), so the mismatch term is often \(O_{p}(1)\): the **additive**
difference \(S_{\mathrm{marg}}-\mathrm{RSS}\) need not vanish at finite
\(n_{w}\), but \(S_{\mathrm{marg}}/n_{w}\) and \(\mathrm{RSS}/n_{w}\) share the
same leading order.

**Limit \(pwt\to 0\) and coefficient covariance.**
Under the Zellner decomposition above, **\(pwt\to 0^{+}\)** forces
**\(S_{\mathrm{marg}}\to \mathrm{RSS}\)** (for fixed data and \(\mu_{0}\), with a
finite mismatch quadratic), not only when \(\mu_{0}=\hat\beta\).
Since \(b_{n} = b_{0} + \tfrac{1}{2} S_{\mathrm{marg}}\), the marginal Gamma rate
then limits to **\(b_{0} + \tfrac{1}{2}\mathrm{RSS}\)** whenever \(b_{0}\) is held
fixed along the path; \(E(\tau^{-1}\mid y)=b_{n}/(a_{n}-1)\) therefore ties to
**RSS** through \(b_{n}\) in the same limit.
At the same time, **\((1-pwt)\,G^{-1}\to G^{-1}\)** from A3.2--A3.3.
Thus \(\mathrm{Cov}(\beta \mid \tau,y) = \tau^{-1}(1-pwt)\,G^{-1}\) tends to
\(\tau^{-1}G^{-1}\) (likelihood-only WLS covariance given \(\tau\)), and
\[
\mathrm{Cov}(\beta \mid y)
=
\frac{b_{n}}{a_{n}-1}\,(1-pwt)\,G^{-1},
\qquad a_{n}>1,
\]
has both **\(b_{n}\)** driven by **\(S_{\mathrm{marg}}\to \mathrm{RSS}\)** and the
matrix prefactor \((1-pwt)\,G^{-1}\to G^{-1}\): weak \(pwt\) removes prior
shrinkage of precision **and** drives the residual sum of squares in the
\(\tau\) update back to the **data** \(\mathrm{RSS}\).
The usual \(\sigma^{2}(X^{\mathsf T}WX)^{-1}\) limit is recovered when, in
addition, \(a_{n}\) and \(b_{0}\) follow the weak-prior path discussed below
(e.g.\ formal \(n_{\mathrm{prior}}\to 0\) with \(n_{w}\) fixed, holding
\(a_{n}>1\)).
If \(pwt\) is driven to zero by \(n_{w}\to\infty\) at fixed \(n_{\mathrm{prior}}\),
then \((1-pwt)\to 1\) **and** \(G^{-1}\) typically scales as \(O(1/n_{w})\), so
\(\mathrm{Cov}(\beta\mid y)\) shrinks with more weighted data even though the
prior contributes negligibly to precision.

With `Prior_Setup()` defaults for the Gamma on \(\tau\),
\(a_{0} = n_{\mathrm{prior}}/2\) in the pre-calibration Step 9 rate scaling, so
\(a_{n} = (n_{\mathrm{prior}} + n_{w})/2\); the prior rate is
\(b_{0} = (n_{\mathrm{prior}}/2)\, d\) with `dispersion` \(d\) returned by
`Prior_Setup()` (see Section 3.2 and `?Prior_Setup`).

The **marginal** posterior of \(\beta\) is **multivariate \(t\)**; the matrix
above is its **covariance** (not the scale matrix in every textbook
parameterization of the \(t\) density).

#### `Prior_Setup()` dispersion and the Gamma rates \(b_{0}\), \(b_{n}\)

For Gaussian models with scalar `n_prior`, `Prior_Setup()` sets the prior Gamma
**rate** to \(\mathrm{rate} = (n_{\mathrm{prior}}/2)\, d\), where \(d\) is the
scalar `dispersion` returned by the function. In the notation above,
\(b_{0} = \mathrm{rate}\).

The **posterior** Gamma rate is
\(b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}}\). Thus a larger returned
\(d\) increases \(b_{0}\) and, holding \(S_{\mathrm{marg}}\) fixed, would
increase \(b_{n}\). In practice, for **`dNormal_Gamma`** the matrix passed to
the sampler is **`Sigma_0`** (dispersion-free prior covariance from `Prior_Setup()`),
while the returned **`dispersion`** calibrates the Gamma rate and the coefficient-scale
**`Sigma`**; changing \(d\) rescales the Normal--Gamma coupling of \(\beta\) given \(\tau\);
the augmented least-squares step then produces a **different** \(S_{\mathrm{marg}}\) and posterior mean
\(m_{\mathrm{post}}\) as well.

#### Special case: \(b_{0} = \frac{1}{2}(n_{\mathrm{prior}}/n_{w})\,S_{\mathrm{marg}}\)

\[
\frac{S_{\mathrm{marg}}}{n_{w}}
\]

This choice is **not** the `Prior_Setup()` default; it is a **hypothetical**
reparameterization that sets the prior Gamma rate proportional to the **same**
marginal sum of squares that enters the likelihood fragment, with weight
\(n_{\mathrm{prior}}/n_{w}\) on the prior piece relative to the data piece (see
the discussion in the main text). Because \(S_{\mathrm{marg}}\) depends on
\(y\) (and on \(\mu_{0}\) versus \(\hat\beta\)), such a \(b_{0}\) is
**data-dependent** unless used only in stylized derivations.

Assume \(a_{0} = n_{\mathrm{prior}}/2\) (nominal rate scaling). Then
\(a_{n} = (n_{\mathrm{prior}} + n_{w})/2\) and, provided \(a_{n} > 1\)
(equivalently \(n_{\mathrm{prior}} + n_{w} > 2\)),

\[
a_{n} - 1 = \frac{n_{\mathrm{prior}} + n_{w} - 2}{2}.
\]

Take

\[
b_{0} = \frac{1}{2}\,\frac{n_{\mathrm{prior}}}{n_{w}}\, S_{\mathrm{marg}},
\qquad
b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}}
= \frac{1}{2}\, S_{\mathrm{marg}}\left(1 + \frac{n_{\mathrm{prior}}}{n_{w}}\right)
= \frac{S_{\mathrm{marg}}}{2}\,\frac{n_{w} + n_{\mathrm{prior}}}{n_{w}}.
\]

Then the posterior expectation of \(\tau^{-1} = \phi = \sigma^{2}\) is

\[
\frac{b_{n}}{a_{n}-1}
=
\frac{S_{\mathrm{marg}}(n_{w} + n_{\mathrm{prior}})/n_{w}}
{(n_{\mathrm{prior}} + n_{w} - 2)}.
\]

Under the Zellner prior of A3.1,
\(V_{n} = (1-pwt)\, G^{-1} = \dfrac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1}\).
One may write \(\mathrm{Cov}(\beta \mid y)\) as the product of
\(b_{n}/(a_{n}-1)\) and \(V_{n}\) **without** combining the
\((n_{\mathrm{prior}} + n_{w})\) terms:

\[
\boxed{
\mathrm{Cov}(\beta \mid y)
=
\frac{b_{n}}{a_{n}-1}\, V_{n}
=
\dfrac{S_{\mathrm{marg}}\,(n_{\mathrm{prior}} + n_{w})/n_{w}}
{n_{\mathrm{prior}} + n_{w} - 2}
\cdot
\dfrac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,
G^{-1}.
}
\]

The two fractions cancel to
\(S_{\mathrm{marg}} / (n_{\mathrm{prior}} + n_{w} - 2)\) times \(G^{-1}\).
So with this \(b_{0}\), the marginal coefficient covariance matches that
heuristic pattern---analogous in structure to \(\mathrm{RSS}/(n-p)\) times
\((X^{\mathsf T}WX)^{-1}\) when \(S_{\mathrm{marg}} = \mathrm{RSS}\)
(e.g.\ \(\mu_{0} = \hat\beta\)) and one reads
\(n_{\mathrm{prior}} + n_{w} - 2\) as an effective residual degrees of freedom
for the **joint** Normal--Gamma fragment. The factor \(a_{n}-1\) in the
denominator of \(b_{n}/(a_{n}-1)\) is exactly half that count, and the factor
\((1-pwt) = n_{w}/(n_{\mathrm{prior}}+n_{w})\) from \(V_{n}\) supplies the
remaining alignment.

**Limit \(n_{\mathrm{prior}} \to 0\) (formal comparison to classical).**
Holding \(n_{w} > 2\) and the data fixed, \(pwt = n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w}) \to 0\),
\(b_{0} = \frac{1}{2}(n_{\mathrm{prior}}/n_{w})S_{\mathrm{marg}} \to 0\),
\(a_{n} = (n_{\mathrm{prior}}+n_{w})/2 \to n_{w}/2\), and
\(a_{n}-1 \to (n_{w}-2)/2\).
Along the same path, \(S_{\mathrm{marg}} = \mathrm{RSS} + pwt\,(\hat\beta -
\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) \to \mathrm{RSS}\) because
\(pwt\to 0\) (Section A3.3): the prior--data mismatch vanishes **even when**
\(\mu_{0} \neq \hat\beta\).
Also \(b_{n} = b_{0} + \frac{1}{2}S_{\mathrm{marg}} \to \frac{1}{2}\mathrm{RSS}\),
so

\[
\frac{b_{n}}{a_{n}-1} \to \frac{\mathrm{RSS}}{n_{w}-2},
\qquad
V_{n} = (1-pwt)\,G^{-1} \to G^{-1}.
\]

Hence
\[
\mathrm{Cov}(\beta \mid y) \to \frac{\mathrm{RSS}}{n_{w}-2}\,(X^{\mathsf T}WX)^{-1},
\]
the usual weighted least squares covariance under the Gaussian `glm.fit`
dispersion convention of `Prior_Setup()` (see `?Prior_Setup`, Details).
**Before** taking \(n_{\mathrm{prior}}\to 0\), a fixed \(n_{\mathrm{prior}}>0\)
with \(\mu_{0} \neq \hat\beta\) typically has \(S_{\mathrm{marg}} > \mathrm{RSS}\)
and hence a **larger** \(b_{n}\) and posterior spread than this classical
limit suggests; only in the limit does \(S_{\mathrm{marg}}\) coincide with
\(\mathrm{RSS}\) for every \(\mu_{0}\).
At \(n_{\mathrm{prior}} = 0\) the Zellner \(g\)-prior weight and the Gamma
shape \(a_{0}\) are degenerate, so this limit is best read as **asymptotic
intuition** for weak prior weight, not as a literal prior specification.

**Growth of \(n_{w}\) at fixed \(n_{\mathrm{prior}}\).**
The posterior mean of \(\tau^{-1}=\sigma^{2}\) in the special case can be written
with the uncancelled factors as
\[
\frac{b_{n}}{a_{n}-1}
=
\frac{S_{\mathrm{marg}}\,(n_{\mathrm{prior}} + n_{w})/n_{w}}
{n_{\mathrm{prior}} + n_{w} - 2}
=
\frac{S_{\mathrm{marg}}}{n_{w}}
\cdot
\frac{n_{\mathrm{prior}} + n_{w}}{n_{\mathrm{prior}} + n_{w} - 2}.
\]
As \(n_{w} \to \infty\) with \(n_{\mathrm{prior}}\) fixed,
\((n_{\mathrm{prior}} + n_{w})/(n_{\mathrm{prior}} + n_{w} - 2) \to 1\) and
\((n_{\mathrm{prior}} + n_{w})/n_{w} \to 1\), so asymptotically
\((b_{n}/(a_{n}-1)) \sim S_{\mathrm{marg}}/n_{w}\): the leading behavior is
**mean marginal residual energy per effective observation**.
Moreover \(pwt = n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w}) \to 0\), so in the
Zellner decomposition
\(S_{\mathrm{marg}} = \mathrm{RSS} + pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\)
the mismatch term is multiplied by a coefficient tending to **zero**; under
typical regularity \(pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) = O_{p}(1)\)
while \(\mathrm{RSS}\) grows with \(n_{w}\), so
\(S_{\mathrm{marg}}/n_{w}\) and \(\mathrm{RSS}/n_{w}\) have the same leading
order. The factor \((n_{\mathrm{prior}}+n_{w})/n_{w}\)
is the only explicit **\(O(1/n_{w})\)** correction linking total sample size to
effective data count before cancellation with \(V_{n}\).

#### `dGamma` posterior with fixed \(\beta\): general form and special choices

For a precision-only prior
\[
\tau \sim \mathrm{Gamma}(a_0,b_0),
\]
and Gaussian weighted likelihood with \(\beta\) treated as fixed, define
\[
\mathrm{RSS}_w(\beta)=\sum_i w_i(y_i-x_i^\top\beta)^2.
\]
Then
\[
\tau\mid y,\beta \sim \mathrm{Gamma}\!\left(a_0+\frac{n_w}{2},\; b_0+\frac12\,\mathrm{RSS}_w(\beta)\right).
\]

Using the Chapter 11 calibration
\[
a_0=\frac{n_{\mathrm{prior}}}{2},
\qquad
b_0=\frac12\frac{n_{\mathrm{prior}}}{n_w}\,\mathrm{RSS}_w(\beta),
\]
we get
\[
\tau\mid y,\beta \sim
\mathrm{Gamma}\!\left(
\frac{n_{\mathrm{prior}}+n_w}{2},
\frac12\left(1+\frac{n_{\mathrm{prior}}}{n_w}\right)\mathrm{RSS}_w(\beta)
\right).
\]

Hence, for \(n_{\mathrm{prior}}+n_w>2\),
\[
E(\tau^{-1}\mid y,\beta)
=
\frac{b_n}{a_n-1}
=
\frac{\left(\frac{n_{\mathrm{prior}}+n_w}{n_w}\right)\mathrm{RSS}_w(\beta)}
{n_{\mathrm{prior}}+n_w-2}.
\]

This expression holds for any fixed \(\beta\).

#### Choice \(\beta=\beta_\star\) (posterior mode from `dNormal_Gamma`) and weak-prior limit

If \(\beta=\beta_\star\), where \(\beta_\star\) is the posterior mode from the
`dNormal_Gamma` fit, the same formula applies with \(\mathrm{RSS}_w(\beta_\star)\).

Using the weighted least-squares decomposition:
\[
\mathrm{RSS}_w(\beta)
=
\mathrm{RSS}_w(\hat\beta)
+
(\beta-\hat\beta)^\top G(\beta-\hat\beta),\qquad
G=X^\top W X,
\]
so
\[
\mathrm{RSS}_w(\beta_\star)
=
\mathrm{RSS}_w(\hat\beta)
+
(\beta_\star-\hat\beta)^\top G(\beta_\star-\hat\beta).
\]

Under scalar Zellner weighting, \(\beta_\star=(1-pwt)\hat\beta+pwt\,\mu_0\), giving
\[
\mathrm{RSS}_w(\beta_\star)
=
\mathrm{RSS}_w(\hat\beta)
+
pwt^2(\hat\beta-\mu_0)^\top G(\hat\beta-\mu_0).
\]

As \(n_{\mathrm{prior}}\to 0\) with \(n_w\) fixed, \(pwt\to 0\), so
\[
\mathrm{RSS}_w(\beta_\star)\to \mathrm{RSS}_w(\hat\beta),
\]
and therefore
\[
E(\tau^{-1}\mid y,\beta_\star)
\to
\frac{\mathrm{RSS}_w(\hat\beta)}{n_w-2},
\]
the classical weighted residual-variance limit.