---
title: "Chapter 02: Estimating Bayesian Linear Models"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: REFERENCES.bib
reference-section-title: References
vignette: >
  %\VignetteIndexEntry{Chapter 02: Estimating Bayesian Linear Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(glmbayes)
```


# 1. Introductory discussion

The lmb function brings the familiar interface of R’s lm into a Bayesian framework. It preserves the core outputs and methods you know from lm—such as coefficient tables, fitted values, and residual summaries—while adding Bayesian priors and posterior summaries. In this chapter, we fit the classical linear model side-by-side with its Bayesian counterpart, demonstrating how their results align when using common default settings. Standard GLM references include [@McCullagh1989] and [@VenablesRipley2002]; Bayesian linear-model material aligns with [@Gelman2013] and [@Agresti2015].

For consistency with our Binomial and Count vignettes, we use the default priors generated by Prior_Setup() and fix the dispersion at its maximum-likelihood estimate. Holding dispersion constant ensures that the Bayesian point estimates closely match the classical ones, isolating the effect of the prior. We defer a deeper dive into alternative prior choices—including priors on dispersion parameters—to a later chapter.

## 2. Linear Model Structure and Gaussian Likelihood (Fixed Dispersion)

In a standard linear regression model, the response is modeled as

\[
Y_i = x_i^\top \beta + \varepsilon_i,
\qquad
\varepsilon_i \sim N(0,\ \sigma^2),
\]

where:
- \(x_i\) is the vector of predictors for observation \(i\),
- \(\beta\) is the vector of regression coefficients,
- \(\sigma^2\) is the error variance (dispersion), treated as **fixed** in this chapter.

The fitted values are

\[
\hat{\mu}_i = x_i^\top \beta.
\]


### 2.1 Gaussian Log‑Likelihood (Fixed Dispersion)

Up to an additive constant, the log‑likelihood for \(\beta\) is

\[
\ell(\beta)
=
-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i^\top \beta)^2.
\]

In matrix notation:

\[
\ell(\beta)
=
-\frac{1}{2\sigma^2}
(y - X\beta)^\top (y - X\beta).
\]

This is the same criterion minimized by ordinary least squares, so the classical estimate of \(\beta\) is the familiar OLS solution.


### 2.2 Normal Prior on \(\beta\)

To form a Bayesian linear model, we place a multivariate Normal prior on the coefficients:

\[
\beta \sim N(\mu_0,\ \Sigma_0).
\]

The default prior from `Prior_Setup()` is a Zellner‑type \(g\)-prior, which scales \(\Sigma_0\) relative to the information in the design matrix.  
With the default setting \(\mathrm{pwt} = 0.01\), the prior is weakly informative and exerts only mild shrinkage toward \(\mu_0\).


### 2.3 Log‑Posterior and Conjugate Updating

Combining the likelihood and the Normal prior yields a Normal posterior distribution:

\[
\beta \mid y \sim N(\mu_{\text{post}},\ \Sigma_{\text{post}}),
\]

where

\[
\Sigma_{\text{post}}
=
\left(
\frac{1}{\sigma^2} X^\top X
+
\Sigma_0^{-1}
\right)^{-1},
\]

\[
\mu_{\text{post}}
=
\Sigma_{\text{post}}
\left(
\frac{1}{\sigma^2} X^\top y
+
\Sigma_0^{-1} \mu_0
\right).
\]

Because both the likelihood and prior are quadratic in \(\beta\), the posterior is available in closed form.  
This conjugate structure explains why the Bayesian estimates closely track the classical OLS estimates when the prior is weak and \(\sigma^2\) is fixed at the classical estimate.


### 2.4 Why Fixing Dispersion Matters

In this chapter, we treat \(\sigma^2\) as known.  
This ensures:

- the Bayesian and classical fits are directly comparable,  
- the prior affects only the regression coefficients,  
- the posterior remains exactly Normal.

Models that **estimate** \(\sigma^2\) or place a prior on it are covered in Chapter 11.


# 3. Sample Data - Dobson's Plant weight data

We illustrate our models using the Plant Weight dataset from Dobson (1990), which measures the dry weight of plants grown under control (“Ctl”) and treatment (“Trt”) conditions. We combine the two groups into a single response vector and create a factor variable for group membership.


```{r Plant_Data}
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
```



# 4. Calling the two functions

In this section, we fit both the classical and Bayesian linear models to the plant data. First, we run lm() to establish a baseline. Then, we use Prior_Setup() to define our Bayesian priors and call lmb() with those settings.

## 4.1 Using the classical lm function

This simple ordinary least squares regression of plant weight on group membership will serve as our point of comparison.

## 4.1.1 Calling the classical lm function

```{r Plant_w_intercept}
lm.D9 <- lm(weight ~ group)
```


## 4.2 Setting the prior and calling the lmb function

Here we generate default Gaussian priors and plug them into the Bayesian model.

### 4.2.1 Setting the prior

Prior_Setup() returns the prior mean vector (mu), covariance matrix (V), and the maximum-likelihood estimate of dispersion (disp_ML) for a fixed-dispersion Gaussian model.

```{r Plant_Prior}
ps=Prior_Setup(weight ~ group,family=gaussian())
```

We now utilize the print function for **Prior_Setup** to understand the default prior specification.

```{r Plant_Prior_Spec}
ps
```
The printed output begins by echoing the original function call, confirming the model formula and the specified likelihood family. Next, the setup of a Zellner g-type prior is displayed, where pwt = 0.01 indicates that the prior contributes just 1% of the total information, while the likelihood contributes the remaining 99%. This reflects a weakly informative prior, allowing the data to dominate the posterior. The derived value of g = 99 quantifies the relative weight of the likelihood to the prior (or more precisely how much the Prior Variance-Covariance matrix is scaled up relative to the Likelihood Variance-Covariance matrix). 

The note on n_prior shows how this pseudo-sample size was computed from pwt and the effective sample size of the likelihood (n_likelihood = 20). The summary section includes prior means, standard deviations, and confidence intervals for each coefficient. The prior correlation matrix reveals the structure of dependencies induced by the design matrix—here, a strong negative correlation between the intercept and treatment effect. In addition, the conditional dispersion (0.485) reflects the residual variance under the Gaussian model, consistent with the classical GLM fit.

Finally, the printout also contains information related to shape and rate parameters that are used to specify priors for the dispersion. We discuss these extensively in a separate vignette and will not cover here.

In our next vignette, we discuss the default prior in more details. We also discuss how users can utilize the options in the Prior_Setup function to get different **g-prior** specifications that rescale the Prior-Variance covariance matrix as well as options for using alternative prior mean specifications. In a later vignette, we dig deeper into the use of differential **pwt** specifications for different dimensions and the use of more informative priors.


### 4.2.2 Calling the lmb function

We show how to call the lmb function under three different prior specification. The first (a dNormal Prior) treats the dispersion as a fixed and known constant (and that specification is the focus of most of this vignette). The other two are alternative priors that also provide a prior specification for the dispersion. Those priors will be covered in greater details in later chapters but are introduced so that users are aware of the potential need to estimate the dispersion. Many glm models (like the binomial and poisson models) don't have dispersion parameters. For that reason, we focus most of our discussion in this chapter on models with fixed dispersion parameter.

```{r Plant_lmb_calls}
lmb.D9=lmb(weight ~ group,dNormal(mu=ps$mu,Sigma  =ps$Sigma,dispersion=ps$dispersion))
##lmb.D9_v2=lmb(weight ~ group,dNormal_Gamma(mu=ps$mu,Sigma_0=ps$Sigma_0,shape=ps$shape,rate=ps$rate))
## lmb.D9_v3=lmb(weight ~ group,dIndependent_Normal_Gamma(mu=ps$mu,Sigma = ps$Sigma,shape=ps$shape_ING,rate=ps$rate))
summary(lmb.D9)
```
With both models fitted, we’re ready to explore their summaries and compare classical versus Bayesian outputs in the next section.

# 5. Summarizing the output

Before we dive into each summary, it’s helpful to see the big picture.  In the classical lm summary, you get three blocks—residual diagnostics, coefficient tests, and global fit measures—that tell you how well an OLS model is behaving.  In the Bayesian lmb summary, you see additional blocks that compare your specified priors to the data (via ML estimates), then deliver full posterior summaries, effective‐parameter counts, and information criteria.  In Sections 4.1 and 4.2 we walk through each of these blocks side by side, highlighting where the Bayesian output enriches or mirrors the classical view.


## 5.1 Summarizing the lm output


The built-in summary() method for an lm object organizes its output into three clear pieces:
- A quick look at the residuals
- A coefficient table with t-tests
- Overall fit statistics (RSE, R^2, F-test)
We’ll step through each of these in turn, starting with the residual summary.


```{r Plant_summary}
summary(lm.D9)
```

### 5.1.1 Residual summary

`summary(lm.D9)` begins with a five‐number summary of the residuals (Min, 1Q, Median, 3Q, Max). This display highlights skew, spread, and potential outliers in the model’s error distribution.

### 5.1.2 Coefficient Estimates and Hypothesis Tests

Coefficients:

|               | Estimate | Std. Error | t value | Pr(>|t|)  |
|---------------|---------:|-----------:|--------:|----------:|
| (Intercept)   |   5.0320 |     0.2202 |  22.850 | 9.55e-15*** |
| groupTrt      |  –0.3710 |     0.3114 |  –1.191 |   0.249    |

The columns report:

- Estimate: the ordinary least squares estimate of each regression coefficient.  
- Std. Error: the estimated standard deviation of the sampling distribution, derived from the residual mean square.  
- t value: \(\frac{\text{Estimate} - 0}{\text{Std. Error}}\); this is the test statistic for \(H_{0}:\beta=0\).  
- Pr(>|t|): the two-sided p-value, i.e. the probability of observing a \(|t|\) at least as large as the one computed, assuming \(H_{0}\) is true. Significance codes (***) denote \(p<0.001\).  

The t-test assesses whether each predictor’s effect differs from zero after accounting for sampling variability. A small p-value indicates that the null hypothesis \( \beta = 0 \) is unlikely given the data.  

In this example, the intercept is highly significant (\(p<10^{-14}\)), whereas the treatment effect (groupTrt) yields \(p=0.249\), so there is insufficient evidence to reject \(H_{0}:\beta_{\text{groupTrt}}=0\) at conventional levels.


### 5.1.3 Model fit statistics

Below the coefficient table, `summary(lm.D9)` prints four key measures of fit:

    Residual standard error: 0.7047 on 18 DF
    Multiple R-squared:    0.00285
    Adjusted R-squared:   -0.04863
    F-statistic: 0.3094 on 1 and 18 DF,  p-value: 0.5839

These metrics gauge how well the model captures variation in the data and whether the added predictor meaningfully improves over an intercept-only model.


---

**Residual standard error (RSE)**

This is the estimated standard deviation of the residuals.

---

**Multiple R-squared (\(R^2\))**

This measures the fraction of total variability in \(y\) explained by the model.

---

**Adjusted R-squared (\(\text{adj}R^2\))**

Adjusts for the number of predictors, penalizing model complexity.

---

**F-statistic**

Tests the null hypothesis that all slopes are zero.

## 5.2 Summarizing the lmb output

When you call summary() on an lmb object, the first thing you see is how your prior specification lines up against a classical fit under fixed dispersion.  This side-by-side display makes it easy to gauge how much your prior is pulling the estimates.  Only after that does the function report full posterior summaries (moments, quantiles, Monte Carlo errors) and Bayesian diagnostics (pD, DIC, etc.).  We begin by examining the prior vs. ML block.


```{r Plant_lmb_summary}
summary(lmb.D9)
```


### 5.2.1 Prior and maximum‐likelihood estimates

Before diving into the posterior, `summary(lmb.D9)` reports the prior means (and variances) you supplied alongside the maximum-likelihood estimates under a fixed‐dispersion Gaussian model. Seeing these side by side helps you gauge how much influence your prior exerts relative to the data.

```{r Plant_lmb_coefficients1}
sumlmb<-summary(lmb.D9)
sumlmb$coefficients1
```



The default prior specification from `Prior_Setup()` is designed to align closely with the classical fit while still imposing mild regularization:

- The intercept prior mean is set equal to the maximum‐likelihood estimate from a "Null" intercept only model.
- All other “effects” (here, the groupTrt slope) default to zero, expressing no a priori directional pull.
- The prior variance–covariance matrix is inflated by a factor  
  \[
    g \;=\;\frac{1 - \mathrm{pwt}}{\mathrm{pwt}}
    \quad\text{with}\quad \mathrm{pwt}=0.01,
  \]
  giving \(g=99\). This is exactly Zellner’s g-prior:  
  \[
    V_{\text{prior}} \;=\; g \times V_{\text{likelihood}},
  \]
  so each prior standard deviation (2.191 and 3.099) is \(\sqrt{g}\) times the corresponding classical standard error (0.2200 and 0.3110). This choice delivers minimal shrinkage towards the prior.  

### 5.2.2 Posterior estimates

The main table lists posterior mode and mean for each coefficient, plus the posterior standard deviation, Monte Carlo standard error, and the tail probability for a zero‐effect test. These values integrate your prior and the likelihood to give a full distributional summary.



```{r Plant_lmb_coefficients}
sumlmb<-summary(lmb.D9)
sumlmb$coefficients
```

When a g prior is used, the posterior means simplify and becomes a weighted average of the prior mean and the Maximum likelihood estimates.  

\[
  \text{Post.Mean}
    = \mathrm{pwt}\times\text{Prior Mean}
    + (1-\mathrm{pwt})\times\text{MLE}.
\]

In other words, `pwt` directly tunes how much “pull” your prior has versus the data in each dimension.

The `Pr(tail)` column gives the posterior “reverse‐tail” probability relative to the prior mean: if the posterior mean is below the prior mean, it’s  
\[
  P(\beta > \text{prior mean}\mid\text{data})
\]  
and if the posterior mean is above the prior mean, it’s  
\[
  P(\beta < \text{prior mean}\mid\text{data}).
\]  
For models with a zero prior mean and roughly symmetric posteriors, `Pr(tail)` will be close to one-half of the classical two-sided p-value—and thus roughly equal to the one-sided p-value—offering a Bayesian analogue to hypothesis testing. In essence, a small `Pr(tail)` value signifies strong evidence against the prior mean, functioning as a form of “rejection” of the prior hypothesis.

### 5.2.3 Directional Prior-Posterior Summaries

The Standardized Mahalanobis Distance quantifies how far the prior point estimate lies from the posterior mean, accounting for posterior covariance. Building on this, the Directional Tail Probability assesses how likely it is—under the posterior distribution—to observe coefficient values that fall opposite the prior, relative to the posterior mean. This directional measure generalizes the concept of reverse-tail probabilities from univariate to multivariate settings. Small values suggest that the posterior density places little mass in the direction of the prior, and can thus be interpreted as a form of prior rejection.

### 5.2.4 Distribution percentiles 

Below the moments, you get a grid of key quantiles (1%, 2.5%, 5%, median, 95%, 97.5%, 99%) for each parameter. Percentiles let you quickly read off credible‐interval bounds and assess skew or heavy tails in the posterior.


```{r Plant_lmb_Percentiles}
sumlmb<-summary(lmb.D9)
sumlmb$Percentiles
```

**Interpretation for `groupTrt`:**  
- The 95% credible interval (2.5%–97.5%) is approximately **[–0.9315, 0.2334]**, which spans zero and indicates uncertainty about the direction of the treatment effect.  
- The 90% credible interval (5%–95%) is approxinately **[–0.8722, 0.1557]**, also including zero and reinforcing that a null or small positive effect remains plausible.  
- The posterior distribution is slightly negatively skewed: the lower tail (1.0% = –1.0553) extends farther below the median (–0.3337) than the upper tail (99.0% = 0.3380) extends above it.  
- Although the median effect is –0.3337 (suggesting a small negative treatment effect), the credible intervals remind us that the data do not provide strong evidence to rule out no effect or even a slight positive effect.
```


### 5.2.5 DIC and Related Measures

-The effective number of parameters (pD) quantifies how much your posterior has “moved” from the prior. A pD near the number of coefficients signals a relatively data‐driven fit; a smaller pD indicates stronger shrinkage toward the prior.


- The expected residual deviance (\(\bar{D}\)) is the posterior mean of the deviance statistic. It is analogous to the classical deviance but averaged over your posterior draws, giving a Bayesian measure of overall fit.

- The Deviance Information Criterion (DIC) combines \(\bar{D}\) and \(p_D\) into a single metric for model comparison:
\[
\text{DIC} = \bar{D} + p_D
\]
Lower values signal better trade-offs between fit and complexity under your chosen prior.

### 5.2.6 Expected mean dispersion

For Gaussian models, `summary(lmb.D9)` also prints the posterior mean of the dispersion parameter. This aligns with the classical residual variance estimate but incorporates prior uncertainty.

### 5.2.7 Square‐root of expected mean dispersion

Finally, taking the square root of the expected mean dispersion yields a Bayesian estimate of the residual standard deviation. This makes it directly comparable to the classical residual standard error.

# 6. Textbook Content Mappings

The tables below map vignette sections to chapters in @Agresti2015.

## 6.1 Agresti - Foundations of Linear and Generalized Linear Models

| Vignette Section               | Textbook Chapter & Section    | Notes (provisional)                                                                                                                          |
|--------------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| 3.1.1 Calling the classical lm function | 3.4 Example Normal Linear Model Inference | Formal definition of \(y = X\beta + \varepsilon\); shows OLS as the MLE under Normal errors (see derivation in Example 3.1). |
| 3.2.1 Setting the prior        | 10.2 Bayesian Linear Models   | Construction of conjugate Normal priors and Zellner's g-prior \(V = g (X^{T} X)^{-1}\), with \(g = (1 - \mathrm{pwt})/\mathrm{pwt}\); see Example 10.2. |
| 3.2.2 Calling the lmb function | 10.2 Bayesian Linear Models   | Shows closed-form posterior updates under conjugate Normal priors: derivation of the posterior mean and variance for \(\beta\) in the linear model. |


| Vignette Section                              | Textbook Chapter & Section                              | Notes                                                                                              |
|-----------------------------------------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| 4.1.1 Residual summary                        | 3.4 Example Normal Linear Model Inference               | Discussion of raw, standardized, and studentized residuals; graphical diagnostics (residual vs fitted, Q-Q plots); measures of leverage and influence. |
| 4.1.2 Coefficient Estimates and Hypothesis Tests | 3.2 Significance Tests for Normal Linear Models        | Formal derivation of t-tests and confidence intervals for slopes; interpretation of estimates, standard errors, t-statistics, two-sided p-values, and significance codes. |
| 4.1.3 Model fit statistics                    | 3.2 Significance Tests for Normal Linear Models        | Definition and interpretation of residual standard error; \(R^2\) and adjusted \(R^2\); ANOVA F-test for overall model significance. |
| 4.2.1 Prior and maximum-likelihood estimates  | 10.1 The Bayesian Approach to Statistical Inference    | Presentation of conjugate Normal priors and Zellner's g-prior; overlay of prior and sampling distributions (MLEs); role of \(g\) in shrinkage. |
| 4.2.2 Posterior estimates                     | 10.2 Bayesian Linear Models                            | Closed-form posterior mode, mean, and variance under conjugacy; Monte Carlo approximation for non-conjugate settings; interpretation of "Pr(tail)" statements. |
| 4.2.3 Distribution percentiles                | 10.2 Bayesian Linear Models                            | Definition of equal-tailed and HPD intervals; extraction of key posterior quantiles (2.5%, 97.5%, etc.); assessing skew and tail behavior. |
| 4.2.4 Effective number of parameters (pD)     | 10.3 Bayesian Generalized Linear Models                | Introduction of deviance-based effective number of parameters (\(p_D\)); interpretation of \(p_D\) as a penalty/shrinkage measure in model comparison. |
| 4.2.5 Expected residual deviance (\(\bar{D}\)) | 10.3 Bayesian Generalized Linear Models               | Definition of deviance for GLMs; calculation of posterior mean deviance; use of \(\bar{D}\) as a Bayesian fit statistic. |
| 4.2.6 Deviance Information Criterion (DIC)     | 10.3 Bayesian Generalized Linear Models               | Derivation of \(\text{DIC} = \bar{D} + p_D\); guidelines for using DIC to compare Bayesian models under a common prior. |
| 4.2.7 Expected mean dispersion                | 10.2 Bayesian Linear Models                            | Bayesian estimation of the Gaussian error variance \((\sigma^2)\) via conjugate priors; posterior mean of \(\sigma^2\). |
| 4.2.8 Square-root of expected mean dispersion | 10.2 Bayesian Linear Models                            | Transformation of posterior \(\sigma^2\) into \(\sigma\) (residual standard-error analogue); direct comparison to classical RSE. |