--- title: Basic unit-level models output: rmarkdown::html_vignette bibliography: mcmcsae.bib vignette: > %\VignetteIndexEntry{Basic unit-level models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} options(width=100) # width of output ``` The basic unit-level model [@Battese1988; @rao2015small] is given by $$ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) $$ where $j$ runs from 1 to $n$, the number of unit-level observations, $\beta$ is a vector of regression coefficients for given covariates $x_j$, and $v_i$ are random area intercepts. We use the `api` dataset included in packages survey. ```{r, message=FALSE} library(survey) data(api) apipop$cname <- as.factor(apipop$cname) apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname)) ``` The `apipop` data frame contains the complete population whereas `apisrs` is a simple random sample from it. The variable `cname` is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample's `cname` variable match those of its population counterpart. The basic unit-level model with county random area effects is fit as follows ```{r, message=FALSE} library(mcmcsae) mod <- api00 ~ reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") + gen(factor = ~ iid(cname), name="v") sampler <- create_sampler(mod, data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) (summary(sim)) ``` We wish to estimate the area population means $$ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, $$ where $U_i$ is the set of units in area $i$ of size $N_i$. The MCMC output in variable `sim` can be used to obtain draws from the posterior distribution for $\theta_i$. The $r$th draw can be expressed as $$ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, $$ where $\bar{y}_i$ is the sample mean of $y$ in area $i$ and $t_{x;i}$ is a vector of population totals for area $i$. ```{r, fig.width=7, fig.height=7, fig.align="center"} N <- table(apipop$cname) samplesums <- tapply(apisrs$api00, apisrs$cname, sum) samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas m <- match(apisrs$cds, apipop$cds) # population units in the sample res <- predict(sim, newdata=apipop, labels=names(N), fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N, show.progress=FALSE) (summ <- summary(res)) theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30) ``` ## Binomial Unit-Level Model A model with binomial likelihood can also be fit. We now model the target variable `sch.wide`, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, $$ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) $$ ```{r} apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes") sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) summary(sim) ``` To predict the population fractions of schools that meet the growth target by county, ```{r, fig.width=7, fig.height=7, fig.align="center"} samplesums <- tapply(apisrs$target.met, apisrs$cname, sum) samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas res <- predict(sim, newdata=apipop, labels=names(N), fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N, show.progress=FALSE) (summ <- summary(res)) theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean)) # true population quantities plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30) ``` ## References