%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Chapter 07, Horton et al. using mosaic}
%\VignettePackage{Sleuth2}
\documentclass[11pt]{article}

\usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry}
\usepackage{hyperref}
\usepackage{language}
\usepackage{alltt}
\usepackage{mathtools}
\usepackage{amsmath}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhf{}

%% Now begin customising things. See the fancyhdr docs for more info.

\chead{}
\lhead[\sf \thepage]{\sf \leftmark}
\rhead[\sf \leftmark]{\sf \thepage}
\lfoot{}
\cfoot{Statistical Sleuth in R: Chapter 7}
\rfoot{}

\newcounter{myenumi}
\newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}}
\newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}}

\pagestyle{fancy}

\def\R{{\sf R}}
\def\Rstudio{{\sf RStudio}}
\def\RStudio{{\sf RStudio}}
\def\term#1{\textbf{#1}}
\def\tab#1{{\sf #1}}


\usepackage{relsize}

\newlength{\tempfmlength}
\newsavebox{\fmbox}
\newenvironment{fmpage}[1]
     {
   \medskip
   \setlength{\tempfmlength}{#1}
   \begin{lrbox}{\fmbox}
     \begin{minipage}{#1}
         \vspace*{.02\tempfmlength}
                 \hfill
           \begin{minipage}{.95 \tempfmlength}}
                 {\end{minipage}\hfill
                 \vspace*{.015\tempfmlength}
                 \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}
         \medskip
         }


\newenvironment{boxedText}[1][.98\textwidth]%
{%
\begin{center}
\begin{fmpage}{#1}
}%
{%
\end{fmpage}
\end{center}
}

\newenvironment{boxedTable}[2][tbp]%
{%
\begin{table}[#1]
  \refstepcounter{table}
  \begin{center}
\begin{fmpage}{.98\textwidth}
  \begin{center}
        \sf \large Box~\expandafter\thetable. #2
\end{center}
\medskip
}%
{%
\end{fmpage}
\end{center}
\end{table}             % need to do something about exercises that follow boxedTable
}


\newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}}

\title{The Statistical Sleuth in R: \\
Chapter 7}

\author{
Ruobing Zhang \and Kate Aloisio \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu}
} 

\date{\today}

\begin{document}


\maketitle
\tableofcontents

%\parindent=0pt


<<setup0, include=FALSE, cache=FALSE>>=
require(knitr)
opts_chunk$set(
  dev="pdf",
  fig.path="figures/",
        fig.height=6,
        fig.width=8,
        out.width=".67\\textwidth",
        fig.keep="high",
        fig.show="hold",
        fig.align="center",
        prompt=TRUE,  # show the prompts; but perhaps we should not do this 
        comment=NA    # turn off commenting of ouput (but perhaps we should not do this either
  )
@
<<setup, include=FALSE, cache=FALSE, echo=FALSE, message=FALSE>>=
opts_chunk$set(
  dev="pdf",
  fig.path="figures/",
  fig.height=3,
        fig.width=4,
        out.width=".47\\textwidth",
        fig.keep="high",
        fig.show="hold",
        fig.align="center",
        prompt=TRUE,  # show the prompts; but perhaps we should not do this 
        comment=NA    # turn off commenting of ouput (but perhaps we should not do this either
  )
require(Sleuth2)
require(mosaic)
trellis.par.set(theme=col.mosaic())  # get a better color scheme for lattice
set.seed(123)
# this allows for code formatting inline.  Use \Sexpr{'function(x,y)'}, for exmaple.
knit_hooks$set(inline = function(x) {
if (is.numeric(x)) return(knitr:::format_sci(x, 'latex'))
x = as.character(x)
h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize'))
h = gsub("([_#$%&])", "\\\\\\1", h)
h = gsub('(["\'])', '\\1{}', h)
gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h)
})
showOriginal=FALSE
showNew=TRUE
@

<<pvalues, echo=FALSE, message=FALSE>>=
print.pval = function(pval) {
  threshold = 0.0001
    return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""),
                ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""),
                       paste("p=", round(pval, 3), sep=""))))
}
@

\section{Introduction}

This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the \emph{Statistical Sleuth} (2002) by Fred Ramsey and Dan Schafer.
More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}.  This
file as well as the associated \pkg{knitr} reproducible analysis source file can be found at
\url{http://www.amherst.edu/~nhorton/sleuth}.

This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the 
\pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). 

To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command:
<<install_mosaic,eval=FALSE>>=
install.packages('mosaic')               # note the quotation marks
@
Once this is installed, it can be loaded by running the command:
<<load_mosaic,eval=FALSE>>=
require(mosaic)
@
This needs to be done once per session.

In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth2} package.
<<install_Sleuth2,eval=FALSE>>=
install.packages('Sleuth2')               # note the quotation marks
@
<<load_Sleuth2,eval=FALSE>>=
require(Sleuth2)
@

We also set some options to improve legibility of graphs and output.
<<eval=TRUE>>=
trellis.par.set(theme=col.mosaic())  # get a better color scheme for lattice
options(digits=4)
@

The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 7: Simple Linear Regression: A Model for the Mean using R.

\section{The Big Bang}

Is  there relation between distance and radial velocity among extra-galactic nebulae?  This is the question addressed in case study 7.1 in the \emph{Sleuth}.

\subsection{Summary statistics and graphical display}

We begin by reading the data and summarizing the variables.

<<>>=
summary(case0701)
@

A total of \Sexpr{nrow(case0701)} nebulae are included in this data.

<<>>=
histogram(~ Velocity, type='density', density=TRUE, nint=10, data=case0701)
histogram(~ Distance, type='density', density=TRUE, nint=10, data=case0701)
@

The density plots show that the distributions for the two variables are fairly symmetric, but more uniform than normally
distributed.

<<fig.height=6,fig.width=10>>=
xyplot(Distance ~ Velocity, type=c("p", "r"), data=case0701)
@

The scatterplot is displayed on page 175 of the \emph{Sleuth}. It indicates that there is a linear statistical relationship between distance and velocity.


\subsection{The simple linear regression model}

The following code presents the results interpreted on page 184 of the \emph{Sleuth}.

<<>>=
lm1 = lm(Distance ~ Velocity, data=case0701)
summary(lm1)
@

The estimated parameter for the intercept is \Sexpr{round(coef(lm1)["(Intercept)"], 4)} megaparsecs and the estimated parameter for velocity is \Sexpr{round(coef(lm1)["Velocity"], 4)} megaparsecs/(km/sec). The estimated mean function is $\hat{\mu}\left(\mathrm{distance}|\mathrm{velocity}\right)$ = \Sexpr{round(coef(lm1)["(Intercept)"], 4)} + \Sexpr{round(coef(lm1)["Velocity"], 4)} * velocity. The estimate of residual standard error is \Sexpr{summary(lm1)$sigma} megaparsecs with 22 degrees of freedom. These results are also presented by Display 7.9 (page 185).

<<>>=
fitted(lm1)
resid(lm1)^2
sum(resid(lm1)^2)
sum(resid(lm1)^2)/sum((fitted(lm1)-mean(~Distance, data=case0701))^2)
@

Display 7.8 (page 184) shows the list of fitted values and residuals for this model. The sum of all the squared residuals is \Sexpr{round(sum((resid(lm1)^2)), 3)} and R-squared is \Sexpr{sum(resid(lm1)^2)/sum((fitted(lm1)-mean(~Distance, data=case0701))^2)}.

We can also display 95\% confidence bands for the model line and the predicted values, the following graph is akin to Display 7.11 (page 189).

<<fig.width=10, fig.height=6>>=
xyplot(Distance ~ Velocity, panel=panel.lmbands, data=case0701)
@

\subsection{Inferential Tools}

First, we test $\beta_{0}$ (the intercept). From the previous summary, we know that the two-sided $p$-value for the intercept is \Sexpr{summary(lm1)$coefficients["(Intercept)", "Pr(>|t|)"]}. This $p$-value is small enough for us to reject the null hypothesis that the estimated parameter for the intercept equals 0 (page 186).

Next we want to examine $\beta_{1}$. The current $\beta_{1}$ for $\hat{\mu}\left(\mathrm{Y}|\mathrm{X}\right)$ = $\beta_{0}$ + $\beta_{1}$ * X is \Sexpr{round(coef(lm1)["Velocity"], 4)}, and we want to get the $\beta_{1}$ for $\hat{\mu}\left(\mathrm{Y}|\mathrm{X}\right)$ = $\beta_{1}$ * X, a model with no intercept (page 186).

<<>>=
# linear regression with no intercept
lm2 = lm(Distance ~ Velocity-1, data=case0701)
summary(lm2)
confint(lm2)
@

Without the intercept, the new estimate for $\beta_{1}$ is \Sexpr{round(coefficients(lm2)["Velocity"], 4)} megaparsec-second/km. The standard error is $\Sexpr{round(summary(lm2)$coefficients["Velocity","Std. Error"], 6)}$ megaparsecs with 23 degrees of freedom. The 95\% confidence interval is (\Sexpr{confint(lm2)[1]}, \Sexpr{confint(lm2)[2]}). Because 1 megaparsec-second/km = 979.8 billion years, the confidence interval could be written as \Sexpr{round(confint(lm2)[1]*979.8, 2)} to \Sexpr{round(confint(lm2)[2]*979.8, 2)} billion years, and the best estimate is \Sexpr{round(coef(lm2)["Velocity"]*979.8, 2)} billion years (page 186).

\section{Meat Processing and pH}

Is there a relationship between postmortem muscle pH and time after slaughter?  This is the question addressed in case study 7.2 in the \emph{Sleuth}.

\subsection{Summary statistics and graphical display}

We begin by reading the data and summarizing the variables.

<<>>=
summary(case0702)
@

A total of \Sexpr{nrow(case0702)} steer carcasses are included in this data as 
shown in Display 7.3, page 177.

<<>>=
logtime = log(case0702$Time)
xyplot(pH ~ logtime, data=case0702)
@
The above scatterplot indicates a negative linear relationship between pH and log(Time).

\subsection{The simple linear regression model}

We fit a simple linear regression model of pH on log(time) after slaughter. The estimated mean function will be $\hat{\mu}\left(\mathrm{pH}|\mathrm{logtime}\right)$ = $\beta_{0}$ + $\beta_{1}$ * log(Time).

<<>>=
lm3 = lm(pH ~ logtime, data=case0702)
summary(lm3)
beta0 = coef(lm3)["(Intercept)"]; beta0
beta1 = coef(lm3)["logtime"]; beta1
sigma = summary(lm3)$sigma; sigma
@

The $\hat{\beta_{0}}$ is \Sexpr{beta0} and the $\hat{\beta_{1}}$ is \Sexpr{beta1}. The $\hat{\sigma}$ is \Sexpr{sigma} (page 187).

\subsection{Inferential Tools}

With the previous information, we can calculate the 95\% confidence interval for the estimated mean pH of steers 4 hours after slaughter (Display 7.10, page 187):

<<>>=
muhat = beta0+beta1*log(4); muhat
n = nrow(case0702)
mean = mean(~logtime, data=case0702)
sd = sd(~logtime, data=case0702)
se = sigma*sqrt(1/n+(log(4)-mean)^2/((n-1)*sd)); se
upper = muhat + qt(0.975, df=8)*se; upper
lower = muhat - qt(0.975, df=8)*se; lower
@
Or we can use the following code to get the same result:
<<>>=
predict(lm3, interval="confidence")[5,]
@

So the 95\% confidence interval for estimated mean is (\Sexpr{round(lower, 2)}, \Sexpr{round(upper, 2)}).

Next, we can calculate the 95\% prediction interval for a steer carcass 4 hours after slaughter (Display 7.12, page 191):

<<>>=
pred = beta0+beta1*log(4); pred
predse = sigma*sqrt(1+1/n+(log(4)-mean)^2/((n-1)*sd)); predse
predupper = pred+qt(0.975, df=8)*predse; predupper
predlower = pred-qt(0.975, df=8)*predse; predlower
@
Or we can use the following code to get the 95\% prediction interval for a steer carcass 4 hours after slaughter:
<<message=FALSE>>=
predict(lm3, interval="prediction")[5,] 
@

So the 95\% prediction interval is (\Sexpr{round(predlower, 2)}, \Sexpr{round(predupper, 2)}).

<<fig.width=10, fig.height=8, message=FALSE>>=
xyplot(pH ~ logtime, abline=(h=6), data=case0702, panel=panel.lmbands)
@

The 95\% prediction band is presented as Display 7.4 (page 178).

\end{document}