---
title: "The haplo.stats package"
author: "Jason Sinnwell and Daniel Schaid"
output:
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 2
vignette: |
  %\VignetteIndexEntry{The haplo.stats package}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---


```{r, setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy.opts=list(width.cutoff=100), tidy=TRUE, comment=NA)
options(width=120, max.print=1000)
```

Introduction
==========================

Haplo Stats is a suite of R routines for the analysis of indirectly measured haplotypes. The statistical methods assume that all subjects are unrelated and that haplotypes are ambiguous (due to unknown linkage phase of the genetic markers), while also allowing for missing alleles.

The primary user-level functions in Haplo Stats are documented by their name in the table of contents. 
 
This manual explains the basic and advanced usage of these routines, with guidelines for running the analyses and interpreting results. We provide many of these details in the function help pages, which are accessed within an R session using *help(haplo.em)*, for example.  We also provide brief examples in the help files, which can be run in the R session with *example(haplo.em)*.

## Operating System and Installation

The haplo.stats package has been uploaded to the Comprehensive R Archive Network (CRAN), and is made available 
on various operating systems through CRAN.  Package installation within R is made simple from within R using
*install.packages('haplo.stats')*, but other procedures for installing R packages can be found at the [R project 
website](https://cran.r-project.org).


The routines are computationally intensive and return lots of information in the
returned object. Therefore, we assign classes to the returned objects and provide
various methods (e.g., print, summary, plot) for each of them.  

## Data Setup

We first show some typical steps when you first load a package and look
for details on a function of interest.  In the sample code below, we 
load the package, check which functions are available in the package, 
view a help file, and run the example that is within the help file.

```{r, getstarted, echo=TRUE, eval=FALSE}
# load the library, load and preview at demo dataset
library(haplo.stats)
ls(name="package:haplo.stats")
help(haplo.em)
example(haplo.em)
```

```{r, rsettingsecho=FALSE, eval=TRUE}
options(width=100)
rm(list=ls())
require(haplo.stats)
```

## Example Data

The package package contains three example data sets. The primary data 
set used in this manual is named *hla.demo*, which contains 11 
loci from the HLA region on chromosome 6, with covariates, qualitative, and quantitative
responses. Since the data is provided in the package, we load the data in R using *data()* and view the names of 
the columns.  Then to make the columns of *hla.demo* accessible without 
typing it each time, we attach it to the current session.

```{r, hladat, echo=TRUE}
# load and preview demo dataset stored in ~/haplo.stats/data/hla.demo.tab
data(hla.demo)
names(hla.demo)
# attach hla.demo to make columns available in the session
#attach(hla.demo)
```

The column names of ~hla.demo~ are shown above.  They are defined as
follows:

  + resp: quantitative antibody response to measles vaccination
  + resp.cat: a factor with levels "low", "normal", "high", for categorical antibody response
  + male: gender code with 1="male", 0="female"
  + age: age (in months) at immunization

The remaining columns are genotypes for 11 HLA loci, with a prefix 
name (e.g., "DQB") and a suffix for each of two alleles (".a1" and ".a2").

## Creating a Genotype Matrix

Many of the functions require a matrix of genotypes, denoted below as 
*geno*. This matrix is arranged such that each locus has a pair of
adjacent columns of alleles, and the order of columns corresponds 
to the order of loci on a chromosome. If there are K loci, then the 
number of columns of * geno* is $2K$. Rows represent the alleles for each 
subject. For example, if there are three loci, in the order A-B-C, 
then the 6 columns of * geno* would be arranged as A.a1, A.a2, B.a1,
B.a2, C.a1, C.a2. For illustration, three of the loci in 
* hla.demo* will be used to demonstrate some of the functions. Create 
a separate data frame for 3 of the loci, and call this * geno*.  Then create 
a vector of labels for the loci.

```{r, label=makegeno, echo=TRUE}
geno <- hla.demo[,c(17,18,21:24)]
genolabel <-c("DQB","DRB","B")
```

The * hla.demo* data already had alleles in two columns for each
locus. For many SNP datasets, the data is in a one column format,
giving the count of the minor allele.   To assist in converting this
format to two columns, a function named * geno1to2* has been added
to the package. See its help file for more details.

## Preview Missing Data: *summaryGeno*

Before performing a haplotype analysis, the user will want to assess
missing genotype data to determine the completeness of the data. If 
many genotypes are missing, the functions may take a long time to
compute results, or even run out of memory.  For these reasons, the user 
may want to remove some of the subjects with a lot of missing data. 
This step can be guided by using the *summaryGeno* function, which checks 
for missing allele information and counts the number of potential 
haplotype pairs that are consistent with the observed data (see the 
Appendix for a description of this counting scheme).

The codes for missing alleles are defined by the parameter 
*miss.val*, a vector to define all possible missing value
codes. Below, the result is saved in *geno.desc*, which is a data frame, 
so individual rows may be printed.  Here we show the results for 
subjects 1-10, 80-85, and 135-140, some of which have missing alleles.

```{r, summarygeno, echo=TRUE}
geno.desc <- summaryGeno(geno, miss.val=c(0,NA))
print(geno.desc[c(1:10,80:85,135:140),])
```

The columns with *'loc miss-'* illustrate the number of loci 
missing either 0, 1, or 2 alleles, and the last column, *num\_enum\_rows*, 
illustrates the number of haplotype pairs that are consistent with the observed 
data. In the example above, subjects indexed by rows 81 and 137 have 
missing alleles.  Subject \#81 has one locus missing two alleles, while
subject \#137 has two loci missing two alleles.  As indicated by 
*num\_enum\_rows*, subject \#81 has 1,800 potential haplotype pairs, while 
subject \#137 has nearly 130,000.  

The 130,000 haplotype pairs is considered a large number, but *haplo.em*, 
*haplo.score*, and *haplo.glm* complete in roughly 3-6 minutes (depending 
on system limits or control parameter settings). If person \#137 were removed, 
the methods would take less than half that time.  It is preferred to keep people 
if they provide information to the analysis, given that run time and memory usage 
are not too much of a burden.

When a person has no genotype information, they do not provide
information to any of the methods in \package. Furthermore, they cause a
much longer run time.  Below, using the *table* function on the third 
column of *geno.desc*, we can tabulate how many people are missing two alleles 
at any at of the three loci.  If there were people missing two alleles at 
all three loci, they should be removed.  The second command below shows how to 
make an index of which people to remove from *hla.demo* because they are 
missing all their alleles.

```{r, tabledesc, echo=TRUE}
# how many samples missing 2 alleles at the 3 markers
table(geno.desc[,3])
```

```{r, rmmiss, echo=TRUE, eval=FALSE}
## create an index of people missing all alleles
miss.all <- which(geno.desc[,3]==3)

# use index to subset hla.demo
hla.demo.updated <- hla.demo[-miss.all,]
```

## Random Numbers and Setting Seed 

Random numbers are used in several of the functions (e.g., to determine 
random starting frequencies within *haplo.em*, and to compute permutation
p-values in *haplo.score*).  To reproduce calculations involving random
numbers, we use *set.seed()* before any function that uses 
random numbers. We illustrate setting the seed below.

```{r, setseed, echo=TRUE}
# this is how to set the seed for reproducing results where haplo.em is 
# involved, and also if simulations are run. In practice, don't reset seed.
seed <- c(17, 53, 1, 40, 37, 0, 62, 56, 5, 52, 12, 1)
set.seed(seed)
```


Haplotype Frequency Estimation: *haplo.em*
==================================================

## Algorithm

For genetic markers measured on unrelated subjects, with linkage phase
unknown, *haplo.em* computes maximum likelihood estimates of haplotype 
probabilities. Because there may be more than one pair of haplotypes 
that are consistent with the observed marker phenotypes, posterior 
probabilities of haplotype pairs for each subject are also
computed. Unlike the usual EM which attempts to enumerate all possible
haplotype pairs before iterating over the EM steps, our
"progressive insertion" algorithm progressively inserts batches of
loci into haplotypes of growing lengths, runs the EM steps, trims 
off pairs of haplotypes per subject when the posterior probability of 
the pair is below a specified threshold, and then continues these 
insertion, EM, and trimming steps until all loci are inserted into 
the haplotype. The user can choose the batch size. If the batch size 
is chosen to be all loci, and the threshold for trimming is set to 0, 
then this reduces to the usual EM algorithm. The basis of this
progressive insertion algorithm is from the ["snphap" software by 
David Clayton](https://github.com/chr1swallace/snphap). 
Although some of the features and control parameters of *haplo.em* are 
modeled after *snphap*, there are substantial differences, such as
extension to allow for more than two alleles per locus, and some 
other nuances on how the algorithm is implemented.

## Example Usage *haplo.em()*

We use *haplo.em* on *geno* for the 3 loci defined above and save
the result in an object named *save.em*, which has the *haplo.em*
class. The print method would normally print all 178 haplotypes from *save.em*,
but to keep the results short for this manual, we give a quick glance of 
the output by using the option *nlines=10*, which prints 
only the first 10 haplotypes of the full results.  The *nlines* 
parameter has been employed in some of Haplo Stats' print methods 
for when there are many haplotypes. In practice, it is best to exclude this 
parameter so that the default will print all results.

```{r,  eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, label=runEM, echo=TRUE}
save.em <- haplo.em(geno=geno, locus.label=genolabel, miss.val=c(0,NA))
names(save.em)
print(save.em, nlines=10)
```

The print methods shows the haplotypes and their estimated frequencies, 
followed by the final log-likelihood statistic and the *lr stat for no LD*,
which is the likelihood ratio test statistic contrasting the *lnlike* for the 
estimated haplotype frequencies versus the *lnlike* under the null assuming 
that alleles from all loci are in linkage equilibrium. We note that the trimming 
by the progressive insertion algorithm can invalidate the *lr stat* and the 
degrees of freedom (*df*).\\

## Summary Method

The *summary* method for a *haplo.em* object on *save.em* shows the 
list of haplotypes per subject, and their posterior probabilities:

```{r, summaryEM, echo=TRUE}
summary(save.em, nlines=7)
```

The first part of the *summary* output lists the subject id 
(row number of input *geno* matrix), the codes for the haplotypes of 
each pair, and the posterior probabilities of the haplotype pairs. The 
second part gives a table of the maximum number of pairs of haplotypes per subject,
versus the number of pairs used in the final posterior probabilities.
The haplotype codes remove the clutter of illustrating all the alleles
of the haplotypes, but may not be as informative as the actual
haplotypes themselves. To see the actual haplotypes, use the 
*show.haplo=TRUE* option, as in the following example.

```{r, summaryEMshow, echo=TRUE}
# show full haplotypes, instead of codes
summary(save.em, show.haplo=TRUE, nlines=7)
```

### Control Parameters for *haplo.em*

A set of control parameters can be passed together to *haplo.em* as the
*control* argument. This is a list of parameters that control the 
EM algorithm based on progressive insertion of loci. The default values are set by a
function called *haplo.em.control* (see *help(haplo.em.control)*
for a complete description). Although the user can accept the default 
values, there are times when control parameters may need to be adjusted. 

These parameters are defined as:

  + insert.batch.size: Number of loci to be inserted in a single batch.
  + min.posterior: Minimum posterior probability of haplotype pair, conditional on observed marker genotypes. Posteriors below this minimum value will have their pair of haplotypes "trimmed" off the list of possible pairs.
  + max.iter: Maximum number of iterations allowed for the EM algorithm before it stops and prints an error.
  + n.try Number of times to try to maximize the *lnlike* by the EM algorithm. The first try will use, as initial starting values for the posteriors, either equal values or uniform random variables, as determined by *random.start*. All subsequent tries will use uniform random values as initial starting values for the posterior probabilities.
  + max.haps.limit: maximum number of haplotypes for the input genotypes.  Within *haplo.em*, the first step is to try to allocate the sum of the result of *geno.count.pairs()*, if that exceeds *max.haps.limit*, start by allocating *max.haps.limit*.  If that is exceeded in the progressive-insertions steps, the C function doubles the memory until it can longer request more.


One reason to adjust control parameters is for finding the global aximum of the log-likelihood.  It can be difficult in particular for small sample sizes and many possible haplotypes.  Different maximizations of the log-likelihood may result in different results from *haplo.em*, *haplo.score*, or *haplo.glm* when rerunning the analyses.  The algorithm uses multiple attempts to maximize the log-likelihood, starting each attempt with random starting values. To increase the chance of finding the global maximum of the log-likelihood, the user can increase the number of attempts (*n.try*), increase the batch size (*insert.batch.size*), or decrease the trimming threshold for posterior probabilities (*min.posterior*).  

Another reason to adjust control parameters is when the algorithm runs
out of memory because there are too many haplotypes.  If 
*max.haps.limit* is exceeded when a batch of markers is added, 
the algorithm requests twice as much memory until it runs out.  One option is
to set *max.haps.limit* to a different value, either to make *haplo.em*
request more memory initially, or to request more memory in smaller
chunks. Another solution is to make the algorithm trim the number of
haplotypes more aggressively by decreasing *insert.batch.size* or
increasing *min.posterior*.  Any changes to these parameters should be made
with caution, and not drastically different from the default values.
For instance, the default for *min.posterior* used to be $1e-7$, 
and in some rare circumstances with many markers in only moderate linkage
disequilibrium, some subjects had all their possible haplotype pairs
trimmed.  The default is now set at $1e-9$, and we recommend not
increasing *min.posterior* much greater than $1e-7$.

The example below gives the command for increasing the number of tries
to $20$, and the batch size to $2$, since not much more can be done
for three markers.

```{r, emcontrol, echo=TRUE, eval=FALSE}
# demonstrate only the syntax of control parameters
save.em.try20 <- haplo.em(geno=geno, locus.label=genolabel, miss.val=c(0, NA),
     control = haplo.em.control(n.try = 20, insert.batch.size=2))
```

## Haplotype Frequencies by Group Subsets

To compute the haplotype frequencies for each level of a grouping
variable, use the function *haplo.group*.  The following example 
illustrates the use of a binomial response based on *resp.cat*,
*y.bin*, that splits the subjects into two groups.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```
```{r, groupBin, echo=TRUE}
## run haplo.em on sub-groups
## create ordinal and binary variables
y.bin <- 1*(hla.demo$resp.cat=="low")
group.bin <- haplo.group(y.bin, geno, locus.label=genolabel, miss.val=0)
print(group.bin, nlines=15)
```

The *group.bin* object can be very large, depending on the number
of possible haplotypes, so only a portion of the output is illustrated above
(limited again by *nlines*).  The first section gives a short summary of 
how many subjects appear in each of the groups.  The second section is a table 
with the following columns:  row numbers, the alleles of the haplotypes, 
then *Total* is the estimated haplotype frequencies for the entire data set.
The last columns are the estimated haplotype frequencies for the subjects in the
levels of the group variable (*y.bin=0* and *y.bin=1* in this example).
Note that some haplotype frequencies have an *NA*, which appears when the haplotypes do not occur in the
subgroups. 

Power and Sample Size for Haplotype Association Studies
=====================================================================

It has been established that using haplotypes has greater power than single-markers
to detect genetic association in some circumstances.  There is little
guidance, however, in determining sample size and power under
different circumstances, some of which include: marker type, dominance, 
and effect size. The \package\ package now includes functions to calculate 
sample size and power for haplotype association studies, which is flexible 
to handle these multiple circumstances.

Based on work in [Schaid 2005](https://doi.org/10.1111/j.1529-8817.2005.00215.x), we can take a set of 
haplotypes with their population frequencies, assign a risk to a subset of the haplotypes, then
determine either the sample size to achieve a stated power, or the power for a stated sample size.
Sample size and power can be calculated for either quantitative traits or case-control studies.

## Quantitative Traits: *haplo.power.qt*

We assume that quantitative traits will be modeled by a linear
regression.  Some well-known tests for association between haplotypes and 
the trait include [score statistics](https://pubmed.ncbi.nlm.nih.gov/11791212/) and an
[F-test by Zaykin](https://pubmed.ncbi.nlm.nih.gov/12037407/). For both types of tests, power depends on the 
amount of variance in the trait that is explained by haplotypes, or a 
multiple correlation coefficient, $R^{2}$.  Rather than specifying the
haplotype coefficients directly, we calculate the vector of
coefficients based on an $R^{2}$ value.

In the example below, we load an example set of haplotypes that contain 5
markers, and specify the indices of the at-risk haplotypes; in this case, 
whichever haplotype has allele $1$ at the 2nd and 3rd markers.  
We set the first haplotype (most common) as the baseline. 
With these values we calculate the vector of coefficients for haplotype 
effects from *find.haplo.beta.qt* using an $R^{2}=0.01$.  
Next, we use *haplo.power.qt* to calculate the sample size for the 
set of haplotypes and their coefficients, type-I error (alpha) set to $0.05$, 
power at 80\%, and the same mean and variance used to get haplotype coefficients.  
Then we use the sample size needed for 80\% power for un-phased haplotypes 
($2,826$) to get the power for both phased and un-phased haplotypes.

```{r, hapPowerQT, echo=TRUE}

# load a set of haplotypes 

data(hapPower.demo) 

#### an example using save.em hla markers may go like this.
# keep <- which(save.em$hap.prob > .004)  # get an index of non-rare haps
# hfreq <- save.em$hap.prob[keep]
# hmat <- save.em$haplotype[keep,]
# hrisk <- which(hmat[,1]==31 & hmat[,2]==11)  # contains 3 haps with freq=.01
# hbase <- 4  # 4th hap has mas freq of .103
####

## separate the haplotype matrix and the frequencies
hmat <- hapPower.demo[,-6]
hfreq <- hapPower.demo[,6]

## Define risk haplotypes as those with "1" allele at loc2 and loc3
hrisk <- which(hmat$loc.2==1 & hmat$loc.3==1)

# define index for baseline haplotype
hbase <-  1

hbeta.list <- find.haplo.beta.qt(haplo=hmat, haplo.freq=hfreq, base.index=hbase, 
                                 haplo.risk=hrisk, r2=.01, y.mu=0, y.var=1)
hbeta.list

ss.qt <- haplo.power.qt(hmat, hfreq, hbase, hbeta.list$beta, 
                        y.mu=0, y.var=1, alpha=.05, power=.80)
ss.qt

power.qt <-  haplo.power.qt(hmat, hfreq, hbase, hbeta.list$beta, 
                        y.mu=0, y.var=1, alpha=.05, sample.size=2826)
power.qt
``` 

## Case-Control Studies: *haplo.power.cc*

The steps to compute sample size and power for case-control studies is
similar to the steps for quantitative traits.  If we assume a
log-additive model for haplotype effects, the haplotype
coefficients can be specified first as odds ratios (OR), and then converted
to logistic regression coefficients according to $log(OR)$.

In the example below, we assume the same baseline and risk haplotypes defined above, give the risk
haplotypes an odds ratio of 1.50, and specify a population disease
prevalance of 10\%.   We also assume cases make up 50\% (*case.frac*) of the study subjects.
We first compute the sample size for this scenario for Type-I error (alpha) at $0.05$ and 80\%
power, and then compute power for the sample size required for un-phased haplotypes ($4,566$).

```{r, hapPowerCC, echo=TRUE}
## get power and sample size for quantitative response
## get beta vector based on odds ratios

cc.OR <- 1.5

# determine beta regression coefficients for risk haplotypes

hbeta.cc <- numeric(length(hfreq))
hbeta.cc[hrisk] <-  log(cc.OR)

# Compute sample size for stated power

ss.cc <- haplo.power.cc(hmat, hfreq, hbase, hbeta.cc, case.frac=.5, prevalence=.1,
               alpha=.05, power=.8)
ss.cc

# Compute power for given sample size

power.cc <- haplo.power.cc(hmat, hfreq, hbase, hbeta.cc, case.frac=.5, prevalence=.1,
               alpha=.05, sample.size=4566)
power.cc
```
 
Haplotype Score Tests: *haplo.score*
=============================================================

The *haplo.score* routine is used to compute score statistics to test 
association between haplotypes and a wide variety of traits, including binary,
ordinal, quantitative, and Poisson. This function provides several different
global and haplotype-specific tests for association and allows for adjustment
for non-genetic covariates. Haplotype effects can be specified as additive, dominant, or recessive.
This method also has an option to compute permutation p-values, which may be needed for sparse data when
distribution assumptions may not be met. Details on the background and theory of the
score statistics can be found in [Schaid et al. 2002](https://pubmed.ncbi.nlm.nih.gov/11791212/).

## Quantitative Trait Analysis

First, we assess a haplotype association with a quantitative trait in 
*hla.demo* called *resp*.  To tell *haplo.score* the trait is 
quantitative, specify the parameter *trait.type="gaussian"* 
(a reminder that a gaussian distribution is assumed for the error terms).  The other
arguments, all set to default values, are explained in the help file.  Note that 
rare haplotypes can result in unstable variance estimates, and hence 
unreliable test statistics for rare haplotypes.  We restrict the
analysis to get scores for haplotypes with a minimum sample count using 
*min.count=5*.  Below is an example of running *haplo.score* with 
a quantitative trait, then viewing the results using the *print* 
method for the *haplo.score* class. (again, output shortened by *nlines*).

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```
```{r, scoregauss, echo=TRUE} 
# score statistics w/ Gaussian trait
score.gaus.add <- haplo.score(hla.demo$resp, geno, trait.type="gaussian",
                              min.count=5, 
                              locus.label=genolabel, simulate=FALSE)
print(score.gaus.add, nlines=10)
```

First, the model effect chosen by *haplo.effect* is
printed across the top. The section *Global Score Statistics* shows results for 
testing an overall association between haplotypes and the response.  The 
*global-stat* has an asymptotic $\chi^{2}$ distribution, with degrees of 
freedom (*df*) and *p-value* as indicated.  Next, *Haplotype-specific 
scores* are given in a table format.  The column descriptions are as follows: 
\begin{itemize}
\item{} The first column gives row numbers.
\item{} The next columns (3 in this example) illustrate the alleles of
the haplotypes.
\item{} *Hap-Freq* is the estimated frequency of the haplotype in the
pool of all subjects.
\item{} *Hap-Score* is the score for the haplotype, the results
  are sorted by this value. Note, the score statistic should not be 
  interpreted as a measure of the haplotype effect.
\item{} *p-val* is the asymptotic $\chi^{2}_1$ p-value, calculated
  from the square of the score statistic.
\end{itemize}


## Binary Trait Analysis

Let us assume that "low" responders are of primary interest, so we create 
a binary trait that has values of 1 when *resp.cat* is "low", and 0 otherwise.
Then in *haplo.score* specify the parameter *trait.type="binomial"*.  

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```
```{r, scorebinom, echo=TRUE}
# scores, binary trait
y.bin <- 1*(hla.demo$resp.cat=="low")
score.bin <- haplo.score(y.bin, geno, trait.type="binomial",
                        x.adj = NA, min.count=5,
                        haplo.effect="additive", locus.label=genolabel, 
                        miss.val=0, simulate=FALSE)
print(score.bin, nlines=10)
```

## Ordinal Trait Analysis

To create an ordinal trait, here we convert *resp.cat* (described above) to 
numeric values, *y.ord* (with levels 1, 2, 3).  For *haplo.score*, 
use *y.ord* as the response variable, and set the parameter 
*trait.type = "ordinal"*.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```
```{r, scoreOrd, echo=TRUE }
# scores w/ ordinal trait
y.ord <- as.numeric(hla.demo$resp.cat)
score.ord <- haplo.score(y.ord, geno, trait.type="ordinal",
                     x.adj = NA, min.count=5,
                     locus.label=genolabel, 
                     miss.val=0, simulate=FALSE)
print(score.ord, nlines=7)
```

### Warning for Ordinal Traits

When analyzing an ordinal trait with adjustment for covariates (using
the *x.adj* option), the software requires [the *rms* package, 
distributed by Frank Harrell](https://link.springer.com/book/10.1007/978-3-319-19425-7).  If the user 
does not have these packages installed, then it will not be possible 
to use the *x.adj* option.  However, the unadjusted scores
for an ordinal trait (using the default option *x.adj=NA*) do not
require these pacakgeses.  Check the list of your local packages in
the list shown from entering *library()*\ in your prompt.


## Haplotype Scores, Adjusted for Covariates

To adjust for covariates in *haplo.score*, first set up a matrix of covariates
from the example data.  For example, use a column for male (1 if male; 0 if female),
and a second column for age.  Then pass the matrix to *haplo.score* 
using parameter *x.adj*.  The results change slightly in this example.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, scoreAdj, echo=TRUE }
# score w/gaussian, adjusted by covariates
x.ma <- with(hla.demo, cbind(male, age))
score.gaus.adj <- haplo.score(hla.demo$resp, geno, trait.type="gaussian",
                        x.adj = x.ma, min.count=5,
                        locus.label=genolabel, simulate=FALSE)
print(score.gaus.adj, nlines=10)
```


## Plots and Haplotype Labels

A convenient way to view results from *haplo.score* is a plot of the haplotype
frequencies (*Hap-Freq*) versus the haplotype score statistics (*Hap-Score*).
This plot, and the syntax for creating it, are shown in the figure below.

Some points on the plot may be of interest.  To identify individual points on the plot,
use *locator.haplo(score.gaus), which is similar to *locator()*.  Use the mouse
to select points on the plot.  After points are chosen, click on the middle
mouse button, and the points are labeled with their haplotype labels.
Note, in constructing the figure, we had to define which points
to label, and then assign labels in the same way as done within the
*locator.haplo()* function.

```{r, scorePlot, fig=TRUE, echo=TRUE}
## plot score vs. frequency, gaussian response
plot(score.gaus.add, pch="o")
 
## locate and label pts with their haplotypes
## works similar to locator() function
#> pts.haplo <- locator.haplo(score.gaus)

pts.haplo <- list(x.coord=c(0.05098, 0.03018, .100), 
                  y.coord=c(2.1582, 0.45725, -2.1566), 
                  hap.txt=c("62:2:7", "51:1:35", "21:3:8"))

text(x=pts.haplo$x.coord, y=pts.haplo$y.coord, labels=pts.haplo$hap.txt)
```

## Skipping Rare Haplotypes\label{freqmin}

For the *haplo.score*, the *skip.haplo* and *min.count* 
parameters control which rare haplotypes are pooled 
into a common group. The *min.count* parameter is a recent
addition to *haplo.score*, yet it does the same task as 
*skip.haplo* and is the same idea as *haplo.min.count* used in
*haplo.glm.control* for *haplo.glm*.  As a guideline, you may wish 
to set *min.count* to calculate scores for haplotypes with expected 
haplotype counts of $5$ or greater in the sample.  We concentrate on 
this expected count because it adjusts to the size of the input data. 
If $N$ is the number of subjects and $f$ the haplotype frequency, 
then the expected haplotype count is \mbox{$count=2 \times N \times f$}. 
Alternatively, you can choose \mbox{$skip.haplo= \frac{count}{2 \times N}$}. 

In the following example we try a different cut-off than before, 
*min.count=10*, which corresponds to *skip.haplo* of $10 \div (2 \times 220) = .045$.
In the output, see that the global statistic, degrees of freedom, and p-value change
because of the fewer haplotypes, while the haplotype-specific scores do not change.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, scoremin10, echo=TRUE}
# increase skip.haplo, expected hap counts = 10  
score.gaus.min10 <- haplo.score(hla.demo$resp, geno, trait.type="gaussian",
                          x.adj = NA, min.count=10,
                          locus.label=genolabel, simulate=FALSE)
print(score.gaus.min10)
```

# Score Statistic Dependencies: the *eps.svd* parameter

The global score test is calculated using the vector of scores and the generalized inverse of
their variance-covariance matrix, performed by the *Ginv* function. This function determines
the rank of the variance matrix by its singular value decomposition, and an epsilon value is
used as the cut-off for small singular values.  If all of the haplotypes in the sample are scored,
then there is dependence between them and the variance matrix is not of full rank.
However, it is more often the case that one or more rare haplotypes are not scored because
of low frequency.  It is not clear how strong the dependencies are between the remaining score statistics,
and likewise, there is disparity in calculating the rank of the variance matrix. For these instances
we give the user control over the epsilon parameter for *haplo.score* with *eps.svd*.

We have seen instances where the global score test had a very
significant p-value, but none of the haplotype-specific scores showed strong
association. In such instances, we found the default epsilon value
in *Ginv* was incorrectly considering the variance matrix as
having full rank, and the misleading global score test was corrected with
a larger epsilon for *Ginv*.


## Haplotype Model Effect

The *haplo.score()* function allows non-additive effects for scoring haplotypes. The possible effects
for haplotypes are additive, dominant, and recessive.  Under recessive effects, fewer 
haplotypes may be scored, because subjects are required to be homozygous for haplotypes.
Furthermore, there would have to be *min.count* such persons in the sample to
have the recessive effect scored.  Therefore, a recessive model should
only be used on samples with common haplotypes.  In the example below with 
the gaussian response, set the haplotype effect to dominant using parameter 
*haplo.effect = ''dominant''*.  Notice the results change slightly
compared to the *score.gaus.add* results above.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, scoreDom, echo=TRUE }
# score w/gaussian, dominant effect

score.gaus.dom <- haplo.score(hla.demo$resp, geno, trait.type="gaussian",
                        x.adj=NA, min.count=5,
                        haplo.effect="dominant", locus.label=genolabel, 
                        simulate=FALSE)
print(score.gaus.dom, nlines=10)
```

## Simulation p-values

When *simulate=TRUE*, *haplo.score* gives simulated p-values.  
Simulated haplotype score statistics are the re-calculated score
statistics from a permuted re-ordering of the trait and covariates and 
the original ordering of the genotype matrix.  The simulated p-value 
for the global score statistic (*Global sim. p-val*) is the number
of times the simulated global score statistic exceeds the observed, divided by the total 
number of simulations.  Likewise, simulated p-value for the maximum 
score statistic (\mbox*Max-stat sim. p-val*) is the number of 
times the simulated maximum haplotype score statistic exceeds the observed
maximum score statistic, divided by the total number of simulations.
The maximum score statistic is the maximum of the square of the
haplotype-specific score statistics, which has an unknown distribution, so its
significance can only be given by the simulated p-value.  
Intuitively, if only one or two haplotypes are associated with 
the trait, the maximum score statistic should have greater power to 
detect association than the global statistic.

The *score.sim.control* function manages control parameters for 
simulations. The *haplo.score* function employs the simulation p-value
precision criteria of [Besag and Clifford](https://doi.org/10.1093/biomet/78.2.301).   
These criteria ensure that the simulated p-values for both the global 
and the maximum score statistics are precise for small p-values.  
The algorithm performs a user-defined minimum number of permutations
(*min.sim*) to guarantee sufficient precision for the simulated
p-values for score statistics of individual haplotypes. Permutations
beyond this minimum are then conducted until the sample standard
errors for simulated p-values for both the *global-stat* and {\sl
  max-stat* score statistics are less than a threshold 
\mbox{(*p.threshold * p-value*)}.  The default value for 
\mbox{*p.threshold= $\frac{1}{4}$}} provides a two-sided 95\% 
confidence interval for the p-value with a width that is approximately 
as wide as the p-value itself.  Effectively, simulations are 
more precise for smaller p-values.  The following example illustrates
computation of simulation p-values with *min.sim=1000*.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, scorePerm, echo=TRUE }
# simulations when binary response
score.bin.sim <- haplo.score(y.bin, geno, trait.type="binomial",
              x.adj = NA, locus.label=genolabel, min.count=5,
              simulate=TRUE, sim.control = score.sim.control() )
print(score.bin.sim)
```

Regression Models: *haplo.glm* 
===================================================

The *haplo.glm* function computes the regression of a trait on
haplotypes, and possibly other covariates and their interactions with 
haplotypes. We currently support the gaussian, binomial, and Poisson 
families of traits with their canonical link functions.  The effects
of haplotypes on the link function can be modeled as either additive, 
dominant (heterozygotes and homozygotes for a particular haplotype 
assumed to have equivalent effects), or recessive (homozygotes of a 
particular haplotype considered to have an alternative effect on the 
trait). The basis of the algorithm is a two-step iteration process; 
the posterior probabilities of pairs of haplotypes per subject are 
used as weights to update the regression coefficients, and the
regression coefficients are used to update the haplotype posterior 
probabilities. See [Lake et al](https://pubmed.ncbi.nlm.nih.gov/12890927/) for details.

## New and Updated Methods for haplo.glm

We initially wrote *haplo.glm* with a focus on creating a basic print method
for results. We have now refined the *haplo.glm* class to 
look and act as much like a glm class object as possible with methods defined
specifically for the *haplo.glm* class. We provide print and 
summary methods that make use of the corresponding methods
for glm and then add extra information for the haplotypes and their 
frequencies.  Furthermore, we have defined for the *haplo.glm* class
some of the standard methods for regression fits, including residuals, 
fitted.values, vcov, and anova. We later describe the challenges that haplotype regression 
presents with these methods.

## Preparing the *data.frame* for *haplo.glm* 

A critical distinction between *haplo.glm* and all other functions
in Haplo Stats is that the definition of the regression model follows
the S/R formula standard (see *lm* or *glm*). So, a 
*data.frame* must be defined, and this object must contain the trait and 
other optional covariates, plus a special kind of genotype matrix 
(*geno.glm* for this example) that contains the genotypes of the 
marker loci. We require the genotype matrix to be prepared using 
*setupGeno()*, which handles character, numeric, or factor alleles, 
and keeps the columns of the genotype matrix as a single unit when 
inserting into (and extracting from) a *data.frame*.  The 
*setupGeno* function recodes all missing genotype value codes given 
by *miss.val* to NA, and also recodes alleles to integer values. 
The original allele codes are preserved within an attribute of *geno.glm*, 
and are utilized within *haplo.glm*.  The returned object has class 
*model.matrix*, and it can be included in a *data.frame* to 
be used in *haplo.glm*.  In the example below we prepare a 
genotype matrix, *geno.glm*, and create a *data.frame* object, 
*glm.data*, for use in *haplo.glm*.  

```{r, glmGeno, echo=TRUE}
# set up data for haplo.glm, include geno.glm, 
# covariates age and male, and responses resp and y.bin
geno <- hla.demo[,c(17,18,21:24)]
geno.glm <- setupGeno(geno, miss.val=c(0,NA), locus.label=genolabel)
attributes(geno.glm)
y.bin <- 1*(hla.demo$resp.cat=="low")
glm.data <- data.frame(geno.glm, age=hla.demo$age, male=hla.demo$male, y=hla.demo$resp, y.bin=y.bin)
```

## Rare Haplotypes 

The issue of deciding which haplotypes to use for association is
critical in *haplo.glm*.  By default it will model a rare
haplotype effect so that the effects of other haplotypes are in
reference to the baseline effect of the one common happlotype.  The
rules for choosing haplotypes to be modeled in *haplo.glm* are
similar to the rules in *haplo.score*: by a minimum frequency or a
minimum expected count in the sample. 

Two control parameters in *haplo.glm.control* may be used to 
control this setting: *haplo.freq.min* may be set to a selected 
minimum haplotype frequency, and *haplo.min.count* may be set to select 
the cut-off for minimum expected haplotype count in the sample. The
default minimum frequency cut-off in *haplo.glm* is set to
$0.01$. More discussion on rare haplotypes is included later. 

## Regression for a Quantitative Trait

The following illustrates how to fit a regression of a quantitative
trait *y* on the haplotypes estimated from the *geno.glm* matrix, 
and the covariate *male*.  For *na.action*, we use 
*na.geno.keep*, which keeps a subject with missing values in the genotype
matrix if they are not missing all alleles, but removes subjects
with missing values (*NA*) in either the response or covariate. 

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, glmGauss, echo=TRUE}
# glm fit with haplotypes, additive gender covariate on gaussian response
fit.gaus <- haplo.glm(y ~ male + geno.glm, family=gaussian, data=glm.data, 
        na.action="na.geno.keep", locus.label = genolabel, x=TRUE,
        control=haplo.glm.control(haplo.freq.min=.02))

summary(fit.gaus)
```

The summary function for *haplo.glm* shows much the same
information as summary for glm objects with the extra table for the haplotype
frequencies.  The above table for *Coefficients* lists the estimated regression
coefficients (*coef*), standard errors (*se*), the 
corresponding t-statistics (*t.stat*), and p-values (*pval*). 
The labels for haplotype coefficients are a concatenation of the name of the 
genotype matrix (*geno.glm*) and unique haplotype codes assigned
within *haplo.glm*. The haplotypes corresponding to these haplotype 
codes are listed in the *Haplotypes* table, along with the estimates of the 
haplotype frequencies (*hap.freq*). The rare haplotypes, those with 
expected counts less than \mbox*haplo.min.count=5} (equivalent to
having frequencies less than \mbox{*haplo.freq.min =} .01136 
\Sexpr{5/(2*nrow(glm.data))}}) in the 
above example), are pooled into a single category labeled *geno.glm.rare*. 
The haplotype chosen as the baseline category for the design matrix (most 
frequent haplotype is the default) is labeled as *haplo.base}.

## Fitting Haplotype x Covariate Interactions

Interactions are fit by the standard S-language model syntax, using 
a '$*$' in the model formula to indicate main effects and interactions.
Some other formula constructs are not supported, so use the formula
parameter with caution.  Below is an example of modeling the
interaction of *male* and the haplotypes.  Because more terms will
be estimated in this case, we limit how many haplotypes will be
included by increasing *haplo.min.count* to 10.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, glmInter, echo=TRUE }
# glm fit haplotypes with covariate interaction
fit.inter <- haplo.glm(formula = y ~ male * geno.glm, 
                   family = gaussian, data=glm.data, 
                   na.action="na.geno.keep", 
                   locus.label = genolabel, 
                   control = haplo.glm.control(haplo.min.count = 10))
summary(fit.inter)
```

The listed results similar as explained above. 
The main difference is that the interaction coefficients are labeled as a 
concatenation of the covariate (*male* in this example) and the
name of the haplotype, as described above. In addition, estimates may
differ because the model has changed. 

## Regression for a Binomial Trait

Next we illustrate the fitting of a binomial trait with the same
genotype matrix and covariate.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, glmBinom, echo=TRUE}
# gender and haplotypes fit on binary response, 
# return model matrix
fit.bin <- haplo.glm(y.bin ~ male + geno.glm, family = binomial, 
             data=glm.data, na.action = "na.geno.keep",
             locus.label=genolabel,
             control = haplo.glm.control(haplo.min.count=10))
summary(fit.bin)
```

The underlying methods for *haplo.glm* are based on a prospective likelihood. Normally,
this type of likelihood works well for case-control studies with standard covariates. For ambiguous 
haplotypes, however, one needs to be careful when interpreting the results from fitting
*haplo.glm* to case-control data. Because cases are over-sampled, relative to the population
prevalence (or incidence, for incident cases), haplotypes associated with disease will be
over-represented in the case sample, and so estimates of haplotype frequencies will be biased.
Positively associated haplotypes will have haplotype frequency estimates that are higher 
than the population haplotype frequency. To avoid this problem, one can weight each subject.
The weights for the cases should be the population prevalence, and the weights for controls should be 1 
(assuming the disease is rare in the population, and controls are representative of the
general population). See [Stram et al. 2003](https://pubmed.ncbi.nlm.nih.gov/14566096/) for background on using weights,
and see the help file for *haplo.glm* for how to implement weights.

The estimated regression coefficients for case-control studies can be 
biased by either a large amount of haplotype ambiguity and
mis-specified weights, or by departures from Hardy-Weinberg
Equilibrium of the haplotypes in the pool of cases and controls. 
Generally, the bias is small, but tends to be towards the null of no 
association. See [Stram et al. 2003](https://pubmed.ncbi.nlm.nih.gov/14566096/) and [Epstein and
Satten, 2003](https://pubmed.ncbi.nlm.nih.gov/14631556/) for further details.

### Caution on Rare Haplotypes with Binomial Response

If a rare haplotype occurs only in cases or only in controls, the
fitted values would go to 0 or 1, where R would issue a
warning.  Also, the coefficient estimate for that haplotype would go
to positive or negative infinity, If the default *haplo.min.count=5* 
were used above, this warning would appear.  To keep this from 
occuring in other model fits, increase the minimum count or minimum frequency. 

## Control Parameters 
Additional parameters are handled using *control*, which is a list of
parameters providing additional functionality in *haplo.glm*.  This list
is set up by the function *haplo.glm.control*.  See the help
file (*help(haplo.glm.control)*) for a full list of control
parameters, with details of their usage.  Some of the options are
described here.

## Controlling Genetic Models: *haplo.effect*

The *haplo.effect* control parameter for *haplo.glm*
instructs whether the haplotype effects are fit as additive, dominant,
or recessive. That is, *haplo.effect* determines whether the
covariate ($x$) coding of haplotypes follows the values in Table 1 
for each effect type.  Heterozygous means a subject has one copy of a
particular haplotype, and homozygous means a subject has two copies of
a particular haplotype. \\ \\ \\ \\

\begin{center}
\begin{tabular}{|r|c|c|c|}
\hline
Hap - Pair    & additive & dominant & recessive\\ \hline \hline
Heterozygous  & 1        & 1        & 0        \\ \hline
Homozygous    & 2        & 1        & 1        \\ \hline
\end{tabular}
\end{center}

Note that in a recessive model, the haplotype effects are estimated only from subjects
who are homozygous for a haplotype. Some of the haplotypes which meet the *haplo.freq.min* and 
*haplo.count.min* cut-offs may occur as homozygous in only a few of the subjects.
As stated above, recessive models should be used when the region has multiple common haplotypes.

The default *haplo.effect* is *additive*, whereas the example below illustrates the
fit of a *dominant* effect of haplotypes for the gaussian trait with the gender covariate. 

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r,glmDom, echo=TRUE }
# control dominant effect of haplotypes (haplo.effect) 
# by using haplo.glm.control
fit.dom <- haplo.glm(y ~ male + geno.glm, family = gaussian,
       data = glm.data, na.action = "na.geno.keep", 
       locus.label = genolabel, 
       control = haplo.glm.control(haplo.effect='dominant', 
       haplo.min.count=8))

summary(fit.dom)
```

### Selecting the Baseline Haplotype \label {glmBaseline}

The haplotype chosen for the baseline in the model is the one with the
highest frequency.  Sometimes the most frequent haplotype may be an at-risk 
haplotype, and so the measure of its effect is desired.  To specify a
more appropriate haplotype as the baseline in the binomial example, choose 
from the list of other common haplotypes, *fit.bin\$haplo.common*. To specify an
alternative baseline, such as haplotype 77, use the control parameter 
*haplo.base* and haplotype code, as in the example below.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, glmBaseline, echo=TRUE }
# control baseline selection, perform the same exact run as fit.bin, 
# but different baseline by using haplo.base chosen from haplo.common
fit.bin$haplo.common
fit.bin$haplo.freq.init[fit.bin$haplo.common]
fit.bin.base77 <- haplo.glm(y.bin ~ male + geno.glm, family = binomial,
       data = glm.data, na.action = "na.geno.keep",
       locus.label = genolabel,
       control = haplo.glm.control(haplo.base=77, 
       haplo.min.count=8))
summary(fit.bin.base77)
```

The above model has the same haplotypes as *fit.bin*, except haplotype $4$,
the old baseline,  now has an effect estimate while haplotype $77$ is the new baseline. 
Due to randomness in the starting values of the haplotype frequency estimation,
different runs of *haplo.glm* may result in a different set of haplotypes
meeting the minimum counts requirement for being modeled. Therefore, once you have
arrived at a suitable model, and you wish to modify it by changing baseline and/or
effects, you can make results consistent by controlling the randomness using
*set.seed*. In this document, we use the same seed before making *fit.bin* and *fit.bin.base77*.

### Rank of Information Matrix and eps.svd 

Similar to recent additions to *haplo.score* in section~\ref{scoreEPS}, 
we give the user control over the epsilon parameter determining the number of
singular values when determining the rank of the information matrix in 
*haplo.glm*.  Finding the generalized inverse of this matrix can be 
problematic when either the response variable or a covariate has a large variance
and is not scaled before passed to *haplo.glm*.  The rank of the information
matrix is determined by the number of non-zero singular values a small cutoff, 
epsilon. When the singular values for the coefficients are on a larger numeric scale
than those for the haplotype frequencies, the generalized inverse may incorrectly
determine the information matrix is not of full rank.  Therefore, we allow the user to
specify the epsilon as *eps.svd* in the control parameters for *haplo.glm*.  
A simpler fix, which we strongly suggest, is for the user to pre-scale any 
continuous responses or covariates with a large variance.

Here we demonstrate what happens when we increase the variance of a gaussian 
response by $2500$. We see that the coefficients are all highly significant
and the rank of the information matrix is much smaller than the scaled gaussian fit.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
``` 

```{r, glmEPS1, eval=TRUE}
glm.data$ybig <- glm.data$y*50
fit.gausbig <- haplo.glm(formula = ybig ~ male + geno.glm, family = gaussian, 
          data = glm.data, na.action = "na.geno.keep", locus.label = genolabel, 
          control = haplo.glm.control(haplo.freq.min = 0.02), x = TRUE)

summary(fit.gausbig)

fit.gausbig$rank.info
fit.gaus$rank.info
``` 

Now we set a smaller value for the *eps.svd* control parameter and find the fit
matches the original Gaussian fit.

```{r, echo=FALSE, eval=TRUE}
set.seed(seed)
``` 

```{r, glmEPS2}

fit.gausbig.eps <- haplo.glm(formula = ybig ~ male + geno.glm, family = gaussian, 
          data = glm.data, na.action = "na.geno.keep", locus.label = genolabel, 
          control = haplo.glm.control(eps.svd=1e-10, haplo.freq.min = 0.02), x = TRUE)

summary(fit.gausbig.eps)
fit.gausbig.eps$rank.info
``` 

### Rare Haplotypes and *haplo.min.info* 

Another notable control parameter is the minimum frequency for a rare haplotype to be included
in the calculations for standard error (se) of the coefficients, or *haplo.min.info*.
The default value is $0.001$, which means that haplotypes with frequency less than that
will be part of the rare haplotype coefficient estimate, but it will not be used in the
standard error calculation.

The following example demonstrates a possible result when dealing with the rare haplotype
effect. We show with the hla genotype data one consequence for when this occurs.  However,
we make it happen by setting *haplo.freq.min* equal to *haplo.min.info*, which we
advise strongly against in your analyses. 

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, glmRare2, echo=TRUE }
## set haplo.freq.min and haplo.min.info to same value to show how the 
## rare coefficient may be modeled but standard error estimate is not 
## calculated because all haps are below haplo.min.info
## warning expected
fit.bin.rare02 <- haplo.glm(y.bin ~ geno.glm, family = binomial,
       data = glm.data, na.action = "na.geno.keep", locus.label = genolabel,
       control = haplo.glm.control(haplo.freq.min=.02, haplo.min.info=.02)) 
summary(fit.bin.rare02)
```

The above results show the standard error for the rare
haplotype coefficient is ``NaN'', or ``Not a Number'' in R, which is a
consequence of having most, or all, of the rare haplotypes discarded for the
standard error estimate.  In other datasets there may be only a few
haplotypes between *haplo.min.info* and *haplo.freq.min*, and
may yield misleading results for the rare haplotype coefficient.  For
this reason, we recommend that any inference made on the rare
haplotypes be made with caution, if at all. 


Methods for *haplo.glm* 
================================

After the introduction of *haplo.glm*, the package was updated with methods to
allow it to act similar to a glm object with methods to compare and assess model
fits.  In this section we describe the challenges and caveats of defining 
these methods for a *haplo.glm* object and show how to use them.

## fitted.values

A challenge when defining methods for *haplo.glm* is that we account for
the ambiguity in a sample haplotype pair. To handle this in the glm framework, the
response and non-haplotype covariates are expanded for each person with a 
posterior probability of the haplotype given their genotype as a weight.  
The returned object from *haplo.glm* looks somewhat like a regular glm, 
but the model matrix, response, and thus the fitted values, are all expanded.  
Users who want to work with the expanded versions of those items are welcome 
to access them from the returned object.

We now provide a method to get the fitted values for each person, 
*fitted.haplo.glm*. These collapsed fitted values are calculated 
by a weighted sum of the expanded fitted values for each person where 
the weights are the posterior probabilities of the sample expanded 
haplotype pairs.

## residuals

The residuals within the *haplo.glm* object are also expanded for the 
haplotype pairs for subjects.  We provide *residuals.haplo.glm* to
get the collapsed deviance, pearson, working, and response residuals for 
each person. Because we have not implemented a predict method for 
*haplo.glm*, the method does not calculate partial residuals.

## vcov

We provide *vcov.haplo.glm* as a method to get the variance-covariance matrix of 
model parameters in the *haplo.glm* object.  Unlike the standard glm object, this 
matrix is computed and retained in the returned object.  We do this because the model
parameters are the model coefficients and the haplotype frequencies, and it is 
computationally-intensive to compute.

We show how to get the variance matrix for all the parameters and for only the model 
coefficients.

```{r, vcovGLM}
varmat <- vcov(fit.gaus)
dim(varmat)

varmat <- vcov(fit.gaus, freq=FALSE)
dim(varmat)
print(varmat, digits=2)
``` 

## anova and Model Comparison

We use the *anova.glm* method as a framework for *anova.haplo.glm* to allow
comparisons of model fits. We limit the model comparisons to multiple 
nested model fits, which requires that each model to be compared is either a
*haplo.glm* or *glm* fitted object.  We eliminate the functionality of testing
sub-models of a single fit because removal of a single covariate would require 
re-fitting of the reduced model to get updated coefficient and haplotype frequency 
estimates with a maximized log-likelihood.  We decided to simplify the usage and
require that all models to be compared are fully fitted.

As with the *anova.glm* method, it is difficult to check for truly nested models,
so we pass the responsibility on to the user.  We discuss some of the requirements.

One type of two-model comparison is between models with haplotypes (expanded subjects) 
and a reduced model without haplotypes. We check for the same sample size in these models
by comparing the collapsed sample size from a *haplo.glm* fit to the sample size from
the *glm* fit, which we remind users is only a loose check of model comparability.

The other comparison of two models in *anova.haplo.glm* is to compare two models 
that contain the same genotypes, and inherently the same haplotypes.  This is more 
tricky because a subject may not have the same expanded set of possible haplotype pairs 
across two fits of *haplo.glm* unless the same seed is set before both fits. 
Even if a seed is the same, the other effects that are different between the two models 
will affect the haplotype frequency estimates, and may still result in a different 
expansion of haplotype pairs per subject.  Our check of the collapsed sample size for the
two models still applies with the same pitfalls, but a better assurance of model
comparability is to use the same seed.

In the *haplo.glm* fit we provide the likelihood ratio test of the null model 
against the full model, which is the most appropriate 
test available for *haplo.glm* objects, but it is difficult to compare the log-likeihood
across two *haplo.glm* fits.  Therefore, we remain consistent with 
glm model comparison proposed by McCullagh and Nelder, 1989, and use the difference in deviance to 
compare models. Furthermore, we restrict the asymptotic test for model 
comparison to be the $\chi^{2}$ test for goodness of fit.

Below we show how to get the LRT from the *fit.gaus* result, then show how to 
compare some of the nested models fit above, including a regular
glm fit of $y \sim male$.  The anova method requires the nested model to be given first,
and any anova with a *haplo.glm* object should explicitly call *anova.haplo.glm*.

```{r, anovaGLM}
fit.gaus$lrt

glmfit.gaus <- glm(y~male, family="gaussian", data=glm.data)

anova.haplo.glm(glmfit.gaus, fit.gaus)
anova.haplo.glm(fit.gaus, fit.inter)
anova.haplo.glm(glmfit.gaus, fit.gaus, fit.inter)
``` 

Extended Applications
=================================

The following functions are designed to wrap the functionality of the
major functions in Haplo Stats into other useful applications.

## Combine Score and Group Results: *haplo.score.merge*

When analyzing a qualitative trait, such as binary, it can be helpful
to align the results from *haplo.score* with *haplo.group*. 
To do so, use the function *haplo.score.merge*, as illustrated in
the following example:

```{r, scoreMerge,echo=TRUE}
# merge haplo.score and haplo.group results
merge.bin <- haplo.score.merge(score.bin, group.bin)
print(merge.bin, nlines=10)
```

The first column is a row index, the next columns (3 in this example) 
illustrate the haplotype, the *Hap.Score* column is the score
statistic and *p.val* the corresponding $ \chi^{2} $ p-value.
*Hap.Freq* is the haplotype frequency for the total sample, and
the remaining columns are the estimated haplotype frequencies
for each of the group levels (*y.bin* in this example). The default
print method only prints results for haplotypes appearing in the
*haplo.score* output. To view all haplotypes, use the print option
*all.haps=TRUE*, which prints all haplotypes from the 
*haplo.group* output.  The output is ordered by the score
statistic, but the *order.by* parameter can specify ordering by 
haplotypes or by haplotype frequencyies.


## Case-Control Haplotype Analysis: *haplo.cc* 

We provide *haplo.cc* to run and combine the results of *haplo.score*, 
*haplo.group*, and *haplo.glm* for case-control data. The function peforms 
a score test and a glm on the same haplotypes. The parameters that determine which
haplotypes are used are *haplo.min.count* and *haplo.freq.min*,  
which are set in the *control* parameter, as done for *haplo.glm*. 

Below we run *haplo.cc* setting the minimum haplotype
frequency at $0.02$. The print results are shown, in addition to the
names of the objects stored in the *cc.hla* result. 

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, hapCC, echo=TRUE }
# demo haplo.cc where haplo.min.count is specified
# use geno, and this function prepares it for haplo.glm
y.bin <- 1*(hla.demo$resp.cat=="low")
cc.hla <- haplo.cc(y=y.bin, geno=geno, locus.label = genolabel, 
                   control=haplo.glm.control(haplo.freq.min=.02))

print(cc.hla, nlines=25, digits=2)
names(cc.hla)
```

First, from the names function we see that *cc.hla* also contains
*score.lst* and *fit.lst*, which are the *haplo.score* and 
*haplo.glm* objects, respectively. For the printed results of *haplo.cc*,
first are the global statistics from *haplo.score*, followed by cell counts for cases and
controls. The last portion of the output is a data frame containing combined
results for individual haplotypes: 

\begin{itemize}
\item{} {\bf Hap-Score:}{\ haplotype score statistic}
\item{} {\bf p-val:}{\ haplotype score statistic p-value}
\item{} {\bf sim p-val:}{\ (if simulations performed) simulated p-value for the haplotype score statistic}
\item{} {\bf pool.hf:}{\ haplotype frequency for the pooled sample}
\item{} {\bf control.hf:}{\ haplotype frequencies for the control sample only}
\item{} {\bf case.hf:}{\ haplotype frequencies for the case sample only}
\item{} {\bf glm.eff:}{\ one of three ways the haplotype appeared in the
    glm model: *Eff}: modeled as an effect; *Base}: part of the
    baseline; and *R}: a rare haplotype, included in the effect of 
    pooled rare haplotypes}
\item{} {\bf OR.lower:}{\ Odds Ratio confidence interval
    lower limit}
\item{} {\bf OR:}{\ Odds Ratio for each effect in the model}
\item{} {\bf OR.upper:}{\ Odds Ratio confidence interval upper limit}
\end{itemize}

Significance levels are indicated by the p-values for the score statistics, 
and the odds ratio (OR) confidence intervals for the haplotype effects.  Note 
that the Odds Ratios are effect sizes of haplotypes, assuming haplotype 
effects are multiplicative.  Since this last table has many columns,
lines are wrapped in the output in this manual.  You can align wrapped
lines by the haplotype code which appears on the far left.
Alternatively, instruct the print function to only print *digits*
significant digits, and set the width settings for output in your session 
using the *options()* function.


## Score Tests on Sub-Haplotypes: *haplo.score.slide*

To evaluate the association of sub-haplotypes (subsets of alleles from
the full haplotype) with a trait, the user can evaluate a "window" of 
alleles by *haplo.score*, and slide this window across the entire 
haplotype. This procedure is implemented by the function 
*haplo.score.slide*. To illustrate this method, we use all 11 loci
in the demo data, *hla.demo*.
 
First, make the geno matrix and the locus labels for the 11 loci.
Then use *haplo.score.slide* for a window of 3 loci 
(*n.slide=3*), which will slide along the haplotype for all 9 
contiguous subsets of size 3, using the previously defined gaussian trait 
*resp*.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, label=scoreSlide, echo=TRUE, eval=TRUE }
# haplo.score on 11 loci, slide on 3 consecutive loci at a time
geno.11 <- hla.demo[,-c(1:4)]
label.11 <- c("DPB","DPA","DMA","DMB","TAP1","TAP2","DQB","DQA","DRB","B","A")
score.slide.gaus <- haplo.score.slide(hla.demo$resp, geno.11, trait.type =
                "gaussian", n.slide=3, min.count=5, locus.label=label.11)
print(score.slide.gaus)
```

The first column is the row index of the nine calls to 
*haplo.score*, the second column is the number of the starting 
locus of the sub-haplotype, the third column is the global score
statistic p-value for each call.  The last two columns are the 
simulated p-values for the global and maximum score statistics, 
respectively.  If you specify *simulate=TRUE* in the function 
call, the simulated p-values would be present. 

### Plot Results from *haplo.score.slide*

The results from *haplo.score.slide* can be easily viewed in a plot
shown below. The x-axis has tick marks for each locus, and the y-axis is the 
$ -log_{10}(pval) $.  To select which p-value to plot, use the 
parameter  pval, with  choices "*global*", "*global.sim*", and
"*max.sim*" corresponding to p-values described above.  If the
simulated p-values were not computed, the default is to plot the
global p-values.  For each p-value, a horizontal line is drawn at the 
height of $ -log_{10}(pval) $\ across the loci over which it was
calculated. For example, the p-value *score.global.p = 0.009963*
for loci 8-10 is plotted as a horizontal line at $ y=2.002 $
spanning the $8^{th}$, $9^{th}$, and $10^{th}$ x-axis tick marks.

```{r, label=plotSlide, fig=TRUE, echo=TRUE }
# plot global p-values for sub-haplotypes from haplo.score.slide
plot.haplo.score.slide(score.slide.gaus)
```

## Scanning Haplotypes Within a Fixed-Width Window: *haplo.scan*

Another method to search for a candidate locus within a genome region 
is *haplo.scan*, an implementation of the method proposed
by [Cheng et al. in 2005](https://doi.org/10.1046/j.1529-8817.2004.00140.x).
This method searches for a region for which the haplotypes 
have the strongest association with a binary trait by sliding a window of fixed 
width over each marker locus, and then scans over all haplotype lengths within
each window. This latter step, scanning over all possible haplotype
lengths within a window, distinguishes *haplo.scan* from 
*haplo.score.slide* (which considers only the maximum haplotype
length within a window).  To account for unknown linkage phase, the 
function *haplo.em* is called prior to scanning, to create a list of 
haplotype pairs and posterior probabilities. To illustrate the scanning 
window, consider a 10-locus dataset. When placing a window of 
width 3 over locus 5, the possible haplotype lengths that contain locus 5 
are three (loci 3-4-5, 4-5-6, and 5-6-7), two (loci 4-5 and 5-6) and one 
(locus 5).
For each of these loci subsets a score statistic is computed, which is
based on the difference between the mean vector of haplotype counts
for cases and that for controls. The maximum of these score
statistics, over all possible haplotype lengths within a window, is 
the locus-specific test statistic, or the locus scan statistic. The global 
test statistic is the maximum over all computed score statistics. To 
compute p-values, the case/control status is randomly permuted. Below we 
run *haplo.scan* on the 11-locus HLA dataset with a binary
response and a window width of $3$, but first we use the results of 
*summaryGeno* to choose subjects with less than $50,000$ haplotype
pairs to speed calculations with all $11$ polymorphic loci
with many missing alleles.

```{r, eval=TRUE, echo=FALSE}
set.seed(seed)
```

```{r, hapScan, echo=TRUE, eval=TRUE }
geno.11 <- hla.demo[,-c(1:4)]
y.bin <- 1*(hla.demo$resp.cat=="low")
hla.summary <- summaryGeno(geno.11, miss.val=c(0,NA))

# track those subjects with too many possible haplotype pairs ( > 50,000)
many.haps <- (1:length(y.bin))[hla.summary[,4] > 50000]

# For speed, or even just so it will finish, make y.bin and geno.scan 
# for genotypes that don't have too many ambigous haplotypes
geno.scan <- geno.11[-many.haps,]
y.scan <- y.bin[-many.haps]

# scan haplotypes for regions within width of 3 for each locus.
# test statistic measures difference in haplotype counts in cases and controls
# p-values are simulated for each locus and the maximum statistic, 
# we do 100 simuations here, should use default settings for analysis

scan.hla <- haplo.scan(y.scan, geno.scan, width=3,
        sim.control=score.sim.control(min.sim=100, max.sim=100),
        em.control=haplo.em.control())

print(scan.hla)
```

In the output we report the simulated p-values for each locus test statistic.
Additionally, we report the loci (or locus) which provided the maximum observed test
statistic, and the *Max-Stat Simulated Global p-value* is the simulated p-value for that 
maximum statistic. We print the number of simulations, because they are performed until
p-value precision criteria are met.  We would typically allow simulations to run under
default parameters rather than limiting to $100$ by the control parameters.

## Creating Haplotype Effect Columns: *haplo.design* 

In some instances, the desired model for haplotype effects is not
possible with the methods given in *haplo.glm*.  Examples 
include modeling just one haplotype effect, or modeling an interaction of
haplotypes from different chromosomes, or analyzing censored data.  To circumvent
these limitations, we provide a function called *haplo.design*, which 
will set up an expected haplotype design matrix from a *haplo.em* 
object, to create columns that can be used to model haplotype effects 
in other modeling functions.  

The function *haplo.design* first creates a design marix for all
pairs of haplotypes over all subjects, and then uses the posterior
probabilities to create a weighted average contribution for each
subject, so that the number of rows of the final design matrix is
equal to the number of subjects.  This is sometimes called the
expectation-substitution method, as proposed by [Zaykin et al. 
2002](https://pubmed.ncbi.nlm.nih.gov/12037407/), and using this haplotype design matrix in a
regression model is asymptotically equivalent to the score statistics
from *haplo.score* [Xie and Stram 2005](https://pubmed.ncbi.nlm.nih.gov/16025443/).
Although this provides much flexibility, by using the design matrix in
any type of regression model, the estimated regression parameters can
be biased toward zero [see Lin and Zeng, 2006](https://doi.org/10.1198/016214505000000808)
for concerns about the expectation-substitution method).  

In the first example below, using default parameters, the returned data.frame 
contains a column for each haplotype that meets a minimum count in the 
sample *min.count*. The columns are named by the code they are 
assigned in *haplo.em*.
 
```{r, label=hapDesign1, echo=TRUE}
# create a matrix of haplotype effect columns from haplo.em result
hap.effect.frame <- haplo.design(save.em)

names(hap.effect.frame)

hap.effect.frame[1:10,1:8]
```

Additionally, *haplo.design* gives the user flexibility to make
a more specific design matrix with the following parameters:

\begin{itemize}
\item{}{\bf hapcodes:\ }{codes assigned in the *haplo.em* object,
    the only haplotypes to be made into effects} 
\item{}{\bf haplo.effect:\ }{the coding of haplotypes as additive,
    dominant, or recessive}
\item{}{\bf haplo.base:\ }{code for the baseline haplotype}
\item{}{\bf min.count:\ }{minimum haplotype count}
\end{itemize}

This second example below creates columns for specific haplotype
codes that were most interesting in *score.gaus.add*, haplotypes
with alleles 21-3-8 and 62-2-7, corresponding to codes 4 and 138 in
*haplo.em*, respectively. Assume we want to test their individual 
effects when they are coded with *haplo.effect=''dominant''*.

```{r, label=hapDesign2,echo=TRUE}
# create haplotype effect cols for haps 4 and 138
hap4.hap138.frame <- haplo.design(save.em, hapcodes=c(4,138), 
                                 haplo.effect="dominant")

hap4.hap138.frame[1:10,]

dat.glm <- data.frame(resp=hla.demo$resp, male=hla.demo$male, age=hla.demo$age, 
                      hap.4=hap4.hap138.frame$hap.4, 
                      hap.138=hap4.hap138.frame$hap.138)

glm.hap4.hap138 <- glm(resp ~ male + age + hap.4 + hap.138, 
                       family="gaussian", data=dat.glm)
summary(glm.hap4.hap138)
```


\section{License and Warranty}

\noindent License:\linebreak

\noindent Copyright 2003 Mayo Foundation for Medical Education and Research. \linebreak

\noindent This program is free software; you can redistribute it and/or modify
it under the terms of 
the GNU General Public License as published by the Free Software
Foundation; either 
version 2 of the License, or (at your option) any later version.\\

\noindent This program is distributed in the hope that it will be useful, but
WITHOUT ANY 
WARRANTY; without even the implied warranty of MERCHANTABILITY or 
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 
for more details. \\

\noindent You should have received a copy of the GNU General Public License
along with this program; if not, write to \\
\noindent Free Software Foundation, Inc. \\
59 Temple Place, Suite 330 \\ 
Boston, MA 02111-1307 USA \\

\noindent For other licensing arrangements, please contact Daniel J. Schaid.\\
Daniel J. Schaid, Ph.D.\\
Division of Biostatistics\\
Harwick Building - Room 775\\
Mayo Clinic\\
200 First St., SW\\
Rochester, MN 55905\\
phone: 507-284-0639\\
fax:\ \ \ 507-284-9542\\
email: schaid@mayo.edu\\


\section{Acknowledgements}
This research was supported by United States Public Health Services, 
National Institutes of Health; Contract grant numbers R01 DE13276, 
R01 GM 65450, N01 AI45240, and R01 2AI33144. The *hla.demo* data is 
kindly provided by Gregory A. Poland, M.D. and the Mayo Vaccine
Research Group for illustration only, and may not be used  for
publication.

Appendix
==============

## Counting Haplotype Pairs When Marker Phenotypes Have Missing Alleles

The following describes the process for counting the number of
haplotype pairs that are consistent with a sample observed marker 
phenotypes, allowing for some loci with missing data.  Note that we
refer to marker phenotypes, but our algorithm is oriented towards 
typical markers that have a one-to-one correspondence with their 
genotypes.  We first describe how to count when none of the loci 
have missing alleles, and then generalize to allow loci to have 
either one or two missing alleles.  When there are no missing alleles,
note that homozygous loci are not ambiguous with respect to the 
underlying haplotypes, because at these loci the underlying 
haplotypes will not differ if we interchange alleles between 
haplotypes.  In contrast, heterozygous loci are ambiguous, 
because we do not know the haplotype origin of the 
distinguishable alleles (i.e., unknown linkage phase).  
However, if there is only one heterozygous locus, then it 
does not matter if we interchange alleles, because the pair of 
haplotypes will be the same.  In this situation, if parental 
origin of alleles were known, then interchanging alleles would 
switch parental origin of haplotypes, but not the composition 
of the haplotypes.  Hence, ambiguity arises only when there 
are at least two heterozygous loci.  For each heterozygous 
locus beyond the first one, the number of possible haplotypes 
increases by a factor of 2, because we interchange the two 
alleles at each heterozygous locus to create all possible 
pairs of haplotypes.  Hence, the number of possible haplotype 
pairs can be expressed as $2^{x}$, where $x = H-1$, if  $H$\ (the number of 
heterozygous loci) is at least 2, otherwise $x = 0$.

Now consider a locus with missing alleles.  The possible 
alleles at a given locus are considered to be those that 
are actually observed in the data. Let $a_{i}$\ denote the number 
of distinguishable alleles at the locus. To count the number 
of underlying haplotypes that are consistent with the observed 
and missing marker data, we need to enumerate all possible 
genotypes for the loci with missing data, and consider whether 
the imputed genotypes are heterozygous or homozygous. 

To develop our method, first consider how to count the number 
of genotypes at a locus, say the $i^{th}$ locus, when either one or two 
alleles are missing.  This locus could have either a homozygous 
or heterozygous genotype, and both possibilities must be 
considered for our counting method.  If the locus is considered 
as homozygous, and there is one allele missing, then there is 
only one possible genotype; if there are two alleles missing, 
then there are $a_{i}$  possible genotypes. A function to perform this 
counting for homozygous loci is denoted  $f(a_{i})$.  If the locus is 
considered as heterozygous, and there is one allele missing, 
then there are $ a_{i}-1$\ possible genotypes; if there are two alleles 
missing, then there are $\frac{a_{i}(a_{i}-1)}{2}$\  possible
genotypes. A function to perform this counting for heterozygous loci is denoted $g(a_{i})$ 
These functions and counts are summarized in Table A.1. \\

Factors for when a locus having missing allele(s)
is counted as homozygous($f()$) or heterozygous($g()$)\\ 

\begin{table}[h] 
\begin{center}
\begin{tabular}{|r|c|c|}  
\hline
Number of       & Homozygous          &  Heterozygous \\
missing alleles & function $f(a_{i})$ &  function $g(a_{i})$ \\
\hline \hline
1               &    1                & $a_{i} - 1$ \\ \hline
2               & $ a_{i}$            & $\frac{a_{i}(a_{i}-1)}{2}$ \\ \hline
\end{tabular}
\end{center}
\end{table}

Now, to use these genotype counting functions to determine the number 
of possible haplotype pairs, first consider a simple case where only
one locus, say the $i^{th}$ locus, has two missing alleles.  Suppose that 
the phenotype has H heterozygous loci (H is the count of heterozygous 
loci among those without missing data).  We consider whether the locus with 
missing data is either homozygous or heterozygous, to give the count of possible 
haplotype pairs as 

\begin{equation}
a_{i}2^{x} + \left[{\frac{a_{i}(a_{i}-1)}{2}}\right]2^{x+1}
\end{equation}

where again  $x = H-1$\ if H is at least 2, otherwise x = 0.  This special 
case can be represented by our more general genotype counting functions as 

\begin{equation}
f(a_{i})\,2^{x} + g(a_{i})\,2^{x+1}
\end{equation}

When multiple loci have missing data, we need to sum over all possible
combinations of heterozygous and homozygous genotypes for the
incomplete loci.  The rows of Table A.2 below present these 
combinations for up to $m=3$\ loci with missing data.  Note that as the 
number of heterozygous loci increases (across the columns of Table A.2),
so too does the exponent of 2. To calculate the total number of pairs 
of haplotypes, given observed and possibly missing genotypes, we need 
to sum the terms in Table A.2 across the appropriate row.  For example, 
with $m=3$, there are eight terms to sum over.  The general
formulation for this counting method can be expressed as 
\begin{equation}
Total Pairs = \sum_{j=0}^{m} \sum_{combo} C(combo,j)
\end{equation}

where combo is a particular pattern of heterozygous and homozygous 
loci among the loci with missing values (e.g., for $m = 3$, one 
combination is the first locus heterozygous and the $2^{nd}$\ and 
$3^{rd}$\ third as homozygous), and $C(combo,j)$\ is the corresponding count for 
this pattern when there are $i$\ loci that are heterozygous (e.g., for 
$m = 3$\ and $j = 1$\,  , as illustrated in Table A.2).


\noindent {\bf Table A.2:} Genotype counting terms when $m$ loci
have missing \\
alleles, grouped by number of heterozygous loci (out of $m$)\\
\begin{table}[h] 
\begin{center}
\begin{tabular}{|r||r|r|r|r|}  
\hline 
$m$ & $j=0\,of\, m$         & $j=1\, of\, m$ &$j=2\, of\, m$ & $j=3\, of\, m$ \\
\hline \hline
$0$ & $2^{x}$             &              &             &   \\
\hline
$1$ & $f(a_{1})2^{x}$     &  $g(a_{1})2^{x+1}$          &         &        \\
\hline
$2$ & $f(a_{1})f(a_{2})2^{x}$       &  $g(a_{1})f(a_{2})2^{x+1}$  &$g(a_{1})g(a_{2})2^{x+1}$ & \\ 
    &                         &  $f(a_{1})g(a_{2})2^{x+1}$  &  & \\
\hline
$3$ & $f(a_{1})f(a_{2})f(a_{3})2^{x}$&$g(a_{1})f(a_{2})f(a_{3})2^{x+1}$&$g(a_{1})g(a_{2})f(a_{3})2^{x+2}$& 
 $g(a_{1})g(a_{2})g(a_{3})2^{x+2}$   \\
    &                                &$f(a_{1})g(a_{2})f(a_{3})2^{x+1}$&$g(a_{1})f(a_{2})g(a_{3})2^{x+2}$& \\
    &                                &$f(a_{1})f(a_{2})g(a_{3})2^{x+1}$&$f(a_{1})g(a_{2})g(a_{3})2^{x+2}$& \\
\hline
\end{tabular}
\end{center}
\end{table}