--- title: 'SNPassoc: an R package to perform whole genome association studies' author: - name: Juan R. Gonzalez affiliation: - Bioinformatics Research Group in Epidemiolgy (BRGE), Barcelona Institute for Global Health (ISGlobal) - Department of Mathematics, Autonomous University of Barcelona (UAB) - name: Victor Moreno affiliation: Catalan Institute of Oncology date: "`r Sys.Date()`" package: "`r paste('SNPassoc', packageVersion('SNPassoc'))`" output: rmarkdown::html_vignette: number_sections: yes toc: yes fig_caption: yes bibliography: multiomic.bib vignette: > %\VignetteIndexEntry{SNPassoc: an R package to perform whole genome association studies} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(comment="", warning=FALSE, message=FALSE, cache=TRUE) ``` ```{r resetUserConf, echo=FALSE, eval=TRUE} oldpar <- par(no.readonly = TRUE) on.exit(par(oldpar)) ``` # Introduction The `SNPassoc` package contains facilities for data manipulation, tools for exploratory data analysis, convenient graphical facilities, and tools for assessing genetic association for both quantitative and categorial (case-control) traits in whole genome approaches. Genome-based studies are normally analyzed using a multistage approach. In the first step researchers are interested in assessing association between the outcome and thousands of SNPs. In a second and possibly third step, medium/large scale studies are performed in which only a few hundred of SNPs, those with a putative association found in the first step, are genotyped. `SNPassoc` is specially designed for analyzing this kind of designs. In addition, a convenience-based approach (select variants on the basis of logistical considerations such as the ease and cost of genotyping) can also be analyzed using `SNPassoc`. Different genetic models are also implemented in the package. Analysis of multiple SNPs can be analyzed using either haplotype or gene-gene interaction approaches. This document is an updated version of the initial vignette that was published with the SNPassoc paper @gonzalez2007snpassoc. It contains a more realistic example belonging to a real dataset. The original vignette is still available [here](https://github.com/isglobal-brge/TeachingMaterials/blob/master/Master_Bioinformatics/Recommended_lectures/SupplementaryMaterial-BIOINF-2006-1602.R1.pdf). # Data loading SNP data are typically available in text format or Excel spreadsheets which are easily uploaded in `R` as a data frame. Here, as an illustrative example, we are analyzing a dataset containing epidemiological information and 51 SNPs from a case-control study on asthma. The data is available within `SNPassoc` and can be loaded by Then, the data is loaded into the R session by ```{r load_asthma} data(asthma, package = "SNPassoc") str(asthma, list.len=9) asthma[1:5, 1:8] ``` We observe that we have case-control status (0: control, 1: asthma) and another 4 variables encoding the country of origin, gender, age, body mass index (bmi) and smoking status (0: no smoker, 1: ex-smoker, 2: current smoker). There are 51 SNPs whose genotypes are given by the alleles names. # Descriptive analysis To start the analysis, we must indicate which columns of the dataset `asthma` contain the SNP data, using the `setupSNP` function. In our example, SNPs start from column 7 onwards, which we specify in argument *colSNPs* ```{r prepare_data} library(SNPassoc) asthma.s <- setupSNP(data=asthma, colSNPs=7:ncol(asthma), sep="") ``` This is an alternative way of determining the columns containing the SNPs ```{r prepare_data2} idx <- grep("^rs", colnames(asthma)) asthma.s <- setupSNP(data=asthma, colSNPs=idx, sep="") ``` The argument *sep* indicates the character separating the alleles. The default value is ''/´´. In our case, there is no separating character, so that, we set *sep=""*. The argument *name.genotypes* can be used when genotypes are available in other formats, such as 0, 1, 2 or ''norm´´, ''het´´, ''mut´´. The purpose of the `setupSNP` function is to assign the class *snp* to the SNPs variables, to which `SNPassoc` methods will be applied. The function labels the most common genotype across subjects as the reference one. When numerous SNPs are available, the function can be parallelized through the argument *mc.cores* that indicates the number of processors to be used. We can verify that the SNP variables are given the new class *snp* ```{r classSNP} head(asthma.s$rs1422993) class(asthma.s$rs1422993) ``` and summarize their content with `summary` ```{r summarySNP} summary(asthma.s$rs1422993) ``` which shows the genotype and allele frequencies for a given SNP, testing for Hardy-Weinberg equilibrium (HWE). We can also visualize the results in a plot by ```{r plotsummarySNP, fig.cap="SNP summary. Bar chart showing the basic information of a given SNP"} plot(asthma.s$rs1422993) ``` The argument *type* helps to get a pie chart ```{r plotsummarySNP2, fig.cap="SNP summary. Pie chart showing the basic information of a given SNP"} plot(asthma.s$rs1422993, type=pie) ``` The *summary* function can also be applied to the whole dataset ```{r descriptive} summary(asthma.s, print=FALSE) ``` showing the SNP labels with minor/major allele format, the major allele frequency the HWE test and the percentage of missing genotypes. Missing values can be further explored plotting with ```{r plotMissing, fig.cap="Missing genotypes. Black squares shows missing genotuype information of asthma data example."} plotMissing(asthma.s, print.labels.SNPs = FALSE) ``` This plot can be used to inspect if missing values appear randomly across individuals and SNPs. In our case, we can see that the missing pattern may be considered random, except for three clusters in consecutive SNPs (large black squares). These individuals should be further checked for possible problems with genotyping. # Hardy-Weinberg equilibrium Genotyping of SNPs needs to pass quality control measures. Aside from technical details that need to be considered for filtering SNPs with low quality, genotype calling error can be detected by a HWE test. The test compares the observed genotype frequencies with those expected under random mating, which follows when the SNPs are in the absence of selection, mutation, genetic drift, or other forces. Therefore, HWE must be checked only in controls. There are several tests described in the literature to verify HWE. In `SNPassoc` HWE is tested for all the bi-allelic SNP markers using a fast exact test @wigginton2005note implemented in the *tableHWE* function. ```{r HWE} hwe <- tableHWE(asthma.s) head(hwe) ``` We observe that the first SNPs in the dataset are under HWE since their P-values rejecting the HWE hypothesis (null hypothesis) are larger than 0.05. However, when tested in control samples only, by stratifying by cases and controls ```{r HWE_controls} hwe2 <- tableHWE(asthma.s, casecontrol) #SNPs is HWE in the whole sample but not controls snpNHWE <- hwe2[,1]>0.05 & hwe2[,2]<0.05 rownames(hwe2)[snpNHWE] hwe2[snpNHWE,] ``` We see that rs1345267 is not in HWE within controls because its P-value is <0.05. Notice that one is interested in keeping those SNPsthat do not reject the null hypothesis. As several SNPs are tested, multiple comparisons must be considered. In this particular setting, a threshold of 0.001 is normally considered. As a quality control measure, it is not necessary to be as conservative as in those situations where false discovery rates need to be controlled. SNPs that do not pass the HWE test must be removed form further analyses. We can recall *setupSNP* and indicate the columns of the SNPs to be kept ```{r remove_SNPs} snps.ok <- rownames(hwe2)[hwe2[,2]>=0.001] pos <- which(colnames(asthma)%in%snps.ok, useNames = FALSE) asthma.s <- setupSNP(asthma, pos, sep="") ``` Note that in the variable *pos* we redefine the SNP variables to be considered as class *snp*. # SNP association analysis We are interested in finding those SNPs associated with asthma status that is encoded in the variable *casecontrol*. We first illustrate the association between case-control status and the SNP rs1422993. The association analysis for all genetic models is performed by the function *association* that regresses *casecontrol* on the variable rs1422993 from the dataset *asthma.s* that already contains the SNP variables of class *snp*. ```{r association} association(casecontrol ~ rs1422993, data = asthma.s) ``` The function *association* follows the usual syntax of R modelling functions with the difference that the variables in the model that are of class *snp* are tested using different genetic models. In our example, we observe that all genetic models but the recessive one are statistically significant. *association* also fits the overdominant model, which compares the two homozygous genotypes versus the heterozygous one. This genetic model of inheritance is biologically rare although it has been linked to sickle cell anemia in humans. The result table describes the number of individuals in each genotype across cases and controls. The ORs and CI-95\% are also computed. The last column describes the AIC (Akaike information criteria) that can be used to decide which is the best model of inheritance; the lower the better the model is. In the example, one may conclude that rs1422993 is associated with asthma and that, for instance, the risk of being asthmatic is 39\% higher in people having at least one alternative allele (T) with respect to individuals having none (dominant model). This risk is statistically significant since the CI-95\% does not contain 1, the P-value is 0.0078<0.022, or the P-value of the max-statistics is 0.01 ```{r maxrs1422993} maxstat(asthma.s$casecontrol, asthma.s$rs1422993) ``` If an expected model of inheritance is hypothesized, the association analysis for the model can be specified in the argument *model*, which by default test all models, ```{r mode_inheritance} association(casecontrol ~ rs1422993, asthma.s, model="dominant") ``` Association tests are typically adjusted by covariates, which are incorporated in the model in the usual form ```{r adjusted} association(casecontrol ~ rs1422993 + country + smoke, asthma.s) ``` ORs for stratified analysis on given categorical covariates are used to verify whether the risk is constant across groups ```{r stratified} association(casecontrol ~ rs1422993 + survival::strata(gender), asthma.s) ``` We can see, for instance, that the dominant model is significant only in males. The subset argument allows fitting the model in a subgroup of individuals ```{r subset} association(casecontrol ~ rs1422993, asthma.s, subset=country=="Spain") ``` These analyses can be also be performed in quantitative traits, such as body mass index, since *association* function automatically selects the error distribution of the regression analysis (either Gaussian or binomial). ```{r quantitative} association(bmi ~ rs1422993, asthma.s) ``` For BMI, *association* tests whether the difference between means is statistically significant, rather than computing an OR. For multiple SNP data, our objective is to identify the variants that are significantly associated with the trait. The most basic strategy is, therefore, to fit an association test like the one described above for each of the SNPs in the dataset and determine which of those associations are significant. The massive univariate testing is the most widely used analysis method for *omic* data because of its simplicity. In *SNPassoc*, this type of analysis is done with the function *WGassociation* ```{r morethan1SNP} ans <- WGassociation(casecontrol, data=asthma.s) head(ans) ``` Here, only the outcome is required in the formula argument (first argument) since the function successively calls *association* on each of the variables of class *snp* within *data*. The function returns the P-values of association of each SNP under each genetic model. Covariates can also be introduced in the model ```{r morethan1SNPadjusted, eval=FALSE} ans.adj <- WGassociation(casecontrol ~ country + smoke, asthma.s) head(ans.adj) ``` `SNPassoc` is computationally limited on large genomic data. The computing time can be reduced by parallelization, specifying in the argument *mc.cores* the number of computing cores to be used. Alternatively, the function *scanWGassociation*, a C compiled function, can be used to compute a predetermined genetic model across all SNPs, passed in the argument *model*, which by default is the additive model ```{r morethan1SNPscan, eval=FALSE} ans.fast <- scanWGassociation(casecontrol, asthma.s) ``` **NOTE**: This function is not available on the `SNPassoc` version available on CRAN. The user can install the development version available on GitHub to get access to this function just executing ```{r installGitHubVersion, eval=FALSE} devtools::install_github("isglobal-brge/SNPassoc") ``` The P-values obtained from massive univariate analyses are visualized with the generic *plot* function ```{r plotGWAS, fig.cap="Manhattan-type plots for different genetic models. P-values in -log10 scale to assess the association between case-control status and SNPs in the asthma example.", fig.height=8} plot(ans) ``` This produces a Manhattan plot of the -log10(P-values) for all the SNPs over all models. It shows the nominal level of significance and the Bonferroni level, which is the level corrected by the multiple testing across all SNPs. The overall hypothesis of massive univariate association tests is whether there is any SNP that is significantly associated with the phenotype. As multiple SNPs are tested, the probability of finding at least one significant finding increases if we do not lower the significance threshold. The Bonferroni correction lowers the threshold by the number of SNPs tested (0.0001=0.05/51). In the Manhattan plotof our analysis, we see that no SNP is significant at the Bonferroni level, and therefore there is no SNP that is significantly associated with asthma. Maximum-statistic (see @gonzalez2008maximizing) can also be used to test association between asthma status and SNPs ```{r max-statistic} ans.max <- maxstat(asthma.s, casecontrol) ans.max ``` We note that even under the max-statistics none of the SNPs tested is significant under the Bonferroni correction (<0.0001) for multiple SNP testing ```{r maxBonferroni} #minimum P-value across SNPs min(ans.max["Pr(>z)",]) ``` Information for specific association models for given SNPs can also be retrieved with *WGstats* ```{r OR_several_SNPs, results='hide'} infoTable <- WGstats(ans) ``` Therefore, we can have access to the results for a given SNP by ```{r get_info_rs1422993} infoTable$rs1422993 ``` recovering our previous results given by *association* function. **NOTE**: The R output of specific association analyses can be exported into LaTeX by using *getNiceTable* function and *xtable* R package. The following code creates a table for the SNPs rs1422993 and rs184448 ```{r getNiceTable, eval=FALSE} library(xtable) out <- getNiceTable(ans[c("rs1422993", "rs184448")]) nlines <- attr(out, "nlines") hlines <- c(-1, -1, 0, cumsum(nlines+1), nrow(out), nrow(out)) print(xtable(out, caption='Genetic association using different genetic models from asthma data example of rs1422993 and rs184448 SNPs obtained with SNPassoc.', label = 'tab-2SNPs'), tabular.enviroment="longtable", file="tableSNPs", floating=FALSE, include.rownames = FALSE, hline.after= hlines, sanitize.text.function=identity) ``` # Gene-environment and gene-gene interactions Gene-enviroment (GxE) analyses can be performed within `SNPassoc` using *association* function. Assume that we are interested in testing whether the risk of rs1422993 for asthma under the dominant model is different among smokers (variable smoke; 0=never, 1=ever). This code fits a model with an interaction term where the environmental variable is required to be a factor factor variable. ```{r snpxsmoke} association(casecontrol ~ dominant(rs1422993)*factor(smoke), data=asthma.s) ``` The result is an interaction table showing that the risk of individuals carrying the T allele increases the risk of asthma in never smokers (OR=1.35; CI: 1.02-1.79) while it is not significant in ever smokers (OR=0.95; CI: 0.63-1.42). However, the interaction is not statistically significant ($P$-interaction=0.8513). The output also shows the stratified ORs that can help in interpreting the results. In a similar way, gene-gene interaction (GxG) of a given SNP epistasis model can also be fitted using the same function. In that case, the genetic model of the interacting SNP must be indicated in the *model.inteaction* argument. ```{r snpxsnp} association(casecontrol ~ rs1422993*factor(rs184448), data=asthma.s, model.interaction = "dominant" ) ``` We observe that the interaction between these two SNPs is not statistically significant (P-value=0.24). However, the OR of GG genotype of rs184448 differs across individuals between the GG and GT-TT genotypes of rs1422993 (see ORs for GG in the second table of the output). The user also can perform GxG for a set of SNPs using this code. Let us assume we are interested in assessing interaction between the SNPs that are significant at 10\% level ```{r gxg_subset} ans <- WGassociation(casecontrol, data=asthma.s) mask <- apply(ans, 1, function(x) min(x, na.rm=TRUE)<0.1) sig.snps <- names(mask[mask]) sig.snps idx <- which(colnames(asthma)%in%sig.snps) asthma.s2 <- setupSNP(asthma, colSNPs = idx, sep="") ans.int <- interactionPval(casecontrol ~ 1, data=asthma.s2) ans.int ``` we can visualize the results by ```{r plotInf, fig.cap="Interaction plot. Interaction plot of SNPs significant al 10\\% significant level (see help of 'interactionPval' function to see what is represented in the plot). ", fig.height=7, fig.width=7} plot(ans.int) ``` # Haplotype analysis Genetic association studies can be extended from single SNP associations to haplotype associations. Several examples including different complex diseases can be found in \cite{zill2004single, mill2004haplotype, nair2006sequence, gonzalez2012fto, asherson2007confirmation}. While alleles naturally occur in haplotypes, as they belong to chromosomes, the phase, or the knowledge of the chromosome an allele belongs to is lost in the genotyping process. As each genotype is measured with a different probe the phase between the alleles in a chromosome is broken. Consider for instance an individual for which one chromosome has alleles A and T at two different loci and alleles G or C at the second chromosome. The individual's genotypes for the individual at the two loci are A/G and T/C, which are the same genotypes of another individual that has alleles A and C in one chromosome and G and T in the second chromosome. Clearly, if the pair A-T confers a risk to a disease, subject one is at risk while subject two is not, despite both of them having the same genotypes. The only unequivocal method of resolving phase ambiguity is sequencing the chromosomes of individuals. However, given the correlational structure of SNPs, it is possible to estimate the probability of a particular haplotype in a subject. This can be done in a genetic association study where a number of cases and controls are genotyped. There are numerous methods to infer unobserved haplotypes, two of the most popular are maximum likelihood, implemented via the expectation-maximization (EM) algorithm \cite{excoffier1995maximum, long1995algorithm}, and a parsimony method \cite{clark1990inference}. Recent methods based on Bayesian models have also been proposed \cite{stephens2001new, stephens2003comparison}. In addition, haplotypes inferences carry uncertainty, which should be considered in association analyses. ## Haplotype estimation We now illustrate how to perform haplotype estimation from genotype data using the EM algorithm and how to integrate haplotype uncertainty when evaluating the association between traits and haplotypes. Haplotype inference is performed with `haplo.stats`, for which genotypes are encoded in a different format. *make.geno*, from `SNPassoc`, formats data for `haplo.stats`. The function *haplo.em* computes the haplotype frequency in the data for the SNPs of interest. Here we illustrate how to estimate haplotypes built from the SNPs rs714588, rs1023555 and rs898070: ```{r haplo_em} library(haplo.stats) snpsH <- c("rs714588", "rs1023555", "rs898070") genoH <- make.geno(asthma.s, snpsH) em <- haplo.em(genoH, locus.label = snpsH, miss.val = c(0, NA)) em ``` Coding the common and variant alleles as 1 and 2, we can see there are 8 possible haplotypes across the subjects and are listed with an estimated haplotype frequency. Clearly, the haplotypes are not equally probable, as expected from the high LD between the SNPs. In particular, we observe that haplotypes 4 and 7 are the most probable, accumulating 65\% of the haplotype sample. *haplo.em* estimates for each subject the probability of a given haplotype in each of the subject's chromosomes. ## Haplotype association We then want to assess if any of these haplotypes significantly associates with asthma. The *haplo.glm* fits a regression model between the phenotype and the haplotypes, incorporating the uncertainty for the probable haplotypes of individuals. The function *intervals* of `SNPassoc` provides a nice summary of the results ```{r hap_assoc} trait <- asthma.s$casecontrol mod <- haplo.glm(trait ~ genoH, family="binomial", locus.label=snpsH, allele.lev=attributes(genoH)$unique.alleles, control = haplo.glm.control(haplo.freq.min=0.05)) intervals(mod) ``` *haplo.glm* fits a logistic regression model for the asthma status (*trait* - NOTE: this must be a 0/1 variable) on the inferred haplotypes (*genoH*), the names of SNPs and their allele names are passed in the *locus.label* and *allele.lev* arguments, while only haplotypes with at least 5\% frequency are considered. As a result, we obtain the OR for each haplotype with their significance P-value, with respect to the most common haplotype (ATG with 44\% frequency). In particular, from this analysis, we cannot see any haplotype significantly associated with asthma. ## Sliding window approach The inference of haplotypes depends on a predefined region or sets of SNPs. For instance, in the last section, we selected three SNPs that were in high LD. However, when no previous knowledge is available about the region or SNPs for which haplotypes should be inferred, we can apply a sliding window for haplotype inference \cite{gabriel2002structure, zhang2002dynamic}. To illustrate this type of analysis, we now consider a second block of 10 SNPs in our asthma example (from 6th to 15th SNP). Considering large haplotypes, however, increases the number of possible haplotypes in the sample, decreasing the power of finding real associations. In addition, in predefined blocks, it is possible to miss the most efficient length of the susceptible haplotype, incrementing the loss of power. This is overcome using a sliding window. We thus ask which is the haplotype combination from any of 4, 5, 6 or 7 consecutive SNPs that gives the highest association with asthma status. We then reformat SNP genotypes in the region with the *make.geno* function and perform an association analysis for multiple haplotypes of *i* SNPs sliding from the 6th to the 15th SNP in the data. We perform an analysis for each window length *i* varying from 4 to 7 SNPs. ```{r sliding_window_hap} snpsH2 <- labels(asthma.s)[6:15] genoH2 <- make.geno(asthma.s, snpsH2) haplo.score <- list() for (i in 4:7) { trait <- asthma.s$casecontrol haplo.score[[i-3]] <- haplo.score.slide(trait, genoH2, trait.type="binomial", n.slide=i, simulate=TRUE, sim.control=score.sim.control(min.sim=100, max.sim=200)) } ``` The results can be visualized with the following plot ```{r plotSliding, fig.cap="Sliding window approach. Results obtained of varying haplotype size from 4 up to 7 of 6th to the 15th SNP from asthma data example", fig.height=7, fig.width=7} par(mfrow=c(2,2)) for (i in 4:7) { plot(haplo.score[[i-3]]) title(paste("Sliding Window=", i, sep="")) } ``` We observe that the highest -log10(P-value) is obtained for a haplotype of 4 SNP length starting at the 4th SNP of the selected SNPs. After deciding the best combination of SNPs, the haplotype association with asthma can be estimated by ```{r hap_assoc_bestH} snpsH3 <- snpsH2[4:7] genoH3 <- make.geno(asthma.s, snpsH3) mod <- haplo.glm(trait~genoH3, family="binomial", locus.label=snpsH3, allele.lev=attributes(genoH3)$unique.alleles, control = haplo.glm.control(haplo.freq.min=0.05)) intervals(mod) ``` Here, we observe that individuals carrying the haplotype GCAC have a 53\% increased risk of asthma relative to those having the reference haplotype TCGT (OR=1.53, p=0.0021). The haplotype GTCA is also significantly associated with the disease (p=0.0379). A likelihood ratio test for haplotype status can be extracted from the results of the *haplo.glme* function: ```{r lrt_hap} lrt <- mod$lrt pchisq(lrt$lrt, lrt$df, lower=FALSE) ``` We can also test the association between asthma and the haplotype adjusted for smoking status ```{r lrt_hap_Adj} smoke <- asthma.s$smoke mod.adj.ref <- glm(trait ~ smoke, family="binomial") mod.adj <- haplo.glm(trait ~ genoH3 + smoke , family="binomial", locus.label=snpsH3, allele.lev=attributes(genoH3)$unique.alleles, control = haplo.glm.control(haplo.freq.min=0.05)) lrt.adj <- mod.adj.ref$deviance - mod.adj$deviance pchisq(lrt.adj, mod.adj$lrt$df, lower=FALSE) ``` # Session info ```{r sessionInfo} sessionInfo() ``` # References