--- title: "Data import and preparation" author: - Roeland E. Voorrips - Alejandro Therese Navarro date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data import and preparation} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r set-options, echo=FALSE, cache=FALSE} options(width = 100) ``` FitPoly contains several functions to import from SNP array software, perform selections based on signal intensity and prepare input data for genotyping. The package offers more functions than shown here, and the functions we use have additional parameters; check the help for extra information. Load the package and set the working directory to the data directory: ```{r} library(fitPoly) ``` ## Exercise 1: extract data from SNP array We cover here the cases for Illumina Infinium and GoldenGate arrays and for Affymetrix Axiom arrays. The aim of this step is to get the data into a uniform format, such that the further steps are not dependent on the source of the data (array platform). ### Illumina The data from Illumina arrays are processed with Illumina's GenomeStudio software. This software can export a `Full Data Table.txt` file, which is a tab-separated text file (i.e. columns in this file are separated by Tab stops). In this file format every row corresponds to a marker. The first columns contain general data for the marker including its name; next for each sample there are a few columns. For our purposes we need the columns X and Y for each sample. These are not exported by default; in order to export them, within GenomeStudio we use the Column Chooser button to include them in the output. All other sample columns besides X and Y are not needed but are tolerated; however since they increase the file size and memory load it is better to omit them when exporting from GenomeStudio. ```{r} # show part of a GenomeStudio FullDataTable file (only some relevant columns): fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly") dat <- readDatfile(fname, check.names=FALSE) dat[1:3, c(2, 15,16, 21,22, 27,28)] ``` In case you want to see the `FullDataTable.txt` file itself, you can find it in the extdata subdirectory of the main package directory. ### Affymetrix The data from Affymetrix Axiom arrays are processed with the Affymetrix Power Tools (APT) or the Genotyping Console software. After initial quality control steps all passing samples and all SNPs are processed by the program "apt probeset genotype". If the option --summaries is included (as is normally the case) a file `AxiomGT1.summary.txt` file is produced. Like the GenomeStudio file, this is a tab-separated text file with the markers in rows and the samples in columns. There are also differences; most importantly, instead of having columns X and Y for each sample, we now have one column per sample but rows A and B for each marker. Also the files start with several hundred lines of text, and the last few thousand rows contain quality control (DQC) markers. Also the sample names all have ".CEL" appended to them. ```{r} # show part of an Affymetrix AxiomGT1.summary.txt file: fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly") dat <- readDatfile(fname, comment.char="#", check.names=FALSE) dat[1:6, 1:4] ``` Again, the `AxiomGT1.summary.txt` file is located in the extdata subdirectory of the main package directory. ### Read and convert the array data We provide demo files, each with only 10 or 15 samples and 10 SNP markers. We will import these files, clean the sample names and convert them from **wide** format (all samples side by side) to **long** format (all samples under each other, with columns `MarkerName`, `SampleName`, `X`, `Y`, `R`, `ratio`). ```{r} # import a FullDataTable.txt file from GenomeStudio: fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly") datGS <- readFullDataTable(filename=fname, out="") head(datGS) unique(datGS$MarkerName) unique(datGS$SampleName) # import an AxiomGT1.summary.txt file from Affymetrix Power Tools: fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly") datAX <- readAxiomSummary(AXdata=fname, out="") head(datAX) unique(datAX$MarkerName) unique(datAX$SampleName) ``` These functions return the reformatted data as a data.frame. Normally you would use the parameter `out` to specify the name of an RData file (or optionally a tab-separated text file, see the help) to save the converted data. ## Exercise 2: Rename the samples and separate by ploidy level Normally the marker names and/or the sample names as used in the array software are different from the original names. In these cases there should be a file specifying the translation between the two sets of names. Such a file can also contain additional information on the samples, such as the categories of the samples (e.g. F1 parent 1, F1 parent 2, F1 individual, association panel, wild, control) and ploidy (e.g. 2x, 4x). In this exercise we will convert the marker and sample names of the `datAX` data.frame and split it into two parts: one for the tetraploid and one for the diploid samples. First we rename the markers. The translation is specified by file `CsvAnnotationFile.v1.txt`. Since the markers on the array are a fixed set these annotation files don't change and can usually be assumed to be correct (i.e. all marker names in the array can be assumed also to occur in the annotation file and v.v., and all `customer_id` codes can be assumed to be unique). Therefore we don't need to check this and can perform the translation directly: ```{r} fname <- system.file("extdata", "CsvAnnotationFile.v1.txt", package="fitPoly") mrktable <- read.csv(fname, comment.char="#") levels(datAX$MarkerName) <- mrktable$customer_id[match(levels(datAX$MarkerName), mrktable$Probe.Set.ID)] ``` Next we rename the samples. The samples are different between projects using the same array, and sample annotations are usually made by hand, in different formats, and cannot be assumed to be error-free. Also it may be that there are samples of various ploidy levels, which we may want to split into separate sets. For this we use function `splitNrenameSamples`: ```{r} fname <- system.file("extdata", "AX_sampletable.csv", package="fitPoly") smptable <- read.csv(fname) head(smptable) # no split, return the original data.frame with substituted SampleNames: datAX_unsplit <- splitNrenameSamples(dat=datAX, sampletable=smptable, SampleID="BestArray", CustomerID="SampleName", out=NA) head(datAX_unsplit) # split samples based on ploidy, return a list of data.frames: datAX_split <- splitNrenameSamples(dat=datAX, sampletable=smptable, SampleID="BestArray", CustomerID="SampleName", Ploidy="Ploidy", out=NA) head(datAX_split[[1]]) head(datAX_split[[2]]) ``` Although we use `out=NA` in this demo, normally you would provide a filename to store the results. ## Exercise 3: Filtering the data based on total signal intensity The total signal intensity R = X + Y can be used as a measure whether sample DNA is ok, the marker is designed well, and individual datapoints are acceptable. Selection of data based on signal intensity has to be done before dosage calling by fitPoly as fitPoly only works with the signal ratio, not with absolute signal intensities. For this exercise we will use an example data.frame `XYdat` . ```{r} data(XYdat) head(XYdat) ``` ### 3a: Selection of samples based on total signal intensity Samples with bad DNA quality may be recognized by a low R (total signal intensity), averaged over all markers. This information is saved in a file produced by `readFullDataTable` or `readAxiomSummary` if parameter out is specified. However it is also quite easy to calculate directly if all data are in the same data.frame: ```{r} sampRmean <- tapply(XYdat$R, XYdat$SampleName, mean) hist(sampRmean, breaks = 20) ``` It is clear that the mean R values for all samples are quite similar so there is no reason to exclude any samples at this point. ### 3b: Selection of markers based on total signal intensity Next we will check if any markers should be excluded based on their overall signal intensity R. We calculate a number of statistics of the R distribution per marker and study the 95% quantiles (q95) of the markers: ```{r} Rstats <- calcRstats(XYdat, out=NA) quantile(Rstats$q95) ``` In contrast to the samples, the markers vary widely in signal intensity. Probably there is a level below which no usable markers occur. We define a number of levels for q95 and at each level we select 3 markers. We then draw XY-scatterplots for these markers and decide from which level clear clustering pattern occur: ```{r eval=FALSE} Rlevels <- seq(200, 6000, by=400) sel.Rstats <- selMarkers_byR(Rstats, Rlevels, mrkperlevel=3) drawXYplots(dat=XYdat, markers=sel.Rstats, out="your-path-and filename", drawRthresholds=TRUE) ``` This generates a series of pages, each with 6 plots; here is part of one such page: ![Part of a page generated by drawXYplots](check_R_levels_DO_NOT_DELETE.png) These two markers show more or less clear clusters and have a q95 around 1600, so clearly the threshold below which markers should be excluded must be lower than 1600. In this case we will exclude all markers with q95 \< 1400. This selection will be combined with a within-marker selection of samples in function `makeFitPolyFiles` (see next exercise), but if no within-marker selection is needed it can also be done as shown here: ```{r} keep <- Rstats$MarkerName[Rstats$q95 >= 1400] dat <- XYdat[XYdat$MarkerName %in% keep,] ``` ### 3c: Selection based on R values of samples within markers In the plots as shown above the diagonal line represents R = 0.5 \* q95. Different, and even multiple, R threshold lines can be drawn by using the `Rthreshold.param` paprameter of `drawXYplots` (see the help). By looking at a series of XY-plots we can decide if we want to exclude samples with R \< 0.5 \* q95 (or any other threshold). Note that the within-marker threshold should be based on the q95 (or another R statistic) of each marker, as the signal intensity differs a lot between markers. The following selects all markers where q95 (the 95% quantile of R) is \>= 1400, and within these markers it assigns a missing value (NA) to all markers where R \< 0.5 \* q95: ```{r} dat.sel <- makeFitPolyFiles(XYdat, out=NA, Rquantile=0.95, marker.threshold=1400, Rthreshold.param=c(0.95, 0.5, 0)) #dat.sel is in this case a list with only one useful element (the other element #is NA); simplify: dat.sel <- dat.sel[[1]] head(dat.sel) ``` For more information on `makeFitPoly` files see the help. ## Excersise 4: run fitPoly This vignette is about preparing and selecting data for dosage scoring by using the function `fitMarkers`. We assume that data.frame `dat.sel` from the previous exercise is still in memory. ```{r eval=FALSE} fitMarkers(ploidy=4, markers=1:3, data=dat.sel, filePrefix="A", rdaFiles=TRUE) ``` The main result of this function call is a file `A_scores.RData` which contains a data.frame with for each marker (in this case only the first 3 markers) and sample the assigned dosage, as well as some further data. For further information on the input and output of fitPoly functions see the vignette [Segregation analysis of dosage scores](Segregation_analysis_of_dosage_scores.html) in this package.