Data import and preparation

Roeland E. Voorrips

Alejandro Therese Navarro

2025-02-11

FitPoly contains several functions to import from SNP array software, perform selections based on signal intensity and prepare input data for genotyping. The package offers more functions than shown here, and the functions we use have additional parameters; check the help for extra information.

Load the package and set the working directory to the data directory:

library(fitPoly)

Exercise 1: extract data from SNP array

We cover here the cases for Illumina Infinium and GoldenGate arrays and for Affymetrix Axiom arrays. The aim of this step is to get the data into a uniform format, such that the further steps are not dependent on the source of the data (array platform).

Illumina

The data from Illumina arrays are processed with Illumina’s GenomeStudio software. This software can export a Full Data Table.txt file, which is a tab-separated text file (i.e. columns in this file are separated by Tab stops). In this file format every row corresponds to a marker. The first columns contain general data for the marker including its name; next for each sample there are a few columns. For our purposes we need the columns X and Y for each sample. These are not exported by default; in order to export them, within GenomeStudio we use the Column Chooser button to include them in the output. All other sample columns besides X and Y are not needed but are tolerated; however since they increase the file size and memory load it is better to omit them when exporting from GenomeStudio.

# show part of a GenomeStudio FullDataTable file (only some relevant columns):
fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly")
dat <- readDatfile(fname, check.names=FALSE)
dat[1:3, c(2, 15,16, 21,22, 27,28)]
##      Name 101765-0001.X 101765-0001.Y 101765-0002.X 101765-0002.Y 101765-0003.X 101765-0003.Y
## 1 SNP0001    0.10170589     0.7760399    0.10211141     0.7885740    0.35890719     0.7804398
## 2 SNP0002    0.04271171     1.3020643    0.04893748     1.2564085    0.02508135     1.1183847
## 3 SNP0003    0.01710911     0.9452760    0.02105922     0.8730388    0.02211778     0.8874161

In case you want to see the FullDataTable.txt file itself, you can find it in the extdata subdirectory of the main package directory.

Affymetrix

The data from Affymetrix Axiom arrays are processed with the Affymetrix Power Tools (APT) or the Genotyping Console software. After initial quality control steps all passing samples and all SNPs are processed by the program “apt probeset genotype”. If the option –summaries is included (as is normally the case) a file AxiomGT1.summary.txt file is produced. Like the GenomeStudio file, this is a tab-separated text file with the markers in rows and the samples in columns. There are also differences; most importantly, instead of having columns X and Y for each sample, we now have one column per sample but rows A and B for each marker. Also the files start with several hundred lines of text, and the last few thousand rows contain quality control (DQC) markers. Also the sample names all have “.CEL” appended to them.

# show part of an Affymetrix AxiomGT1.summary.txt file:
fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly")
dat <- readDatfile(fname, comment.char="#", check.names=FALSE)
dat[1:6, 1:4]
##     probeset_id 814-379_A03.CEL 814-379_A04.CEL 814-379_A05.CEL
## 1 AX-86752740-A        532.8074       1089.7257        856.1627
## 2 AX-86752740-B        606.8601        726.3137        647.0679
## 3 AX-86752741-A       1596.1018       2228.1368       1820.3040
## 4 AX-86752741-B        871.0347        486.4657        877.3148
## 5 AX-86752742-A       2365.1000       2560.8275       2511.4948
## 6 AX-86752742-B        803.7539        981.7766        856.3433

Again, the AxiomGT1.summary.txt file is located in the extdata subdirectory of the main package directory.

Read and convert the array data

We provide demo files, each with only 10 or 15 samples and 10 SNP markers. We will import these files, clean the sample names and convert them from wide format (all samples side by side) to long format (all samples under each other, with columns MarkerName, SampleName, X, Y, R, ratio).

# import a FullDataTable.txt file from GenomeStudio:
fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly")
datGS <- readFullDataTable(filename=fname, out="")
head(datGS)
##   MarkerName  SampleName           X          Y         R      ratio
## 1    SNP0001 101765-0001 0.101705887 0.77603991 0.8777458 0.88412831
## 2    SNP0002 101765-0001 0.042711711 1.30206429 1.3447760 0.96823879
## 3    SNP0003 101765-0001 0.017109110 0.94527599 0.9623851 0.98222218
## 4    SNP0004 101765-0001 0.008343576 0.86564662 0.8739902 0.99045347
## 5    SNP0005 101765-0001 1.158060329 0.09794167 1.2560020 0.07797891
## 6    SNP0006 101765-0001 0.033692960 1.10556404 1.1392570 0.97042550
unique(datGS$MarkerName)
##  [1] "SNP0001" "SNP0002" "SNP0003" "SNP0004" "SNP0005" "SNP0006" "SNP0007" "SNP0008" "SNP0009"
## [10] "SNP0010"
unique(datGS$SampleName)
##  [1] "101765-0001" "101765-0002" "101765-0003" "101765-0004" "101765-0005" "101765-0006"
##  [7] "101765-0007" "101765-0008" "101765-0009" "101765-0010"
# import an AxiomGT1.summary.txt file from Affymetrix Power Tools:
fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly")
datAX <- readAxiomSummary(AXdata=fname, out="")
head(datAX)
##    MarkerName  SampleName         X         Y        R     ratio
## 1 AX-86752740 814-379_A03  532.8074  606.8601 1139.668 0.5324887
## 2 AX-86752740 814-379_A04 1089.7257  726.3137 1816.039 0.3999438
## 3 AX-86752740 814-379_A05  856.1627  647.0679 1503.231 0.4304515
## 4 AX-86752740 814-379_A06  721.9925  960.5359 1682.528 0.5708884
## 5 AX-86752740 814-379_A07  453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740 814-379_A08  525.5640 1253.0268 1778.591 0.7045054
unique(datAX$MarkerName)
##  [1] "AX-86752740" "AX-86752741" "AX-86752742" "AX-86752743" "AX-86752744" "AX-86752745"
##  [7] "AX-86752746" "AX-86752747" "AX-86752748" "AX-86752749"
unique(datAX$SampleName)
##  [1] "814-379_A03" "814-379_A04" "814-379_A05" "814-379_A06" "814-379_A07" "814-379_A08"
##  [7] "814-379_A09" "814-379_A10" "814-379_A11" "814-379_A12" "814-380_G12" "814-380_H01"
## [13] "814-380_H02" "814-380_H03" "814-380_H04"

These functions return the reformatted data as a data.frame. Normally you would use the parameter out to specify the name of an RData file (or optionally a tab-separated text file, see the help) to save the converted data.

Exercise 2: Rename the samples and separate by ploidy level

Normally the marker names and/or the sample names as used in the array software are different from the original names. In these cases there should be a file specifying the translation between the two sets of names. Such a file can also contain additional information on the samples, such as the categories of the samples (e.g. F1 parent 1, F1 parent 2, F1 individual, association panel, wild, control) and ploidy (e.g. 2x, 4x). In this exercise we will convert the marker and sample names of the datAX data.frame and split it into two parts: one for the tetraploid and one for the diploid samples.

First we rename the markers. The translation is specified by file CsvAnnotationFile.v1.txt. Since the markers on the array are a fixed set these annotation files don’t change and can usually be assumed to be correct (i.e. all marker names in the array can be assumed also to occur in the annotation file and v.v., and all customer_id codes can be assumed to be unique). Therefore we don’t need to check this and can perform the translation directly:

fname <- system.file("extdata", "CsvAnnotationFile.v1.txt", package="fitPoly")
mrktable <- read.csv(fname, comment.char="#")
levels(datAX$MarkerName) <- 
  mrktable$customer_id[match(levels(datAX$MarkerName), mrktable$Probe.Set.ID)]

Next we rename the samples. The samples are different between projects using the same array, and sample annotations are usually made by hand, in different formats, and cannot be assumed to be error-free. Also it may be that there are samples of various ploidy levels, which we may want to split into separate sets. For this we use function splitNrenameSamples:

fname <- system.file("extdata", "AX_sampletable.csv", package="fitPoly")
smptable <- read.csv(fname)
head(smptable)
##     X SubmittedPlateName SubmittedWell    PlateName Well SampleName Ploidy SampleSource
## 1  99       SMP4_0008814           A03 SMP4_0008814  A03       K001     4x     Customer
## 2 100       SMP4_0008814           A04 SMP4_0008814  A04       K002     4x     Customer
## 3 101       SMP4_0008814           A05 SMP4_0008814  A05       K003     4x     Customer
## 4 102       SMP4_0008814           A06 SMP4_0008814  A06       K004     4x     Customer
## 5 103       SMP4_0008814           A07 SMP4_0008814  A07       K007     4x     Customer
## 6 104       SMP4_0008814           A08 SMP4_0008814  A08       K008     4x     Customer
##         SampleType InternalPico CVPercent   BestArray FailureMode
## 1 Rose genomic DNA        16.46      6.90 814-379_A03          NA
## 2 Rose genomic DNA        14.87      3.63 814-379_A04          NA
## 3 Rose genomic DNA         8.32      6.97 814-379_A05          NA
## 4 Rose genomic DNA        23.68      1.52 814-379_A06          NA
## 5 Rose genomic DNA        20.28      3.61 814-379_A07          NA
## 6 Rose genomic DNA         6.78      2.29 814-379_A08          NA
# no split, return the original data.frame with substituted SampleNames:
datAX_unsplit <- splitNrenameSamples(dat=datAX, sampletable=smptable,
                                     SampleID="BestArray", CustomerID="SampleName",
                                     out=NA)
head(datAX_unsplit) 
##    MarkerName SampleName         X         Y        R     ratio
## 1 AX-86752740       K001  532.8074  606.8601 1139.668 0.5324887
## 2 AX-86752740       K002 1089.7257  726.3137 1816.039 0.3999438
## 3 AX-86752740       K003  856.1627  647.0679 1503.231 0.4304515
## 4 AX-86752740       K004  721.9925  960.5359 1682.528 0.5708884
## 5 AX-86752740       K007  453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740       K008  525.5640 1253.0268 1778.591 0.7045054
# split samples based on ploidy, return a list of data.frames:
datAX_split <- splitNrenameSamples(dat=datAX, sampletable=smptable,
                                   SampleID="BestArray", CustomerID="SampleName",
                                   Ploidy="Ploidy", out=NA)
## sampletable contains ploidy levels 2x 4x
## the samples will be split per ploidy value
head(datAX_split[[1]])
##     MarkerName SampleName        X        Y        R     ratio
## 11 AX-86752740        V01 720.7784 2086.717 2807.496 0.7432664
## 12 AX-86752740        V02 427.6401 1864.825 2292.465 0.8134584
## 13 AX-86752740        V03 756.3445 1333.276 2089.621 0.6380470
## 14 AX-86752740        V04 415.7363 1850.588 2266.325 0.8165593
## 15 AX-86752740        V05 722.2508 1729.903 2452.154 0.7054627
## 26 AX-86752741        V01 864.9218 1615.226 2480.148 0.6512620
head(datAX_split[[2]])
##    MarkerName SampleName         X         Y        R     ratio
## 1 AX-86752740       K001  532.8074  606.8601 1139.668 0.5324887
## 2 AX-86752740       K002 1089.7257  726.3137 1816.039 0.3999438
## 3 AX-86752740       K003  856.1627  647.0679 1503.231 0.4304515
## 4 AX-86752740       K004  721.9925  960.5359 1682.528 0.5708884
## 5 AX-86752740       K007  453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740       K008  525.5640 1253.0268 1778.591 0.7045054

Although we use out=NA in this demo, normally you would provide a filename to store the results.

Exercise 3: Filtering the data based on total signal intensity

The total signal intensity R = X + Y can be used as a measure whether sample DNA is ok, the marker is designed well, and individual datapoints are acceptable. Selection of data based on signal intensity has to be done before dosage calling by fitPoly as fitPoly only works with the signal ratio, not with absolute signal intensities. For this exercise we will use an example data.frame XYdat .

data(XYdat)
head(XYdat)
##          MarkerName SampleName        X        Y         R     ratio
## 2013 RhK5_1000_451P       K001 291.7016 853.7687 1145.4703 0.7453433
## 2014 RhK5_1000_451P       K002 295.5672 392.0710  687.6382 0.5701705
## 2015 RhK5_1000_451P       K003 260.1488 364.9967  625.1455 0.5838588
## 2016 RhK5_1000_451P       K004 248.3574 314.0045  562.3619 0.5583673
## 2017 RhK5_1000_451P       K007 265.1335 286.6872  551.8208 0.5195296
## 2018 RhK5_1000_451P       K008 257.5974 337.2776  594.8750 0.5669723

3a: Selection of samples based on total signal intensity

Samples with bad DNA quality may be recognized by a low R (total signal intensity), averaged over all markers. This information is saved in a file produced by readFullDataTable or readAxiomSummary if parameter out is specified. However it is also quite easy to calculate directly if all data are in the same data.frame:

sampRmean <- tapply(XYdat$R, XYdat$SampleName, mean)
hist(sampRmean, breaks = 20)

It is clear that the mean R values for all samples are quite similar so there is no reason to exclude any samples at this point.

3b: Selection of markers based on total signal intensity

Next we will check if any markers should be excluded based on their overall signal intensity R. We calculate a number of statistics of the R distribution per marker and study the 95% quantiles (q95) of the markers:

Rstats <- calcRstats(XYdat, out=NA)
quantile(Rstats$q95)
##        0%       25%       50%       75%      100% 
##  252.7965 2677.6312 3295.4407 3698.0998 5871.7522

In contrast to the samples, the markers vary widely in signal intensity. Probably there is a level below which no usable markers occur. We define a number of levels for q95 and at each level we select 3 markers. We then draw XY-scatterplots for these markers and decide from which level clear clustering pattern occur:

Rlevels <- seq(200, 6000, by=400)
sel.Rstats <- selMarkers_byR(Rstats, Rlevels, mrkperlevel=3)
drawXYplots(dat=XYdat, markers=sel.Rstats,
            out="your-path-and filename", drawRthresholds=TRUE)

This generates a series of pages, each with 6 plots; here is part of one such page: Part of a page generated by drawXYplots

These two markers show more or less clear clusters and have a q95 around 1600, so clearly the threshold below which markers should be excluded must be lower than 1600. In this case we will exclude all markers with q95 < 1400. This selection will be combined with a within-marker selection of samples in function makeFitPolyFiles (see next exercise), but if no within-marker selection is needed it can also be done as shown here:

keep <- Rstats$MarkerName[Rstats$q95 >= 1400]
dat <- XYdat[XYdat$MarkerName %in% keep,]

3c: Selection based on R values of samples within markers

In the plots as shown above the diagonal line represents R = 0.5 * q95. Different, and even multiple, R threshold lines can be drawn by using the Rthreshold.param paprameter of drawXYplots (see the help). By looking at a series of XY-plots we can decide if we want to exclude samples with R < 0.5 * q95 (or any other threshold). Note that the within-marker threshold should be based on the q95 (or another R statistic) of each marker, as the signal intensity differs a lot between markers. The following selects all markers where q95 (the 95% quantile of R) is >= 1400, and within these markers it assigns a missing value (NA) to all markers where R < 0.5 * q95:

dat.sel <- makeFitPolyFiles(XYdat, out=NA, Rquantile=0.95, marker.threshold=1400,
                            Rthreshold.param=c(0.95, 0.5, 0))
#dat.sel is in this case a list with only one useful element (the other element
#is NA); simplify:
dat.sel <- dat.sel[[1]]
head(dat.sel)
##           MarkerName SampleName        X        Y        R     ratio
## 21173 RhK5_100_3510P       K001 2704.714 369.3005 3074.015 0.1201362
## 21174 RhK5_100_3510P       K002 2472.556 756.2809 3228.837 0.2342270
## 21175 RhK5_100_3510P       K003 2281.848 759.0215 3040.869 0.2496068
## 21176 RhK5_100_3510P       K004 2563.892 857.1673 3421.059 0.2505561
## 21177 RhK5_100_3510P       K007 2796.043 459.6424 3255.686 0.1411814
## 21178 RhK5_100_3510P       K008 2459.193 789.6996 3248.893 0.2430673

For more information on makeFitPoly files see the help.

Excersise 4: run fitPoly

This vignette is about preparing and selecting data for dosage scoring by using the function fitMarkers. We assume that data.frame dat.sel from the previous exercise is still in memory.

fitMarkers(ploidy=4, markers=1:3, data=dat.sel, filePrefix="A", rdaFiles=TRUE)

The main result of this function call is a file A_scores.RData which contains a data.frame with for each marker (in this case only the first 3 markers) and sample the assigned dosage, as well as some further data. For further information on the input and output of fitPoly functions see the vignette Segregation analysis of dosage scores in this package.