Analysis of F1 scores

Roeland E. Voorrips

Alejandro Therese Navarro

2025-02-11

The package fitPoly includes several helper functions to input data from different SNP array providers as well as utilities to analyse the quality of genotype scores obtained by the main model of fitPoly. This vignette shows how to use the second set of functions and explains their output; another vignette deals with the data import and preparation. The package offers more functions than shown here, and the functions we use have additional parameters; check the help for extra information.

In this vignette we will take the output of fitPoly for one Full-Sib family and its parents, and analyze it:

Load the package and the scores data.frame produced by fitPoly:

library(fitPoly)
data(scores)

Exercise 1: Segregation types and quality

Function fitMarkers in package fitPoly returns (a.o.) a file with dosage scores for all markers where a model could be fitted. These data are in data.frame scores. , of which we show a part:

##     marker     MarkerName SampleName ratio P0 P1    P2    P3 P4 maxgeno  maxP geno
## 637      5 RhK5_1000_451Q       K001 0.343  0  1 0.000 0.000  0       1 1.000    1
## 638      5 RhK5_1000_451Q       K002 0.520  0  0 1.000 0.000  0       2 1.000    2
## 639      5 RhK5_1000_451Q       K003 0.341  0  1 0.000 0.000  0       1 1.000    1
## 640      5 RhK5_1000_451Q       K004 0.361  0  1 0.000 0.000  0       1 1.000    1
## 641      5 RhK5_1000_451Q       K007 0.609  0  0 0.999 0.001  0       2 0.999    2
## 642      5 RhK5_1000_451Q       K008 0.345  0  1 0.000 0.000  0       1 1.000    1

Columns P0 … P4 show the probabilities that the dosage is nulliplex … quadruplex. The last column, geno, is the dosage assigned; it is NA if the probability of the most probable dosage is less than 0.99 (the threshold used in the fitPoly analysis in this case). In the first part of this vignette we will only use the geno values, but at some point the P-values will also be considered.

The initial check for segregation type is carried out by function checkF1:

# specify parental and F1 samples:
par1 <- "P540a1"
par2 <- c("P867a1","P867a2","P867b")
F1 <- levels(scores$SampleName)[substr(levels(scores$SampleName),1,1)=="K"]
chk1 <- checkF1(scores, par1, par2, F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile=NA)
#some parts of the result:
knitr::kable(chk1[1:4,1:14])
m MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_NA P540a1 P867a1 P867a2 P867b
5 RhK5_1000_451Q 3 0 2 57 94 1 0 1 3 0 0 0
6 RhK5_1005_2503P 2 3 0 21 62 52 12 8 2 3 3 3
7 RhK5_1005_2503Q 2 NA 0 21 59 51 12 12 2 3 3 2
8 RhK5_1007_811P 3 2 0 12 53 53 13 24 3 2 2 NA
knitr::kable(chk1[1:4,c(1:2, 19:22, 26)])
m MarkerName bestParentfit frqInvalid_bestParentfit Pvalue_bestParentfit matchParent_bestParentfit qall_mult
5 RhK5_1000_451Q 11_1 0.0195 0.0026 Yes 0.1428
6 RhK5_1005_2503P 1331_1 0.0000 0.3054 Yes 0.8710
7 RhK5_1005_2503Q 1331_1 0.0000 0.3699 OneOK 0.4032
8 RhK5_1007_811P 1551_1 0.0000 0.8971 Yes 0.5108

This is a rose (Rosa x hybrida) dataset and we expect that both polysomic and disomic inheritance may occur. By setting both corresponding parameters to TRUE we consider both sets of segregation types. The result is a data.frame with one row for each marker. The first columns list the marker name, the inferred parental dosages, the observed F1 segregation and the observed parental sample dosages. The bestParentfit column lists the selected segregation type and the next 3 columns list 3 parameters describing how well this segregation fits with the observations. The qall_mult column is an overall quality assessment, from poor (0) to perfect (1). There are more columns in chk1, and with other parameter settings even more columns will be produced; for a full explanation refer to the help. For this demo we set outfile to NA, but we could also have specified a filename. In that case a tab-separated text file would be saved which can be imported into Excel for easy viewing.

The segregation codes need some explanation. They consist of 2 parts: the first part has the segregation ratios (e.g. 11 means 1:1, 141 means 1:4:1 and so on; a single 1 means no segregation, all F1 have the same dosage). An underscore separates the two parts and the second part is a single number specifying the lowest dosage occurring in the segregation. For example: 121_0 means 1 nulliplex : 2 simplex : 2 duplex; 11_3 means 1 triplex : 1 quadruplex. There is one tetrasomic segregation type 1:8:18:8:1 where one of the ratios is larger than 9 and cannot be represented by a digit; in this and similar cases we use letters A-Z to represent ratios 10-35, so this segregation is represented as 18I81_0, with the I meaning 18. For higher ploidy levels this system is extended, see the help for a full explanation.

Exercise 2: checking for “shifted” markers

When not all dosages are present among the samples it is possible that fitPoly assigns the peaks in the signal ratio histogram to the wrong dosages. For example, if parents in reality have dosages 0 and 1, also the F1 progeny will (almost) only have dosages 0 or 1. If no other samples are analyzed, only two of the five dosages are present, and these might be mis-interpreted as beings dosages 1 and 2 instead of 0 and 1. These mis-assignments are called “shifts”. fitPoly checks against such possible shifts whether or not parents and F1 populations are defined but it doesn’t catch them all.

The function correctDosages is provided that based on the result of checkF1 checks if shifting the dosages up or down by 1 would be worth to try. The result is a set of suggested shifts, that should then again (using checkF1) be checked for their actual fit.

cordos <- correctDosages(chk1, scores, par1, par2, ploidy=4,
                        polysomic=TRUE, disomic=TRUE, mixed=FALSE)
# show the nonzero shifts:
cordos[cordos$shift!=0,]
##          MarkerName segtype parent1 parent2 shift   fracNotOk ParNA
## 10  RhK5_10102_270P    11_2       2      NA     1 0.000000000     1
## 15  RhK5_1011_3044Q    11_1       1       2    -1 0.000000000     0
## 22   RhK5_101_3700P    11_1       1      NA    -1 0.012820513     1
## 25   RhK5_1020_824Q    11_1       1       2    -1 0.000000000     0
## 27   RhK5_1021_984Q    11_1       1       2    -1 0.000000000     0
## 35  RhK5_10334_506Q    11_2      NA       2     1 0.000000000     1
## 45  RhK5_10484_448Q    11_1       1      NA    -1 0.007936508     1
## 48   RhK5_1054_473P    11_1       2       1    -1 0.000000000     0
## 55  RhK5_10619_199Q    11_1       1       2    -1 0.000000000     0
## 63  RhK5_10728_511P    11_1       2       1    -1 0.012658228     0
## 92  RhK5_11250_511P    11_1       2       1    -1 0.013071895     0
## 99  RhK5_11334_499Q    11_1       1       2    -1 0.000000000     0
## 101 RhK5_1145_1123Q    11_1       1       2    -1 0.000000000     0

The column shift contains the suggested shifts to try: a 0 means no shift suggested, a 1 means: shift all dosages assigned by fitPoly up (0 becomes 1, 1 becomes 2, 2 becomes 3, 3 and 4 become 4 (3 and 4 are merged)). A shift of -1 means that all dosages should be decreased by 1 (merging previous dosages 0 and 1 into dosage 0). Note that this is a rather simplistic way to check for possibly shifted markers. If there are more samples available than only the F1 population and its parents, those might also be informative about possible shifts. Also, sometimes both the unshifted and the shifted version of the marker might be acceptable. In such cases it may be useful to keep both versions and decide later on (after mapping or haplotyping steps) which version fits best.

The next step is to run checkF1 again with as additional parameter these shifts; it will then perform the indicated dosage shift on all samples (parental and F1) before selecting the most likely segregation type and quality assessment:

#select the markers where a shift should be tried:
cordos <- cordos[cordos$shift != 0,] 
subscores <- scores[scores$MarkerName %in% as.character(cordos$MarkerName),]
chk2 <- checkF1(subscores, par1, par2, F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile=NA, shiftmarkers=cordos)

Data.frame chk2 has the same format as ck1 but has an additional column “shift” at the end. For the next steps it is useful to merge chk1 and chk2. In order to do that, chk1 needs also to get this extra column:

chk1$shift <- 0
chk <- rbind(chk1, chk2)
chk <- chk[order(chk$MarkerName),]

Exercise 3: select acceptably scored markers

The data.frame chk now has all unshifted and shifted versions of the markers. Each of the (versions of the) markers has been checked for its quality, which is (a.o.) expressed in its value of qall_mult. We would like to select only those (versions of) markers that are of sufficient quality. As a rule of thumb, all scores of markers with qall_mult==0 are bad and all others might be acceptable, but perhaps we can raise the threshold for qall_mult a bit for a more stringent selection, without losing valuable markers. The approach is to study XY-scatterplots for several markers at different levels of qall_mult and see from which level good (well-scored) markers appear. This approach is similar to that used to exclude markers based on total signal intensity, as discussed in the vignette on data import and preparation. In order to produce scatterplots with dots colored according to assigned dosages we first combine the X and Y signals with the geno (dosage) values.

data("XYdat") 
XYgeno <- combineFiles(XYdata=XYdat, scores=scores)
# define qall levels of 0, 0.05, 0.10 up to 1 where we want to inspect some SNPs:
qall.levels <- seq(0, 1, by=0.05)
# select six SNPs with qall values near each of these values and draw their plots:
chkx <- selMarkers_qall(chk,  qall.levels, mrkperlevel=6)
drawXYplots(dat=XYgeno, markers=chkx,
            out="your-path-and filename",
            genocol=get.genocol(ploidy=4), sample.groups=list(par1, par2),
            groups.col=c("red", "blue"), ploidy=4)

This generates a series of pages, each with 6 plots; here is part of one such page: Part of a page generated by drawXYplots

The drawXYplots command here is quite complex. A full explanation of all parameters can be found in the help. Some things to note here: genocol sets the (pastel) colors assigned to each of the dosages, from red (nulliplex) through blue (duplex) to green (quadruplex), and grey for samples with missing dosage score. The parents are highlighted in red and blue via parameters sample.groups and groups.col. The two plots show a duplex x duplex (1:8:18:8:1) and a simplex x simplex (1:2:1) marker. A conclusion from this series of plots is that already at qall_mult==0.05 most of the markers appear to be well-scored. A next step could be to try to find a qall_mult threshold between 0 and 0.05 to separate the well-scored and badly scored markers; we leave that for the reader and for now we will accept all marker (versions) with qall > 0, and save them to a dosage file:

#select only the markers with qall_mult > 0:
chk <- chk[!is.na(chk$qall_mult) & chk$qall_mult > 0,]
#write the dosage file, applying the shifts as listed in chk:
dosages <- writeDosagefile(chk, scores, par1, par2, F1,
                           polysomic=TRUE, disomic=TRUE, mixed=FALSE,
                           ploidy=4, scorefile=NA)
## batch 1 of 1
knitr::kable(dosages[10:14, 1:12])
MarkerName segtype P540a1 P867a1 P867a2 P867b parent1 parent2 K001 K002 K003 K004
10 RhK5_10102_270Pn 11_2 2 3 NA 3 2 3 2 2 2 2
11 RhK5_10102_270Ps 11_3 3 4 NA 4 3 4 3 3 3 3
12 RhK5_10107_247Pn 121_0 1 1 1 1 1 1 0 0 0 0
13 RhK5_10107_247Qn 121_0 1 1 1 1 1 1 0 0 0 0
14 RhK5_1011_3044Pn 11_0 0 1 1 1 0 1 1 1 0 0

Normally we would specify a filename for scorefile, to get the results also as a tab-separated text file. The dosages data.frame (and file) list the marker, segregation type, the scores of the parental samples, the consensus/inferred parental dosages and the dosages of the F1 individuals. If chk contains a column “shift”, as here, the shift is applied to the dosages as listed in the score data.frame. In that case also an n or s is appended to the MarkerNames to indicate they were not shifted (shift==0) or shifted (shift!=0); in this way we avoid having duplicated markers if both the nonshifted and the shifted version passed the quality filter.

Here we see that the two probes (P and Q) of marker RhK5_101_3700 were initially scored with a difference of 1 in the dosages (versions Pn vs Qn, where n means non-shifted). Function correctDosages found that the P probe should possibly be shifted and this resulted in version Ps (s for shifted). Versions Ps and Qn generally have the same scores, and both fit segregation type 11_0 (1:1 nulliplex:simplex). Similar for the non-shifted and shifted versions of probe P of marker RhK5_10102_270; for that marker the Q probe was rejected.

Exercise 4: combine probes

The Affymetrix Axiom array needs only one probe to detect a SNP. This means that the same SNP can be detected by two different probes, flanking the SNP on either side. If both probes are present on the array they are initially treated as two separate markers by the array software and fitPoly. If all is well the two probes should produce the same results (since they detect the same SNP) but due to various factors a probe may fail entirely or for a subset of the samples. Results that are identical between the probes validate each other, and missing values from one probe might be filled in by the other. Also, fitPoly might produce shifted dosages for one or both probes and a comparison might help to decide the correct version.

In order to compare the results of both probes we use function compareProbes. This function assumes that the MarkerNames of the two probes for the same SNP are identical except for a short suffix; by default (and in the example) these suffixes are P and Q.

cpp <- compareProbes(chk, scores, parent1=par1, parent2=par2, F1=F1,
                     polysomic=TRUE, disomic=TRUE, mixed=FALSE,
                     ploidy=4, compfile=NA, combscorefile=NA)
## batch 1 of 1
knitr::kable(head(cpp$compstat))
SNPname segtypePn qallPn segtypePs qallPs segtypeQn qallQn segtypeQs qallQs segtypeRnn segtypeRns segtypeRsn segtypeRss countP countQ countR
RhK5_1000_451 NA NA NA NA 11_1 0.1428 NA NA NA NA NA NA 0 1 0
RhK5_1005_2503 1331_1 0.8710 NA NA 1331_1 0.4032 NA NA 1331_1 NA NA NA 1 1 1
RhK5_1007_811 1551_1 0.5108 NA NA 1551_0 0.5000 NA NA NA NA NA NA 1 1 0
RhK5_100_3510 11_0 1.0000 NA NA 11_0 0.8333 NA NA 11_0 NA NA NA 1 1 1
RhK5_100_531 141_2 0.6452 NA NA 141_2 0.7742 NA NA 141_2 NA NA NA 1 1 1
RhK5_10102_270 11_2 0.6845 11_3 0.6845 NA NA NA NA NA NA NA NA 2 0 0
knitr::kable(cpp$combscores[1:4, 1:12])
MarkerName segtype P540a1 P867a1 P867a2 P867b parent1 parent2 K001 K002 K003 K004
RhK5_1000_451Qn 11_1 3 0 0 0 3 0 1 2 1 1
RhK5_1005_2503Pn 1331_1 2 3 3 3 2 3 2 1 3 1
RhK5_1005_2503Qn 1331_1 2 3 3 2 2 3 2 1 3 1
RhK5_1005_2503Rnn 1331_1 2 3 3 3 2 3 2 1 3 1

compareProbes returns a list with two components compstat and combscores, each a data.frame. These are also saved as tab-separated text files if compfile and combscorefile are specified. Data.frame compstat has an overview of the different versions of each SNP (two probes, each unshifted and/or shifted). For each of these (suffixed with P or Q for the probe and n or s for (non-)shifted) the segregation type and qall_mult quality score is listed. If (a version of) the P and the Q probe are sufficiently similar, a combined version of the SNP is produced which gets the suffix R, and a further suffix nn, ns, sn or ss depending on whether the nonshifted or the shifted version of the P and Q probe was used for merging. Also for such a merged (R) marker the segregation type is given, and finally the number of versions of the P, Q and R markers. Two probes are considered sufficiently similar to merge if they have the same segregation type and if there are (by default) not more than 4% conflicting sample scores. For more details and additional parameters see the help.

The other component of the compareProbes result is data.frame combscores. This lists the segregation type and dosages for all markers (the original P and Q markers and the combined R markers) in the same format as produced by writeDosagefile (exercise 3). In the second table we can see that there is redundancy: for the first SNP the P, Q and R markers are all present, but they are almost completely identical (as expected, since a combined marker is produced). If we are happy to accept the merging we need to remove the P and Q marker data if an R marker is present. This is done by function removeRedundant:

rr <- removeRedundant(compstat=cpp$compstat, combscores=cpp$combscores,
                      compfile=NA, combscorefile=NA)
knitr::kable(rr$combscores[1:4, 1:12])
MarkerName segtype P540a1 P867a1 P867a2 P867b parent1 parent2 K001 K002 K003 K004
1 RhK5_1000_451Qn 11_1 3 0 0 0 3 0 1 2 1 1
4 RhK5_1005_2503Rnn 1331_1 2 3 3 3 2 3 2 1 3 1
5 RhK5_1007_811Pn 1551_1 3 2 2 NA 3 2 3 2 3 2
6 RhK5_1007_811Qn 1551_0 2 1 0 1 2 1 2 1 2 1

removeRedundant produces the same type of output as compareProbes. The compstat data.frame is not very interesting but the combined scores in data.frame rr$combscores are the result that we will want to use in mapping studies and further genetic analyses.