\documentclass[a4paper]{article} \usepackage{rotating} \usepackage{graphicx} %% \VignetteIndexEntry{Using PolyPatEx} \setlength{\parskip}{1ex plus 0.5ex minus 0.2ex} \title{Using the PolyPatEx package} \author{Alexander Zwart} \begin{document} \maketitle %% checks: %% PolyPatEx as \textbf{PolyPatEx} %% \texttt for references to functions, arguments, etc %% () notation for functions %% = notation for arguments \section{What is \textbf{PolyPatEx}?} \label{sec:Intro} \begin{sloppypar} \textbf{PolyPatEx} is an R package for performing paternity exclusion in autopolyploid, monoecious or diocecious species having ploidy $p = 4n$, $6n$ or $8n$, using allele matching on codominant marker data such as microsatellite data. \textbf{PolyPatEx} can also perform exclusion on diploid data, provided the data is \emph{genotypic} (see below for an explanation of \emph{genotypic} vs \emph{phenotypic} data). Note that \textbf{PolyPatEx} is suited to low density, high information markers, but is not suited to high density markers such as SNPs. \end{sloppypar} The analysis is based on direct comparison of the alleles at a given locus (referred to as an `allele set') to determine whether there is a possible match between a candidate father's allele set and the allele set of the progeny, given the allele set observed in the progeny's known mother. Results of such per-locus matches are then collated across all loci to provide multi-locus progeny-father matches for each progeny in the dataset. `Candidate fathers' may be referred to simply as `candidates' in this document and the \textbf{PolyPatEx} reference documentation. \textbf{PolyPatEx} makes several key assumptions. In an autopolyploid species of ploidy $p$, at a given locus, \begin{itemize} \item $p/2$ of the $p$ alleles in the progeny are assumed to have been selected, without replacement, from the mother's $p$ alleles (the `maternal gamete'); \item the remaining $p/2$ of the progeny's alleles are assumed to have been selected, without replacement, from the father's $p$ alleles (the `paternal gamete'); \item The relationship between mother and progeny is known---the aim of the analysis is to determine which of one or more candidate fathers is capable of producing a paternal gamete compatible with the alleles in the progeny-mother pair; \item No account is taken of mutation or other events (such as double reduction) which can violate the assumptions of `selection, without replacement, of $p/2$ alleles from $p$ alleles'. \end{itemize} \textbf{PolyPatEx} does not provide a probabilistic paternity exclusion analysis which would yield information on likelihood of paternity and more elegantly handle issues such as missing data. Consequently no assumptions are made concerning possible correlations between loci (linkage), nor is \emph{random} selection of gametic alleles a necessary assumption. Appropriate caution in interpreting results from this software is advised, in light of the assumptions made by the package. In collating per-locus comparisons results across multiple loci, the relevant \textbf{PolyPatEx} functions \emph{do} provide one simple option to address potential effects of mutation or missing data. In assessing whether a candidate should be flagged as a potential father, the option exists to allow one or more mismatching loci in a progeny-candidate pair that is flagged as a `match'. \textbf{PolyPatEx} contains routines to perform paternity exclusion on datasets where marker dosages are known (genotype data) and where marker dosages are unknown (`phenotype' data). In the latter case, the software allows for the fact that comparisons made between allele sets must take into account all of the possible genotypic allele sets that can arise from a given `phenotypic' allele set. A typical \textbf{PolyPatEx} workflow: \begin{enumerate} \item Prepare an allele data set file in Comma Separated Value (CSV) format. The required data layout is described in Section~\ref{sec:inputDataFormat}, \textbf{Input data format} on page \pageref{sec:inputDataFormat}. \item Start up R, load the \textbf{PolyPatEx} package and set the R working directory to the location of the data file. \item Load the data file into R and perform a number of checks and necessary preprocessing steps on the data using functions provided by the \textbf{PolyPatEx} package. \item Perform the exclusion analysis. \item Query, and/or output to CSV file, the results of the exclusion analysis. \end{enumerate} The workflow is described in more detail in Section~\ref{sec:workflow}, \textbf{PolyPatEx example workflow} on page~\pageref{sec:workflow}. % \section{A note on terminology} % \label{sec:terminology} % In the PolyPatEx documentation we use the terms genotype/genotypic to % refer to datasets in which marker dosages are known, while the terms % `phenotype'/`phenotypic' are used to refer to datasets in which the % marker dosages are unknown. % The use of genotype/genotypic for data where marker dosages are known % is consistent with the general use of these terms---in a species with % ploidy p, and at a given locus, exactly p alleles should be observed, % some of which may be repeated. If fewer than p alleles are observed % at a locus, then there is missing data (and \textbf{PolyPatEx} will treat such % loci-subject combinations as containing no data at all---i.e., the % affected locus is ignored). % %% Still in the process of deciding what to say here... % % However, our use of `phenotype'/`phenotypic' is certainly % % non-standard, even given the multiple meanings associated with the % % word. Our use here is not necessarily intended to introduce a new % % meaning, but % % The terms `phenotype'/'phenotypic' are defined as the phenotypic % % expression of alleles at a locus as viewed by the marker system % % (i.e. electrophoresis), and while it is a somewhat non-standard use of % % these terms, there are certainly multiple meanings associated with % % them. % % Our use of the words `phenotype'/'phenotypic' is influenced by the % % idea of the expression or `effect' of an allele set depending only on % % the presence of distinct allele states, and not on their dosages % % (i.e. multiplicities). Of course, we are only considering markers % % here, not genes, and when even when considering gene alleles, the % % relationship between an alleles set and their `effect' depends upon % % the nature of the dominance relationships between the alleles, which % % can vary. \section{Typesetting conventions and notation} In this document, displayed R code and references to R functions, function arguments or data structures are displayed in \texttt{typewriter-style} font. References to columns and content of the input allele data file are similarly typeset. References to R function names are followed by an empty pair of braces, as in \texttt{inputData()}. The `\texttt{()}' suffix is an informal notation useful for clarifying that the name refers specifically to an R \emph{function}. Similarly, references to function \emph{argument} names are immediately followed by an `equals' sign, as in \texttt{file=}. I hope these conventions will aid (rather than confuse) those users not familiar with R programming. As they are not formal R conventions, I do not use these notations in the R reference documentation. \section{Key \textbf{PolyPatEx} functions} \label{sec:keyFunctions} The functions in \textbf{PolyPatEx} that are of interest to the user are (in approximate order of possible use): \begin{itemize} \item \texttt{fixCSV()}---`Tidies' a CSV file prior to input into R. \item \texttt{inputData()}---Loads an allele dataset (from a CSV file) into R. \item \texttt{preprocessData()}---Performs a number of essential checks and preprocessing steps on the dataset, prior to further analysis by other \textbf{PolyPatEx} functions. This function is \emph{automatically called for you} by \texttt{inputData()}, but can also be called directly should you load your dataset into R by other means. \item \texttt{phenotPPE()}, \texttt{genotPPE()}---perform the paternity exclusion analyses on `phenotypic' or genotypic allele data sets respectively. \item \texttt{potentialFatherCounts()}---tabulates the number of potential fathers identified by \texttt{phenotPPE()} or \texttt{genotPPE()} for each offspring. \item \texttt{potentialFatherIDs()}---displays the potential fathers that were identified by \texttt{phenotPPE()} or \texttt{genotPPE()} for each offspring. \end{itemize} Three other functions of possible interest are: \begin{itemize} \item \texttt{multilocusTypes()}---summarise the unique `phenotypes' or genotypes in an allele data set. \item \texttt{foreignAlleles()}---identify the alleles present in the progeny that are not present in any of the adults in a dataset. \item \texttt{convertToPhenot()}---convert a genotype dataset to a `phenotype' dataset by removing all repeated alleles in each allele set. \end{itemize} Reference documentation for these functions can be found via the R HTML package documentation, or by typing, e.g.: \begin{verbatim} ?fixCSV \end{verbatim} at the R command prompt. \section{Input data format} \label{sec:inputDataFormat} \subsection{Data table schematic} \textbf{PolyPatEx} requires the allele data to be presented as a Comma Separated Value (CSV) file in a specific format, which we now describe. The first three columns of the data set are named \texttt{id}, \texttt{popn} and \texttt{mother}, in that order. The \textbf{PolyPatEx} scripts are case-sensitive, so ensure the column names are in lower case as shown. If dealing with data from a dioecious species, the fourth column of the data set is named \texttt{gender}. These first three (or four) columns are immediately followed by $p\times k$ further columns, where $p$ is the ploidy ($4$, $6$ or $8$) of the data set, and $k$ is the number of loci being analysed. These columns contain the marker allele information for the first locus in the first $p$ columns, information for the next locus in the next $p$ columns, and so on. Each cell in this part of the dataset contains either an allele label, or is left empty or contains an appropriate missing data symbol (see below). The $p\times k$ marker allele columns should all be named. Choose whatever names you like, provided they are unique and contain no commas (to avoid confusion with the CSV format). It is recommended to keep column names brief, and avoid using spaces or symbols other than periods or underscores in column names - otherwise, R may change column names when loading the data, to suit its own restrictions on valid variable names. Note that in \textbf{PolyPatEx} output, the $k$ loci will be referred to simply as \texttt{Locus~1} up to \texttt{Locus~k}, in the order that the groups of $p$ columns occur in the data file. Figure~\ref{fig:dataSchematic} is a schematic of the required data layout, for a dioecious data set with ploidy $p = 4$, and $k = 2$ loci. A dataset for a monoecious species would neglect the gender column. \begin{figure}[h] \includegraphics[width=\linewidth]{PolyPatExDataSchematic.pdf} \caption{Schematic of the input data layout required by \textbf{PolyPatEx}, for a dioecious species - for a monoecious species, neglect the \texttt{gender} column.} \label{fig:dataSchematic} \end{figure} %% Superceded by the above, but left in case any problems arise in %% loading an external figure into the vignette. %% % \begin{figure}[h] % \setlength{\unitlength}{1mm} % \begin{picture}(120,36) % \linethickness{1mm} % % Box % \put(0,0) {\line(1,0){120}} % \put(0,0) {\line(0,1){36}} % \put(0,36) {\line(1,0){120}} % \put(120,0) {\line(0,1){36}} % % Header line % \put(0,30){\line(1,0){120}} % % Interior column separators % %% Thinner lines to help identify loci data blocks % %% May be able to use \thinklines and \thinklines instead % %% of \linethickness... % \linethickness{0.5mm} % \put(10,0) {\line(0,1){36}} % \put(20,0) {\line(0,1){36}} % \put(30,0) {\line(0,1){36}} % \linethickness{1mm} % \put(40,0) {\line(0,1){36}} % \linethickness{0.5mm} % \put(50,0) {\line(0,1){36}} % thin line % \put(60,0) {\line(0,1){36}} % thin line % \put(70,0) {\line(0,1){36}} % thin line % \linethickness{1mm} % \put(80,0) {\line(0,1){36}} % \linethickness{0.5mm} % \put(90,0) {\line(0,1){36}} % thin line % \put(100,0) {\line(0,1){36}} % thin line % \put(110,0) {\line(0,1){36}} % thin line % %% Column headers % \footnotesize % \put(0.5,33) {\texttt{id}} % \put(10.5,33) {\texttt{popn}} % \put(20.5,33) {\texttt{mother}} % \put(30.5,33) {\texttt{gender}} % \put(40.5,33) {\texttt{L1a}} % \put(50.5,33) {\texttt{L1b}} % \put(60.5,33) {\texttt{L1c}} % \put(70.5,33) {\texttt{L1d}} % \put(80.5,33) {\texttt{L2a}} % \put(90.5,33) {\texttt{L2b}} % \put(100.5,33) {\texttt{L2c}} % \put(110.5,33) {\texttt{L2d}} % \end{picture} % \caption[loftitle]{text} % \label{fig:dataSchematic2} % \end{figure} \subsection{Data column contents} \textbf{PolyPatEx} assumes a set of marker data from a number of subjects (individual plants or animals), which include progeny, their mothers, and other adults---the dataset should contain a row for each subject. All included progeny should have their (known) mother present in the data set. Since the data is to be loaded from a Comma Separated Variable (CSV) format, do not use commas in any id or allele labels or column names, to avoid conflict with the use of the comma as a separator by the CSV format. Certain cells of the data set should be left empty in the CSV file as described below. When the dataset is loaded into R using \mbox{\texttt{inputData()}}, these blank cells are automatically converted to contain \texttt{NA}, which is R's symbol for a missing datum.\footnote{\texttt{NA} is a special symbol recognised by R, and is not to be confused with \texttt{"NA"} or \texttt{`NA'}, which are each character strings containing the letters `N' and `A'. Also, R is case sensitive, so use \texttt{NA}, not \texttt{na} or \texttt{Na}.} In the CSV file, you may also use \texttt{*} or \texttt{NA} (note: uppercase) for these `blank' cells if you prefer. R users who create or load their data by means other than \texttt{inputData()} should ensure that all `blank' cells contain \texttt{NA} prior to calling \texttt{preprocessData()}). \begin{itemize} \item Column \texttt{id} contains a unique identifying label for each individual in the dataset. \item Column \texttt{popn} contains a label for the population each subject comes from. Note that although the \texttt{popn} must be present in the dataset, it is not currently used by PolyPatEx---all analyses are applied ignoring population distinctions within a given allele dataset. \item Column \texttt{mother} contains, for each offspring, the \texttt{id} label of the mother of that offspring. For all non-progeny, entries in this column should be left blank. \item For dioecious data, column \texttt{gender} contains \texttt{M} for male adults, or \texttt{F} for female adults. Entries in this column for progeny should be left blank. \item The remaining $p\times k$ columns contain the allele labels, one label per cell. Where fewer than $p$ alleles were observed at a locus, the remaining cells for that locus-subject combination should be left blank. \item \emph{Do not use spaces in allele labels}. \textbf{PolyPatEx} functions use spaces as delimiters between allele labels in their processing of the data, so labels that already contain spaces will cause errors to occur. \end{itemize} %% TODO - implement code to check for spaces in allele labels... Table~\ref{tbl:exPhenot} is an example of a (very small) `phenotype' data set for a monoecious species with ploidy $6n$ having just two observed loci. %% Phenotype table %% \begin{table}[ht] \begin{sidewaystable} \centering \texttt{ \begin{tabular}{ccccccccccccccc} \hline id&popn&mother&L1a&L1b&L1c&L1d&L1e&L1f&L2a&L2b&L2c&L2d&L2e&L2f \\ [0.5ex] \hline GF1&GF&&252&249&255&267&268&279&367&&&&& \\ GF13&GF&&249&250&255&264&272&277&367&369&374&&& \\ GF14&GF&&249&252&255&&&&367&&&&& \\ GF15&GF&&246&249&250&277&264&278&367&369&&&& \\ GF16&GF&&249&250&255&264&252&282&367&368&374&&& \\ GF17&GF&&249&252&255&257&282&&367&369&375&&& \\ GF1-2310&GF&GF1&249&255&267&272&282&&367&&&&& \\ GF1-2311&GF&GF1&244&249&255&268&279&&&&&&& \\ GF1-2315&GF&GF1&249&252&255&264&266&&367&374&&&& \\ GF1-2316&GF&GF1&252&&268&279&267&&367&&&&& \\ GF13-2480&GF&GF13&249&252&255&264&277&&367&369&374&&& \\ GF13-2481&GF&GF13&249&250&264&277&252&&368&369&&&& \\ GF13-2482&GF&GF13&249&250&255&252&&&369&374&&&& \\ GF13-2485&GF&GF13&250&264&272&277&&&367&369&&&& \\ GF13-2487&GF&GF13&249&264&277&252&&&367&369&368&&& \\ GF13-2491&GF&GF13&250&253&264&272&277&252&368&374&&&& \\ GF13-2492&GF&GF13&249&250&264&277&&&367&369&374&&& \\ GF13-2493&GF&GF13&250&255&264&268&277&&369&374&368&&& \\ GF13-2495&GF&GF13&249&252&264&267&277&&367&374&&&& \\ GF13-2496&GF&GF13&249&250&252&255&266&&368&374&&&& \\ [1ex] \hline \end{tabular} } % end texttt \caption{Small example `phenotype' data set for two observed loci in a monoecious species having ploidy $6n$.} \label{tbl:exPhenot} \end{sidewaystable} %% \end{table} In Table~\ref{tbl:exPhenot}, note that: \begin{itemize} \item \texttt{GF1} and \texttt{GF13} are mothers of progeny in the dataset. Their progeny are indicated by the presence of \texttt{GF1} or \texttt{GF13} in the \texttt{mother} column. \item \texttt{GF14} up to \texttt{GF17} are other adults. \item Blank cells in each block of six columns indicate where fewer than six alleles were observed. Locus \texttt{L2} appears to have fewer unique alleles overall (with heavier multiplicity) than Locus \texttt{L1}. \item The order of the rows is unimportant, but the order of the columns \emph{is} important! \end{itemize} Table~\ref{tbl:exGenot} is an example of a genotype data set for a dioecious species having a ploidy $4n$ and three observed loci. Subjects \texttt{1913} and \texttt{1914} are mothers; their progeny are again indicated by the mother id's in the mother column. Two of the subject-locus combinations failed to record any alleles and have therefore been left blank. %% Genotype table (see Genotype small example dataset.xlsx) %% \begin{table}[ht] \begin{sidewaystable} \centering \texttt{ \begin{tabular}{cccccccccccccccc} \hline id&popn&mother&gender&SB85a&SB85b&SB85c&SB85d&SB101a&SB101b&SB101c&SB101d&SB243a&SB243b&SB243c&SB243d \\ [0.5ex] \hline 2089&FR&&M&103&103&103&103&179&179&179&179&126&120&120&120 \\ 2090&FR&&M&103&103&103&106&179&179&&179&120&120&120&123 \\ 2091&FR&&M&&&&&179&179&179&179&117&120&120&126 \\ 2092&FR&&M&100&103&103&106&179&179&179&187&117&120&120&126 \\ 2208&FR&&M&100&100&103&103&179&179&179&179&117&120&120&126 \\ 1913&FR&&F&103&103&103&103&179&179&179&179&117&120&120&123 \\ 2800&FR&1913&&103&103&103&103&179&179&179&179&117&120&120&123 \\ 2809&FR&1913&&103&103&103&103&179&179&179&179&120&120&123&123 \\ 2810&FR&1913&&103&103&103&103&179&179&179&179&117&120&123&126 \\ 1914&FR&&F&103&103&103&103&179&179&179&179&117&117&117&123 \\ 2820&FR&1914&&103&103&103&103&179&179&179&179&117&120&120&123 \\ 2829&FR&1914&&103&103&103&103&179&179&186&186&117&117&117&123 \\ 2824&FR&1914&&103&103&103&106&&&&&117&117&117&126 \\ 2825&FR&1914&&103&103&103&103&179&179&179&186&120&120&120&123 \\[1ex] \hline \end{tabular} } % end texttt \caption{Small example genotype data set for three observed loci in a dioecious species having ploidy $4n$. The gender column contains codes \texttt{M} for Male, and \texttt{F} for Female.} \label{tbl:exGenot} \end{sidewaystable} %% \end{table} \subsection{Spreadsheets, CSV format, and the \texttt{fixCSV} function} \label{sec:CSVIssues} Most users will prepare their data in a commercial spreadsheet application, then export or `Save As...' the data to a CSV formatted file. In some spreadsheets applications, when `Save As...' is used, the process can be a little confusing, so we strongly recommend you save your spreadsheet file and \emph{make a backup of it before you attempt to create the CSV file}. A further complication that can arise relates to the cells of the data table that are represented in the CSV file when it is created. The CSV format is a plain text format, and when viewed as a plain text file, the CSV file for a rectangular table should ideally contain one fewer commas (column separators) per row as there are columns in the table. The data set of Table~\ref{tbl:exPhenot} contains 15 columns, so ideally there should be exactly 14 commas in each row of the resulting CSV file, as in Table~\ref{tbl:exPhenotCSV} below. \begin{table}[ht] \begin{verbatim} id,popn,mother,L1a,L1b,L1c,L1d,L1e,L1f,L2a,L2b,L2c,L2d,L2e,L2f GF1,GF,,252,249,255,267,268,279,367,,,,, GF13,GF,,249,250,255,264,272,277,367,369,374,,, GF14,GF,,249,252,255,,,,367,,,,, GF15,GF,,246,249,250,277,264,278,367,369,,,, GF16,GF,,249,250,255,264,252,282,367,368,374,,, GF17,GF,,249,252,255,257,282,,367,369,375,,, GF1-2310,GF,GF1,249,255,267,272,282,,367,,,,, GF1-2311,GF,GF1,244,249,255,268,279,,,,,,, GF1-2315,GF,GF1,249,252,255,264,266,,367,374,,,, GF1-2316,GF,GF1,252,,268,279,267,,367,,,,, GF13-2480,GF,GF13,249,252,255,264,277,,367,369,374,,, GF13-2481,GF,GF13,249,250,264,277,252,,368,369,,,, GF13-2482,GF,GF13,249,250,255,252,,,369,374,,,, GF13-2485,GF,GF13,250,264,272,277,,,367,369,,,, GF13-2487,GF,GF13,249,264,277,252,,,367,369,368,,, GF13-2491,GF,GF13,250,253,264,272,277,252,368,374,,,, GF13-2492,GF,GF13,249,250,264,277,,,367,369,374,,, GF13-2493,GF,GF13,250,255,264,268,277,,369,374,368,,, GF13-2495,GF,GF13,249,252,264,267,277,,367,374,,,, GF13-2496,GF,GF13,249,250,252,255,266,,368,374,,,, \end{verbatim} \caption{The data of Table~\ref{tbl:exPhenot} in CSV format, viewed as plain (ascii) text.} \label{tbl:exPhenotCSV} \end{table} Note in Table~\ref{tbl:exPhenotCSV} that the empty cells at the right hand side of the data table (due to the sparsity of distinct alleles in the phenotypic data at locus \texttt{L2}) are still represented in the CSV file, hence each row contains exactly 14 commas (even though the trailing commas may be thought of as somewhat redundant). The \textbf{PolyPatEx} function \texttt{inputData()} is pedantic on this point---it will complain if a column contains too many or too few commas in any of the rows. Also, any empty rows appearing below the data table in the CSV file will cause errors to occur when the data is checked by \texttt{preprocessData()}. When a CSV file is created by exporting/saving from a spreadsheet program: \begin{itemize} \item The spreadsheet may drop trailing commas that it sees as redundant in a row, particularly if there would be a large number of such commas (this can happen if there are lots of empty cells at the right hand side of a data table, as in the example of Table~\ref{tbl:exPhenot}); \item The spreadsheet may think that the data area is larger than it is meant to be, perhaps due to edits or formatting changes made in cells beyond the rows and columns of the data table (even if the content that was previously there has since been deleted). This can result in excess commas in a row, or commas appearing below the data table. \end{itemize} Both possibilities arise because spreadsheet applications typically provide a working space that is larger (in terms of number of rows and columns) than the final data table to be saved---so the spreadsheet has to make decisions as to which of the `blank' cells in the spreadsheet should be represented in the CSV file. There are two steps to ensure these problems do not occur: \begin{enumerate} \item To avoid problems due to previous edits outside the data table, it is a good idea to copy the completed data table from its original spreadsheet into a newly created spreadsheet or spreadsheet tab, before exporting to CSV format. If you copy \emph{only the rectangle containing the data and column headers}, you should leave behind any edits outside the data region which could otherwise cause problems when exporting to CSV format. \item In addition, \textbf{PolyPatEx} provides the function \texttt{fixCSV()}, which can be used to `tidy up' a CSV file before you load it into R with \texttt{inputData()}. Function \texttt{fixCSV()} will check the header row of the dataset in the CSV file and will add or remove trailing commas in each subsequent row as needed, to obtain the correct number of commas in each row. \texttt{fixCSV()} can also recognise and remove redundant empty rows below the main body of data should they occur (if you have carried out step~1 above, this should not occur). By default \texttt{fixCSV()} will write the result to a new CSV file having the word `FIXED' appended to the original filename. An option exists to overwrite the original file, but if you wish to use this option \emph{make sure you have backed up your data file first}! \end{enumerate} \section{\textbf{PolyPatEx} example workflow} \label{sec:workflow} Prior to using \textbf{PolyPatEx} for the first time, you should ensure that you have installed package \textbf{gtools} from the \textbf{Comprehensive R Archive Network} (CRAN). If not already installed, and assuming your computer is connected to the internet, the following R command can be used to download and install \textbf{gtools}: \begin{verbatim} install.packages('gtools') \end{verbatim} The \textbf{PolyPatEx} R package contains two example microsatellite data sets in the required input file format. Once \textbf{PolyPatEx} is installed, the following R command will print out the location of these files: \begin{verbatim} system.file("extdata", package = "PolyPatEx") \end{verbatim} These data sets may also be found on the \textbf{PolyPatEx} website: \begin{verbatim} http://www.anbg.gov.au/cpbr/tools/polypatex/ \end{verbatim} \begin{sloppypar} The file \texttt{GF\_Phenotype.csv} contains data from seven loci in a hexaploid, monoecious species, \emph{Eremophila glabra ssp.~glabra} (Elliott 2010). The file \texttt{FR\_Genotype.csv} contains data from seven loci in a tetraploid, dioecious species, \emph{Salix cinerea} (Hopley 2011). \end{sloppypar} Let us assume that the file \texttt{GF\_Phenotype.csv} has been copied to a suitable directory, and the R working directory has been set to this location (via the R file menu, or see the R function \texttt{setwd()}). Table~\ref{tbl:codeEx} shows example R code to load the data set into R, perform the exclusion analysis, and output results from the summary functions \texttt{potentialFatherCounts()} and \texttt{potentialFatherIDs()} to CSV files for scrutiny in a spreadsheet application. Note that R code is case sensitive. In Table~\ref{tbl:codeEx}, the \texttt{require()} function is used to load the \textbf{PolyPatEx} package.\footnote{For newcomers to R: When an R package is \emph{installed}, its content is placed where R can access it on your computer's hard drive---but in an R session, R does not try to access that content until the package is \emph{loaded} via the \texttt{require()} or \texttt{library()} functions. Loading the \textbf{PolyPatEx} package is required for each new R session, before you can access \textbf{PolyPatEx} functions. Provided you have installed package \textbf{gtools}, you need not \texttt{require()} it before using \textbf{PolyPatEx}---\textbf{PolyPatEx} functions know how to access \textbf{gtools} content directly.} The data file is then checked for possible CSV format issues discussed in Section~\ref{sec:CSVIssues}, using \texttt{fixCSV()}---the option \texttt{overwrite=TRUE} has been specified to overwrite the original file with the `corrected' version should any issues be found - always ensure you have packed up the original versions of the your datasets before using this option! Function \texttt{inputData()} is then used to input the allele data from the CSV file. The data set is immediately passed by \texttt{inputData()} to \texttt{preprocessData()} which performs a number of checks and preprocessing steps on the dataset before the result is returned in the form of an R data frame - this is stored as \texttt{adata} in the example of Table~\ref{tbl:codeEx}. Users should not make changes to this data frame before analysing it with other \textbf{PolyPatEx} functions, unless they re-run \texttt{preprocessData()} on the data frame again. Users who wish to load or create the allele data set in R without using \texttt{inputData()} must run \texttt{preprocessData()} on the data frame prior to using further \textbf{PolyPatEx} analysis functions. \begin{sloppypar} Further information on the key checks and preprocessing performed by \texttt{preprocessData()} is provided in Section~\ref{sec:preprocessData}. \end{sloppypar} \begin{table}[ht] \begin{verbatim} require(PolyPatEx) fixCSV("GF_Phenotype.csv",overwrite=TRUE) adata <- inputData("GF_Phenotype.csv", numLoci=7, ploidy=6, dataType="phenotype", dioecious=FALSE, selfCompatible=TRUE) ppe1 <- phenotPPE(adata) write.csv(potentialFatherCounts(ppe1),file="pFatherCounts.csv") write.csv(potentialFatherIDs(ppe1),file="pFatherIDs.csv") \end{verbatim} \caption{Example code to perform a paternity exclusion analysis on a phenotype dataset.} \label{tbl:codeEx} \end{table} Arguments to \texttt{inputData()} (or when called directly, \texttt{preprocessData()}) should specify the details of the data to be loaded: \begin{itemize} \item \texttt{numLoci=} The number of loci $k$, in the dataset. \item \texttt{ploidy=} The ploidy $p$ ($4$, $6$, or $8$) of the autopolyploid species being analysed. The ploidy can also be $2$, provided \texttt{dataType="genotype"}. \item \texttt{dataType=} Either \texttt{"genotype"} (allele copy numbers known) or \texttt{"phenotype"} (allele copy numbers unknown). \item \texttt{dioecious=} Is the species dioecious (\texttt{dioecious=TRUE}) or monoecious (\texttt{dioecious=FALSE})? \item \texttt{selfCompatible=} If the species is monoecious, should an offspring's mother be regarded also as a candidate father (\texttt{selfCompatible=TRUE}) or not (\texttt{selfCompatible=FALSE})? If \texttt{dioecious=TRUE}, you need not specify this argument. \item \texttt{mothersOnly=} If the species is dioecious, should female adults that are not mothers be removed from the dataset (\texttt{mothersOnly=TRUE}) or not (\texttt{mothersOnly=FALSE})? Female adults that are not mothers do not play a part in the exclusion analysis, but this option does affect results from the functions \texttt{multilocusTypes()} and \texttt{foreignAlleles()}. \item \texttt{lociMin=} The minimum required number of loci that must have alleles present for the individual to be retained in the dataset. This argument has a default value of \texttt{lociMin=1}. For more on this argument, see Section~\ref{sec:preprocessData}. \item \texttt{matMismatches} The minimum allowed number of loci in a given offspring that are allowed to mismatch between mother and offspring before the offspring is removed from the dataset. For more on this argument, see Section~\ref{sec:preprocessData}. \end{itemize} % A further argument to \texttt{inputData()} (and \texttt{fixCSV()}) is % \texttt{skip=}, which can be used to specify a number of leading lines % in the data file that are to be \emph{ignored} by \texttt{inputData()} % (and \texttt{fixCSV()}) - the default value for this argument is zero, % implying that the header row (the row of column names) in the dataset % is the first row of the CSV file. The \texttt{skip=} argument allows % you to place some lines of comments \emph{before} the data table in % the CSV file. If \texttt{skip=} is non-zero, \texttt{fixCSV()}) will % simply copy the skipped lines of the input file into its output file. Once the data has been loaded into R, checked by \texttt{preprocessData()} and stored as an R data frame object (here called \texttt{adata}), the data frame is passed to one of the exclusion routines, in this case \texttt{phenotPPE}, which performs the paternity exclusion analysis. The result of the exclusion analysis is an R `list' structure, which is stored here as \texttt{ppe1}. This list will generally be very large, and its contents are explained in more detail in the reference information for functions \texttt{phenotPPE} and \texttt{genotPPE}. \begin{sloppypar} However, the contents of the list \texttt{ppe1} will usually not be the most useful output for interpretation. Functions \texttt{potentialFatherCounts()} and \texttt{potentialFatherIDs()} are provided to summarise the results of the analysis in a more useable form - these functions return R data frames summarising the number of non-excluded candidates (i.e., potential fathers) for each offspring in the dataset, and the ID's of potential fathers of each offspring. \end{sloppypar} Two optional arguments to these functions (not shown in the example) should be noted. Missing allele sets in progeny, mother or candidate father and issues such as mismatching allele sets between progeny and mother, can result in loci at which exclusion comparisons with a given candidate cannot be made. Argument \texttt{VLTMin=} can be used to set the minimum number of `valid' loci (at which valid exclusion comparisons could be made) that are required for a candidate to be assessed as a potential father to a given offspring. Candidates with fewer than this number of 'valid' loci are ignored. The default value is \texttt{VLTMin=1}. Argument \texttt{mismatches=} can be used to specify a maximum number of \emph{mismatching} loci that are allowed in a candidate that is still flagged as a potential father. This option can provide some allowance for the possibility of mismatches occurring due to mutations or genotyping errors. The default value is \texttt{mismatches=0}, hence all valid loci must provide an offspring-candidate match before the candidate is flagged as a potential father. \begin{sloppypar} In the example code, the results of calls to \texttt{potentialFatherCounts()} and \texttt{potentialFatherIDs()} have been saved directly to CSV files via R's \texttt{write.csv()}, rather than being saved as R objects. \end{sloppypar} %% TODO The first 12 lines of the table produced by potentialFatherIDs %% are shown in Table 2. \section{More on function \texttt{preprocessData()}} \label{sec:preprocessData} As mentioned, function \texttt{preprocessData()} performs a number of checks and preprocessing steps on the input allele dataset, and is a prerequisite to using any of \textbf{PolyPatEx}'s analysis functions. \texttt{preprocessData()} may stop with an error message if certain errors in the data layout or content are found, requiring the user to correct the dataset before calling \texttt{fixCSV()}, \texttt{inputData()} and/or \texttt{preprocessData()} again as appropriate. Certain other issues result in affected allele sets being reset to have no alleles. In genotype datasets, allele sets that do not contain $p$ alleles are reset to contain no alleles (a `missing' allele set), and subsequent analyses will ignore this locus in this individual - hence, for example, reducing the number of `valid' loci remaining from the exclusion analysis. (In allelic phenotype datasets, there is no requirement to observe exactly $p$ unique alleles, so this adjustment is only relevant to genotypic data.) Such changes to the allele dataset only affects the data frame in R - \texttt{preprocessData} does not change the input CSV file. As an example, offspring-mother pairs are checked by \texttt{preprocessData()} for loci in which the allele sets are incompatible (a mother-offspring `mismatch'). In genotype datasets, the condition for compatibility between mother and offspring is simply that the mother's allele set must be able to provide at least $p/2$ of the alleles present in the offspring's allele set. In phenotype datasets, where allele copy numbers are not known, the comparison is more complex---the software must search through the possible genotypes arising from the observed phenotypes before it can confirm whether a mother's observed alleles compatible with the offsprings observed alleles. An argument, \texttt{matMismatches} is provided that allows the user to specify the number of mismatching loci between mother and offspring that are allowed before the offspring is removed from the dataset. \texttt{matMismatches} may be between \texttt{0} and \texttt{numLoci - 1}. If a given offspring has \texttt{matMismatches} or fewer loci mismatching with its mother, the affected loci in the \emph{offspring} are set to be missing (i.e. to contain no non-missing alleles). If an offspring has greater than \texttt{matMismatches} missing loci, its data will be removed from the dataset returned by \texttt{preprocessData}. The default value for \texttt{matMismatches} is \texttt{0}. Details of the affected individuals and loci involved are reported to the user, so that they may, if they wish, check for and correct possible data errors in the input CSV file. \section{The exclusion comparisons} At each locus, the exclusion routines \texttt{genotPPE} and \texttt{phenotPPE} look for candidate fathers whose allele set is compatible with the alleles in the offspring, given the alleles present in the mother's allele set. Comparisons are made neglecting the possibility of mutation or other mechanisms that may violate the assumptions described in Section~\ref{sec:Intro}. In genotype datasets, this means that the candidate must be able to account for any alleles in the offpsring that cannot be provided by that offspring's mother, plus as many more of the offspring's alleles (also found in the mother) as are needed to make up a full gamete's component of $p/2$ alleles. In phenotype datasets, where allele copy numbers are unknown, the situation is more complex, since a proper comparison between offspring and candidate father (given the offspring's mother) must take into account all of the possible genotypes (totalling $p$ alleles per allele set) that could have arisen from the $1$ to $p$ alleles observed at the locus in each of the paired individuals. In \texttt{phenotPPE}, the search through all possible comparisons is implemented via previously calculated lookup tables included in the package. Currently, lookup tables are provided for ploidies $4n$, $6n$ and $8n$. \newpage \section{PolyPatEx - License} \label{sec:License} {\small \textbf{CSIRO Open Source Software License Agreement (GPLv3)} Copyright (c) 2014, Commonwealth Scientific and Industrial Research Organisation (CSIRO) ABN 41 687 119 230. All rights reserved. CSIRO is willing to grant you a license to the PolyPatEx R package on the terms of the GNU General Public License version 3 as published by the Free Software Foundation (http://www.gnu.org/licenses/gpl.html), except where otherwise indicated for third party material. --------------------------------------------------------------------- The following additional terms apply under clause 7 of that license: EXCEPT AS EXPRESSLY STATED IN THIS AGREEMENT AND TO THE FULL EXTENT PERMITTED BY APPLICABLE LAW, THE SOFTWARE IS PROVIDED "AS-IS". CSIRO MAKES NO REPRESENTATIONS, WARRANTIES OR CONDITIONS OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY REPRESENTATIONS, WARRANTIES OR CONDITIONS REGARDING THE CONTENTS OR ACCURACY OF THE SOFTWARE, OR OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, THE ABSENCE OF LATENT OR OTHER DEFECTS, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. TO THE FULL EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL CSIRO BE LIABLE ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, IN AN ACTION FOR BREACH OF CONTRACT, NEGLIGENCE OR OTHERWISE) FOR ANY CLAIM, LOSS, DAMAGES OR OTHER LIABILITY HOWSOEVER INCURRED. WITHOUT LIMITING THE SCOPE OF THE PREVIOUS SENTENCE THE EXCLUSION OF LIABILITY SHALL INCLUDE: LOSS OF PRODUCTION OR OPERATION TIME, LOSS, DAMAGE OR CORRUPTION OF DATA OR RECORDS; OR LOSS OF ANTICIPATED SAVINGS, OPPORTUNITY, REVENUE, PROFIT OR GOODWILL, OR OTHER ECONOMIC LOSS; OR ANY SPECIAL, INCIDENTAL, INDIRECT, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES, ARISING OUT OF OR IN CONNECTION WITH THIS AGREEMENT, ACCESS OF THE SOFTWARE OR ANY OTHER DEALINGS WITH THE SOFTWARE, EVEN IF CSIRO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH CLAIM, LOSS, DAMAGES OR OTHER LIABILITY. APPLICABLE LEGISLATION SUCH AS THE AUSTRALIAN CONSUMER LAW MAY APPLY REPRESENTATIONS, WARRANTIES, OR CONDITIONS, OR IMPOSES OBLIGATIONS OR LIABILITY ON CSIRO THAT CANNOT BE EXCLUDED, RESTRICTED OR MODIFIED TO THE FULL EXTENT SET OUT IN THE EXPRESS TERMS OF THIS CLAUSE ABOVE "CONSUMER GUARANTEES". TO THE EXTENT THAT SUCH CONSUMER GUARANTEES CONTINUE TO APPLY, THEN TO THE FULL EXTENT PERMITTED BY THE APPLICABLE LEGISLATION, THE LIABILITY OF CSIRO UNDER THE RELEVANT CONSUMER GUARANTEE IS LIMITED (WHERE PERMITTED AT CSIRO'S OPTION) TO ONE OF FOLLOWING REMEDIES OR SUBSTANTIALLY EQUIVALENT REMEDIES: (a) THE REPLACEMENT OF THE SOFTWARE, THE SUPPLY OF EQUIVALENT SOFTWARE, OR SUPPLYING RELEVANT SERVICES AGAIN; (b) THE REPAIR OF THE SOFTWARE; (c) THE PAYMENT OF THE COST OF REPLACING THE SOFTWARE, OF ACQUIRING EQUIVALENT SOFTWARE, HAVING THE RELEVANT SERVICES SUPPLIED AGAIN, OR HAVING THE SOFTWARE REPAIRED. IN THIS CLAUSE, CSIRO INCLUDES ANY THIRD PARTY AUTHOR OR OWNER OF ANY PART OF THE SOFTWARE OR MATERIAL DISTRIBUTED WITH IT. CSIRO MAY ENFORCE ANY RIGHTS ON BEHALF OF THE RELEVANT THIRD PARTY. ------------------------------------------------------------------------ Note: The GNU General Public License version 3 can also be viewed at \texttt{http://www.r-project.org/licenses/} or in the file \texttt{share/licenses/GPL-3} in the R (source or binary) distribution of the PolyPatEx package } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \section{} % \label{sec:} % \textbf{toCSV} % \texttt{toCSV} % section~\ref{sec:ImplicitLists}, % page~\pageref{sec:ImplicitLists} % \section{About the toCSV package} % \label{sec:About} % \emph{not} % \begin{itemize} % \item % \item % \item % \end{itemize} % <<>>= % require(PolyPatEx) % @ % <>= % R code goes here % @ % \noindent to prevent paragraph indentation % \begin{Sinput} % ?toCSV % \end{Sinput} % I forget what the diff is between <<>>= and \begin{Sinput}\end{Sinput} % - check % \begin{quotation} % \end{quotation} \end{document}