Pedigree_constructor_details

Terry Therneau, Elizabeth Atkinson

24 March, 2024

Introduction

The pedigree routines came out of a simple need – to quickly draw a pedigree structure on the screen, within R, that was ``good enough’’ to help with debugging the actual routines of interest, which were those for fitting mixed effecs Cox models to large family data. As such the routine had compactness and automation as primary goals; complete annotation (monozygous twins, multiple types of affected status) and most certainly elegance were not on the list. Other software could do that much better.

It therefore came as a major surprise when these routines proved useful to others. Through their constant feedback, application to more complex pedigrees, and ongoing requests for one more feature, the routine has become what it is today. This routine is still not suitable for really large pedigrees, nor for heavily inbred ones such as in animal studies, and will likely not evolve in that way. The authors fondest hope is that others will pick up the project.

Pedigree Constructor

The pedigree function is the first step, creating an object of class pedigree.
It accepts the following input

The [[famid]] variable is placed last as it was a later addition to the code; thus prior invocations of the function that use positional arguments will not be affected.
If present, this allows a set of pedigrees to be generated, one per family. The resultant structure will be an object of class [[pedigreeList]].

Note that a factor variable is not listed as one of the choices for the subject identifier. This is on purpose. Factors were designed to accomodate character strings whose values came from a limited class – things like race or gender, and are not appropriate for a subject identifier. All of their special properties as compared to a character variable turn out to be backwards for this case, in particular a memory of the original level set when subscripting is done.

However, due to the awful decision early on in S to automatically turn every character into a factor — unless you stood at the door with a club to head the package off — most users have become ingrained to the idea of using them for every character variable. (I encourage you to set the global option stringsAsFactors=FALSE to turn off autoconversion – it will measurably improve your R experience). Therefore, to avoid unnecessary hassle for our users the code will accept a factor as input for the id variables, but the final structure does not retain it.
Gender and relation do become factors. Status follows the pattern of the survival routines and remains an integer.

We will describe the code in a set of blocks.

Data Checks and Errors

Errors1

The code starts out with some checks on the input data.
Is it all the same length, are the codes legal, etc. Checks for ids being non-missing, and for sex to be as expected of the codes 1-4 for female/male/unknown/terminated.

Errors2

Create the variables descibing a missing father and/or mother, which is what we expect both for people at the top of the pedigree and for marry-ins, adding in the family id information. It is easier to do it first. If there are multiple families in the pedigree, make a working set of identifiers that are of the form `family/subject’. Family identifiers can be factor, character, or numeric.

Errors3-Parents

Next check that any mother or father identifiers are found in the identifier list, and are of the right sex. Subjects who don’t have a mother or father are founders. For those people both of the parents should be missing.

Creation of Pedigrees

Now, paste the parts together into a basic pedigree. The fields for father and mother are not the identifiers of the parents, but their row number in the structure.

Finish Object

The final structure will be in the order of the original data, and all the components except [[relation]] will have the same number of rows as the original data.

Subscipting

Subscripting of a pedigree list extracts one or more families from the list. We treat character subscripts in the same way that dimnames on a matrix are used. Factors are a problem though: assume that we have a vector x with names joe'',charlie’‘, ``fred’’, then [[x[‘joe’]]] is the first element of the vector, but [[temp <- factor(‘joe’, ‘charlie’, ‘fred’); z <- temp[1]; x[z] ]] will be the second element! R is implicitly using as.numeric on factors when they are a subscript; this caught an early version of the code when an element of a data frame was used to index the pedigree: characters are turned into factors when bundled into a data frame.

          Note:
              \begin{enumerate}
            \item What should we do if the family id is a numeric: when the user
            says [4] do they mean the fourth family in the list or family '4'?
              The user is responsible to say ['4'] in this case.
            \item  In a normal vector invalid subscripts give an NA, e.g. (1:3)[6], but
            since there is no such object as an ``NA pedigree'', we emit an error
            for this.
            \item The [[drop]] argument has no meaning for pedigrees, but must to be
            a defined argument of any subscript method; we simply ignore it.
            \item Updating the father/mother is a minor nuisance;
            since they must are integer indices to rows they must be
            recreated after selection.  Ditto for the relationship matrix.  
            \end{enumerate}
              For a pedigree, the subscript operator extracts a subset of individuals.
            We disallow selections that retain only 1 of a subject's parents, since    %'
            they cause plotting trouble later.
            Relations are worth keeping only if both parties in the relation were
            selected.

As Data.Frame for Pedigree

            Convert the pedigree to a data.frame so it is easy to view when removing or
            trimming individuals with their various indicators.  
            The relation and hints elements of the pedigree object are not easy to
            put in a data.frame with one entry per subject. These items are one entry 
            per subject, so are put in the returned data.frame:  id, findex, mindex, 
            sex, affected, status.  The findex and mindex are converted to the actual id
            of the parents rather than the index.
            
            Can be used with as.data.frame(ped) or data.frame(ped). Specify in Namespace
            file that it is an S3 method.
            
            
            This function is useful for checking the pedigree object with the
            $findex$ and $mindex$ vector instead of them replaced with the ids of 
            the parents.  This is not currently included in the package.

Printing Pedigree

            It usually doesn't make sense to print a pedigree, since the id is just   %'
            a repeat of the input data and the family connections are pointers.
            Thus we create a simple summary.