Help for package SNPfiltR

Title:

Interactively Filter SNP Datasets

Version:

1.0.7

Description:

Is designed to interactively and reproducibly visualize and filter SNP (single-nucleotide polymorphism) datasets. This R-based implementation of SNP and genotype filters facilitates an interactive and iterative SNP filtering pipeline, which can be documented reproducibly via 'rmarkdown'. 'SNPfiltR' contains functions for visualizing various quality and missing data metrics for a SNP dataset, and then filtering the dataset based on user specified cutoffs. All functions take 'vcfR' objects as input, which can easily be generated by reading standard vcf (variant call format) files into R using the R package 'vcfR' authored by Knaus and Grünwald (2017) <doi:10.1111/1755-0998.12549>. Each 'SNPfiltR' function can return a newly filtered 'vcfR' object, which can then be written to a local directory in standard vcf format using the 'vcfR' package, for downstream population genetic and phylogenetic analyses.

License:

MIT + file LICENSE

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.2.3

Imports:

vcfR, ggplot2, Rtsne, cluster, adegenet, gridExtra, ggridges, stats, graphics, utils

Depends:

R (≥ 2.10)

Suggests:

rmarkdown, knitr

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-09-03 17:55:21 UTC; devder

Author:

Devon DeRaad

[aut, cre]

Maintainer:

Devon DeRaad <devonderaad@gmail.com>

Repository:

CRAN

Date/Publication:

2025-09-04 06:40:39 UTC

SNPfiltR: A package for interactively visualizing and filtering SNP datasets

Description

The SNPfiltR package allows users to interactively visualize the effects of relevant filters on their datasets in order to optimize filtering parameters rather than simply following historical precedent. Each function takes a variant call format (vcf) file, stored in local memory as a vcfR object, as input. Most functions can be run without specified cutoffs, in order to visualize the distribution of the parameter of interest in your given dataset. Then the same function can be run with a specified cutoff, and a filtered vcfR object will be returned. For detailed documentation and vignettes showing fully implemented SNP filtering pipelines, please go to: devonderaad.github.io/SNPfiltR

Vizualise how missing data thresholds affect sample clustering

Description

This function can be run in two ways: 1) Without 'thresholds' specified. This will run a PCA for the input vcf without filtering, and visualize the clustering of samples in two-dimensional space, coloring each sample according to a priori population assignment given in the popmap. 2) With 'thresholds' specified. This will filter your input vcf file to the specified missing data thresholds, and run a PCA for each filtering iteration. For each iteration, a 2D plot will be output showing clustering according to the specified popmap. This option is ideal for assessing the effects of missing data on clustering patterns.

Usage

assess_missing_data_pca(
  vcfR,
  popmap = NULL,
  thresholds = NULL,
  clustering = TRUE
)

Arguments

vcfR

a vcfR object

popmap

set of population assignments that will be used to color code the plots

thresholds

optionally specify a vector of missing data filtering thresholds to explore

clustering

use partitioning around medoids (PAM) to do unsupervised clustering on the output? (default = TRUE, max clusters = # of levels in popmap + 2)

Value

a series of plots showing the clustering of all samples in two-dimensional space

Examples

assess_missing_data_pca(vcfR = SNPfiltR::vcfR.example,
popmap = SNPfiltR::popmap,
thresholds = c(.6,.8))

Vizualise how missing data thresholds affect sample clustering

Description

This function can be run in two ways: 1) Without 'thresholds' specified. This will run t-SNE for the input vcf without filtering, and visualize the clustering of samples in two-dimensional space, coloring each sample according to a priori population assignment given in the popmap. 2) With 'thresholds' specified. This will filter your input vcf file to the specified missing data thresholds, and run a t-SNE clustering analysis for each filtering iteration. For each iteration, a 2D plot will be output showing clustering according to the specified popmap. This option is ideal for assessing the effects of missing data on clustering patterns.

Usage

assess_missing_data_tsne(
  vcfR,
  popmap = NULL,
  thresholds = NULL,
  perplexity = NULL,
  iterations = NULL,
  initial_dims = NULL,
  clustering = TRUE
)

Arguments

vcfR

a vcfR object

popmap

set of population assignments that will be used to color code the plots

thresholds

a vector specifying the missing data filtering thresholds to explore

perplexity

numerical value specifying the perplexity paramter during t-SNE (default: 5)

iterations

a numerical value specifying the number of iterations for t-SNE (default: 1000)

initial_dims

a numerical value specifying the number of initial_dimensions for t-SNE (default: 5)

clustering

use partitioning around medoids (PAM) to do unsupervised clustering on the output? (default = TRUE, max clusters = # of levels in popmap + 2)

Value

a series of plots showing the clustering of all samples in two-dimensional space

Examples

assess_missing_data_tsne(vcfR = SNPfiltR::vcfR.example,
popmap = SNPfiltR::popmap,
thresholds = .8)

Filter a vcf file based on distance between SNPs on a given scaffold

Description

This function requires a vcfR object as input, and returns a vcfR object filtered to retain only SNPs greater than a specified distance apart on each scaffold. The function starts by automatically retaining the first SNP on a given scaffold, and then subsequently keeping the next SNP that is greater than the specified distance away, until it reaches the end of the scaffold/chromosome. This function scales well with an increasing number of SNPs, but poorly with an increasing number of scaffolds/chromosomes. For this reason, there is a built in progress bar, to monitor potentially long-running executions with many scaffolds. This type of filtering is often employed to reduce linkage among input SNPs, especially for downstream input to programs like structure, which require unlinked SNPs.

Usage

distance_thin(vcfR, min.distance = NULL)

Arguments

vcfR

a vcfR object

min.distance

a numeric value representing the smallest distance (in base-pairs) allowed between SNPs after distance thinning

Value

An identical vcfR object, except that SNPs separated by less than the specified distance have been removed from the file

Examples

distance_thin(vcfR = SNPfiltR::vcfR.example, min.distance = 1000)

Filter out heterozygous genotypes failing an allele balance check

Description

This function requires a vcfR object as input, and returns a vcfR object filtered to convert heterozygous sites with an allele balance falling outside of the specified ratio to 'NA'. If no ratio is specified, a default .25-.75 limit is imposed. From the dDocent filtering page "Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5".

Usage

filter_allele_balance(vcfR, min.ratio = NULL, max.ratio = NULL)

Arguments

vcfR

a vcfR object

min.ratio

minumum allele ratio for a called het

max.ratio

maximum allele ratio for a called het

Value

An identical vcfR object, except that genotypes failing the allele balance filter have been converted to 'NA'.

Examples

filter_allele_balance(vcfR = SNPfiltR::vcfR.example)

Remove SNPs with more than two alleles

Description

This function simply removes any SNPs from the vcf file which contains more than two alleles. Many downstream applications require SNPs to be biallelic, so this filter is generally a good idea during processing.

Usage

filter_biallelic(vcfR)

Arguments

vcfR

a vcfR object

Value

a vcfR object with SNPs containing more than two alleles removed

Examples

filter_biallelic(vcfR = SNPfiltR::vcfR.example)

Hard filter a vcf file by depth and genotype quality (gq)

Description

This function requires a vcfR object as input. The user can then specify the minimum value for depth of coverage required to retain a called genotype (must be numeric). Additionally, the user can specify a minimum genotype quality required to retain a called genotype (again, must be numeric).

Usage

hard_filter(vcfR, depth = NULL, gq = NULL)

Arguments

vcfR

a vcfR object

depth

an integer representing the minimum depth for genotype calls that you wish to retain (e.g. 'depth = 5' would remove all genotypes with a sequencing depth of 4 reads or less)

gq

an integer representing the minimum genotype quality for genotype calls that you wish to retain (e.g. 'gq = 30' would remove all genotypes with a quality score of 29 or lower)

Value

The vcfR object input, with the sites failing specified filters converted to 'NA'

Examples

hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5)
hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5, gq = 30)

Vizualise and filter based on mean depth across all called SNPs

Description

This function can be run in two ways: 1) specify vcfR object only. This will visualize the distribution of mean depth per sample across all SNPs in your vcf file, and will not alter your vcf file. 2) specify vcfR object, and set 'maxdepth' = 'integer value'. This option will show you where your specified cutoff falls in the distribution of SNP depth, and remove all SNPs with a mean depth above the specified threshold from the vcf. Super high depth loci are likely multiple loci stuck together into a single paralogous locus. Note: This function filters on a 'per SNP' basis rather than a 'per genotype' basis, otherwise it would disproportionately remove genotypes from our deepest sequenced samples (because sequencing depth is so variable between samples).

Usage

max_depth(vcfR, maxdepth = NULL)

Arguments

vcfR

a vcfR object

maxdepth

an integer specifying the maximum mean depth for a SNP to be retained

Value

The vcfR object input, with SNPs above the 'maxdepth' cutoff removed

Examples

max_depth(vcfR = SNPfiltR::vcfR.example)
max_depth(vcfR = SNPfiltR::vcfR.example, maxdepth = 100)

Vizualise, filter based on Minor Allele Count (MAC)

Description

This function can be run in two ways: 1) Without 'min.mac' specified. This will return a folded site frequency spectrum (SFS), without performing any filtering on the vcf file. Or 2) With 'min.mac' specified. This will also print the folded SFS and show you where your specified min. MAC count falls. It will then return your vcfR object with SNPs falling below your min. MAC threshold removed. Note: previous filtering steps (especially removing samples) may have resulted in invariant SNPs (MAC =0). For this reason it's a good idea to run min_mac(vcfR, min.mac=1) before using a SNP dataset in downstream analyses.

Usage

min_mac(vcfR, min.mac = NULL)

Arguments

vcfR

a vcfR object

min.mac

an integer specifying the minimum minor allele count for a SNP to be retained (e.g. 'min.mac=3' would remove all SNPs with a MAC of 2 or less)

Value

if 'min.mac' is not specified, the allele frequency spectrum is returned. If 'min.mac' is specified, SNPs falling below the MAC cutoff will be removed, and the filtered vcfR object will be returned.

Examples

min_mac(vcfR=SNPfiltR::vcfR.example)

Vizualise missing data per sample, remove samples above a missing data cutoff

Description

This function can be run in two ways: 1) Without 'cutoff' specified. This will vizualise the amount of missing data in each sample across a variety of potential missing data cutoffs. Additionally, it will show you a dotplot ordering the amount of overall missing data in each sample. Based on these visualizations, you can make an informed decision on what you think might be an optimal cutoff to remove samples that are missing too much data to be retained for downstream analyses. 2) with 'cutoff' specified. This option will show you the dotplot with the cutoff you set, and then remove samples above the missing data cutoff you set, and return the filtered vcf to you.

Usage

missing_by_sample(vcfR, popmap = NULL, cutoff = NULL)

Arguments

vcfR

a vcfR object

popmap

if specifies, it must be a two column dataframe with columns names 'id' and 'pop'. IDs must match the IDs in the vcfR object

cutoff

a numeric value between 0-1 specifying the maximum proportion of missing data allowed in a sample to be retained for downstream analyses

Details

Note: This decision is highly project specific, but these visualizations should help you get a feel for how very low data samples cannot be rescued simply by a missing data SNP filter. If you want to remove specific samples from your vcf that cannot be specified with a simple cutoff refer to this great tutorial which is the basis for the code underlying this function.

Value

if 'cutoff' is not specified, will return a dataframe containing the average depth and proportion missing data in each sample. If 'cutoff' is specified, the samples falling above the missing data cutoff will be removed, and the filtered vcfR object will be returned.

Examples

missing_by_sample(vcfR = SNPfiltR::vcfR.example)
missing_by_sample(vcfR = SNPfiltR::vcfR.example, cutoff = .7)

Vizualise missing data per SNP, remove SNPs above a missing data cutoff

Description

This function can be run in two ways: 1) Without 'cutoff' specified. This will vizualise the amount of missing data in each sample across a variety of potential missing data cutoffs. Additionally, it will show you dotplots visualizing the number of total SNPs retained across a variety of filtering cutoffs, and the total proportion of missing data. Based on these visualizations, you can make an informed decision on what you think might be an optimal cutoff to minimize the overall missingness of your dataset while still retaining an appropriate amount of SNPs for the downstream inferences you hope to make 2) with 'cutoff' specified. This option will show you the dotplots with the cutoff you set, and then remove SNPs above the missing data cutoff.

Usage

missing_by_snp(vcfR, cutoff = NULL)

Arguments

vcfR

a vcfR object

cutoff

a numeric value between 0-1 specifying the maximum proportion of missing data allowed in a SNP to be retained for downstream analyses

Value

if 'cutoff' is not specified, will return a dataframe containing the proportion missing data and the total SNPs retained across each filtering level. If 'cutoff' is specified, SNPs falling above the missing data cutoff will be removed, and the filtered vcfR object will be returned.

Examples

missing_by_snp(vcfR = SNPfiltR::vcfR.example)
missing_by_snp(vcfR = SNPfiltR::vcfR.example, cutoff = .6)

Popmap for example scrub-jay vcfR file

Description

A dataset containing the sample name and population assignment for the 20 scrub-jay samples in SNPfilR::vcfR.example . The variables are as follows:

Usage

data(popmap)

Format

A data frame with 20 rows and 2 variables

Details

id. unique sample identifier
pop. population assignment for each individual

Example scrub-jay vcfR file

Description

A vcfR object containing 500 SNPs for 20 individuals. Species assignments for each individual can be accessed via SNPfiltR::popmap

Usage

data(vcfR.example)

Format

A vcfR object containing 500 SNPs for 20 individuals