{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of package {bigstatsr} for the purpose of analyzing genotype data.
To get you started:
List of functions from bigsnpr and from bigstatsr
Extended documentation with more examples + course recording
In R, run
# install.packages("remotes")
::install_github("privefl/bigsnpr") remotes
or for the CRAN version
install.packages("bigsnpr")
This package reads bed/bim/fam files
(PLINK preferred format) using functions snp_readBed()
and
snp_readBed2()
. Before reading into this package’s special
format, quality control and conversion can be done using PLINK, which
can be called directly from R using snp_plinkQC()
and
snp_plinkKINGQC()
.
This package can also read UK Biobank BGEN files
using function snp_readBGEN()
. This function takes around
40 minutes to read 1M variants for 400K individuals using 15 cores.
This package uses a class called bigSNP
for representing
SNP data. A bigSNP
object is a list with some elements:
$genotypes
: A FBM.code256
.
Rows are samples and columns are variants. This stores genotype calls or
dosages (rounded to 2 decimal places).$fam
: A data.frame
with some information
on the individuals.$map
: A data.frame
with some information
on the variants.Note that most of the algorithms of this package don’t handle
missing values. You can use snp_fastImpute()
(taking a few hours for a chip of 15K x 300K) and
snp_fastImputeSimple()
(taking a few minutes only) to
impute missing values of genotyped variants.
Package {bigsnpr} also provides functions that directly work on bed
files with a few missing values (the bed_*()
functions).
See paper “Efficient toolkit
implementing..”.
Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:
Penalized regressions with individual-level data (see paper and tutorial)
Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial).
LDpred2 with summary statistics (see paper and tutorial), and lassosum2
Multiple imputation for GWAS (https://doi.org/10.1371/journal.pgen.1006091).
More interactive (visual) QC.
You can request some feature by opening an issue.
How to make a great R reproducible example?
Please open an issue if you find a bug.
If you want help using {bigstatsr} (the big_*()
functions), please open an issue on {bigstatsr}’s
repo, or post on Stack Overflow with the tag bigstatsr.
I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.
Privé, Florian, et al. “Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.” Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, et al. “Efficient implementation of penalized regression for genetic risk prediction.” Genetics 212.1 (2019): 65-74.
Privé, Florian, et al. “Making the most of Clumping and Thresholding for polygenic scores.” The American Journal of Human Genetics 105.6 (2019): 1213-1221.
Privé, Florian, et al. “Efficient toolkit implementing best practices for principal component analysis of population genetic data.” Bioinformatics 36.16 (2020): 4449-4457.
Privé, Florian, et al. “LDpred2: better, faster, stronger.” Bioinformatics 36.22-23 (2020): 5424-5431.
Privé, Florian. “Optimal linkage disequilibrium splitting.” Bioinformatics 38.1 (2022): 255–256.
Privé, Florian. “Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics.” Bioinformatics 38.13 (2022): 3477-3480.
Privé, Florian, et al. “Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores.” Human Genetics and Genomics Advances 3.4 (2022).
Privé, Florian, et al. Inferring disease architecture and predictive ability with LDpred2-auto. The American Journal of Human Genetics 110.12 (2023): 2042-2055.