All 4 functions streamline the following data processing:
Normalization of proteomic intensity values
Filtering based on the percentage of missing values.
Imputation of missing data to ensure robust downstream analysis.
Description fetching for DIA-NN input files: .pg_matrix.tsv, and ProteomeDiscoverer input files.
Mass Spectrometry quantitative data produced by software tools such as Proteome Discoverer and MaxQuant may be required to be processed with Normalization methods to reduce their systemic bias. The ProtE package offers different methods of normalization that are provided via the normalization argument. These options include a simple log2 transformation of the data, Cyclic Loess normalization that aims to reduce their dissimilarities , and Quantile normalization which is implemented to make the distribution of each feature identical. Both Quantile and Cyclic loess Normalization are applied to the log2 transformed data and are implemented using functions of the limma package. The data can also be transformed with median normalization, in which all intensity values are divided by each sample’s median intensity, resulting in a median equal to 1 across the protein data set . Other normalization methods share an initial step of dividing each intensity value by the total sum of intensities for its respective sample. Total Ion Current normalization then rescales the values by multiplying them by the average Total Ion Current while Parts Per Million (PPM) normalization scales the values by multiplying them by one million. Lastly, Variable Stabilizing Normalization is also available implementing the corresponding function of the vsn package . Because the data output from DIA-NN has already been normalized with the MaxLFQ quantification, extra normalizing methods are suggested to be selected cautiously.
When working with mass spectrometry-based proteomics data, a common
issue encountered is the presence of missing values in the intensity
measurements for each protein. These missing values can occur for
several reasons. Some proteins may not be identified in the sample due
to technical limitations, their abundances may fall below the detection
limit of the analyzing instrument, or the proteins may be completely
absent from the examined sample. ProtE offers the option of filtering
Proteins based on the percentage of missing values they contain.
Specifically, functions include the argument filtering_value, which
refers to the percentage of missing values per protein allowed to remain
in the filtered dataset. Thus, if the user sets it to 100, no filtering
will occur, and the proteins will not be altered.
The parameter global_filtering determines if filtering for missing
values will be performed across all groups or separately inside each
group. When the intensity values of a protein contain only missing
values, they will be always omitted from the analysis. Also, the reverse
positive proteins (REV) will be excluded, when the input is the
ProteinGroups.txt from MaxQuant.
The ProtE package also offers a few options for estimating the missing values, via the argument imputation of each function. The available imputation methods include simply assigning the limit of detection of the experiment (lowest abundance value in the dataset), or its half to the missing values or values derived from the Gaussian distribution of it. Other options include treating missing values as zeros or assigning the mean abundance of each protein to its missing values. Additionally, k-nearest neighbors (kNN) imputation is available from the package VIM and missRanger, a quicker multivariate imputation algorithm alternative to missForest (based on random forests), from the package of the same name.
The input file .pg_matrix.tsv
from DIA-NN does not
provide descriptive information for each Protein ID. To address this
issue, the function dianno
includes an argument called
description
, which allows for the retrieval of descriptive
information such as gene names, organisms, and more. This option is also
available for Proteome Discoverer inputs that lack this information.
Detailed data is obtained from the UniProt database, using the
UniProt.ws package. The Description information will be shown in each
Excel file created that contains the dataset.
Statistical analysis will then be performed, including pairwise
comparisons between the groups. The parameter independent
allows users to specify whether the group variables should be analyzed
as independent (e.g. control samples vs patient samples) or as matched
pairs (e.g. same patients before and after treatment) By default, it is
set to independent = TRUE
, and to run a paired/matched
test independent = FALSE
, the user must provide groups with
the same number of samples, with the order of the samples across them
remaining the same.
The output includes an Excel file,
Statistical_analysis.xlsx
, which provides detailed
information for each protein, including the average abundance, standard
deviation, ratio, and log2 fold change of values between groups. It also
includes the p-values and Benjamini–Hochberg adjusted p-values from
pairwise Mann-Whitney comparisons, as well as Kruskal-Wallis test
results when the number of groups is greater than 2. Additionally, the
file contains in the columns “Bartlett_p” and “Levene_p” the p_values
for the reported statistical tests, that examine the homoscedasticity of
each proteins’ abundances. Last but not least, the pValue and the
pseudoF value from the multivariate PERMANOVA statistical test,
utilizing the function adonis2 from the package vegan.
Another Excel file, limma_statistics.xlsx
, contains
results from the parametric limma
statistical test. For the
limma statistics the dataset has been fit to a linear model and then
moderated by empirical Bayes method. This includes the ANOVA F-value,
p-value, and adjusted p-value, as well as the B-statistic, unadjusted
p-value, and Benjamini–Hochberg adjusted p-value for pairwise t-test
comparisons between groups, as well as the ANOVA F-value, p-value, and
adjusted p-value when the number of groups is greater than 2.
To visualize the distribution of data for each sample, a boxplot and violin plot are generated using the ggplot2 package. All protein abundance values are transformed via log2 transformation to ensure normality and comparability across samples. Sample names are displayed on the plots; however, if the names are too lengthy, only their last 25 characters will appear. To improve readability, it is recommended to use shorter sample names when possible. Additionally, the samples are colored according to their respective groups.
Principal Component Analysis (PCA) is also performed on the log2-transformed protein abundance data for each sample. The data are scaled and centered prior to the analysis. A PCA plot is created, displaying the samples in a two-dimensional space where the axes represent the first two principal components.
An additional PCA plot is created using only the significant proteins identified during the statistical analysis. For pairwise comparisons between two groups, significant proteins are selected based on their statistical results from the comparison. When analyzing more than two groups, significant proteins are identified using analysis of variance (ANOVA) across all groups.
Users can select the statistical method for determining significance
by setting the parametric
parameter. Setting
parametric = TRUE
uses results from the limma t-test or
ANOVA, while parametric = FALSE
uses results from the
Mann-Whitney U test or Kruskal-Wallis test. The user can further specify
the significance threshold with the significance
parameter:
setting significance = "p"
uses a raw p-value threshold of
0.05, and significance = "BH"
uses the Benjamini-Hochberg
adjusted p-value threshold.
A PCA plot is then generated based on the significant proteins.
Additionally, a heatmap of the significant proteins is generated using the ComplexHeatmap package. Proteins are clustered based on euclidean distance in abundance patterns across groups, and the heatmap provides an overview of group-specific differences.
Lastly, an excel file names Quality_check.xlsx provides information about the percentage of missing values before and after filtering, along with the scores of the first 2 principal components for each sample.