ProtE Workflow

All 4 functions streamline the following data processing:

  1. Normalization of proteomic intensity values

  2. Filtering based on the percentage of missing values.

  3. Imputation of missing data to ensure robust downstream analysis.

  4. Description fetching for DIA-NN input files: .pg_matrix.tsv, and ProteomeDiscoverer input files.

Normalization

Mass Spectrometry quantitative data produced by software tools such as Proteome Discoverer and MaxQuant may be required to be processed with Normalization methods to reduce their systemic bias. The ProtE package offers different methods of normalization that are provided via the normalization argument. These options include a simple log2 transformation of the data, Cyclic Loess normalization that aims to reduce their dissimilarities , and Quantile normalization which is implemented to make the distribution of each feature identical. Both Quantile and Cyclic loess Normalization are applied to the log2 transformed data and are implemented using functions of the limma package. The data can also be transformed with median normalization, in which all intensity values are divided by each sample’s median intensity, resulting in a median equal to 1 across the protein data set . Other normalization methods share an initial step of dividing each intensity value by the total sum of intensities for its respective sample. Total Ion Current normalization then rescales the values by multiplying them by the average Total Ion Current while Parts Per Million (PPM) normalization scales the values by multiplying them by one million. Lastly, Variable Stabilizing Normalization is also available implementing the corresponding function of the vsn package . Because the data output from DIA-NN has already been normalized with the MaxLFQ quantification, extra normalizing methods are suggested to be selected cautiously.

Filtering of Missing Values

When working with mass spectrometry-based proteomics data, a common issue encountered is the presence of missing values in the intensity measurements for each protein. These missing values can occur for several reasons. Some proteins may not be identified in the sample due to technical limitations, their abundances may fall below the detection limit of the analyzing instrument, or the proteins may be completely absent from the examined sample. ProtE offers the option of filtering Proteins based on the percentage of missing values they contain. Specifically, functions include the argument filtering_value, which refers to the percentage of missing values per protein allowed to remain in the filtered dataset. Thus, if the user sets it to 100, no filtering will occur, and the proteins will not be altered.
The parameter global_filtering determines if filtering for missing values will be performed across all groups or separately inside each group. When the intensity values of a protein contain only missing values, they will be always omitted from the analysis. Also, the reverse positive proteins (REV) will be excluded, when the input is the ProteinGroups.txt from MaxQuant.

Imputation

The ProtE package also offers a few options for estimating the missing values, via the argument imputation of each function. The available imputation methods include simply assigning the limit of detection of the experiment (lowest abundance value in the dataset), or its half to the missing values or values derived from the Gaussian distribution of it. Other options include treating missing values as zeros or assigning the mean abundance of each protein to its missing values. Additionally, k-nearest neighbors (kNN) imputation is available from the package VIM and missRanger, a quicker multivariate imputation algorithm alternative to missForest (based on random forests), from the package of the same name.

Description fetching

The input file .pg_matrix.tsv from DIA-NN does not provide descriptive information for each Protein ID. To address this issue, the function dianno includes an argument called description, which allows for the retrieval of descriptive information such as gene names, organisms, and more. This option is also available for Proteome Discoverer inputs that lack this information. Detailed data is obtained from the UniProt database, using the UniProt.ws package. The Description information will be shown in each Excel file created that contains the dataset.

Statistical Analysis

Statistical analysis will then be performed, including pairwise comparisons between the groups. The parameter independent allows users to specify whether the group variables should be analyzed as independent (e.g. control samples vs patient samples) or as matched pairs (e.g. same patients before and after treatment) By default, it is set to independent = TRUE , and to run a paired/matched test independent = FALSE, the user must provide groups with the same number of samples, with the order of the samples across them remaining the same.

The output includes an Excel file, Statistical_analysis.xlsx, which provides detailed information for each protein, including the average abundance, standard deviation, ratio, and log2 fold change of values between groups. It also includes the p-values and Benjamini–Hochberg adjusted p-values from pairwise Mann-Whitney comparisons, as well as Kruskal-Wallis test results when the number of groups is greater than 2. Additionally, the file contains in the columns “Bartlett_p” and “Levene_p” the p_values for the reported statistical tests, that examine the homoscedasticity of each proteins’ abundances. Last but not least, the pValue and the pseudoF value from the multivariate PERMANOVA statistical test, utilizing the function adonis2 from the package vegan.

Another Excel file, limma_statistics.xlsx, contains results from the parametric limma statistical test. For the limma statistics the dataset has been fit to a linear model and then moderated by empirical Bayes method. This includes the ANOVA F-value, p-value, and adjusted p-value, as well as the B-statistic, unadjusted p-value, and Benjamini–Hochberg adjusted p-value for pairwise t-test comparisons between groups, as well as the ANOVA F-value, p-value, and adjusted p-value when the number of groups is greater than 2.

Data visualization

To visualize the distribution of data for each sample, a boxplot and violin plot are generated using the ggplot2 package. All protein abundance values are transformed via log2 transformation to ensure normality and comparability across samples. Sample names are displayed on the plots; however, if the names are too lengthy, only their last 25 characters will appear. To improve readability, it is recommended to use shorter sample names when possible. Additionally, the samples are colored according to their respective groups.

Principal Component Analysis (PCA) is also performed on the log2-transformed protein abundance data for each sample. The data are scaled and centered prior to the analysis. A PCA plot is created, displaying the samples in a two-dimensional space where the axes represent the first two principal components.

An additional PCA plot is created using only the significant proteins identified during the statistical analysis. For pairwise comparisons between two groups, significant proteins are selected based on their statistical results from the comparison. When analyzing more than two groups, significant proteins are identified using analysis of variance (ANOVA) across all groups.

Users can select the statistical method for determining significance by setting the parametric parameter. Setting parametric = TRUE uses results from the limma t-test or ANOVA, while parametric = FALSE uses results from the Mann-Whitney U test or Kruskal-Wallis test. The user can further specify the significance threshold with the significance parameter: setting significance = "p" uses a raw p-value threshold of 0.05, and significance = "BH" uses the Benjamini-Hochberg adjusted p-value threshold.

A PCA plot is then generated based on the significant proteins.

Additionally, a heatmap of the significant proteins is generated using the ComplexHeatmap package. Proteins are clustered based on euclidean distance in abundance patterns across groups, and the heatmap provides an overview of group-specific differences.

Lastly, an excel file names Quality_check.xlsx provides information about the percentage of missing values before and after filtering, along with the scores of the first 2 principal components for each sample.