One function to analyze them all! The Proteomics Eye (ProtE) establishes an intuitive framework for the univariate analysis of label-free proteomics data. By compiling all necessary data wrangling and processing steps into the same function, ProtE automates all pairwise statistical comparisons for a given categorical variable, returning to the user performance quality metrics, measures to control for Type-I or Type-II errors, and publication-ready visualizations.
ProtE is currently compatible with data generated by MaxQuant, DIA-NN and Proteome Discoverer.
ProtE features 4 functions, each one tailored for a specific use case.
maximum_quantum()
accepts as input the MaxQuant
generated file ProteinGroups.txtdianno()
accepts as input either of the two DIA-NN (or
the FragPipe - DIANN) output files pg_matrix.tsv or
unique_genes_matrix.tsvpd_single()
accepts as input the Proteome Discoverer
output file that contains all sample protein intensities/abundances in
one tablepd_multi()
accepts as input separate Proteome
Discoverer protein intensity filesmaximum_quantum()
,dianno()
,pd_multi()
All 3 functions expect the input file to be parsed in the parameter
file
. To enable statistical analysis, in the input file,
samples (columns) belonging to the same group must be sorted next to
each other. For example, samples from an experiment with a 3-groups
categorical variable (control, treatment, compound) could be arranged
such that: first columns = Control samples, middle columns = Treatment
samples, last columns = Compound samples.
Assuming a MaxQuant quantification has been performed, the file
ProteinGroups.txt can be fed to ProtE with the function
maximum_quantum
.
Insert the file path of the ProteinGroups.txt in the
file
parameter. To copy-paste the file path in Windows,
firstly locate the desired file inside your folders. Hold Shift and
right-click the file, then select “Copy as Path” from the context menu.
Go to RStudio and click Ctrl+V or right-click to paste the path.
Because usually the directories will be separated with a single backlash, ensure to use forward slashes (/) for specifying paths or adding a second backlash e.g:
maximum_quantum(file = "C:\\Bioprojects\\BreastCancer\\Proteomics\\MaxQuant\\ProteinGroups.txt")
or
maximum_quantum(file = "C:/Bioprojects/BreastCancer/Proteomics/MaxQuant/ProteinGroups.txt")
group_names
and number of
samples_per_group
Group names are defined in the parameter group_names
as
a vector. The order of the group names inside the vector must follow the
order of the groups by which the samples (columns) have been arranged in
the input proteomics file (from the left to the right). Same goes for
the number of samples of each group, which is defined again as a vector
in the parameter samples_per_group
. In the following
example there are 3 groups (Control,Treatment,Compound) with the Control
group consisting of 10 samples the Treatment group of 12 samples and the
Compound with 9:
maximum_quantum(
file = "C:\\Bioprojects\\BreastCancer\\Proteomics\\MaxQuant\\ProteinGroups.txt",
group_names = c("Control", "Treatment", "Compound"),
samples_per_group = c(10, 12, 9),
imputation = FALSE,
global_filtering = TRUE,
independent = TRUE,
filtering_value = 50,
normalization = FALSE,
parametric= FALSE,
significance = "p")
In the pairwise comparisons, nominators and denominators of the
FoldChange (and consequently the sign of Log2FoldChage) are defined
based on the order of the group names declared in the parameter
group_names
. The general notion based on which FoldChange
is determined is: NextGroup/PreviousGroup. In our example the FoldChange
for every pairwise comparison will be set as: Treatment/Control,
Compound/Control and Compound/Treatment.
pd_multi()
pd_multi
is tailored for the analysis of multiple
Proteome Discoverer (PD) exports, each one
corresponding to a single sample. To be able to use it, the user must
save the PD exports to different folders corresponding to the different
groups of the variable that is going to be analyzed. The paths to these
folders are specified in the parameter …:
pd_multi(excel_file = "C:\\Bioprojects\\BreastCancer\\Proteomics\\PD\\Control",
"C:\\Bioprojects\\BreastCancer\\Proteomics\\PD\\Treatment",
"C:\\Bioprojects\\BreastCancer\\Proteomics\\PD\\Compound",
imputation = FALSE,
global_filtering = TRUE,
independent = TRUE,
filtering_value = 50,
normalization = FALSE,
parametric= FALSE,
significance = "p")
In the pairwise comparisons, nominators and denominators of the
FoldChange (and consequently the sign of Log2FoldChage) are defined
based on the order of the declared group folders in the
pd_multi
function. Again, the general notion based on which
FoldChange is determined, is: NextGroup/PreviousGroup. In our imaginary
example the FoldChange for every pairwise comparison will be set as:
Treatment/Control, Compound/Control and Compound/Treatment.
All 4 functions streamline the following process, which is reported
in more details in the ProtE Guide
vignette.
Normalization of proteomic intensity values
Filtering based on the percentage of missing values of each protein.
Imputation of missing data to ensure robust downstream analysis.
Description fetching for DIA-NN’s .pg_matrix.tsv and Proteome Discoverer files
Once the data processing is complete, the package performs statistical analysis for every pairwise comparison to identify significant protein abundance differences between experimental groups. The results are automatically exported as Excel files, and a range of visualizations is generated to facilitate quality check and interpretation. These include:
• Principal Component Analysis (PCA) plots for dimensionality reduction and group comparison.
• Heatmap highlighting significant proteins.
• Protein rank-abundance and meanrank-sd scatterplots.
• Boxplots and violin plots to display data distribution and variability across groups.
The results from each function are saved in a folder named ProtE_Analysis, which is created inside the last directory of the provided file(s).
ProtE creates 3 sub-folders: • Data_processing, with the files of the resulting data processing. • Statistical_Analysis, with the results of the statistical tests. • Plots, with all the plots saved in bmp. format.