Two issues in software development for proteomic analyses

Software tools for proteomic data analysis are being fast developed. However, several issues remain and expect future improvements. Numerous reviews have elaborated on this topic. Here I enumerate two issues that are not popularly discussed.

User-unfriendness
Built by lab scientists, several developmental software is command line-based 1-3. They are user-unfriendliness for biologists, especially with multiple computation steps and tedious parameter settings, thus rendering the interest of experimentalists or clinicians 4. Also, the newest knowledge usually targets for specific steps instead of the complete process, which cannot be applied for immediate use by un-experienced enthusiasts. For example, CANDIA is an advanced data de-convolution algorithm that takes advantage of GPU computation to dissect peptide signals into individual analyte spectra 5. Although 33 times more total ion current could be discovered for the down-stream database search, massive application of CANDIA is un-realistic without efforts on software encapsulation. Building up graphic interfaced software that are integrated and could enable plug-and-play analyses from new tools.

Un-annotated proteins
More than 90% of the human proteome have been covered in database 6. However, in practice, a high fraction of MS signals is not annotated as peptides. Apart from the technical variations, they could be the un-annotated proteins, sometimes termed as the “dark proteome”. The comparison of MS data (especially DIA-MS data) from multiple sources might prioritize plausible footprints. Mapping them to the un-reviewed TrEMBL database be a novel way to identify and validate proteins. Also, I have a good look on the deep learning (DL) technology. The current MS data repositories could be satisfactory training material for DL in the consensus of un-annotated proteins and furthermore their proteoforms 7.

  1. Röst, H.L., et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat Biotechnol 32, 219-223 (2014).
  2. Tsou, C.C., et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat Methods 12, 258-264, 257 p following 264 (2015).
  3. Tran, N.H., et al. Deep learning enables de novo peptide sequencing from data-independent-acquisition mass spectrometry. Nat Methods 16, 63-66 (2019).
  4. Zhang, F., Ge, W., Ruan, G., Cai, X. & Guo, T. Data-Independent Acquisition Mass Spectrometry-Based Proteomics and Software Tools: A Glimpse in 2020. Proteomics, e1900276 (2020).
  5. Buric, F., Zrimec, J. & Zelezniak, A. Parallel Factor Analysis Enables Quantification and Identification of Highly Convolved Data-Independent-Acquired Protein Spectra. Patterns 1, 100137 (2020).
  6. Adhikari, S., et al. A high-stringency blueprint of the human proteome. Nature communications 11, 5301-5301 (2020).
  7. Wen, B., et al. Deep Learning in Proteomics. Proteomics 20, e1900335-e1900335 (2020).