Development and implementation of algorithms and innovative software to support oncology research
Dr Paolo Romano
Background
Mass spectrometry (MS) is producing high volumes of data supporting oncological sciences, especially for translational research. Most of related elaborations can be carried out by combining existing tools at different levels, but little is currently available for the automation of the fundamental steps. For the analysis of MALDI/TOF spectra, a number of pre-processing steps are required, including joining of isotopic abundances for a given molecular species, normalization of signals against an internal standard, background noise removal, averaging multiple spectra from the same sample, and aligning spectra from different samples. Further to these steps, a number of statistical tests can be run on pre-processed data that requires adequate standardization of methods and flexible procedures, such as those made possible by workflow management systems.
We have already developed original software for pre-processing of mass spectra. Namely, Geena is a public tool for the automated pre-processing of MS data originated by MALDI/TOF instruments, with a simple and intuitive web interface (10.1186/s12859-016-0911-2, 10.1002/cpbi.59), GeenaR is a web tool providing a complete workflow for pre-processing, analyzing, visualizing, and comparing MALDI-TOF mass spectra based on public R modules (10.3389/fgene.2021.635814), and Seradeg is a software able to screen serum spectra, compute fibrinopeptide A (fpA) fragments’ abundance and relative distribution, and assign quality scores on the assumption that fpA undergoes a gradual degradation as a result of the preservation process and that the best preserved sera contains greater abundances of fragments having higher molecular weights (10.1016/j.jprot.2017.12.004).
Of relevance for oncology research is also the CLIMA database and identification tool (10.1093/nar/gkn730) that we have developed to archive STR profiles of human cell lines, hence allowing cell lines identification by researchers worldwide.
Hypothesis and significance
Mass spectrometry is one of the key tools for proteomics studies. However, a complete understanding of spectra, their contents and significance is still far to be achieved. By improving existing bioinformatics methods and tools for the analysis of mass spectra and by creating new ones, we expect to significantly improve the scientific outcome of proteomics studies, including, e,g, differential proteomics experiments and biomarker discovery. In our Operational Unit, we are performing many mass spectrometry experiments, both for internal research and as a service for external units. The development of original software and the improvement of everyday analysis tools is therefore of a paramount relevance.
The adoption of flexible analysis tools, such as Galaxy, one of the best known and most appreciated workflow management systems, further contribute to the improvement of mass spectra data analysis, by allowing the creation and effective enactment of new analysis workflows on purpose for the single experiment or study. The need of integrating data generated in the context of a study with general repositories and databases, usually available through devoted infrastructures, such as those of the European Bionformatics Institute, further contribute to the justification of workflow management systems and to the development of APIs able to permit direct access to data and interoperability.
Significance and Innovation
The software tools that will be developed and / or improved in the context of the project represent a significant improvement for mass spectra data analysis, especially for studies based on the MALD/ToF technology since little has indeed been made available for this domain.
The concept behind the Seradeg software is also extremely innovative, being one of the few attempts to define the quality of sera by mass spectrometry. Such approach is in fact of fundamental importance for differential proteomics and biomarker discovery when samples for health voluntaries are to be compared with samples from patients. In this case, the formers are often retrieved from blood donors and the conservation of samples is not always appropriate for long storage.
A great innovation effect can also derive from the creation and distribution of data analysis workflows used for a given study. This new approach has a great relevance for the reproducibility and replicability of studies. Researchers willing to reproduce an experiment, or just adopt the same method used in a successful experiment, can easily take and enact the same workflow, possibly with minimal changes.
Translational relevance and impact for the National Health System (SSN)
This project aims at developing enabling technologies in the fields of mass spectrometry and proteomics. As such, it does not in itself have a direct impact on the National Health System (SSN). It is our belief, however, that achieving its objectives will improve mass spectra (MS) data analysis and make it more efficient. This will have an impact on the quality of oncology research, e.g. in differential proteomics and bio-marker discovery. It could therefore indirectly help improve pre-clinical research.