Author
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
The Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences (IOCB Prague) is a leading scientific institution in the Czech Republic, recognized internationally. Its primary mission is basic research in the fields of chemical biology and medicinal chemistry, organic and material oriented chemistry, chemistry of natural compounds, biochemistry and molecular biology, physical chemistry, theoretical chemistry, and analytical chemistry.
Tags
Article
Science and research
Health
Video
LinkedIn Logo

Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules

Tu, 11.3.2025
| Original article from: IOCB
Roman and Anton Bushuiev joined experts from 14 institutes in MassSpecGym, a project to benchmark AI methods for discovering natural molecules from MS, aiding drug development, ecology, and space research.
Video placeholder
  • Photo: IOCB: Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules
  • Video: PolarisHQ: MassSpecGym: A benchmark for the discovery and identification of molecules

In April 2024, brothers Roman and Anton Bushuiev from the teams of Tomáš Pluskal at IOCB Prague and Josef Šivic at CIIRC CTU initiated a collaboration between experts from 14 research institutes across the globe to benchmark AI methods for the discovery of molecules from mass spectrometry data. The collaborative project, titled MassSpecGym, aims to spark the development of next-generation machine learning models for identifying new molecules from nature with applications spanning drug development, environmental science, or space exploration.

The first success didn’t take long to come. The results of the cross-disciplinary initiative were already presented as a Spotlight poster at one of the world’s top machine learning conferences – NeurIPS 2024 in Vancouver, in December 2024.

The discovery of small molecules profoundly influences numerous scientific fields such as organic chemistry, molecular biology, drug development, and environmental analysis. Despite advancements, only a small fraction of life’s molecular diversity has been uncovered.

IOCB: Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules: Living organisms function as chemical factories, generating a vast diversity of molecules with unique structures and functions. However, the majority of these molecules remain unknown.IOCB: Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules: Living organisms function as chemical factories, generating a vast diversity of molecules with unique structures and functions. However, the majority of these molecules remain unknown.

Tandem mass spectrometry (MS/MS) is a cornerstone instrumental technique for identifying molecular structures from biological and environmental samples, enabling applications such as discovering bioactive compounds for drug development, optimizing drug dosages in clinical settings, and detecting environmental pollutants at trace levels. At its core, a tandem mass spectrometer fragments molecules and records the masses of these fragments in so-called MS/MS spectra.

“A typical biological or environmental sample produces thousands of tandem mass spectra, each representing a distinct molecule. Yet, annotating these spectra with molecular structures remains a challenge, with fewer than 10% of spectra successfully annotated using state-of-the-art machine learning methods. This leaves much of the chemical space uncovered, limiting our ability to unlock new scientific and technological advancements,” says Tomáš Pluskal from IOCB Prague.

Currently, the development of new AI methods for mass spectrometry is limited by the absence of well-standardized training datasets and evaluation protocols. The project “MassSpecGym: A benchmark for the discovery and identification of molecules” addresses this limitation.

“Machine learning benchmarks such as ImageNet revolutionized the field of AI by standardizing development, evaluation, and assessment of progress. Similarly, we propose a benchmark for molecular discovery to tackle the critical challenge of annotating tandem mass spectra and aim to foster a new generation of AI models for uncovering the undiscovered space of chemical structures present in nature,” explains doctoral student and the main author of the project Roman Bushuiev.

IOCB: Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules.IOCB: Scientists from CIIRC CTU and IOCB Prague lead a benchmarking effort for AI-driven discovery of molecules.

MassSpecGym comprises three core components: (i) the largest publicly available dataset of tandem mass spectra labeled with molecular structures, (ii) three well-defined machine-learning challenges rendering the process of molecular discovery from mass spectra into well-defined computational problems, and (iii) carefully-selected held-out pairs of mass spectra and molecules designed to evaluate the ability of AI models to generalize to new chemical space. Additionally, MassSpecGym provides a user-friendly platform for developing and evaluating new AI models.

A research paper on MassSpecGym was selected for a Spotlight poster presentation at NeurIPS 2024 in Vancouver, which is one of the most prestigious conferences in machine learning and is ranked among the top ten publication venues in all areas of science by Google Scholar.

This research was co-funded by EU projects FRONTIER (No. 101097822) and ELIAS (No. 101120237).

Read more: https://www.ciirc.cvut.cz/scientists-from-ciirc-ctu-and-iocb-prague-lead-a-global-benchmarking-effort-for-ai-driven-discovery-of-molecules/ 

Resources

Original article 

R. Bushuiev, A. Bushuiev, N. F. de Jonge, A. Young, F. Kretschmer, R. Samusevich, J. Heirman, F. Wang, L. Zhang, K. Dührkop, M. Ludwig, N. A. Haupt, A. Kalia, C. Brungs, R. Schmid, R. Greiner, B. Wang, D. S. Wishart, L.-P. Liu, J. Rousu, W. Bittremieux, H. Rost, T. D. Mak, S. Hassoun, F. Huber, J. J. J. van der Hooft, M. A. Stravs, S. Böcker, J. Sivic, T. Pluskal, “MassSpecGym: A benchmark for the discovery and identification of molecules”, Advances in Neural Information Processing Systems (NeurIPS), 2024. https://doi.org/10.48550/arXiv.2410.23326

Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
LinkedIn Logo
 

Related content

High-Throughput BTEX Analysis in Nail Products by SPME and GC/TQ

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ, SPME
Manufacturer
Agilent Technologies
Industries
Materials Testing

Accurate multi-component blast furnace gas analysis maximizes iron production and minimizes coke consumption

Applications
| 2026 | Thermo Fisher Scientific
Instrumentation
GC/MSD
Manufacturer
Thermo Fisher Scientific
Industries
Energy & Chemicals

Gas Chromatograph Nexis GC-2060

Brochures and specifications
| 2026 | Shimadzu
Instrumentation
GC
Manufacturer
Shimadzu
Industries
Other

Quantitative Volatile PFAS Analysis in Textiles

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ
Manufacturer
Agilent Technologies
Industries
Materials Testing

Effectiveness of the MonoTrap Collection Method for VOC Analysis in Exhaled Breath Using GC-MS

Applications
| 2026 | Shimadzu
Instrumentation
GC/MSD, Thermal desorption, GC/SQ
Manufacturer
Shimadzu
Industries
Clinical Research
 

Related articles

Tomáš Pluskal receives ERC grant and joins EMBO
Article | Scientists

Tomáš Pluskal receives ERC grant and joins EMBO

Dr. Tomáš Pluskal has been accepted into the prestigious ‘Young Investigator Programme’ of the EMBO and has also secured the ERC grant.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague
Article | Science and research

For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague

Scientists at IOCB Prague created MSⁿLib, a vast mass spectrometry library with millions of records, enabling rapid unknown compound identification and boosting drug discovery and biomedical AI.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Exploring sequence space beyond random mutagenesis
Article | Science and research

Exploring sequence space beyond random mutagenesis

Structure-guided libraries and machine learning enable faster discovery of active nucleic acid sequences beyond random mutagenesis.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)
Presentation | Video

Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)

Discover how AI-driven GC-MS workflows and predictive spectral libraries enable confident identification of unknown metabolites, integrating molecular-formula estimation and high-resolution GC×GC-TOFMS.
The Multidimensional Chromatography Workshop
tag
share
more
 

Related content

High-Throughput BTEX Analysis in Nail Products by SPME and GC/TQ

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ, SPME
Manufacturer
Agilent Technologies
Industries
Materials Testing

Accurate multi-component blast furnace gas analysis maximizes iron production and minimizes coke consumption

Applications
| 2026 | Thermo Fisher Scientific
Instrumentation
GC/MSD
Manufacturer
Thermo Fisher Scientific
Industries
Energy & Chemicals

Gas Chromatograph Nexis GC-2060

Brochures and specifications
| 2026 | Shimadzu
Instrumentation
GC
Manufacturer
Shimadzu
Industries
Other

Quantitative Volatile PFAS Analysis in Textiles

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ
Manufacturer
Agilent Technologies
Industries
Materials Testing

Effectiveness of the MonoTrap Collection Method for VOC Analysis in Exhaled Breath Using GC-MS

Applications
| 2026 | Shimadzu
Instrumentation
GC/MSD, Thermal desorption, GC/SQ
Manufacturer
Shimadzu
Industries
Clinical Research
 

Related articles

Tomáš Pluskal receives ERC grant and joins EMBO
Article | Scientists

Tomáš Pluskal receives ERC grant and joins EMBO

Dr. Tomáš Pluskal has been accepted into the prestigious ‘Young Investigator Programme’ of the EMBO and has also secured the ERC grant.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague
Article | Science and research

For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague

Scientists at IOCB Prague created MSⁿLib, a vast mass spectrometry library with millions of records, enabling rapid unknown compound identification and boosting drug discovery and biomedical AI.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Exploring sequence space beyond random mutagenesis
Article | Science and research

Exploring sequence space beyond random mutagenesis

Structure-guided libraries and machine learning enable faster discovery of active nucleic acid sequences beyond random mutagenesis.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)
Presentation | Video

Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)

Discover how AI-driven GC-MS workflows and predictive spectral libraries enable confident identification of unknown metabolites, integrating molecular-formula estimation and high-resolution GC×GC-TOFMS.
The Multidimensional Chromatography Workshop
tag
share
more
 

Related content

High-Throughput BTEX Analysis in Nail Products by SPME and GC/TQ

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ, SPME
Manufacturer
Agilent Technologies
Industries
Materials Testing

Accurate multi-component blast furnace gas analysis maximizes iron production and minimizes coke consumption

Applications
| 2026 | Thermo Fisher Scientific
Instrumentation
GC/MSD
Manufacturer
Thermo Fisher Scientific
Industries
Energy & Chemicals

Gas Chromatograph Nexis GC-2060

Brochures and specifications
| 2026 | Shimadzu
Instrumentation
GC
Manufacturer
Shimadzu
Industries
Other

Quantitative Volatile PFAS Analysis in Textiles

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ
Manufacturer
Agilent Technologies
Industries
Materials Testing

Effectiveness of the MonoTrap Collection Method for VOC Analysis in Exhaled Breath Using GC-MS

Applications
| 2026 | Shimadzu
Instrumentation
GC/MSD, Thermal desorption, GC/SQ
Manufacturer
Shimadzu
Industries
Clinical Research
 

Related articles

Tomáš Pluskal receives ERC grant and joins EMBO
Article | Scientists

Tomáš Pluskal receives ERC grant and joins EMBO

Dr. Tomáš Pluskal has been accepted into the prestigious ‘Young Investigator Programme’ of the EMBO and has also secured the ERC grant.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague
Article | Science and research

For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague

Scientists at IOCB Prague created MSⁿLib, a vast mass spectrometry library with millions of records, enabling rapid unknown compound identification and boosting drug discovery and biomedical AI.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Exploring sequence space beyond random mutagenesis
Article | Science and research

Exploring sequence space beyond random mutagenesis

Structure-guided libraries and machine learning enable faster discovery of active nucleic acid sequences beyond random mutagenesis.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)
Presentation | Video

Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)

Discover how AI-driven GC-MS workflows and predictive spectral libraries enable confident identification of unknown metabolites, integrating molecular-formula estimation and high-resolution GC×GC-TOFMS.
The Multidimensional Chromatography Workshop
tag
share
more
 

Related content

High-Throughput BTEX Analysis in Nail Products by SPME and GC/TQ

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ, SPME
Manufacturer
Agilent Technologies
Industries
Materials Testing

Accurate multi-component blast furnace gas analysis maximizes iron production and minimizes coke consumption

Applications
| 2026 | Thermo Fisher Scientific
Instrumentation
GC/MSD
Manufacturer
Thermo Fisher Scientific
Industries
Energy & Chemicals

Gas Chromatograph Nexis GC-2060

Brochures and specifications
| 2026 | Shimadzu
Instrumentation
GC
Manufacturer
Shimadzu
Industries
Other

Quantitative Volatile PFAS Analysis in Textiles

Applications
| 2026 | Agilent Technologies
Instrumentation
GC/MSD, GC/MS/MS, GC/QQQ
Manufacturer
Agilent Technologies
Industries
Materials Testing

Effectiveness of the MonoTrap Collection Method for VOC Analysis in Exhaled Breath Using GC-MS

Applications
| 2026 | Shimadzu
Instrumentation
GC/MSD, Thermal desorption, GC/SQ
Manufacturer
Shimadzu
Industries
Clinical Research
 

Related articles

Tomáš Pluskal receives ERC grant and joins EMBO
Article | Scientists

Tomáš Pluskal receives ERC grant and joins EMBO

Dr. Tomáš Pluskal has been accepted into the prestigious ‘Young Investigator Programme’ of the EMBO and has also secured the ERC grant.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague
Article | Science and research

For the first time, scientists have access to a comprehensive data set for identifying unknown compounds – thanks to experts at IOCB Prague

Scientists at IOCB Prague created MSⁿLib, a vast mass spectrometry library with millions of records, enabling rapid unknown compound identification and boosting drug discovery and biomedical AI.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Exploring sequence space beyond random mutagenesis
Article | Science and research

Exploring sequence space beyond random mutagenesis

Structure-guided libraries and machine learning enable faster discovery of active nucleic acid sequences beyond random mutagenesis.
Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
tag
share
more
Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)
Presentation | Video

Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)

Discover how AI-driven GC-MS workflows and predictive spectral libraries enable confident identification of unknown metabolites, integrating molecular-formula estimation and high-resolution GC×GC-TOFMS.
The Multidimensional Chromatography Workshop
tag
share
more
Other projects
LCMS
ICPMS
Follow us
FacebookX (Twitter)LinkedInYouTube
More information
WebinarsAbout usContact usTerms of use
LabRulez s.r.o. All rights reserved. Content available under a CC BY-SA 4.0 Attribution-ShareAlike