Structural elucidation using GC×GC-TOFMS and machine learning (Masaaki Ubukata, MDCW 2026)

- Photo: MDCW: Structural elucidation using GCxGC-TOFMS and machine learning for unknown metabolites in HeLa cell (Masaaki Ubukata, MDCW 2026)
- Video: LabRulez: Masaaki Ubukata: Structural elucidation using GC×GC-TOFMS and machine learning (MDCW 2026)
🎤 Presenter: Masaaki Ubukata (JEOL Ltd.)
Abstract
Since aroma components in food comprise numerous compounds, comprehensive two-dimensional gas chromatography–mass spectrometry (GCxGC-MS) is an effective technique. In GCxGC-MS, compound identification is commonly performed by searching mass spectral databases using electron ionization (EI). However, many compounds are often not registered in these databases. These require manual structural analysis, which can be difficult. To address the difficulty of manual structural analysis, we developed a structural elucidation method using machine learning-based mass spectral prediction. The predicted mass spectra can be used as a database, enabling structural estimation in the same way as conventional database searching. Furthermore, in this method, molecular formulas obtained from soft ionization (SI) data are used to narrow down candidate structures. However, a limitation was that multiple molecular formula candidates often remained. To overcome this, we also developed a technique to rank candidate formulas using a machine learning model. In this study, we report an application of the above method to the analysis of aroma compounds in spices by SPME-GCxGC-TOFMS.
In total, 518 compounds were detected by GCxGC-TOFMS. Integrated analysis using both EI and SI data enabled reliable identification of key aroma constituents of cardamom, including monoterpenes, sesquiterpenes, and furans. Compound annotation was based on combined evidence from NIST database searching, retention index, molecular ion and isotope information from SI, and accurate fragment masses from EI. When multiple candidate formulas were obtained, our machine learning model ranked them, allowing selection of the most plausible result. Even for database-unregistered compounds, molecular formulas and tentative structures could be estimated.
Video Transcription
Hello everyone, my name is Masaki Ubukata from Tokyo, Japan. Thank you for the opportunity to present our latest developments. In my previous presentation, I introduced our artificial intelligence–based structure analysis functionalities integrated into our software platform. Today, I will provide an updated overview of these AI-driven technologies, including a newly developed model for chemical formula recommendation, followed by a demonstration of metabolomics applications enabled by these tools.
Metabolomics focuses on the comprehensive analysis of small molecules generated through biological processes. In this context, the term “small molecules” represents a key analytical focus. Both GC-MS and LC-MS are widely used in metabolomics; however, GC-MS continues to offer significant advantages for analyzing low-molecular-weight metabolites. Comprehensive separation strategies such as GC×GC are particularly beneficial for resolving complex metabolomic samples. Furthermore, trimethylsilyl (TMS) derivatization remains a gold-standard sample preparation approach in GC-MS-based metabolomics workflows.
Qualitative GC-MS analysis typically involves multiple stages, including sample introduction, chromatographic separation, compound extraction, and final identification. Database searching, particularly using the NIST spectral library, is standard practice. However, a major challenge arises when compounds are not present in available databases. Manual interpretation of mass spectra can address this issue but is time-consuming and requires substantial expertise. This limitation motivated the development of our AI-based identification technologies, which aim to facilitate the characterization of unknown compounds beyond conventional database coverage.
To address this need, we developed a machine-learning-driven predicted electron ionization (EI) mass spectral database. Our high-resolution GC time-of-flight platform provides resolving power exceeding 30,000 and mass accuracy down to 1 ppm, combined with multiple ionization techniques including EI, chemical ionization (CI), photoionization (PI), and field ionization (FI). The latter three approaches provide softer ionization conditions, enabling reliable detection of molecular ions. The accompanying AI software incorporates a database containing over 200 million predicted EI spectra, generated through deep-learning models. Recent software updates also support comprehensive GC×GC data processing.
The workflow integrates both EI and soft-ionization data. Initial steps involve database matching, fragment-ion analysis using accurate-mass spectral information, and molecular-ion determination from soft-ionization measurements. When these results are consistent, compound identification can be achieved with high confidence. If discrepancies occur, AI-based structure elucidation is performed. In this stage, molecular-formula information is used to narrow candidate structures among millions of possible compounds. The system simultaneously evaluates substructural features, predicted spectra, and retention indices, enabling identification even for previously unknown compounds.
Two AI models underpin the spectral library: one predicts EI mass spectra from molecular structures, while the other predicts chromatographic retention indices. By combining predicted spectra and retention information for more than 200 million compounds—including structures sourced from public databases and in-silico-generated chemical spaces—we created a comprehensive AI-assisted identification framework. Validation studies conducted since 2022 demonstrate continuous improvements in model performance, with average cosine similarity between predicted and measured spectra increasing from 0.72 in early versions to 0.86 in the latest release.
Despite these advances, determining the correct molecular formula can still present challenges, particularly when multiple candidate compositions are possible. To address this, we developed an additional AI model that estimates probability distributions for elemental composition directly from mass spectral data. This model evaluates isotopic patterns, fragment masses, and elemental likelihoods, thereby guiding users toward the most probable molecular formula prior to library searching. As a result, analytical workflows can proceed efficiently without manual bottlenecks.
Finally, we demonstrate the application of these technologies in metabolomics. Using GC×GC-TOFMS with combined EI and FI ionization, we analyzed human cancer cell metabolites in collaboration with Professor Tugawa at Tokyo University of Agriculture and Technology. A novel metabolite, not listed in conventional databases, was successfully identified using our AI-assisted workflow. From more than 8,000 detected compounds, the system narrowed candidates to structural isomers with identical molecular formulas, correctly ranking the true compound as the top candidate. This example highlights the capability of AI-driven identification for real-world unknown metabolite analysis.
In conclusion, our platform integrates large-scale predicted spectral libraries, molecular-formula estimation models, and advanced chromatographic separation techniques to enhance confidence in GC-MS qualitative analysis. These combined technologies offer significant benefits for metabolomics and other applications requiring robust unknown-compound identification.
This text has been automatically transcribed from a video presentation using AI technology. It may contain inaccuracies and is not guaranteed to be 100% correct.
-Workshop-LOGO_s.webp)



