Exploring the Chemical Space of the Exposome: How Far Have We Gone?

- Photo: JACS Au 2024 4 (7), 2412-2425: graphical abstract.
In the research article published recently in the JACS Au journal, the researchers from the University of Amsterdam (Amsterdam, The Netherlands), Imperial College London (London, United Kingdom), and The University of Queensland (Woolloongabba, Australia) have reviewed the latest efforts in mapping the exposome chemical space and its subspaces and provided a view on the integration of data-driven approaches to bridge the identified gaps.
The significant majority of chronic human diseases cannot be solely attributed to genetic factors. The Lancet Commission on Pollution and Health links 16% of premature global deaths to pollution, highlighting humanity's encroachment beyond the Earth's safe thresholds for chemical exposure. This exposure to a vast array of chemicals affects everything from biodiversity to human health, impacting vaccine effectiveness, antibiotic resistance, autoimmune conditions, and mental health disorders. The exposome—encompassing all the environmental chemicals we encounter—remains largely unexplored, with conventional methods capturing only a small fraction of its complexity. Recent efforts to map the exposome and integrate data-driven approaches promise to fill crucial knowledge gaps, enhancing our understanding of environmental impacts on health.
The original article
Exploring the Chemical Space of the Exposome: How Far Have We Gone?
Saer Samanipour, Leon Patrick Barron, Denice van Herwerden, Antonia Praetorius, Kevin V. Thomas, and Jake William O’Brien
JACS Au 2024 4 (7), 2412-2425
DOI: 10.1021/jacsau.4c00220
licensed under CC-BY 4.0
Selected sections from the article follow. Formats and hyperlinks were adapted from the original.
Abstract
Around two-thirds of chronic human disease can not be explained by genetics alone. The Lancet Commission on Pollution and Health estimates that 16% of global premature deaths are linked to pollution. Additionally, it is now thought that humankind has surpassed the safe planetary operating space for introducing human-made chemicals into the Earth System. Direct and indirect exposure to a myriad of chemicals, known and unknown, poses a significant threat to biodiversity and human health, from vaccine efficacy to the rise of antimicrobial resistance as well as autoimmune diseases and mental health disorders. The exposome chemical space remains largely uncharted due to the sheer number of possible chemical structures, estimated at over 1060 unique forms. Conventional methods have cataloged only a fraction of the exposome, overlooking transformation products and often yielding uncertain results. In this Perspective, we have reviewed the latest efforts in mapping the exposome chemical space and its subspaces. We also provide our view on how the integration of data-driven approaches might be able to bridge the identified gaps.
Introduction
The number of chemical structures known to us is expanding exponentially; for example, the number of entries to the Chemical Abstracts Service (CAS) registry has crossed the threshold of 100 million substances in 2015 and continues to grow. (1−7) A similar trend is observed for other chemical families and databases. (8−13) For example, PubChem currently includes more than 115 million unique structures, and this number is growing. Even though these numbers may seem large, compared to the true size of the chemical space, more than 1060 for organic structures smaller than 500 Da, these lists cover less than 0.001% of the possible chemical space. (13−16) Furthermore, even for known structures, <1% of them have been experimentally evaluated for their environmental and biological activity (e.g., toxicity), due to the cost and complexities associated with such measurements. (3−5) In fact, according to Persson et al., around 80% of the chemicals defined as in use according to REACH have yet to be assessed, even though the data may be available. (13) Several studies have shown the negative impact of exposure to chemicals with long-term adverse health outcomes. (17−22) For example, exposure to per- and polyfluoroalkyl substances (PFAS) has shown to correlate with the symptoms of autoimmune disease as well as mental health issues. (18,23)
Our current chemical management strategy is mainly based on manual chemical registration and/or experimental measurements of those chemicals in environmental and biological samples. (1,3,4,24) Both approaches are extremely challenging, costly, and inherently passive or, at best, reactive. Chemical registration, with regulatory focus, takes place only for chemicals with large production volumes at the national or international level (e.g., REACH Regulation). (3,8,9,25,26) With digitalization, these chemical registries and patents as well as scientific publications have been mined to gain an approximate idea about the current exposome chemical space (i.e., all the organic chemicals that humans are exposed to during their lifetime). (27−29) For example, databases such as PubChem or US-EPA CompTox dashboard are constantly updated with new chemicals coming from these mining exercises. (9,27) This process, even though sophisticated and powerful, is mainly centered toward human-made chemicals, thus having limited coverage of structures produced via abiotic and biotic transformation and limited consideration for any future chemicals.
Chemical transformations can take place in the environment (e.g., photo- or biological degradation) or within human-made infrastructures, such as wastewater treatment plants. (30−33) Depending on the type of reactions and the structure of the parent compound, there may be more than 100 new chemicals formed at even the first level of the transformation tree. (31,34,35) Considering the costs and complexity associated with performing transformation experiments, the expansion of such methods to the exposome chemical space is impossible, Figure 1. An alternative to this experimentally driven approach has been the predictive models where a combination of machine learning and heuristic methods are used. (30,31,35) However, these methods are very uncertain, opaque, are limited to a few reaction pathways while stopping at shallow levels (e.g., first or second levels) in the transformation tree. (4,36) This implies that our current estimates of the coverage of the exposome chemical space are orders of magnitude smaller than its true size, given the number of possible reactions.
JACS Au 2024, 4, 7, 2412-2425: Figure 1. (a) A conceptual figure showing different chemical subspaces (i.e., “relevant chemical space” to exposome), including unknown chemical space (gray), exposome chemical space (orange), measurable chemical space (blue), measured chemical space (magenta), and identified/characterized chemical space (light blue) whereas (b) shows the chemicals in US-EPA CompTox with 800 k unique structures. The Principal Component (PC) plot was generated using six elemental mass defects and monoisotopic mass of the chemicals in US-EPA CompTox (details of these calculations and the scripts are provided elsewhere (37,38)). It should be noted that the size of subspaces in panel (a) are meant only for visualization purposes and are not representative of the true size of these spaces. The empty spots in the PC space (panel b) suggest that the exposome chemical space may not be a smooth and continuous space, mainly due to the organic chemistry rules.
Thus far, measuring/monitoring chemicals in, across, and between different media (e.g., water, soil, air, or biological material) has been the main strategy to map the exposome chemical space. This strategy is reliant on three complementary approaches, namely, targeted, suspect, and nontargeted analysis (NTA). (39−41) Targeted analysis could be quantitative and focused on a limited number of preselected structures; for example, less than a few hundred chemicals are actively and routinely monitored in different matrices. Suspect screening/analysis, on the other hand, has been employed to identify chemicals based on user curated lists of preselected compounds and/or presence in databases/libraries using full-scan high-resolution mass spectrometry (HRMS) data. (40,41) Finally, NTA is considered the most agnostic approach for chemical measurement and identification in samples, where the collected signals are translated into candidate structures and confirmed via target analysis. (40,41) Both suspect and nontarget analysis are reliant on full scan data generated via HRMS coupled to a separation technique such as gas or liquid chromatography (GC/LC-HRMS). For a structurally unknown chemical to be identified via suspect or NTA, it must be measurable with our current analytical strategies (i.e., separation and detection). (16,40,42) What is measurable/analyzable with our current analytical technologies is unknown. (42) We are aware of only the identified fraction of the measurable chemical space (see the description below). Therefore, there may be chemicals highly relevant to the exposome that are not measurable with routine methods and may require the development of specific methods for their analysis (e.g., PFOS or glyphosate). (23,43,44)
Currently less than 10% of the signals acquired for suspect and NTA assays are successfully identified/annotated. (40,41,45,46) For example, even for a single surface water sample, the preprocessing reveals thousands of chemical signals to be identified. The existing identification workflows, at best, can confirm less than a few hundred chemicals in complex samples. Therefore, the signal of the unidentified chemicals in those samples remains unused. (38,40,47) These unidentified signals may be of high enough quality to be identified using updated preprocessing strategies or expanded spectral libraries (i.e., retrospective analysis). (48,49) In fact, the retrospective analysis of combined data from multiple human cohorts resulted in additional inferences on the connection of chemicals and health outcomes. (49)
In this Perspective, we critically assess the knowledge and technological gaps in comprehensive characterization/mapping of the exposome chemical space. We thereby aim at helping to provide means for future developments toward more proactive chemical management.
Chemical Space
The concept of the chemical space was initially introduced within the field of drug discovery, where the central role was the exploration of drug-like structures. (50,51) Those efforts were based on using brute force and known organic chemistry rules to generate all possible structures within set boundaries, for example, the number and the type of elements. (15,50−52) This approach resulted in an extremely large number of possible structures ranging between 1020 and 1060 for molecules containing 30 atoms or less. Chemical databases such as GDB-20 or Zinc are generated using this strategy, and they also contain a few estimated molecular descriptors such as logP (i.e., partitioning coefficient between water and an organic phase). (14,53,54) The main objective of such databases is to be able to query them for structural similarity and/or a specific functional group. (54) Given the approach used for generating such databases, they consist of structures that go beyond drug-like chemicals. (55)
The chemical space contains a myriad of structures that may or may not be relevant to a specific application case. (16,56) For example, ozonation degradation products of a natural product, although highly relevant to exposomics, may not be relevant to drug discovery. These selected subspaces of the chemical space are defined as “relevant chemical space” (Figure 1), which are field/question dependent, in the case of exposome it is “exposome chemical space”. (51,57) Another chemical subspace is the “measurable chemical space” (Figure 1). This subspace represents the chemicals within the chemical space that can be measured using current analytical techniques. The measurable chemical space focuses only on whether a chemical can be separated and ionized via one of the existing mass spectrometry ionization technologies. The measured chemical space is the chemical subspace where all the structures have been previously measured (Figure 2). Being part of the measured chemical space does not imply that these chemicals have been identified (i.e., structurally confirmed). As an example, features in chromatograms that are not identified during NTA assays are part of the measured chemical space. The most well-known chemical subspace is the structurally confirmed/identified chemical space. This is composed of chemicals that are well-known and studied, for example, pharmaceuticals and pesticides. Except for the identified chemical space, other subspaces are mostly unexplored and thus unknown. The relevant and measurable subspaces may overlap depending on the field. For example, in the field of exposomics, the relevant chemical space may be larger than the measurable one, while in drug discovery this may not hold true.
JACS Au 2024, 4, 7, 2412-2425: Figure 2. Depicts all the criteria for a chemical to be measured and identified. Each step results in the size of accessible chemical space.
Exploration of Chemical Space
To explore such a vast chemical space, several cheminformatics tools have been built. (51,52,54,58,59) These techniques range from simple nearest neighbor search to generative models based on large language models. (52,58,60−63) These tools are mainly built either to focus on a very specific subspace of the total chemical space (e.g., drugs) or to explore the chemical space as a whole, as a visualization strategy. Recent developments in graph-based methods such as molecular networks have provided the means of a more detailed exploration of the chemical space. (15,62−64) However, the application of these tools has been limited to mostly visualizing the chemical space and to the drug discovery area, due to the sheer size and the diversity of the chemical space.
Within the exposomics community, the concept of exploration of the chemical space has been focused on building large chemical databases (e.g., PubChem). (9,10,16,25) Additionally, recent works have used text mining approaches to further enrich these lists and thus expand the known chemical space, including chemical classification based on either repeating units or the presence of specific functional groups. These tools have also enabled the classification of chemicals based on their functional groups or repeating units. (9,29) The ultimate goal of these efforts has been detailed characterization of the exposome chemical space. However, they are inherently limited to the registered or identified chemicals.
Exposome Chemical Space
The exposome chemical space is the chemical subspace which humans are exposed to from conception to death. (4,5) The exposome chemical space is mostly unknown and may include human-made and natural chemicals, as well as their transformation products. The efforts to explore/map the exposome chemical space have been divided into computational and experimental. (27,40) The computational approaches focus on building chemical databases of mainly human-made chemicals and then ranking (i.e., chemical prioritization) those structures based on the available metadata (e.g., volume of production). (3,9,10) As for the experimental strategies, the focus has been on actively measuring the chemicals in different environmental compartments, including the transformation products. (40,41)
Computational Approaches for Database Building
The main strategy to add human-made chemicals to the list of chemicals associated with the exposome chemical space has been the mining of national and international chemical registries as well as the mining of patents and scientific publications. One of the earliest efforts to keep track of human-made chemicals has been Chemical Abstract Services (CAS). Formed in 1904, CAS collects structural information for chemicals synthesized as early as 1800 (source: Chemical Abstract Services). The main source of CAS is the scientific literature, where newly reported chemicals are registered and given an identification number (i.e., CAS number). Other similarly formed databases such as PubChem, (9) ChemSpider, (65) NORMAN SusDat, and US-EPA CompTox act as hubs where the data from different chemical registries is gathered, curated, and made available for public use. While these larger databases are more generic and tend to have different chemical families, there are also more specialized chemical databases such as drug bank, (66) human metabolome database, (67) and FORIDENT. (68)
In addition to the literature-based chemical databases, there are also national and international chemical registries with mainly regulatory focus. For example, the European Chemical Agency (ECHA) formed in 2007 is the registry of chemicals used, imported, and/or exported into the European Union as well as the Organization for Economic Co-operation and Development (OECD) and/or eChemportal. As these chemical databases are meant for regulatory purposes, they also include information regarding the volume of production/use as well as biological activity (e.g., toxicity) and physiochemical properties. However, these databases may have different volumes of production/use registration thresholds. (8,69) Furthermore, these databases rarely include the transformation products of human-made chemicals unless they are being actively produced and used for other purposes. However, these databases are limited to human-made chemicals, and their size is increasing by around 1500 new structures a year. (3)
Not all human-made chemicals, registered or not, are part of the exposome chemical space due to their total volume of production, physiochemical properties, and use type. For example, the potential of exposure to a chemical with a very small volume of production may be very low, as this chemical once released into the environment is infinitely diluted. Between 2010 and 2012, Howard and Muir published three very influential manuscripts in which they reported lists of high priority chemicals to be further studied. (8,69,70) In those studies, the authors investigated all the existing North American chemical databases and selected the chemicals with a volume of production larger than 1 ton a year. This threshold was set to ensure the environmental detection of these chemicals. Additional filtering (i.e., chemical prioritization) based on physiochemical properties and expert knowledge were employed to narrow down these chemicals to those pertinent to the environmental and human exposome. Similar efforts have been carried out globally for mapping the exposome relevant chemical space (e.g., SusDat). (71,72) It should be noted that these chemical prioritization approaches are designed to direct the monitoring programs given the costs and difficulties associated with them. Consequently, these databases cover only a small portion of the exposome chemical space.
Experimental Approaches for Exposome Assessment
Detection, identification, and quantification of chemicals in exposure media and biological samples are additional approaches for mapping the exposome chemical space (Figure 2). (40,41) Typically, a combination of target, suspect, and NTA using HRMS is employed for the structural elucidation of the chemicals in the exposome chemical space. (39,56,73) Each of these approaches has its advantages and limitations, and they are usually combined together to maximize their coverage of the exposome chemical space.
Targeted analysis is a top-down approach where all of the necessary information for the unequivocal identification and potential quantification of a chemical in a sample is available to the analyst prior to the analysis. Targeted analysis is the main strategy for routine monitoring of chemicals in environmental and biological samples. (39,40,47,74,75) On the other hand, suspect and NTA are less certain and also tend to be only qualitative, (40,47,76) even though there have been several new developments in semiquantification of known and unknown chemicals. (77) For suspect analysis/screening, a list of suspect analytes with as much information as possible (e.g., predicted retention behavior, fragmentation spectra) is compiled, implying that suspect analysis, similar to the target analysis, is a top-down approach. The generated suspect list is used in a later stage for the detection and tentative identification of the chemicals in the analyzed samples. NTA is the most comprehensive but uncertain approach for the identification of chemicals in environmental and biological samples. NTA is a bottom up approach where minimum or no prior knowledge about the structure of the chemicals in samples is used during the identification process. The ultimate goal of most NTA workflows within the exposomics area is the unequivocal identification of all chemicals present in a sample. However, this process is extremely difficult, time-consuming, and uncertain, (40,41) especially when applied across multiple environmental compartments (e.g., air, water, soil, biological fluids, etc.) and spatiotemporally. Consequently, when looking at the number of new structures discovered in environmental samples using NTA strategies in the past five years, those studies resulted in less than 2% of a database such as Norman SusDat. (38)
Transformation Products
Transformation products, natural or based on human-made processes, theoretically constitute a large portion of the exposome chemical space. Each human-made chemical could potentially have a large number of different transformation products, depending on the reaction pathways and the environmental conditions (e.g., biotic or abiotic). (30,31,35,78,79) Some of these transformation products may be more persistent than their parent compounds and hence be even more relevant for the exposome chemical space. However, most of these structures remain unknown, even though their importance to environmental and human health has been previously demonstrated (e.g., DDT and its metabolites DDE and DDD or disinfection byproducts). (80−82)
A combination of experimental and in silico approaches is typically employed for the structural elucidation of the transformation products of chemicals. (30,31,34−36) This task is carried out for one chemical and one reaction type at a time due to the complexity of such systems (e.g., photodegradation of pharmaceuticals). To elucidate the generated transformation products, a combination of NTA/suspect analysis and in silico prediction tools is used. (83−86) The currently available in silico tools are able to estimate the structure of a potential transformation product based on the parent structure and the reaction type. (85) These transformation product structures are used either for the generation of suspect lists or as potential candidate structures during the NTA workflows. Additionally, the generated transformation product structures may not have their chemical standard available or may not have been measured before, increasing the complexity and uncertainty of this task. Additionally, due to the uncertainties associated with the in silico transformation product structure estimation tools and the NTA workflows, the addition of the transformation products to the list of chemicals in the exposome chemical space has been an extremely slow process. In fact, most of the chemicals present in the databases such as PubChem or Norman SusDat consist of the parent structures rather than transformation products, (36) indicating the need for their expansion with transformation products.
Measurable Exposome Chemical Space
The measurable exposome chemical space is the subspace of chemicals that can be measured via existing analytical strategies, in particular, GC and/or LC-HRMS. A recent review by Manz et al. highlighted that the majority of human exposome-related studies that employed HRMS-based NTA used LC-MS only (51%), followed then by GC-MS only (32%), ≈16% used both techniques together, and 1% used direct injection-HRMS without any separation. (87) Of 76 HRMS-based studies reviewed in total, there was no consistency in application of different analytical platforms across chemical classes or the environmental compartment studied. The majority of applications lay in the food and consumer products space (n = 19 studies), followed by air (n = 15), soil/sediment (n = 13), dust and human samples (each, n = 10), and then water (n = 9). Fundamentally therefore, it seems that researchers have assumed that, at exposome relevant concentrations, a large component of chemicals can be separated via a chromatographic approach and be ionized/fragmented via HRMS technology. It should be noted that slight deviation from the optimal experimental conditions may have extreme impact on the measurable subspace explored by the used method. (16,40,41,87,88) There are several such examples where more generic methods fail to cover highly relevant chemicals in the exposome chemical space (e.g., PFOS or glyphosate).
The separation space is dominated by reversed-phased liquid chromatography (RPLC) and gas chromatography. For RPLC in particular, there is often a linearity assumption between the hydrophobicity and the size of a chemical and its retention within the set experimental setup. (89,90) There are several reversed-phase liquid chromatography studies where the retention times of the internal standards are correlated to their octanol/water partitioning coefficient. (89,91,92) These linear models are then extrapolated to infer which portion of the chemical space is covered. This linearity assumption has been challenged by different studies focusing on retention time modeling. (93−95) A recent study has shown that chemicals with similar retention behavior in RPLC may have up to 6 orders of magnitude variance in their predicted partition coefficients. In the same study, a data-driven approach showed that 20,000 chemicals present in Norman SusDat (around 100 k unique structures) are not analyzable with RPLC. (95)
In the detection space, the ionization efficiency (IE) is the main determining factor for the measurability of a chemical, which fundamentally and potentially significantly limits the measurable space covered. (77,96) The IE is a structure-dependent parameter indicating the magnitude of the generated signal for a specific chemical. There have been several recent studies that have successfully predicted the IE of known and unknown structures. Therefore, the IE could be used as a parameter for categorizing chemicals as detectable or not detectable based on their IE. It should be noted that there has been a study classifying the chemicals to be analyzable via GC-MS vs LC-MS. (42) However, that study did not emphasize the measurability bottleneck and has been trained based on well-known structures (e.g., simple hydrocarbons and pharmaceuticals).
Instrumental Perspective
In a simplistic sense, our current measurements for the chemical space rely on getting separated compounds into a mass spectrometer under conditions that are sufficient for them to be measured. This has heavily relied on the ability to easily introduce the chemicals into liquid (mainly RPLC) and gas chromatography and separation based on interactions between the chemicals, a solid phase, and a mobile phase. (38,87,97) This implies that the compounds are already in a liquid or gaseous solution and hence already exclude particles, low solubility chemicals, and nonvolatiles. Mass spectrometry requires chemicals to be in a gaseous ionized state, and ionization has largely focused on soft ionization for LC (mainly electrospray (ESI) and atmospheric pressure chemical ionization (APCI)) to produce pseudo/molecular ions (precursor ions) and hard ionization (electron impact (EI))) for GC which typically fragments the molecules. As such, chemicals with poor ionization efficiency will likely not be introduced to the mass spectrometer, and those that have very high ionization efficiency may fragment too much to provide sufficient identification of the parent compound. Even those that do ionize may be unstable or rearrange. Beyond ionization, precursors are typically fragmented using collision-induced dissociation, also known as collisionally activated dissociation, and then measured via tandem mass spectrometry. Only recently have we seen the introduction of electron-activated dissociation (EAD) into commercial mass spectrometers, (98) and as such, new libraries need to be developed to identify chemicals based off the spectrum produced through EAD.
Rarely in the NTA space have we seen the application of other separation (e.g., electrophoresis) and ionization techniques such as inductively coupled plasma mass spectrometry (ICPMS) and matrix-assisted laser desorption/ionization (MALDI). For identification of chemicals that may impact humans through surface contact or inhalation of particles, other ambient ionization techniques have emerged including direct-analysis in real-time (DART) and desorption electrospray ionization (DESI). (99,100) These platforms have been used heavily in forensic science for the identification of bulk drug or explosive materials or diagnostic markers in contact evidence, such as illicit drug metabolites in fingermarks. (40,87) Other more recent developments include rapid evaporative ionization (REIMS) or laser desorption ionization (LDI) which have both been integrated within surgical blades to rapidly classify biological materials in real-time through identification of discriminating biomolecules. (101) While other ionization sources exist, these are rarely used in parallel or tandem configurations with the conventional sources. Even the application of positive/negative switching ionizations remains limited. Besides instrumental ionization techniques, chemical ionization such as derivatization can be conducted offline or even online and in some cases may even lead to better chromatographic separation. (102)
Measured Chemical Space
Within the measured subspace of the exposome chemical space, a large portion of the detected chemicals remain unidentified. This implies that a large portion of the collected analytical signals, even though of high quality, remain unidentified. An example of these cases is the C6 to C16 PFAS chemicals in lipidomics studies, as both studies use a very similar experimental setup. Recent studies have indicated that the retrospective analysis can be employed to further annotate/identify the unknowns in the archive data sets. (48,103−105) The retrospective analysis of the archived data, even though it has shown great potential, has not been widely applied for the exploration of the measured chemical space. This shortcoming mainly has been due to inadequate data processing tools, limited chemical and spectral databases, the hypothesis-driven approaches used in NTA experiments, and limited computational power available to different research groups.
The data processing strategies used for the retrospective analysis of the archived data mainly consist of typical NTA workflows. (40,48) There have been several extensive reviews on such workflows and all the included steps. (40,45,106−108) Once these data are processed and further annotated, the identified signals are aligned over multiple data sets for trend analysis and/or inference. (109−111) However, the confidence levels associated with that identification may not be the same across different samples. (48,104,106) Depending on the quality of the generated data (e.g., signal-to-noise ratio), those identifications may be less reliable than others. (46) In addition to that, the unidentified signals cannot be aligned unless generated under the same experimental conditions. (46) These challenges have resulted in several attempts in assessing the data quality as well as the chromatogram alignment. (40,89,92)
The quality of the collected data for NTA assays defines whether the generated signals can be confidently identified or not. (112,113) The quality of the acquired data may be compromised due to heavy matrix effects, instrumental issues, and/or the nature of the chemical itself. (40,41) For some chemicals to generate an HRMS signal of acceptable quality, they must be analyzed using specific conditions. (88) These issues with the data quality can be detected only once the data go through a complete data processing workflow. A common approach to approximately assess the quality of the collected data is to look for the added internal standards in the samples. (45,46,104,106) This approach, even though effective, is computationally expensive and requires detailed metadata about the experimental setup (e.g., the separation selectivity, ionization efficiency, data acquisition conditions). Such information may or may not be available for a specific study depending on the objectives. Moreover, depending on the number and spread of internal standards, they may not be enough for assessing the quality of the collected data. Some efforts have been put toward standardizing the NTA data generation procedures. (40,114−116) However, the proposed procedures are specific to different communities and do not translate across. As an example, the metabolomics and exposomics communities use different retention index scales (alkylamide (117) vs University of Athens scale (118,119)), making comparison of such data sets extremely difficult. Furthermore, these measures tend to be overly conservative, resulting in low levels of implementation within each community.
Another very important challenge to be tackled during the processing is the signal alignment. (40) The aligned signals are needed for trend analysis and signal (feature) prioritization. Existing tools are limited to either a single batch or fully identified structures. (89−92) Retention indices using a set of calibrant chemicals or retention mapping have been widely utilized for such alignments. (89,92,117−119) However, both of these approaches need the presence of a set of chemicals (i.e., calibrants) in all the samples. Such solutions may be very effective for small scale chromatogram alignment but are not able to address the challenges associated with the alignment of archived data, given that these data rarely include the retention index calibrants or the same set of internal standards. Additionally, the few added internal standards are meant to represent all of the chemicals present in those samples. These limitations imply that the alignment of data sets acquired on different instruments using different experimental conditions is still an open question and requires further development.
The last step of NTA assays consists of the identification workflows, including spectral matching against the experimental or in silico predicted spectra. (40,41) Independently from the source of the reference spectra, i.e., experimental or predicted, the quality of the measured spectrum to be identified is essential. (112,113) In the field of exposomics, the efforts in assessing the quality of the recorded spectra is limited to a recent study with limited applicability. (112) In addition to the spectral quality, spectral matching algorithms also play an important role during the identification process. These algorithms range from simple dot product to spectral entropy and more data-driven approaches (e.g., deep learning). (106,120−124) For the spectral library, matching the size of the spectral library and its coverage of the relevant chemical space is an essential factor. For example, when merging all open and commercial libraries together, only 40% the metabolic networks is covered. (125) As for in silico fragmentation tools, access to a large and relevant chemical subspace is important. (25,126−128) These approaches employ the structure of the different candidates retrieved from the chemical databases for fragmentation or fingerprint prediction. However, these approaches, even though powerful, have limitations due to the limited coverage of the exposome chemical space by current chemical databases.
In addition to the previously mentioned approaches for the structural elucidation, recently, there has been a surge of data-driven tools for facilitation of de novo identification and/or annotation of the chromatographic signal. Many of these tools use a combination of machine learning, previously annotated spectra, and spectral similarity to provide additional inference into the structure of an unknown signal (e.g., molecular networking and Spec2LDA). (129−131) However, their applications have been limited to mainly metabolomics studies, and therefore, they have not been adequately tested for the exploration of exposome.
Identified Chemical Space
The identified chemical space is the subspace of measured chemical space, where those structures are fully characterized (e.g., measured via GC/LC-HRMS and available as analytical standard). This subspace is extremely small compared to the size of the exposome chemical space. A recent meta analysis showed that all NTA studies in the past 5 years have resulted in around 1600 (i.e., confidence levels one and two) unique new structures while every year around 700 new structures are introduced into the US market alone. (3,38) It should be noted that the true number of new structures introduced into the global market is extremely difficult to estimate. Considering the number of potential transformation products of these chemicals, the speed of NTA studies is far too low to be able to catch up with the rate of expansion of the exposome chemical space.
How to Move Forward
The main categories of chemicals that are absent from the current exposome related chemical databases are the transformation products of anthropogenic chemicals. The structure-based molecular networks (SBMNs), from drug discovery, combined with synthetic accessibility (computational synthetic chemistry) can be used to build the transformation tree of a chemical. (62,132,133) The already well-known transformation products would provide the distance metrics for the SBMNs, and the synthetic accessibility calculation would enable pruning of the trees from the structures that are impossible. (133−135) Other data-driven approaches, such as generative models, can also provide the means of building such transformation trees. Ultimately, the structures in the pruned trees can be added to existing chemical databases.
In terms of the measurable exposome chemical space, the combination of the modeled separation (e.g., retention time) and mass spectral behavior of chemicals can be used. Retention indices can provide a first glance into the connection between the structure of a molecule and its behavior in the chromatographic space. On the other hand, the ionization efficiency has great potential in connecting the structure of a chemical to its response in the mass spectrometer. The combination of these two metrics can provide a valuable training set to build models where the measurability of a chemical can be assessed based on its structure. A potential byproduct of such a strategy is that these models may be able to suggest the optimized experimental conditions for the analysis of a certain structure (e.g., reverse phase vs normal phase). In addition, the development and integration of models using complementary separation techniques in parallel to chromatography are needed (e.g., electrical separations such as capillary electrophoresis (CE) and ion mobility spectrometry). (101,136−139) Ultimately, modeling multiple separation spaces using data from orthogonal techniques may reduce porosity and extend the boundaries of the measured chemical space, as well as provide additional confirmation where overlaps exist (e.g., machine learning-based prediction of both retention behavior and collision cross section may increase confidence in prioritization workflows (89,140)).
To further expand our coverage and understanding of the chemical space experimentally, harmonization of both complementary and orthogonal techniques is required, for example, through application of both GC- and LC-HRMS and HILIC and RPLC separation to the same samples. We also need to better understand the boundaries of each technique and the porosity within. For example, polar analytes typically have lower ionization efficiency; hence, further development is required for both separation and ionization techniques. Integration of ion chromatography (IC) and CE coupled with both HRMS and ICP-MS techniques needs to be considered. These instruments provide unique selectivity for very polar, inorganic, and ionized compounds (e.g., metals/metalloids, low molecular weight PFAS, and disinfection byproducts); hence, for NTA of drinking water, this is a particular knowledge gap where these platforms represent excellent solutions. (88) IC and CE separation techniques are orthogonal and inherently complement each other. (141)
Mass spectrometry is not the only option available for identification, and other techniques such as Nucelar Magnetic Resonance (NMR) can be coupled with LC separation. Proton NMR in particular should be considered for determining the structure of organic molecules as it allows the ability to elucidate the connectivity of the atoms within molecules and for identifying functional groups. (142,143)
However, this draws out a larger issue that arguably requires much more focus in the NTA community moving forward. Identification frameworks for chemical residues in environmental samples usually fundamentally consider the value offered by HRMS data first, followed by evaluation of any increased confidence provided by supplementary chromatographic data. (144) In other fields such as forensic science, the combination of data generated by a much wider set of orthogonal/uncorrelated techniques is considered in far more depth. For example, the Scientific Working Groups for the Identification of Seized Drugs (SWGDRUG) and Fire and Explosions (SWGFEX) both categorize techniques broadly into those providing presumptive, indicative, and confirmatory evidence. The SWGFEX guidelines for postblast explosive identification are a particularly relevant example for trace chemical NTA (recommended guidelines for the forensic identification of postblast explosive residues). An array of confirmatory techniques are categorized by their ability to offer structural or elemental level detail and include Raman, FTIR, and X-ray diffraction, in addition to LC-MS and GC-MS. Energetic materials are well-known to be very challenging to measure by any single method or technique. Gaps in the measurable space that exist for methods that use confirmatory techniques are fundamentally considered. This is especially true for chromatography coupled to mass spectrometry. Given the potential undesirable outcomes of a “false negative” in this particular field, the combination of techniques is critical (e.g., to identify both inorganic and organic explosives). Even when using MS, the choice of ionization technique in LC-MS is also extremely important (e.g., ESI is normally more suitable for nitrate esters, APCI is better for nitrotoluenes, and neither are particularly effective for detection of some explosives like nitroglycerin or hexamethylene triperoxide diamine (145)). For the identification of intact/bulk drugs, SWGDRUG also considers NMR spectroscopy a confirmatory technique (Scientific Working Group for the Analysis of Seized Drugs (SWGDRUG) Recommendations, 2019). With regards to exposomics, NMR has been used for many years for the identification of biomolecules in exposome research. (146) Though less sensitive than MS generally, higher field instruments, multiple scans, different probes, or hyperpolarisation used together with sample preconcentration methods may improve its contribution for identification of new substances at trace concentrations in complex samples. (147)
As for the measured exposome chemical space, the main challenges are the raw data quality assessment, incomplete preprocessing workflows, and identification workflows. The data quality assessment must become independent from the data preprocessing workflows as they may be a major source of error into final results, for example, low quality MS/MS spectrum due to the lack of deconvolution algorithms. The current NTA workflows are set to focus on a single sample or batch of samples analyzed with one specific method. This limitation hinders large scale retrospective analysis of archive data, as most of the signals may remain unidentified. Moreover, the use of the same set of internal standards may not be adequate. Therefore, the development of alignment algorithms that are based on the raw data or the raw feature lists is a must for being able to fully take advantage of the publicly available archived data. Finally, when it comes to identification, the current approaches are based on a set number of matched fragments and hard set thresholds. This may not be the most adequate way forward, as different chemicals may need different parameters for their high confidence identification. For example, for PFAS chemicals, having two or three fragments may be enough, while for hormone-like chemicals, sometimes even 100 fragments are not enough. Moreover, depending on the levels of background signal and matrix effects, the mass accuracy of the instrument may be different, resulting in a better match with incorrect candidates.
From a regulatory point of view, knowledge on what can be measured or not is essential. Chemicals that cannot be measured or are very difficult to measure (e.g., very mobile chemicals) are very difficult to regulate. Thus, for new chemicals to be introduced, evidence of the ready measurability of the parent and the most abundant transformation products may be considered as one of the necessary criteria. The high detection frequency of chemicals in the archived data can be further integrated as one of the strategies for early detection of chemicals of emerging concern.
Overall, here, we have highlighted the most immediate scientific gaps related to mapping the exposome chemical space. It should be noted that there are many more challenges that need to be tackled. However, based on our assessment, it is clear that the current approaches do not provide the means for a pro-active chemical management. Therefore, the combination of data-driven approaches with existing strategies will be a necessary step forward to bridge these knowledge gaps.
Potential Impact
The expanded and mapped chemical space of the exposome, its predicted physiochemical properties, and biological activities will unleash new waves of developments in chemical management, toxicology, and analytical technology development. The predicted properties and biological activities will guide new chemical regulation. The transformation products added to the exposome chemical space will provide the means for the replacement of toxic chemicals with safe alternatives and thus future safe and sustainable-by-design chemicals. The measurability assessment will identify the portion of the exposome chemical space that cannot be analyzed with our current technology, stimulating the development of new analytical tools to further expand this coverage. The newly identified chemicals via a retrospective analysis of the archived data will provide insights into the connection between chemical exposure and an observed health outcome. These connections will provide insights into the mechanistic relationships between exposure and certain health outcomes.
- Exploring the Chemical Space of the Exposome: How Far Have We Gone? Saer Samanipour, Leon Patrick Barron, Denice van Herwerden, Antonia Praetorius, Kevin V. Thomas, and Jake William O’Brien. JACS Au 2024 4 (7), 2412-2425. DOI: 10.1021/jacsau.4c00220
