Cheminformatic Feature Interrelations and Their Conceptual Parallels

 
Feature
Interrelation
Profiling
 
Ivan Čmelo
Department of Informatics and Chemistry
UCT Prague
 
ENBIK 2022
 
Many cheminformatic models are based on presence of individual features
 
Natural
 product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.
 
Many cheminformatic models are based on presence of individual features
 
Natural
 product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.
 
Structure
(example: tyrosine)
 
Structural features
(example: extended connectivity substructures)
 
Many cheminformatic models are based on presence of individual features
 
Natural
 product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.
 
Structure
(example: tyrosine)
 
Structural features
(example: extended connectivity substructures)
Scoring
 
Many cheminformatic models are based on presence of individual features
 
Natural
 product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.
 
Structure
(example: tyrosine)
 
Structural features
(example: extended connectivity substructures)
Scoring
0.28
0.13
0.34
0.11
0.23
0.27
0.22
0.15
0.17
 
Many cheminformatic models are based on presence of individual features
 
Natural
 product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.
 
?
 How about "good" features in "bad" combinations, and 
vice versa
 
?
 
Structure
(example: tyrosine)
 
Structural features
(example: extended connectivity substructures)
Scoring
0.28
0.13
0.34
0.11
0.23
0.27
0.22
0.15
0.17
 
Conceptual parallels for feature interrelations
 
Bioinformatics
: quantifying gene co-inheritance
Medicine
: quantifying disease co-morbidities
… among many others ...
Linguistics
: quantifying text difficulty (
Flor et. al.
)
prior: "text containing obscure words is harder to comprehend"
proposed: "text containing strange word combinations is harder to comprehend"
 
Conceptual parallels for feature interrelations
 
Bioinformatics
: quantifying gene co-inheritance
Medicine
: quantifying disease co-morbidities
… among many others ...
Linguistics
: quantifying text difficulty (
Flor et. al.
)
prior: "text containing obscure words is harder to comprehend"
proposed: "text containing strange word combinations is harder to comprehend"
 
For comparison, consider the sentences:
 
"Sheep slept quietly in the barn."    vs.    "Sheep slept furiously in the hospital."
 
Conceptual parallels for feature interrelations
 
Bioinformatics
: quantifying gene co-inheritance
Medicine
: quantifying disease co-morbidities
… among many others ...
Linguistics
: quantifying text difficulty (
Flor et. al.
)
prior: "text containing obscure words is harder to comprehend"
proposed: "text containing strange word combinations is harder to comprehend"
 
For comparison, consider the sentences:
 
"
Sheep slept quietly in the barn.
"    vs.    "Sheep slept furiously in the hospital."
 
🤔
 
Pointwise mutual information (PMI):
 
1.
Structural features
2.
Feature co-occurences (CORM)
3.
Co-occurrence probabilities (COPRM)
4.
Pointwise mutual information values (PMIRM)
5.
Optionally Z-Scored (ZPMIRM)
 
Relative feature tightness (RFT)
 
mean for all combinations F
i
, F
j
 in a feature set of F
1
-F
n
, against a reference PMIRM profile PMI
ref
 
For a given vector of features, quantifies how well does the feature combination fit the reference
interrelation profile (PMIRM)
 
1.
Translate a molecule to its structural features
2.
Look up the feature combinations in the reference interrelation profile (PMIRM)
3.
Calculate the average of their PMI values
Molecule
Features
Feature
combinations
PMIRM
reference
interrelation
profile
Feature PMI
values
RFT value
 
Application on synthetic accessibility: ZRFT against 
ZINC 
PMI profile using ECFP4
ZINC
structures
Nonpher
structures
ZINC
ZRFT values
Nonpher
ZRFT values
 
 
 
ZPMIRM profile
 
ZINC
a database of commercially available molecules
Sterling T, Irwin JJ (2015) ZINC 15—ligand discovery for everyone. 
J Chem Inf Model
55(11)
Nonpher
a tool to produce synthetically less accessible molecules
Voršilák M, Svozil D (2017) Nonpher: computational method for design of hard-to-
synthesize structures. J Cheminform 9:20
SAScore
a well-established estimator of synthetic accessibility
Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like
molecules based on molecular complexity and fragment contributions. 
J Cheminform
 1:8
SYBA
a bayesian classifier for estimating synthetic accessibility
Voršilák, M., Kolář, M., Čmelo, I. et al. SYBA: Bayesian estimation of synthetic accessibility
of organic compounds. 
J Cheminform
 12:35
 
… compare ZRFT values with SAScore, SYBA
 
Application on synthetic accessibility: ZRFT against 
ZINC 
PMI profile using ECFP4
ZINC
structures
Nonpher
structures
ZINC
ZRFT values
Nonpher
ZRFT values
 
 
 
ZPMIRM profile
 
… compare ZRFT values with SAScore, SYBA
 
Application on synthetic accessibility: ZRFT against 
ZINC 
PMI profile using ECFP4
 
There seems to be a strong correlation
between ZRFT and the established methods
It is possible to gain information on synthetic
accessibility based on bit interrelations in ECFP4
fingerprints alone
More information in the paper:
Čmelo, I., Voršilák, M. & Svozil, D. Profiling and analysis of
chemical compounds using pointwise mutual information.
J Cheminform
 13, 3
Speculated improvements:
Use fragments directly instead of ECFP4 bits
(remove the factor of bit collisions)
Use a reference interrelation profile
(reduce inherent interrelations from fragment overlaps, etc.)
 
Pointwise Kullback–Leibler divergence
 (PKLD)
 
Structural features can be inherently linked by overlapping, etc.
Instead of marginal probabilities, another profile can be used
PKLD is a difference profile between a given pair of profiles
 
Vs.
 
Observed co-occurrence probabilities
 
Reference co-occurrence probabilities
 
Co-occurrence probabilities
presuming feature independence
 
ZRFT against PKLD ZINC/Nonpher profile
 
There seems to be about the
same correlation between ZRFT and
the established methods as before
However, the
separation between ZINC and
Nonpher structures seems to be
much improved
 
ZRFT against PKLD ZINC/Nonpher profile
 
There seems to be about the
same correlation between ZRFT and
the established methods as before
However, the
separation between ZINC and
Nonpher structures seems to be
much improved
Fragment radius of 1 not sufficient;
fragment radius of 2 seems optimal.
Radii of 3+ yield very large, sparse
profiles
According to ROC, the performance
of ZRFT now seems comparable with
the established synthetic
accessibility estimation methods
 
RFT also seems to work for natural product-likeness
 
COCONUT
database of natural products
Sorokina, M., Merseburger, P., Rajan, K. et al.
COCONUT online: Collection of Open Natural Products
database. 
J Cheminform
 13, 2
Interrelation methodology now being
adapted for natural product likeness
estimation by Bc. Kateřina Lišková
COCONUT/ZINC PKLD profile, ECFP6,
2048 bits
Natural products seem to have
quantitatively different ECFP bit
interrelations from generic ZINC
substances
 
Hybrid feature interrelation profiles
 
Summary
 
All tested chemical datasets had 
structural feature
 interdependencies
These interdependencies differ between the tested datasets
Differences in these interdependencies were used to assess synthetic
accessibility of chemical structures, as well as natural product likeness
(+) Can be made for arbitrary features, not just structural
(+) The features and their mutual relations can also be interpreted as a graph
(+) It is a generic measure, not a model!
(-) Many structures needed for meaningful profiles
(-) Profiles get very large
 
Implemented as an open-source Python library (
https://github.com/cmeloi/fip3
)
 
Thank you!
 
Prof. Daniel Svozil, Ph.D.
Ing. Milan Voršilák, Ph.D.
Mgr. Wim Dehaen
Bc. Kateřina Lišková
 
Supported by the Ministry of Education of the Czech Republic (RVO 68378050-KAV-NPUI and LM2018130), as well as Junior Internal Grant of the UCT Prague (2021, #2103)
Slide Note
Embed
Share

Cheminformatic models often rely on individual features like Natural Product Likeness (NPScore) and Structural Features to assess compounds. This study explores how certain features impact scoring and the implications of "good" features in "bad" combinations. It also draws parallels in other fields like Bioinformatics, Medicine, and Linguistics, highlighting the importance of quantifying interrelations in various domains.

  • Cheminformatics
  • Feature Interrelations
  • Scoring
  • Structural Features
  • Bioinformatics

Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Feature Interrelation Profiling Ivan melo Department of Informatics and Chemistry UCT Prague ENBIK 2022

  2. Many cheminformatic models are based on presence of individual features Natural product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc.

  3. Many cheminformatic models are based on presence of individual features Natural product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc. Structural features (example: extended connectivity substructures) Structure (example: tyrosine)

  4. Many cheminformatic models are based on presence of individual features Natural product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc. Structural features (example: extended connectivity substructures) Structure (example: tyrosine) Scoring

  5. Many cheminformatic models are based on presence of individual features Natural product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc. Structural features (example: extended connectivity substructures) Structure (example: tyrosine) 0.28 0.13 0.34 Scoring 0.11 0.23 0.27 0.22 0.15 0.17

  6. Many cheminformatic models are based on presence of individual features Natural product likeness (NPScore), Synthetic accessibility assessment (SAScore, SYBA), etc. Structural features (example: extended connectivity substructures) Structure (example: tyrosine) 0.28 0.13 0.34 Scoring 0.11 0.23 0.27 0.22 0.15 0.17 ? How about "good" features in "bad" combinations, and vice versa?

  7. Conceptual parallels for feature interrelations Bioinformatics: quantifying gene co-inheritance Medicine: quantifying disease co-morbidities among many others ... Linguistics: quantifying text difficulty (Flor et. al.) prior: "text containing obscure words is harder to comprehend" proposed: "text containing strange word combinations is harder to comprehend"

  8. Conceptual parallels for feature interrelations Bioinformatics: quantifying gene co-inheritance Medicine: quantifying disease co-morbidities among many others ... Linguistics: quantifying text difficulty (Flor et. al.) prior: "text containing obscure words is harder to comprehend" proposed: "text containing strange word combinations is harder to comprehend" For comparison, consider the sentences: "Sheep slept quietly in the barn." vs. "Sheep slept furiously in the hospital."

  9. Conceptual parallels for feature interrelations Bioinformatics: quantifying gene co-inheritance Medicine: quantifying disease co-morbidities among many others ... Linguistics: quantifying text difficulty (Flor et. al.) prior: "text containing obscure words is harder to comprehend" proposed: "text containing strange word combinations is harder to comprehend" For comparison, consider the sentences: "Sheep slept quietly in the barn." vs. "Sheep slept furiously in the hospital."

  10. Pointwise mutual information (PMI): 1. 2. 3. 4. 5. Structural features Feature co-occurences (CORM) Co-occurrence probabilities (COPRM) Pointwise mutual information values (PMIRM) Optionally Z-Scored (ZPMIRM)

  11. Relative feature tightness (RFT) For a given vector of features, quantifies how well does the feature combination fit the reference interrelation profile (PMIRM) 1. Translate a molecule to its structural features 2. Look up the feature combinations in the reference interrelation profile (PMIRM) 3. Calculate the average of their PMI values Features Molecule Feature combinations PMIRM reference interrelation profile Feature PMI values RFT value mean for all combinations Fi, Fj in a feature set of F1-Fn, against a reference PMIRM profile PMIref

  12. Application on synthetic accessibility: ZRFT against ZINC PMI profile using ECFP4 ZINC a database of commercially available molecules Sterling T, Irwin JJ (2015) ZINC 15 ligand discovery for everyone. J Chem Inf Model 55(11) ZINC Nonpher structures structures Nonpher a tool to produce synthetically less accessible molecules Vor il k M, Svozil D (2017) Nonpher: computational method for design of hard-to- synthesize structures. J Cheminform 9:20 ZPMIRM profile SAScore a well-established estimator of synthetic accessibility Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1:8 ZINC Nonpher ZRFT values ZRFT values SYBA a bayesian classifier for estimating synthetic accessibility Vor il k, M., Kol , M., melo, I. et al. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J Cheminform 12:35 compare ZRFT values with SAScore, SYBA

  13. Application on synthetic accessibility: ZRFT against ZINC PMI profile using ECFP4 ZINC Nonpher structures structures ZPMIRM profile ZINC Nonpher ZRFT values ZRFT values compare ZRFT values with SAScore, SYBA

  14. Application on synthetic accessibility: ZRFT against ZINC PMI profile using ECFP4 There seems to be a strong correlation between ZRFT and the established methods It is possible to gain information on synthetic accessibility based on bit interrelations in ECFP4 fingerprints alone More information in the paper: melo, I., Vor il k, M. & Svozil, D. Profiling and analysis of chemical compounds using pointwise mutual information. J Cheminform 13, 3 Speculated improvements: Use fragments directly instead of ECFP4 bits (remove the factor of bit collisions) Use a reference interrelation profile (reduce inherent interrelations from fragment overlaps, etc.)

  15. Pointwise KullbackLeibler divergence (PKLD) Structural features can be inherently linked by overlapping, etc. Instead of marginal probabilities, another profile can be used PKLD is a difference profile between a given pair of profiles Observed co-occurrence probabilities Vs. Co-occurrence probabilities presuming feature independence Reference co-occurrence probabilities

  16. ZRFT against PKLD ZINC/Nonpher profile There seems to be about the same correlation between ZRFT and the established methods as before However, the separation between ZINC and Nonpher structures seems to be much improved

  17. ZRFT against PKLD ZINC/Nonpher profile There seems to be about the same correlation between ZRFT and the established methods as before However, the separation between ZINC and Nonpher structures seems to be much improved Fragment radius of 1 not sufficient; fragment radius of 2 seems optimal. Radii of 3+ yield very large, sparse profiles According to ROC, the performance of ZRFT now seems comparable with the established synthetic accessibility estimation methods

  18. RFT also seems to work for natural product-likeness COCONUT database of natural products Sorokina, M., Merseburger, P., Rajan, K. et al. COCONUT online: Collection of Open Natural Products database. J Cheminform 13, 2 Interrelation methodology now being adapted for natural product likeness estimation by Bc. Kate ina Li kov COCONUT/ZINC PKLD profile, ECFP6, 2048 bits Natural products seem to have quantitatively different ECFP bit interrelations from generic ZINC substances

  19. Hybrid feature interrelation profiles

  20. Summary All tested chemical datasets had structural feature interdependencies These interdependencies differ between the tested datasets Differences in these interdependencies were used to assess synthetic accessibility of chemical structures, as well as natural product likeness (+) Can be made for arbitrary features, not just structural (+) The features and their mutual relations can also be interpreted as a graph (+) It is a generic measure, not a model! (-) Many structures needed for meaningful profiles (-) Profiles get very large Implemented as an open-source Python library (https://github.com/cmeloi/fip3)

  21. Thank you! Prof. Daniel Svozil, Ph.D. Ing. Milan Vor il k, Ph.D. Mgr. Wim Dehaen Bc. Kate ina Li kov Supported by the Ministry of Education of the Czech Republic (RVO 68378050-KAV-NPUI and LM2018130), as well as Junior Internal Grant of the UCT Prague (2021, #2103)

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#