Вы находитесь на странице: 1из 21

Dr.

Olexandr Isayev
and Prof. Denis Fourches
Laboratory for Molecular Modeling,
University of North Carolina at Chapel Hill, USA
Decline in Pharmaceutical R&D efficiency
The cost of developing a new
drug roughly doubles every
nine years.

1033 drug-like chemicals*

108 compounds in PubChem

106 compounds in ChEMBL


with 1known bioactivity

Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200

Need of novel bio/cheminformatics methods that


(i) Fully exploit the potential of modern chemical biological data streams;
(ii) Reliably forecast compounds bioactivity and safety profiles;
(iii) Accelerate the translation from basic research to drug candidates
* Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9. 2
Ligand-based Virtual Screening
to identify potential hits

Empirical Rules/Filters
Similarity Search
Consensus
QSAR MODELS QSA

VIRTUAL
SCREENING
~102 103
molecules

Potential
~106 109
molecules
Hits

3
O
Thousands of molecular descriptors
are available for organic compounds
C
N

0.613
constitutional, topological, structural,
quantum mechanics based, fragmental, steric,

O
O

N
pharmacophoric, geometrical,
0.380
thermodynamical, conformational, etc.
AA
-0.222
M O D Samples Features (descriptors)
0.708
CC
N
E (compounds)
TT
O

Descriptor X1 X2 ... Xm
P 1.146
N

O S
matrix Quantitative
O
N

O
C
R
1
Structure
X11 X12
ACTIVITY (i) 0.491
... II
X1m
N
0.301
U O I Activity
2 X21 X22 ...
0.141 VV
X2m
N
P
O

T
Relationships
... ... ... ...
0.956 II
...
N O
N

O 0.256
D
N

R n
- Building Xn1
of models
using machine learning
Xn2 ...
0.799
TT
Xnm
S
S
O

N
methods (NN, SVM, RF)
1.195 YY
- Validation of models
1.005
according to numerous
statistical procedures, and
their applicability domains.
4
Discovery of Novel Antimalarial Compounds
Enabled by QSAR-Based Virtual Screening
- Severe infectious diseases (~700,000 deaths per year worldwide)
- Caused by unicellular eukaryotic parasites, mainly Plasmodium falciparum

- Modeling Set: 158 actives, 2,975 inactives. kNN and SVM with ISIDA descriptors
External predictive power of QSAR models is critical
QSAR 176 putative hits
to enable their
Chembridge
Database
Similarity
Filters
application
Filters
to virtual
Drug-likeness
screening.
models 42 putative inactives
454,638 44,112 39,944
Technically
chemicals challenging
chemicals to compute molecular
chemicals
properties and descriptors for more >10 9 compounds.
EXPERIMENTAL CONFIRMATION (Dr. Guy, St Jude Res. Hosp)
-Most potent hit (SJ000565000) with EC50 = 95.6 nM and novel
No cheminformatics
molecular scaffold architecture is able to screen >109
-7 compounds with EC50 less than 2 M
compounds.
-18 compounds with moderate activity (EC50 2-8 M)
-All of the 42 putative inactives have EC50 >10 M

14.2% hit rate >> HTS hit rate (0.1 5%)


SJ000565000
Zhang, Fourches, et al. JCIM, 2013, 53, 475-492 5
Study Design

6 6
Chemical Datasets
Largest publicly available virtual libraries

GDB-13 955 M compounds


GDB-13-ABCDE subset 141 M
GDB-17 subset 50 M

1 Blum and Reymond, 2009, J Am Chem Soc, 131, 87328733


2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875
Setup
Hardware Stack Software Stack
Intel Core i7 4770 Ubuntu 12.04
CPU 3.4GHz, Anaconda Scientific Py
Intel H87-based thon 2.7.6 Distribution
motherboard, Pandas / Pytables
32GB of DDR3 1600 MKL optimized NumPy
memory NUMBAPRO for CPU
Nvidia Tesla K20 for optimization
GPU accelerated RDKit
calculations
C / CUDA
High throughput
-Data parsing from descriptor generator
Data Smiles Mol weight, logP,
Processing -2D structure H acceptors/donors,
generation Rot bonds, Daylight
Chemical 30M/hr
-Automatic curation fingerprints.
Library

Screening & Modeling


High throughput
-Data parsing from descriptor generator
Data Smiles Mol weight, logP,
Processing -2D structure H acceptors/donors,
generation Rot bonds, Daylight
Chemical 30M/hr
-Automatic curation fingerprints.
Library

Smaller datasets (<1M)


directly allocated in RAM

Storage
&
Interactive
Manipulation analytics
with IPython Indexed, fully searchable, and accessible via high level API, e.g.,
(data. MolWt > 150) & (data.logP == 3)
Access in chunks or streaming compound by compound.
High throughput
-Data parsing from descriptor generator
Data Smiles Mol weight, logP,
Processing -2D structure H acceptors/donors,
generation Rot bonds, Daylight
Chemical 30M/hr
-Automatic curation fingerprints.
Library

Smaller datasets (<1M)


directly allocated in RAM

Storage
&
Interactive
Manipulation analytics
with IPython Indexed, fully searchable, and accessible via high level API, e.g.,
(data. MolWt > 150) & (data.logP == 3)
Access in chunks or streaming compound by compound.

Modeling
&
Screening GPU accelerated
Rapid screening of extremely large libraries with similarity search
multiple molecular probes and QSAR/QSPR models ~1M Tanimoto/s
GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries

GDB-13 GDB-17
Subset of 141 M Random sample
of 50 M

Our GPU-accelerated cheminformatics platform is able to compute


key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries

Our GPU-accelerated cheminformatics platform is able to compute


key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
GPU - Case Study 2
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds

- Lacosamide (trade name Vimpat) is


an anticonvulsant drug used to
prevent seizures for patients treated
for epilepsy; Lacosamide

- Functionalized amino acid;

- Many active analogues have been


synthesized in Prof. Harold Kohns
laboratory* at UNC-CH.
*Wang et al., 2011, ACS Chem Neurosci, 2, 90106
GPU - Case Study 2
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
Analog 1 Analog 2

Analog 3 Analog 4 Analog 5

Similarity search using


200M compound subset
of GDB-13/17 Lacosamide as molecular
probe
GPU - Case Study 2
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
Compound ID Tanimoto Ts
The GPU-accelerated screening Analog 2 0.997
platform was able to retrieve: Analog 3 0.995
-known active analogues of Analog 1 0.994
lacosamide, Analog 4 0.992

-several functionalized amino Analog 5 0.978


Gdb13-a10573585 0.977
acids present in GDB-13,
Gdb13-b28137563 0.977
-a novel compound (Gdb17-
Gdb13-a36264983 0.976
44140083) fully matching the Gdb13-a36264952 0.976
pharmacophore of lacosamide. Gdb13-a10616005 0.976
Gdb13-a3011053 0.976
Gdb13-b21242261 0.976
Gdb17-44140083 0.976
Gdb13-a30878321 0.975
Gdb13-b3485216 0.975
In Summary
GPU-accelerated cheminformatics platform for high
performance virtual screening of extremely large
chemical libraries.
Tested for the analysis of the largest publicly available
dataset GDB-13 (~900M compounds) and (2) the
screening of ~200M compound library for similarity
search using an anticonvulsant drug as the molecular
probe.
Our platform aims to virtually screen billions of
compounds using similarity filters and QSAR models.
Acknowledgements
Professor Alex Tropsha (UNC-CH)
Colleagues at MML laboratory
NVIDIA & Mark Berger for generous hardware
donation

Funding
- NSF ABI program
- Office of Naval Research
Molecular fingerprints - bit string encodings of structural features
and/or calculated molecular properties.

INFORMATION ABOUT THE PRESENCE OF MOLECULAR FRAGMENTS


1 FRAGMENT IS PRESENT
0 FRAGMENT IS ABSENT
Similarity Search
Similarity searching using fingerprint representations of molecules is one of the
most widely used approaches for chemical database mining: it assumes that
similar compounds possess similar biological activities.

Tanimoto Coefficient

From J. Bajorath, SSS Cheminformatics, Obernai 2008

Вам также может понравиться