DataWarrior Tutorial

Know Your Molecule
Examining published pharmacology of the chemical space around your series using
ChEMBL, KNIME & DataWarrior
A frequent goal after HTS (high throughput screen) is to decide in which series to invest chemistry
effort and what issues need to be addressed within each series. One element of that decision is to
understand what published data can tell you about the chemical space surrounding the hit molecule.
The following workflow has been designed to answer that question using three freely available*
resources:
1. ChEMBL – a large database of published compounds and their pharmacology
2. KNIME – a data pipelining tool
3. Data Warrior – an SAR (structure activity relationship) analysis tool
NB: As configured, the KNIME workflow sends your molecular structures to the ChEMBL servers, you
will need to download the database & reconfigure the workflow if you don’t want that to happen (eg
for confidentiality/IP reasons)
1&2
3
Know Your Molecule: published example
• In the paper above (open access) the authors describe a process for hit triage which
includes searching the ChEMBL database for analogues of the hits. This information
helped to inform the decision about which series to prioritise and what issues
(particularly undesirable pharmacology) might need to be screened for & designed
out of the series.
• This guide describes carrying out a similarity search into the ChEMBL
database & provides methods for examining the resulting data. It uses one of
the hits presented in the paper, compound 7 in table 2, CHEMBL587310
ChEMBL: a short introduction
• ChEMBL or ChEMBLdb (https://www.ebi.ac.uk/chembldb/) is a manually curated chemical

database of bioactive molecules with drug-like properties.* It is maintained by the European
Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at
the Wellcome Trust Genome Campus, Hinxton, UK.
• ChEMBL statistics (May 2015):

– Targets: 10,774, Compound records: 1,715,667, Distinct compounds: 1,463,270, Activities:
13,520,737, Publications: 59,610.
• The ChEMBL group have also developed tools and resources for data mining.**
– Kinase SARfari, integrated chemogenomics workbench focussed on kinases.
– GPCR SARfari, focused on GPCRs.
– ChEMBL-Neglected Tropical Diseases (ChEMBL-NTD), repository for Open Access primary
screening and medicinal chemistry data directedtropical diseases.
– Malaria data service, sponsored by the Medicines for Malaria Venture (MMV). Includes
compounds from Malaria Box screening set, as well as donated malaria data in ChEMBL-NTD.
– ADME SARfari, tool for predicting and comparing cross-species ADME targets.
* Gaulton, A et al. (2011). "ChEMBL: a large-scale bioactivity database for drug discovery". Nucleic Acids Research 40:
D1100–7. doi:10.1093/nar/gkr777.
** Bellis, L J et al. (2011). "Collation and data-mining of literature bioactivity data for drug discovery“. Biochemical Society
Transactions 39: 1365–1370. doi:10.1042/BST0391365.
KNIME: a short introduction
KNIME (Konstanz Information Miner) is an open source graphical workflow tool for
manipulating and analyzing data. Many tools have been provided to allow KNIME
to handle chemical structures.
Lots of help is available on KNIME. If you are new to it, there are some great videos
on YouTube & you can get to them from within KNIME Help -> Welcome Page, click
on Learning Hub, then KNIME TV & links from there, or click on the link.
1. To start using the workflow described in this guide, Install KNIME (see
http://www.knime.org/downloads/overview & choose the appropriate installer
for your system – for windows we recommend KNIME + all free extensions).
2. Open KNIME (double click Knime.exe) & set up a default KNIME workspace &
you are ready to start importing the workflow described in this guide (slide 6)
Note: it is recommended you change the default so that it’s not under a data structure
inside the specific version of KNIME so that your workflows are available to you if you
update KNIME
DataWarrior: a short introduction
• “Combines dynamic graphical views,

interactive filtering with chemical
intelligence”
• Upgraded January 2015 and plans for
continued support in place
• Linux, Mac and Windows installers available
here
(http://www.openmolecules.org/datawarrior/download.html)
• Some on-line documentation provided
DataWarrior can read data and structures from either

1. Excel compatible files (.csv) with the structures encoded as SMILES strings
2. Standard SDFs
Detailed instructions: opening workflow & inputting structures
• Having downloaded and installed KNIME (see earlier slide on KNIME), download & install the
workflow ‘KnowYourMolecule.zip’ (download the file from this link).
• File/Import KNIME workflow
• click radio button ‘Select Archive File’, browse to KnowYourMolecule.zip
• Under Target, browse to a destination (eg Local workspace)
• Click through Next/Finish, you should see the workflow in the KNIME explorer
window
• Double click on the workflow & it will open up in the main window as below
• Double click on the MarvinSketch node. As downloaded

this node should contain the structure of
CHEMBL587310, the following descriptions assume you
have used this structure as the input molecule, it is
advisable to run through the workflow with
CHEMBL587310 as the input once before using your own
input molecule
• Notes:
• You can paste a valid smiles string into MarvinSketch &
it will convert it to the structure
• You can use multiple structures; a similarity search will
be carried out on each. This could be useful for two
somewhat different members of a series for example.
To avoid confusion in later nodes it’s probably not
advisable to use multiple diverse structures
Detailed instructions: running through the workflow & selecting
compounds for biology data annotation
1. With a structure present, execute the MarvinSketch node (to execute a node, right click on it & hit
execute’, other useful options available are ‘configure’ and viewing the output, usually at the bottom.
If the node executes correctly it turns green)
2. Configure the structure search node (right click & configure, or double click). Select the compound
search tab which gives you ‘search type’ options for substructure or similarity search. For this
example we have used a similarity search with a default similarity value of 75. Then execute the
node. This may take half a minute or so as a live call is made to the ChEMBL database
Note, your structure is being sent out of your firewall to perform a structure search on the
ChEMBL servers. If you need to carry out this search inside your firewall you will need to
reconfigure this workflow to use downloaded versions of the ChEMBL database.
3. Execute the nodes ‘Morgan sim’ to carry out a similarity calculation between the input structures &
structures found in ChEMBL (which means you will have a similarity value even when you opt for a
substructure search).
4. Execute the RDKit Interactive table. Then right click on this node & select ‘view RD kit interactive
table view’. This will give you a table in which rows can be sorted and selected. Expand the table to
full screen and scroll around & have a look at the structures returned to check they are relevant to
your example.
1. You may find a similarity value beyond which the structures are too dissimilar to your query
molecule to be relevant. If so you can set a cutoff value in the ‘Row Filter’ node. The default
performance of the workflow is to use a cutoff of 0.3.
2. Alternatively you may decide that you want to select the compounds manually. You can do
this through the interactive table as described in the next slide.
4.1
2
1 4
3
Detailed instructions: option – manually select structures
1. With the interactive structure viewer open, you can choose which compounds will be annotated with
their known biology. Follow these steps to select a set for inclusion or exclusion.
1. right click & select View: RD Kit interactive
table 1.2 & 1.4
2. Do Hilite -> Clear hilite to make sure there are
no previous hilite settings
3. CTRL click to select a set of compounds for
inclusion or exclusion 1.3
4. Do Hilite -> Hilite selected
5. Close node & execute the Hilite Row Filter
node. Check (right click & check the output)
that the rows you wanted have been
separated from those you don’t. If not, reset
the Hilite Row Filter (right click, reset, thn re-
execute)
6. connect the upper (to annotate the selected
compounds with biological data) or the lower
(to annotate unselected compounds) output
from the Hilite row filter to the ChEMBL
biology data lookup.
Note: to connect the node, left click from the
triangle on the right of the node & drag to the
triangle on the left of the ChEMBL biology data
lookup node
1.6
Detailed instructions: producing output files for analysis
1. Whether by the default similarity value filter (slide 7) or by manually selecting compounds for annotation
(slide 8), you should now have a set of compounds being sent to the ‘ChEMBL biology data lookup’ node.
Execute this node, which will add biology data (often multiple rows) for each compound on the input
stream.
2. Take a look at the output (right click & select the bottom item ‘Connected to: Joined Table’ – this is the
raw data. If there isn’t too much data, you may be able to make the decisions you need with this data
alone. If you want to write it out as is, create a new CSV writer node as below & write the data out.
However, the workflow makes the assumption that you might have quite a lot of data & that it would be
helpful to filter it & split it into several output files suitable for viewing in DataWarrior.
3. To write out the data for each set of filtered output, for each of the csv files on the right hand side, double
click on them (= configure) and browse to a suitable folder then enter a suitable file name for each. Four
output files have been provided because of the number of different data types stored in ChEMBL and also
how they are most appropriately handled by data warrior.
4. When all 4 CSV writers have a different filename/location set, you can press the double white arrow to
complete execution of the workflow. Depending on the nature of the data retrieved, it is possible that
some of the four files may be empty. If you want to follow through the logic of how the data is split,
double click on the metanodes (ie those containing other nodes) & check the configuration of the nodes
within them.
Note: you can of course modify the behavior of the workflow by editing nodes, their settings & data
flow. It is recommended that you save a version of the workflow before your do this!
4
2
Detailed instructions: viewing output in DataWarrior
1. Launch DataWarrior (see slide 5 for installation instructions). You will also need 4 DataWarrior files you can use as
templates, download them in this zip file (then unzip) to any handy location.
2. File -> Open and browse to a file, the first shown below using the test molecule CHEMBL587310 is
KYM_pAct_split_by_target_name.csv (produced by the 2nd down of the CSV writers). A matching template has
been created for each of the four files. File -> Open Special -> Apply Template then browse to
‘KYM_pAct_split_by_target_name.dwar’. Technically this file is a DataWarrior file – created by Save, and it contains
the test data from CHEMBL587310, but conveniently, you can use .dwar files as templates too.
3. This template has been set up to show scatter plots of pAct (from IC50, Ki, EC50 or XC50 in ChEMBL) on the Y axis
vs similarity to the (most similar) query compound on the X axis. In the example below you can see that the parent
compound (CHEMBL587310) has been retrieved from within ChEMBL and of course it has a similarity of 1 to itself!
Other similar compounds to that parent also have activity against the target Plasmodium falciparum. The scatter
plots have been split by the category ‘target_name’, to split by categories right click and ‘split view by categories’.
To avoid having too many scatter-graphs, only datasets with at least 4 active (sub 1uM) compounds with data per
target are included in this .csv file. The other active compounds are in the .csv file called ‘KYM_pAct_no_split.csv’ –
see next slide
4. Interestingly you can also see that some similar compounds have activity against some other targets. For example,
some compounds with high potency against the Adenosine A3 receptor have been found. However, they are at the
lower end of the similarity scale and on inspection of the structures you might consider that there is little evidence
that the series represented by CHEMBL587310 would be likely to show A3 activity. In truth, it is likely that you
would have removed these compounds by the filtering steps described in slide 7 before getting this far, but we
have included them so that there is plenty of data so we can demonstrate some different aspects of analysis of the
output.
5. In this example, the data points have been coloured by compound id (right click on a scatter plot & select Set
Marker Colour…) which gives a sense that the compounds active against A3 are the same wet with activity against
A1, and similarly the compounds with cannabinoid receptor activity show CB1 and CB3 activity. DataWarrior offers
nice additional features for SAR analysis (use Help->Help to show the options). One option here is to use clustering
(Chemistry -> Cluster Compounds… for the example, accept the defaults). Having created the clusters, right click in
the scatter plot and select Set Marker Colour…, then colour by Cluster No and check the Color by categories box).
3
Detailed instructions: viewing the other output options
• The view below is created using the template KYM_pAct_no_split.dwar and shows compounds with 1 to 3
active (sub 1uM) compounds per target – coloured by target_name
1. The view below is created using the template KYM_split_by_bioactivity_type.dwar. The range of biological
endpoints is generally larger in this output file and more work may be needed to understand what’s there. For
the scatter plots the Y axis is now ‘value’ which might be a dose-response value (usually low is potent) or a
percent inhibition (usually high = potent), and other endpoint types too. To understand what’s here it’s a good
idea to mouse over some of the more similar compounds & examine the data:
1. The data panel here changes on mouse over, but it’s hard to see the assay_description in full which is often
the most useful information about the data value.
2. Another way to see that is to select some points (click & drag some points on one or more scatter plots),
then shift click on assay_description in the data table at the top, the selected rows will come first & this
column can be word-wrapped
1.2
1.1
• The final template KYM_pAct_weak_or_inact.dwar contains data points with IC50, Ki, EC50 or XC50 which are
qualified with > or in the range 10mM to 1uM

DataWarrior Tutorial

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DataWarrior Tutorial

Загружено:

Авторское право:

Доступные форматы

Know Your Molecule

• ChEMBL or ChEMBLdb (https://www.ebi.ac.uk/chembldb/) is a manually curated chemical

• ChEMBL statistics (May 2015):

• “Combines dynamic graphical views,

DataWarrior can read data and structures from either

• Double click on the MarvinSketch node. As downloaded

Вам также может понравиться