Вы находитесь на странице: 1из 12

Core genomes and signature genes that

define Streptococcus pyogenes


The signature genes tool developed for the SEED and implemented at the National Microbial
Pathogen Data Resource (NMPDR), www.nmpdr.org, was used to compare the translated genomes of all
completely sequenced strains of Streptococcus pyogenes (group A streptococcus or GAS) to define a core
genome for this human pathogen. The tool allows the user to select a reference genome to compare with
any number of genomes selected in a comparison set. The commonality factor is set to 80% by default but
may be reset by the user. For example, the 80% common core of GAS with respect to the strain with the
largest genome (MGAS 10750) contains 1,472 proteins that have bidirectional, best BlastP hits (BBH), at
an E-value of 1 x 10-10 or less, in 10 of the 12 available genomes. Increasing the stringency of the analysis
to 100% reduces the number of core proteins to 1,359. In addition to determining the proteins in common
to a set of genomes, we used the signature genes tool to define a signature set of proteins that
distinguishes the strains having the same M-type, e.g. M1 and M12. The information generated by this
genome comparison could be used to design a microarray for the simultaneous analysis of the core GAS
genome as well as signatures for each sequenced strain or M-type. The bioinformatics analysis reveals
interesting consistencies and inconsistencies which generate hypotheses for testing on microarrays.
Because protein functions in NMPDR are organized in subsystems, it is possible to infer functional
differences imparted by gene signatures. Subsystems annotation is used for metabolic reconstruction,
analysis of central machinery and signaling pathways, finding missing genes, integrating regulatory
networks, detection of horizontally transferred genes, and prediction of the functions of hypothetical
proteins. These genome annotation and comparison tools provide unprecedented information about GAS
biology and pathogenesis, and can be eventually applied to all sequenced bacterial pathogens.

Signature Genes Toolcompare and contrast


Comparefind proteins in common to a set of genomes:
Select a reference, or given, genome
Select one or more genomes to compare it with in the inclusion set 1
Select a commonality factor, set to 80% by default
Tool returns table of proteins shared by all or most of genomes in set 1,
with score indicative of the proportion of set 1 that contains each protein

Results include access to comparative analysis environment for every


protein found, including subsystems if desired

Streptococcus pyogenes core genomes:

100% S.pyogenes core, 1,359 proteins 80% S.pyogenes core, 1,472


M1 core: strains SF370 and MGAS 5005, 1,580 proteins
M3 core: strains MGAS 315 and SSI-1, 1,820 proteins
M12 core: strains MGAS 2096 and MGAS 9429, 1,668 proteins
Invasive core: strains MGAS 5005, MGAS 315 and SSI-1, 1,589
Virulence related core: invasive strains filtered with keyword virul*, 78

Contrastfind proteins that distinguish a set of genomes:


Select a reference, or given, genome
Select one or more genomes to compare it with in the inclusion set 1
Select one or more genomes to contrast with in the exclusion set 2
Tool finds genes shared by all or most of genomes in set 1; but which are

not present in all or most of genomes in set 2, with score indicative of the
match with the search parameters

Streptococcus pyogenes signatures:


M1 signature: strains SF370 and MGAS 5005 only, 11 proteins perfect score
M3 signature: strains MGAS 315 and SSI-1 only, 44 proteins perfect score
M12 signature: strains MGAS2096 and MGAS9429 only, 23 proteins perfect
score

Use of core genomes:


Define genes to be spotted on microarrays
Define minimal set of roles for mathematical modelling of metabolism

Use of protein signatures:


Discover mechanism for phenotype expression
Possible target for diagnostics or therapeutics

Pathogen-specific gateways to data

Search box restricted to organisms of interest


User forums with inquiry labs for teaching and exploration
Pathogen information describes genotypes, phenotypes,
serotypes, taxonomy, physiology, epidemiology
Literature aggregator and links to open access jounals
Pathogens in the news
Resource links including strain collections
Virtual proteome

Annotation status tables:

Immediate access to genes whose functions are


known with some degree of certainty
Named genes in subsystems
Named genes not in subsystems
Hypothetical genes in subsystems
Gateway to genes about which nothing is known
Hypothetical genes not in subsystems
List of genes with links to NMPDR analysis tools
Exploration in comparative framework first step to

formulating working hypotheses about functions

Subsystems approach to genome annotation


Subsystems annotation provides researchers with corrected
functional annotations in a structured biological context
Consistency across genomes achieved by vertical annotation of
functions rather than horizontal focus on single genomes
More than 500 distinct subsystems have been developed
Metabolic pathways
Complex structures
Genotype phenotype associations
Subsystems integrate genomic and functional contexts of genes in
metabolic reconstructions or populated subsystem spreadsheets
Metabolic reconstructions summarize all subsystems in a given
genome
Populated subsystems compare all genomes in a given
subsystem

Example: Streptococcal virulome


Open SS from subsystem search tree, or
from a protein page, i.e. SLO

Clustered genes within genome share color


Closely related roles merged in single column
marked with *
Mouse over column headers for full names of
functional roles
Defined in a few genomes and extended to others
via homology

Exploration of physical, genomic context


Protein context graphic and table
Focus protein highlighted green
Genes within about 6 kbp upstream and downstream shown
Genes with conserved proximity shown as blue arrows with functional
coupling scores, fc-sc

Pins
Color-matched orthlogs allow comparative analysis of functional
clustering and chromosomal rearrangements
Redraw the display to show genomes selected from commentary table

Compare regions
Subset of PINS opens with the 5 highest-scoring genomes
Size and number of compared regions may be reset by user

CL
Finds clusters containing the focus protein in other genomes
Useful for genes without functional coupling scores, fc-sc

fc-sc
Measures conservation of gene proximity and phylogenetic distance
Returns table listing pairs of proximal orthologs

Protein context
Focus gene is green
Proximity of blue genes
conserved in at least
four other species (not
strains)
Click to show functional
coupling scores, fc-sc
Click on score to see
table of paired homologs
in other genomes

Orthologous BBH
Bidirectional Best Hits
precomputed, reciprocal
BlastP results
Compare functional
annotations
Select and align
Click any gene id to
refocus display
Link out via alias ids

Pins locates SLS precursor in strain M5


Sag operon performs
biosynthesis (genes 1,2,3,8)
and transport (5-7)of the
bacteriocin-like toxin
streptolysin S (SLS)
Tiny sagA orf was missed in
the first annotation of the M5
genome; known to secrete
streptolysin S
Comparison of streptolysin S
biosynthesis protein D
(SagD, shown as gene 1)
among strep genomes points
to location of sagA in M5
Confirmed with BLASTX
Clostridium and Listeria don't
express the toxin; what is the
purpose of the biosynthesis
and transport proteins?

Streptolysin S subsystem locates orthologous


components in Listeria and Clostridium botulinum
Hypotheses generated by vertical annotation of functions rather
than horizontal focus on single genomes

Variant codes
1.1111 All elements of the Sag operon as in S. pyogenes are present: i.e,
SagA/SagB,C,D,E/SagF/SagG,H,I (ABC transporter)
2.0111 All elements but SagA (missed gene callcds exists)
3.0102 Missing SagA and SagF homologs but having an ABC transporter for
which closest homologs are teichoic acid or Bacillus multidrug export cassettes
4.0100 Only SagB,C,D,Ebiosynthetic pathway with no precursor or transport

Exploration of functional, biological context


Populated Subsystem Spreadsheet

Columns represent functional roles, mouse over header for definition


Genomes (rows) shown may be expanded and sorted by several criteria
Cells populated with specific, annotated genes linked to context pages
Functional variants defined by the annotated roles
Variant code -1 indicates subsystem is not functional
Codes defined in annotator's notes at bottom of page
Diagram of subsystem often provided

Protein families
FigFams taken from single column of functional roles
Pfam, TigrFams, COG, and others presented for comparison
Essential genes on genomic scale
Genome-scale essentiality data from studies of 10 species
Click on the red shaded portions of the bars to explore the genes
Candidate drug or therapeutic targets
Experimental verification of essential or virulent function

Close orthologs of proteins with experimentally determined structure

Вам также может понравиться