Вы находитесь на странице: 1из 52

NCBI Field Guide

NCBI Molecular Biology


Resources

NCBI Databases

January 2008
The National Center for
Biotechnology Information

NCBI Field Guide


Bethesda,MD

Created in 1988 as a part of the


National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Web Access: www.ncbi.nlm.nih.gov

NCBI Field Guide


NCBI Databases and Services

NCBI Field Guide


• GenBank largest sequence database
• Free public access to biomedical
literature
– PubMed free Medline
– PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
• VAST structure similarity searches
• Software and Databases
Types of Databases

NCBI Field Guide


• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
Entrez Nucleotides

NCBI Field Guide


Primary
• GenBank / EMBL / DDBJ 113,617,111
Derivative
• RefSeq 2,605,104
• Third Party Annotation 5,822
• PDB 8,295

Total 116,236,515
What is GenBank?
NCBI’s Primary Sequence

NCBI Field Guide


Database
• Nucleotide only sequence database
• Archival in nature
– Historical
– Reflective of submitter point of view (subjective)
– Redundant
• GenBank Data
– Direct submissions (traditional records)
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
Database
International Sequence

NCBI Field Guide


Database Collaboration
Entrez
NIH
NCBI
•Submissions GenBank
•Updates •Submissions
•Updates
EMBL
DDBJ EBI
CIB

NIG •Submissions
•Updates
SRS
getentry EMBL
GenBank: NCBI’s Primary Sequence Database

NCBI Field Guide


Release 163 December 2007
80,388,382 Records
83,874,179,730 Bases
Whole Genome Shotgun
26,177,471 Records
106,505,691,578 Bases
106,565,853 Total Records
190,379,871,308 Total Bases

• full release every two months


ftp.ncbi.nih.gov/genbank/ • incremental updates daily
• available only via ftp
The Growth of GenBank

NCBI Field Guide


December 2007

WGS: 106.5 billion bases


Doubling time 12-14 months

Non-WGS: 83.9 billion bases


Organization of GenBank:
Traditional Divisions

NCBI Field Guide


Records are divided into 18 Divisions.
 12 Traditional PRI Primate
 6 Bulk PLN Plant and Fungal
BCT Bacterial and Archeal
INV Invertebrate
ROD Rodent
Traditional Divisions: VRL Viral
• Direct Submissions VRT Other Vertebrate
(Sequin and BankIt) MAM Mammalian
• Accurate PHG Phage
SYN Synthetic (cloning vectors)
• Well characterized ENV Environmental Samples
UNA Unannotated

Entrez query: gbdiv_xxx[Properties]


Organization of GenBank:
Bulk Divisions

NCBI Field Guide


Records are divided into 18 Divisions.
 12 Traditional
 6 Bulk
EST Expressed Sequence Tag
GSS Genome Survey Sequence
HTG High Throughput Genomic
BULK Divisions: STS Sequence Tagged Site
• Batch Submission HTC High Throughput cDNA
(Email and FTP) PAT Patent
• Inaccurate
• Poorly characterized

Entrez query: gbdiv_xxx[Properties]


A Traditional
GenBank
LOCUS AF124527 2540 bp mRNA linear PLN 29-JAN-2004
DEFINITION Prunus persica ethylene receptor (ETR1) mRNA, complete cds.
ACCESSION AF124527

NCBI Field Guide


VERSION AF124527.1 GI:6841074
KEYWORDS .

Record
SOURCE Prunus persica (peach)
ORGANISM Prunus persica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
rosids; eurosids I; Rosales; Rosaceae; Amygdaloideae; Prunus.
REFERENCE 1 (bases 1 to 2540)
AUTHORS Bassett,C.L., Artlip,T.S. and Callahan,A.M.
TITLE Characterization of the peach homologue of the ethylene receptor,
PpETR1, reveals some unusual features regarding transcript
processing Header
JOURNAL Planta 215 (4), 679-688 (2002)
PUBMED 12172852
REFERENCE 2 (bases 1 to 2540)
AUTHORS Bassett,C.B., Artlip,T.S. and Nickerson,M.L.
TITLE Direct Submission
JOURNAL Submitted (29-JAN-1999) Appalachian Fruit Research Station,

The Flatfile Format


USDA-ARS, 45 Wiltshire Road, Kearneysville, WV 25430, USA
FEATURES Location/Qualifiers
source 1..2540
/organism="Prunus persica"
/mol_type="mRNA"
/cultivar="Loring"
/db_xref="taxon:3760"
/dev_stage="III B/C fruit"
gene 1..2540
/gene="ETR1"
CDS 269..2485
/gene="ETR1"
/codon_start=1
/product="ethylene receptor"
/protein_id="AAF28893.1"
/db_xref="GI:6841075"
/translation="MEACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVK
KSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSMHSRTVAIVMTTAKVLTAVVSCATA
LMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRH
TILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVF
Feature Table
SSNRALKISPNSPVARMRPLAGKHMPGEVVAVRVPLLHLSNFQINDWPELSTKRYALM
VLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLAR
REAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTPEQRLMVETILKSSHLL
ATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNLIKPVASVKKLSVSLNLAADLP
VQAVGDEKRLMQIVLNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPEFFPAQSDNH
FYLRVQVKDSGSGINPQDIPKLFTKFAQTQSLATRNSGGSGLGLAICKRFVNLMEGHI
WIESEGPGKGCTAIFIVKLGFAERSNESKLPFLTKVQANHVQTNFPGLKVLVMDDNGS
VTKGLLVHLGCDVTTVSSIDEFLHVISQEHKVVFMDVCMPGIDGYELAVRIHEKFTKR
HERPVLVALTGNIDKMTKENCMRVGMDGVILKPVSVDKMRSVLSELLEHRVLFEAM"
ORIGIN
1 gcacgagggc tcaccgagcg agctagctct tcaggagtca aggcttctgg gtgaggggaa
61 gaagaagaag cttctttgat gtgttggggt gccaatctaa agaggaagaa gaaggcctct
121 aatgtattga ggtcggctgt ctgggctgcc gatctgtgtt gaatggatag tttggtagag
181 atgcttcaac gacatagggt ggctgaaaag ggtttgaaga aagtgaagga ggaaaccaag
...
2401 tatactgaaa cctgtctcag ttgataaaat gaggagtgtt ttatcagaac tgttggagca Sequence
2461 tcgagtttta tttgaggcta tgtaagatat aggaaaattg ttctagtgaa ggaaagattt
2521 aaatggaaaa aaaaaaaaaa
//
NCBI Field Guide
Traditional GenBank Record

Accession
•Stable
ACCESSION U07418 •Reportable
•Universal
VERSION U07418.1 GI:466461

Version GI number
Tracks changes in sequence NCBI internal use

well annotated

the sequence is the data


Bulk Divisions

NCBI Field Guide


•Batch Submission and htg (email and ftp)
•Inaccurate
•Poorly Characterized

• Expressed Sequence Tag


– 1st pass single read cDNA
• Genome Survey Sequence
– 1st pass single read gDNA
• High Throughput Genomic
– incomplete sequences of genomic clones
• Sequence Tagged Site
– PCR-based mapping reagents
GenBank Bulk Sequence: EST

NCBI Field Guide


poorly
characterized
Expressed Sequence Tags in

NCBI Field Guide


Entrez

Total 49 million records


Human 8.1 million
Mouse 4.9 million
Arabidopsis 1.5 million
Cow 1.5 million
Pig 1.5 million
Zebrafish 1.4 million
Xenopus tropicalis 1.3 million
Rice 1.2 million
Maize 1.2 million
Wheat 1.0 million
Rat 0.9 million
Ciona intestinalis 0.7 million
Whole Genome Shotgun Projects

NCBI Field Guide


ftp.ncbi.nih.gov/genbank/wgs/

• >600 Projects
• >600 Taxa
– 423 bacteria
– 186 eukaryotes
• 62 fungi
• 87 animals
• 5 flowering plants
Mammalian WGS

NCBI Field Guide


• Duck-billed platypus
• Nine-banded armadillo
• Northern tree shrew
• Domestic rabbit
• Pika
• Guinea pig
• Mouse
• Rat
• Thirteen-lined ground squirrel
• Small-eared galago
• Mouse lemur
• Orangutan
• Human
• Chimpanzee
• Gorilla
• Rhesus macaque
• Tenrec
• African elephant
• Dog
• Cat
• Horse
• European hedgehog
• Eurasian shrew
• Little brown bat
• Cow
• Gray short-tailed opossum
NCBI Field Guide
Plant WGS
NCBI Field Guide
Derivative Databases
Entrez Protein: Derivative

NCBI Field Guide


Database
Data Source Sequences
GenPept 14,068,403
RefSeq 4,456,950

Third Party Annotation 5,751


Swiss Prot 296,496

PIR 21,713
PRF 12,079
PDB 110,035
(PAT Division 920,869)

Total 18,971,426

BLAST nr total 5,879,848


(no patents or env_nr -now 6 million)
GenPept: GenBank CDS

NCBI Field Guide


translations
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
/gene="MLH1"
CDS 22..2292 MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
Redundant Proteins

NCBI Field Guide


>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... 20 Proteins
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ...


MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... GenPept
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|1079787|gb|AAA82079.1| DNA mismatch repair prot...


MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|4557757|ref|NP_000240.1| MutL protein homolog 1...


MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... NCBI RefSeq
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair...


MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... Swiss-Prot
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|741682|prf||2007430A DNA mismatch repair protei...


MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... PRF
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

Etc.
Protein Sequences from Structures

NCBI Field Guide


>gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp
SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL
ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA
HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK
ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED
KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ
Primary vs. Derivative
Sequence Databases

NCBI Field Guide


ACGT
GC RefSeq
C TC
Labs
A A
GA G
GA G
ATCATCT TATAGCCG
TA AGCTCCGATA
TA CCGATGACAA
GC
C
Sequencing C G
Centers AC GTG Genome
CGT
A
Curators Assembly
T
TT
AC
TG
GA

GA
AT
A C

CA
TGC

A
CCG
TTG TTGACA Updated
CTA
CGTGA

CG
AC
ACG

G
CG C

A TAT TA continually
A

GT

AGC TTGA
A
GC

C
ATTGTG
C GA

TA
TG

TAT

CG GA
G
TA

C
CAGCTACT
T C by NCBI
GA
A ATT GACA
A

ATTG TATAGCCG G
ATATAGCCG
AT TATAGCCG
T
TATAGCCG
TA

AT T
TA TT C

GA GenBank
AT UniGene
Updated ONLY
TACTTTCTT
GA G A
GA GA by submitters C TC A A
GA G
GA G
T
A ATCA C ATCATCT Algorithms
RefSeq: NCBI’s Derivative Sequence Database

NCBI Field Guide


• Curated transcripts and proteins
– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
microbial genomes (proteins), and more
• Model transcripts and proteins
• Assembled Genomic Regions (contigs)
– human genome – chicken
– mouse genome – honeybee
– rat genome – sea urchin
• Chromosome records
– Human genome
– microbial
– organelle srcdb_refseq[Properties]

ftp://ftp.ncbi.nih.gov/refseq/release/
Genomes: Two Paths

NCBI Field Guide


• NCBI Eukaryotic Genomes • Microbial Genomes
– Since 1999 • Outside Eukaryotic Genomes
– Map Viewer (Plants, Fungi)

– UniGene – Since 1993


– HomoloGene – Comparative Proteomics
• Clusters of Orthologous
– Contigs, Transcripts and Groups (COGs)
Proteins • Protein Clusters
– Chromosomes and
Proteins
Selected RefSeq Accession

NCBI Field Guide


Numbers

Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA

Reference Genomic Sequence

Microbial replicons, organelle genomes

Contig
WGS Supercontig
Two Paths to RefSeq

NCBI Field Guide


Human MLH1 Sequences
Arabidopsis MLH1 Sequences
mRNA Genomic
. Genomic Annotations
U07343 AL161471
AC006583
AU127758 AL161472
AC011816 AJ270058 :
CAB78038
BC006850 :
. AJ270060 AL161595 Protein
AL161596

NM_000249 NM_116983
NT_022517
(36974983..37032341) NC_003075 Transcript
NC_000003
(37009983..37067341)

NCBI Annotated Genomes and Submitted Genomes and


Selected Model Organisms Annotation
GenBank to RefSeq: NCBI Organisms

NCBI Field Guide


RefSeqs: Annotation Reagents

NCBI Field Guide


Genomic DNA
(NC, NT, NW)
Scanning....

Model mRNA (XM) Model protein (XP)


(XR)
=?
Curated mRNA (NM) Curated Protein (NP)
(NR)

RefSeq

GenBank
Sequences
RefSeq Benefits

NCBI Field Guide


• non-redundancy
• explicitly linked nucleotide and protein sequences
• updates to reflect current sequence data and biology
• data validation
• format consistency
• distinct accession series
• stewardship by NCBI staff and collaborators
Mouse

NCBI Field Guide


Assembly

Other
WGS GenBank UniGene
Transcript

RefSeq
Contig

BAC
RefSeq
Transcript
NCBI Field Guide
Expressed Sequences

UniGene
GEO
NCBI Expressed Sequences

NCBI Field Guide


51,794,823 mRNA
sequences
50,591,152 GenBank
(48,992,049 EST Division)
1,202,049 Reference Sequences
What is UniGene?

NCBI Field Guide


A gene-oriented view of sequence entries

•MegaBlast based automated sequence clustering


•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
EST hits: Human mRNA

NCBI Field Guide


Thrombin mRNA

5’ EST hits

3’ EST hits
Chordates
UniGene

NCBI Field Guide


Plants

Fungi et al.

Invertebrates
Gene Catalog: X. tropicalis

NCBI Field Guide


MLH1Cluster

Uncharacterized ESTs
Associating Sequences: Human Thrombin

NCBI Field Guide


NCBI Field Guide
Expression Data
Other NCBI Databases

NCBI Field Guide


•Structure: imported structures (PDB)
Cn3D viewer, NCBI curation

•CDD: conserved domain database


Protein families (COGs and KOGs)
Single domains (PFAM, SMART, CD)

•dbSNP: nucleotide polymorphism


•Gene: gene records
Unifies LocusLink and Microbial Genomes

•HomoloGene: neighboring function for Gene


MMDB: Molecular Modeling Data Base

NCBI Field Guide


• Derived from experimentally determined PDB records
• Value added to PDB records including:
– Addition of explicit chemical graph information
– Validation (secondary structure elements)
– Inclusion of Taxonomy, Citation
– Conversion to ASN.1 data description language
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
Cn3D 4.1: Bacillus thuringiensis

NCBI Field Guide


Toxin
VAST: Structure Neighbors

NCBI Field Guide


Vector Alignment Search Tool

4
For each protein chain,
2
locate SSEs (secondary
structure elements),
5 6

and represent them as


individual vectors. 1
3
IL-4 &
align the vectors Leptin
Human IL-4
Protein Domains

NCBI Field Guide


• Structural Domain
– Discrete independently folding unit of a protein
• Conserved Domain (sequence-based)
– Protein region with recognizable position-specific
pattern of sequence conservation
• Sequence-based domains often roughly
correspond to structural domains
• Domains often have distinct, identifiable
functions
NCBI’s Conserved Domain Database

NCBI Field Guide


• PSI-BLAST –based score matrices
• Searchable with RPS-BLAST
• Sources
– SMART
– PFAM
– COGs
– NCBI curated domains
• structure informed alignments
Src Domains

NCBI Field Guide


Four 3d domains
Three conserved domains
NCBI Field Guide
Structure vs Conserved Domain
Conserved phosphotyrosine binding residues

SH2

SH2

TyrKC

SH3

Cn3D
NCBI Field Guide
NCBI’s SNP Database

• Primary Database and Derivative (RefSNP)


• Single Nucleotide Polymorphism
• Repeat polymorphisms
• Insertion-Deletion Polymorphisms
• 29 Species
• Over 46 million submissions (submitted SNPs)
• Over 26 million reference SNPs
The Gene Database

NCBI Field Guide


• Gene Centered Information
• Unifies NCBI-annotated and Submitted Genomes
• 3.5 million records for 4,845 taxa
Human 36,465 Sea Urchin 30,604
Chimpanzee 31,555 Mosquito 13,789
Mouse 64,019 Fruit Fly 12,936
Rat 37,848 C. elegans 21,053
Dog 20,183 Fungi 266,454
Cow 29, 590 Green Plants 105,216
Chicken 19, 971 Archaea 111,192
Zebrafish 47, 775 Bacteria 1,653,182

Вам также может понравиться