Вы находитесь на странице: 1из 44

Sequence Formats

Sequences

 DNA and protein sequences


 Can be read and written in a variety of formats
 Sequence formats are ASCII TEXT
 Required arrangement of characters, symbols
and keywords that specify things
 e.g. the sequence, ID name, comments, etc.
 Program should look to find them in seq entry
Sequences

 Never any hidden, unprintable 'control'


characters in any sequence format.
 All standard sequence formats can be
printed out or viewed simply by displaying
their file.
MS word

 Microsoft WORD format is not a sequence format.


 If using a word-processor to type a sequence:
 Save sequence to a file as ASCII text
 Try selecting: File, Save As, Save as type Text
 Do not using word-processors to write sequences
 Simple text editors should be used:
 Notepad
 Wordpad
 For UNIX
 Pico
 nedit
Some common formats

Single sequence Multiple Either single or


per file sequences per multiple
file sequences per
file
gcg Multiple sequence fasta
format (msf)
staden clustal
embl phylip
Plus some others, e.g. MacVector, GeneWorks, DNA Strider etc.
Each Sequence analysis
program has its own format for
storing sequence data!!!
FASTA format

 First line is a Description line


A single-line
 Contains a greater-than (">") symbol in the
first column
 Followed by lines of sequence data
 It is recommended that all lines of text be
shorter than 80 characters in length
FASTA format

Description line
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Sequence
Standard IUB/IUPAC amino acid
and nucleic acid codes
 Lower-case letters are accepted and are mapped into upper-case
 No numerical digits in the sequence.
 The nucleic acid codes supported are:

A  adenosine M  A C (amino)
C  cytidine S  G C (strong)
G  guanine W  A T (weak)
T  thymidine BGTC
U  uridine DGAT
R  G A (purine) HACT
Y  T C (pyrimidine) VGCA
K  G T (keto) N  A G C T (any)
-  gap of indeterminate length
Standard IUB/IUPAC amino acid
and nucleic acid codes
 The accepted amino acid codes are:
A  alanine P  proline
B  aspartate or asparagine Q  glutamine
C  cystine R  arginine
D  aspartate S  serine
E  glutamate T  threonine
F  phenylalanine U  selenocysteine
G  glycine V  valine
H  histidine W  tryptophan
I  isoleucine Y  tyrosine
K  lysine Z  glutamate or glutamine
L  leucine X  any
M  methionine *  translation stop
N  asparagine -  gap of indeterminate length
FASTA format
 Multiple sequences
 Blank lines inserted
> mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT
CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA
CAGTCGATCGATGCAT

> mysequence2
ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG
CAGTCGTAGCATGCTAACGTCGATCGTA

> mysequence3
CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA
CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA
Genbank File Format
 Sequence Data Entries
 First line
 Begins with ‘LOCUS’ in the first 5 spaces
 Followed by genetic locus name or identifier
 Length of the sequences
 Type of sequences
 Second line
 DEFINITION in the first 10 spaces
 Followed by free form text to identify the sequence.
 Third line
 ACCESSION in the first 9 spaces
 Spaces 13 - 18 must hold the primary accession number
 Fourth line
 ORIGIN in the first 6 spaces
 Nothing else is required on this line, it indicates that the nucleic acid
sequence begins on the next line.
Genbank File Format
 Fifth line
 Begins the nucleotide sequence.
 The first 9 spaces of each sequence line may either be blank
or may contain the position in the sequence of the first
nucleotide on the line.
 The next 66 spaces hold the nucleotide sequence in six
blocks of ten nucleotides.
 Each of the six blocks begins with a blank space followed by
ten nucleotides.
 Thus the first nucleotide is in space 11 of the line while the
last is in space 75.
 Last line
 Must have // in the first 2 spaces to indicate termination of the
sequence.
Genbank File Format

LOCUS name size bp type date


Genbank total base dd-MON-yyyy
Locus namecount DNA, RNA,
PROTEIN, MASK,
or TEXT
Genbank Example
LOCUS NM_079846 1190 bp mRNA linear INV 15-DEC-2001
DEFINITION Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA.
ACCESSION NM_079846
VERSION NM_079846.1 GI:17864111
KEYWORDS .
SOURCE fruit fly.
ORGANISM Drosophila melanogaster
Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta;
Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;
Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.
REFERENCE 1 (bases 1 to 1190)
AUTHORS Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T.
TITLE Structure and expression of the triose phosphate isomerase (Tpi) gene of
Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), 225-229 (1991)
MEDLINE 92079900
PUBMED 1720860
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI
review. The reference sequence was derived from AE003772.1.
FEATURES Location/Qualifiers
source 1..1190
/organism="Drosophila melanogaster“
/db_xref="taxon:7227“
/chromosome="3“
/map="99E1-99E2“
gene 1..1190
/gene="Tpi“
/note="TPI; TPIS; CG2171; CT6334“
/db_xref="FLYBASE:FBgn0003738“
/db_xref="LocusID:43582“
Genbank Example
CDS 181..924
/gene="Tpi“
/EC_number="5.3.1.1“
/note="Nucleotide sequence of the Celera sequence differs from the published
sequence for this transcript.“
/codon_start=1
/db_xref="FLYBASE:FBgn0003738“
/db_xref="LocusID:43582“
/product="Triose phosphate isomerase“
/protein_id="NP_524585.1“
/db_xref="GI:17864112"
/translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA
IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG
ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV
VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE
LAKKPDIDGFLVGGASLKPEFVDIINARQ“

misc_feature 187..921
/note="TIM; Region: Triosephosphate isomerase“

BASE COUNT 279 a 368 c 323 g 220 t


ORIGIN
1 ttaatctcga atctgggaaa aatctgagtg gaaaagtcga cggcgagcct ccagtcatcg
61 agttacccac ttgaaattat cagttccaaa cactctaata gcagtcccct tgttttgtcc
121 cccgatccgc agttctacgc caatttcagc accgattgca ccgacagcaa cagcaacaac
181 atgagccgaa agttctgcgt gggaggcaac tggaagatga acggcgacca gaagtccatc
241 gccgagatcg ccaagaccct gagctcggcc gccctcgacc ccaacacgga ggtggtcatc
301 ggctgcccgg ccatctacct gatgtacgcc cgcaacctgc tgccctgcga gctgggtctg
361 gccggccaga atgcctacaa ggtggccaag ggcgcattca ccggcgagat ctcccctgcg
421 atgctgaagg
//
EMBL File Format

 European Molecular Biology Laboratory


 First line
 Begins with two letters ID
 Followed by the EMBL identifier
 Second line
 AC, followed by accession number
 Third line
 DE, followed by a free form text definition
 Fourth line
 SQ, followed by the length of the sequence
 After the sequence length there is a blank space and the two letters
BP.
EMBL File Format
 Fifth line
 Nucleotide sequence begins
 Each line of sequence begins with four blank spaces
 Next 66 spaces hold the nucleotide sequence in 6
blocks of 10 nucleotides.
 Each of the six blocks begins with a blank space
followed by ten nucleotides.
 Thus the first nucleotide is in space 6 of the line while
the last is in space 70.
 The last line ~ terminator line
 Two characters // in the first two spaces
 Multiple sequences may appear in each file
EMBL Example (1)
LINE 1 :ID ID_name
LINE 2 :AC Accession number
LINE 3 :DE Describe the sequence any way you want
LINE 4 :SQ Length BP
LINE 5 : ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTA...
LINE 6 : ACGT...
LINE 7 ://
EMBL Example (2)
EMBL Example (3)
ID DMTPIG standard; DNA; INV; 3419 BP.
XX
AC X57576; S70377;
XX
SV X57576.1
XX
DT 20-JAN-1992 (Rel. 30, Created)
DT 19-AUG-1996 (Rel. 49, Last updated, Version 10)
XX
DE D.melanogaster Tpi gene for Triosephosphate isomerase
XX
KW glycolytic enzyme; tpi gene; triosephosphate isomerase.
XX
OS Drosophila melanogaster (fruit fly)
OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota;
OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea;
OC Drosophilidae; Drosophila.
XX
RN [1]
RP 1-3419
RA Sullivan D.T.;
RT ;
RL Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases.
RL D.T. Sullivan, Biological Research Laboratories, 130 College Pl, Syracuse
RL University, Syracuse, NY 13244, USA
XX
RN [3]
RX MEDLINE; 92079900.
RA Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.;
RT "Structure and expression of the triose phosphate isomerase (Tpi) gene of
RT Drosophila melanogaster.";
RL Mol. Gen. Genet. 230:225-229(1991).
XX
DR FLYBASE; FBgn0003738; Tpi.
DR SWISS-PROT; P29613; TPIS_DROME.
XX
FH Key Location/Qualifiers
FH
FT source 1..3419
FT /db_xref="taxon:7227"
FT /germline
FT /organism="Drosophila melanogaster"
FT /strain="Oregon-R"
FT /clone_lib="EMBL-4"
FT CDS join(2237..2773,2830..3036)
FT /db_xref="FLYBASE:FBgn0003738"
FT /db_xref="SWISS-PROT:P29613"
FT /gene="Tpi"
FT /EC_number="5.3.1.1"
FT /product="triosephosphate isomerase"
FT /protein_id="CAA40804.1"
FT /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAI
FT YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGES
FT DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAY
FT EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKK
FT PDIDGFLVGGASLKPEFLDIINARQ"
FT mRNA join(2004..2028,2186..2773,2830..3036)
FT /gene="Tpi"
FT prim_transcript 2004..3296
FT exon 2008..2032
FT /number=1
FT exon 2189..2773
FT /number=2
FT exon 2830..3296
FT /number=3
FT intron 2033..2188
FT /number=1
FT intron 2774..2829
FT /number=2
FT misc_feature 2147..2151
FT /note="intron 1 lariat sequence"
FT misc_feature 2789..2793
FT /note="intron 2 lariat sequence"
FT polyA_signal 3258..3262
XX
SQ Sequence 3419 BP; 855 A; 933 C; 849 G; 778 T; 4 other;
gatctcgagc gagaaatgtg gaacatagtg gaggcctcca gtggcgccga gctgggtgaa 60
accagctacg agttcccttc ccccgctccg gttcccagcg cagcagtgaa cgaaatagca 120
gttccacagt cccaccagct cctcctgctc ctgcgaagcc ctcagttccg tccgcctcct 180
atgacaacca caactacagt ttcagccagg atgaggacga agatgatgat gatctggagt 240
ttgaggacgt attcgtgccg gccagctctg ttccaaatcc cgttcagcct ggcatagatc 300
ccgtggaact gcgtcgctcc ctggctttgg tcatgaggga gaaattgcga tcggatgaca 360
cggactccag gccaatgggc aacaatcagg atcttcccat agatgaacag tccagggaga 420
gaccgctctc cactcaaaca tctcccacaa atggcccact tccggctctt ctgagggcca 480
aactgcttgc tgggcaactc nnnncaatag cgctcactgc ctgccaggat ccacggcgag 540
tcctgctccc caggagcaat ccggtatctt tgtgatcgat agtgaggcga gtcccggctc 600
aaatgggcac aagcctaagt atcgaaaggg cacggcattc actcggagtt cgctgaagaa 660
gagccgatcc tgcaactgta gctccatcgc taagggacga ggggtccacg acgagcccag 720
cagtaatctc tgcagggatc aggagtcctc tgtacttcca cagcatccgc agccagccaa 780
ccatcccaca gagaactttt
//
National Biomedical Research Foundation (NBRF) format
Protein Information Resource (PIR) format

 First line
 Begins with a greater than symbol (>)
 Immediately followed by 2 character sequence type specifier
Specifier Sequence type
P1 protein, complete
F1 protein, fragment
DL DNA, linear
DC DNA, circular
RL RNA, linear
RC RNA, circular
N1 functional RNA, other than tRNA
N3 tRNA
 Then a semicolon (;)
 Followed by sequence name or identification code for the NBRF
database
 Four to six letters and numbers

>P1;CBRT
National Biomedical Research Foundation (NBRF) format
Protein Information Resource (PIR) format

 Second line contains two kinds of


information
 First:
 Sequence name
 Followed by 3 characters blank space, " - “

 Second
 Organism or organelle name
>P1;CBRT
Cytochrome b - Rat mitochondrion (SGC1)
National Biomedical Research Foundation (NBRF) format
Protein Information Resource (PIR) format

 Amino acid or nucleic acid sequence begins on


line three
 Freeformat
 May be interrupted by blanks for ease of reading

 Protein sequences
 May contain special punctuation to indicate various
indeterminacies in the sequence
 The last character in the sequence must be an
asterisks (*).
NBRF/PIR Example
LINE 1 :>P1;CBRT
LINE 2 :Cytochrome b - Rat mitochondrion (SGC1)
LINE 3 :M T N I R K S H P L F K I I N H S F I D L P A P S
LINE 4 : VTHICRDVN Y GWL IRY
LINE 5 :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*
MolGen/Stanford File Format

 Molecular Genetics Group at Stanford U.


 First line ~ comment line
 Begins with a semi-colon (;)
 Followed by descriptive text
 May be as many comment lines as desired
 Need not be present
 Second line
 Must be present
 Contains an identifier or name for the sequence
MolGen/Stanford File Format

 Third line
 Sequence begins
 Occupies up to 80 spaces
 Spaces may be included in the sequence for
ease of reading.
 Terminated with 1 or 2
 1 indicates a linear sequence
 2 marks a circular sequence
MolGen/Stanford Example
LINE 1 :; Describe the sequence any way you want
LINE 2 :ECTRNAGLY2
LINE 3 :ACGCACGTAC ACGTACGTAC A C G T C C G T ACG TAC GTA CGT
LINE 4 : GCTTA GG G C T A1
PHYLIP File Format

 Interleaved and Sequential formats


 Created and used by several
phylogenetics programs
18 206
a121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD

PHYLIP File Format


a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD
c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD
c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD
o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD
o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD
o1kauf MNTTDCFIAL VQAIREIKAL FLSRTTG-KM ELTLYNGEKK TFYSRPNNHD

 Interleaved format ken1-76


ken34-84
ken
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
LRAFREIKTL
VRAIREFKIL
VQAIREIKLL
FLSRVRG-KM
FSLRPLARKM
FKG--IR-KM
EFTLYNGEKK
EFTLYNGIKK
KLTLYNGEKK
TFYSRPNNHD
TFYSRPNKHD
TFYSRPNSHD
uga97-1 MNTTDCFIAL VQAIREIKSL FRS--SR-KM EFTLYNGEKK TFYSRPNNHD
 Similar to the output bec1-65
zim88-3
MKTTDCFNVL
MKTTDCFDVL
FEIFHRFGQT
LEIFHRFRQT
FKA--DR-KM
FKT--DR-KM
EFTLYNGEKK
EFTLYNGEKK
TFYSRPNTHG
TFYSRPNTHG

of alignment programs knp10-90


zim96-3
MKTTDCFNVL
MKTTGCFDVL
LETFHRFRNV
IEIAHRLRQL
FKT--DR-KM
NKT--DR-KM
EFTLYNGDKK
EFTLYNGEKK
TFYSRPNTHG
TFYSRPNTHG
zim7-83 MKTTDCFNVL LEIIYRFRHT FKT--DR-KM EFTLYNGEKK TFYSRPNKHG
knp196-9 MKTTDCFSVL FEIFHRLRHT LKT--ER-KM EFTLYNGERK TFYSRPNKHG
 The first part of file zam4-96 MKTTDCFDAL LEAFHRLRQT FKT--DR-KM EFTLYNGEKK TFYSRPNRHG

contains the first part NCWLNTILQL


NCWLNTILQL
FRYVDEPFFD
FRYVGEPFFD
WVYNSPENLT
WVYDSPENLT
LAAIKQLEEL
LEAIEQLEEL
TGLELHEGGP
TGLELHEGGP
NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP
of each of the NCWLNTILQL
NCWLNAILQL
FRYVDEPFFD
FRYVEEPFFD
WVYNSPENLT
WVYSTPENLT
LEAIKQLEEL
LEAIKQLEDL
TGLELREGGP
TGLELHEGGP
sequences NCWLNAILQL
NCWLNAILQL
FRYVEEPFFD
FRYVEEPFFD
WVYSTPENLT
WVYSSPENLT
LEAIKQLEDL
LEAIKQLEDL
TGLELHEGGP
TGLELHEGGP
NCWLNAILQL FRYVDEPFFE WVYDSPENLT VEAIRQLEEL TGLELHEGGP

 Only the first parts of NCWLNAILQL


NCWLNTILQL
FRYVDEPFFD
FRYVDEPFFD
WVYESPENLT
WVYNSPENLT
IQAIGQLEEL
LRAIEQLEEL
TGLDLREGGP
TGLELREGGP
NCWLNTILQL FRYVDEPFFD WVYNSPENLT LQAIEQLEEL TGLELHEGGP
the sequences should NCWLNSLLQL
NCWLNSLLQL
FRYVDEPLFE
FRYVDEPLFE
SEYLSPENKT
SEYLSPENKT
LDMIKQLSDY
LDMIKQLSDY
TKLDLSDGGP
TKLDLSDGGP
be preceded by names NCWLNSLLQL
NCWLNSLLQL
FRYVDEPLFE
FRYVDEPLFE
SEYLSPENKT
SEYLSPENKT
LDMIKRLSDY
LDMIKQLSDY
TKLDLSDGGP
TKLDLSDGGP
NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP

 Then the second part NCWLNSLLQL


NCWLNSLLQL
FRYVDEPLFE
FRYVDEPLFE
SEYLSPENKT
SEYLSPENKT
LDMIKQLSDY
LDMIKQLSDY
TKLDLSDGGP
TKLDLSDGGP

of each sequence, and PALVIWNIKH


PALVIWNIKH
LLQTGIGTAS
LLHTGIGTAS
RPAR-CMVDG
RPSEVCMVDG
TNMCLADFHA
TNMCLADFHA
GIFLKEQEHA
GIFLKGQEHA

so on. PALVIWNIKH
PALVIWNIKH
LLHTGIGTAS
LLHTGIGTAS
RPSEVCMVDG
RPSEVCMVDG
TDMCLADFHA
TDMCLADFHA
GIFMKGREHA
GIFMKGQEHA
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA

PHYLIP File Format

 Sequential format 18 206 YF


a121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD
NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP
 Allof one sequence is PALVIWNIKH
VFACVTSNGW
LLQTGIGTAS
YAIDDEDFYP
RPAR-CMVDG
WTPDPSDVLV
TNMCLADFHA
FVPYDQEPLN
GIFLKEQEHA
GGWKANVQRK
LK----
given, possibly on a241 MNTTDCFIAL
NCWLNTILQL
VTAIREIRAF
FRYVGEPFFD
FLPRATG-RM
WVYDSPENLT
EFTLHNGERK
LEAIEQLEEL
VFYSRPNNHD
TGLELHEGGP

multiple lines, before PALVIWNIKH


VFACVTSNGW
LLHTGIGTAS
YAIDDDDFYP
RPSEVCMVDG
WTPDPSDVLV
TNMCLADFHA
FVPYDQEPLN
GIFLKGQEHA
GEWKTKVQQK
LK----
the next starts. c-s8c1 MNTTDCFIAV
NCWLNTILQL
VNAIKEVRAL
FRYVDEPFFD
FLPRTAG-KM
WVYNSPENLT
EFTLHDGEKK
LEAIKQLEEL
VFYSRPNNHD
TGLELREGGP
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA
VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKASVQRK
LKGAGQ
c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD
NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA
VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKANVQRK
LKGAGQ
o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD
NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA
VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GEWKAKVQRK
LK----
o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD
NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA
VFAC…
Protein Data Bank (PDB) File Format
 Each line is 80 columns wide and is terminated by an
end-of-line indicator.
 The first 6 columns of every line contain a "record
name".
 The list of ATOM records in each polymer chain must be
terminated by a TER record.
 ATOM records for polymer atoms must include non-
blank chain ID fields.
 To use the automatic validation check, the coordinate file
must include a complete CRYST1 record defining the
unit cell and space group information.
 Each file should terminate with a line containing only the
word END.
Protein Data Bank (PDB) File Format
COLUMNS DATA TYPE FIELD DEFINITION
---------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in
Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates for Y in
Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates for Z in
Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
73 - 76 LString(4) segID Segment identifier, left-justified.
77 - 78 LString(2) element Element symbol, right-justified.
79 - 80 LString(2) charge Charge on the atom.
Protein Data Bank (PDB) File Format
Pattern:
RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94

Where:
RTyp: Record Type
Num: Serial number of the atom. Each atom has a unique serial number.
Atm: Atom name (IUPAC format).
Res: Residue name (IUPAC format).
Ch: Chain to which the atom belongs (in this case, L for light chain
of an antibody).
ResN: Residue sequence number.
X, Y, Z: Cartesian coordinates specifying atomic position in space.
Occ: Occupancy factor
Temp: Temperature factor (atoms disordered in the crystal have high
temperature factors).
PDB: The PDB data file unique identifier.
Line: Line (record) number in the data file.
PDB Example
HEADER LYASE 06-JUL-99 1QU4
TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE
TITLE 2 DECARBOXYLASE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE;
COMPND 3 CHAIN: A, B, C, D;
COMPND 4 EC: 4.1.1.17;
COMPND 5 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI;
SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA;
SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3;
SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID
KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA
KEYWDS 2 BARREL, LYASE
EXPDTA X-RAY DIFFRACTION
AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
AUTHOR 2 E.J.GOLDSMITH
REVDAT 2 29-DEC-99 1QU4 1 JRNL COMPND REMARK
REVDAT 1 17-NOV-99 1QU4 0
JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
JRNL AUTH 2 E.J.GOLDSMITH
JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM
JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE
JRNL TITL 3 STRUCTURE IN COMPLEX WITH
JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE
JRNL REF BIOCHEMISTRY V. 38 15174 1999
JRNL REFN ASTM BICHAW US ISSN 0006-2960
REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.90 ANGSTROMS.
REMARK …
DBREF 1QU4 A 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 B 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 C 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 D 1 425 SWS P07805 DCOR_TRYBB 21 445
SEQRES 1 A 425 GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS
SEQRES 2 A 425 ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS
SEQRES 3 A 425 LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO
SEQRES 4 A 425 PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS
SEQRES 5 A 425 GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE
SEQRES 6 A 425 TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY
SEQRES 7 A 425 THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER
SEQRES 8 A 425 ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO
SEQRES 9 A 425 PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE
SEQRES 10 A 425 SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL
SEQRES 11 A 425 MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA
SEQRES 12 A 425 LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER
SEQRES 13 A 425 THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS
SEQRES 14 A 425 PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU
SEQRES 15 A 425 GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER
SEQRES 16 A 425 PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE
SEQRES 17 A 425 ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET
SEQRES 18 A 425 GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE
SEQRES 19 A 425 GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS
SEQRES 20 A 425 PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU
SEQRES 21 A 425 LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA
SEQRES 22 A 425 GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU
SEQRES 23 A 425 ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL
SEQRES 24 A 425 GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN
SEQRES 25 A 425 SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER
SEQRES 26 A 425 PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO
SEQRES 27 A 425 LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR
SEQRES 28 A 425 PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP
SEQRES 29 A 425 GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL
SEQRES 30 A 425 GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR
SEQRES 31 A 425 VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO
SEQRES 32 A 425 THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL
SEQRES 33 A 425 VAL ARG GLU LEU LYS SER GLN LYS SER
HET PLP A 600 15
HET PLP B 600 15
HET PLP C 600 15
HET PLP D 600 15
HETNAM PLP PYRIDOXAL-5'-PHOSPHATE
HETSYN PLP VITAMIN B6 COMPLEX
FORMUL 5 PLP 4(C8 H10 N1 O6 P1)
HELIX 1 1 LEU A 45 LEU A 59 1 15
HELIX 2 2 LYS A 69 ASN A 71 5 3
HELIX 3 3 ASP A 73 GLY A 84 1 12
HELIX 4 4 SER A 91 ILE A 101 1 11
HELIX 5 5 PRO A 104 GLU A 106 5 3
HELIX 6 6 GLN A 116 SER A 126 1 11
HELIX 7 7 CYS A 135 HIS A 146 1 12
HELIX 8 8 LYS A 173 GLU A 175 5 3
HELIX 9 9 ASP A 176 LEU A 187 1 12
HELIX 10 10 ALA A 205 LEU A 225 1 21
HELIX 11 11 LYS A 247 PHE A 263 1 17
HELIX 12 12 GLY A 276 ALA A 281 1 6
HELIX 13 13 PHE A 326 HIS A 333 1 8
HELIX 14 14 THR A 390 THR A 394 5 5
HELIX 15 15 SER A 396 PHE A 400 5 5
SHEET 1 A 6 GLN A 365 PRO A 373 0
SHEET 2 A 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372
SHEET 3 A 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354
SHEET 4 A 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317
SHEET 5 A 6 PHE A 40 ASP A 44 -1 O PHE A 40 N ALA A 287
SHEET 6 A 6 THR A 404 VAL A 408 1 O THR A 404 N PHE A 41
SHEET 1 A1 6 GLN A 365 PRO A 373 0
SHEET 2 A1 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372
SHEET 3 A1 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354
SHEET 4 A1 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317
SHEET 5 A1 6 TRP A 380 PHE A 383 -1 N LEU A 381 O VAL A 288
SHEET 6 A1 6 PRO A 338 PRO A 340 -1 O LEU A 339 N LEU A 382
CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.014970 0.000000 0.003264 0.00000
SCALE2 0.000000 0.006592 0.000000 0.00000
SCALE3 0.000000 0.000000 0.011992 0.00000
ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N
ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C
ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C
ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O
ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C
ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C
ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O
ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O
ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N
ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C
ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C
ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O
ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C
ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C
ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C
ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O
ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O
ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N
ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C
ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C
ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O
ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N
ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C
ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C
ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O
ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C
ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C
ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O
ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C
...
CONECT1117911177
CONECT1118011177
MASTER 482 0 4 60 80 0 0 611176 4 64 132
END
Conversion of Sequence Formats
 seqret (EMBOSS)
 gcg GCG 9.x and 10.x format
 embl
 swiss
 fasta
 genbank
 nbrf
 pir NBRF (PIR)
 codata CODATA format.
 strider DNA strider format
 clustal
 phylip PHYLIP non-interleaved multiple alignment format.
 acedb ACeDB format
 msf Wisconsin Package GCG's MSF multiple sequence format.
 hennig86 Hennig86 format
 jackknifer Jackknifer format
 jackknifernon Jackknifernon format
 nexus
 paup Nexus/PAUP format
 treecon Treecon format
 mega Mega format
 ig IntelliGenetics format.
 staden
 text
Conversion of Sequence Formats

 Web-based
 WWW READSEQ Sequence Conversion at
NIH
 http://bimas.cit.nih.gov/molbio/readseq/
 WWW READSEQ at Human Genome
Mapping Project (HGMP) Center
 http://www.hgmp.mrc.ac.uk/Registered/Webapp/re
adseq/
http://bimas.cit.nih.gov/molbio/readseq/
Conversion of Sequence Formats
 Windows-based program
 SeqVerter
 Downloadable from GeneStudio, Inc. (free)

 http://www.genestudio.com/seqverter.htm

 Read: ABI traces, Clustal, DCSE, DNASIS, DNAStar,


DNAStrider (including binary), EMBL, FASTA, GDE,
GenBank, IBI/Pustell, Macaw, MSF, Nexus/PAUP, PHYLIP
Interleaved, PIR/NBRF, SCF 2.0 and SCF 3.0 traces,
Swiss-Prot, and TreeCon.
 Write: Clustal, DNASIS, DNAStar, FASTA and FASTA-
SequIn, GenBank, IBI/Pustell, MSF, Nexus/PAUP, PHYLIP
Interleaved, and TreeCon.
 Tutorial also available

Вам также может понравиться