Академический Документы
Профессиональный Документы
Культура Документы
Sequences
Description line
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Sequence
Standard IUB/IUPAC amino acid
and nucleic acid codes
Lower-case letters are accepted and are mapped into upper-case
No numerical digits in the sequence.
The nucleic acid codes supported are:
A adenosine M A C (amino)
C cytidine S G C (strong)
G guanine W A T (weak)
T thymidine BGTC
U uridine DGAT
R G A (purine) HACT
Y T C (pyrimidine) VGCA
K G T (keto) N A G C T (any)
- gap of indeterminate length
Standard IUB/IUPAC amino acid
and nucleic acid codes
The accepted amino acid codes are:
A alanine P proline
B aspartate or asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate or glutamine
L leucine X any
M methionine * translation stop
N asparagine - gap of indeterminate length
FASTA format
Multiple sequences
Blank lines inserted
> mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT
CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA
CAGTCGATCGATGCAT
> mysequence2
ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG
CAGTCGTAGCATGCTAACGTCGATCGTA
> mysequence3
CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA
CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA
Genbank File Format
Sequence Data Entries
First line
Begins with ‘LOCUS’ in the first 5 spaces
Followed by genetic locus name or identifier
Length of the sequences
Type of sequences
Second line
DEFINITION in the first 10 spaces
Followed by free form text to identify the sequence.
Third line
ACCESSION in the first 9 spaces
Spaces 13 - 18 must hold the primary accession number
Fourth line
ORIGIN in the first 6 spaces
Nothing else is required on this line, it indicates that the nucleic acid
sequence begins on the next line.
Genbank File Format
Fifth line
Begins the nucleotide sequence.
The first 9 spaces of each sequence line may either be blank
or may contain the position in the sequence of the first
nucleotide on the line.
The next 66 spaces hold the nucleotide sequence in six
blocks of ten nucleotides.
Each of the six blocks begins with a blank space followed by
ten nucleotides.
Thus the first nucleotide is in space 11 of the line while the
last is in space 75.
Last line
Must have // in the first 2 spaces to indicate termination of the
sequence.
Genbank File Format
misc_feature 187..921
/note="TIM; Region: Triosephosphate isomerase“
First line
Begins with a greater than symbol (>)
Immediately followed by 2 character sequence type specifier
Specifier Sequence type
P1 protein, complete
F1 protein, fragment
DL DNA, linear
DC DNA, circular
RL RNA, linear
RC RNA, circular
N1 functional RNA, other than tRNA
N3 tRNA
Then a semicolon (;)
Followed by sequence name or identification code for the NBRF
database
Four to six letters and numbers
>P1;CBRT
National Biomedical Research Foundation (NBRF) format
Protein Information Resource (PIR) format
Second
Organism or organelle name
>P1;CBRT
Cytochrome b - Rat mitochondrion (SGC1)
National Biomedical Research Foundation (NBRF) format
Protein Information Resource (PIR) format
Protein sequences
May contain special punctuation to indicate various
indeterminacies in the sequence
The last character in the sequence must be an
asterisks (*).
NBRF/PIR Example
LINE 1 :>P1;CBRT
LINE 2 :Cytochrome b - Rat mitochondrion (SGC1)
LINE 3 :M T N I R K S H P L F K I I N H S F I D L P A P S
LINE 4 : VTHICRDVN Y GWL IRY
LINE 5 :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*
MolGen/Stanford File Format
Third line
Sequence begins
Occupies up to 80 spaces
Spaces may be included in the sequence for
ease of reading.
Terminated with 1 or 2
1 indicates a linear sequence
2 marks a circular sequence
MolGen/Stanford Example
LINE 1 :; Describe the sequence any way you want
LINE 2 :ECTRNAGLY2
LINE 3 :ACGCACGTAC ACGTACGTAC A C G T C C G T ACG TAC GTA CGT
LINE 4 : GCTTA GG G C T A1
PHYLIP File Format
so on. PALVIWNIKH
PALVIWNIKH
LLHTGIGTAS
LLHTGIGTAS
RPSEVCMVDG
RPSEVCMVDG
TDMCLADFHA
TDMCLADFHA
GIFMKGREHA
GIFMKGQEHA
PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA
…
PHYLIP File Format
Where:
RTyp: Record Type
Num: Serial number of the atom. Each atom has a unique serial number.
Atm: Atom name (IUPAC format).
Res: Residue name (IUPAC format).
Ch: Chain to which the atom belongs (in this case, L for light chain
of an antibody).
ResN: Residue sequence number.
X, Y, Z: Cartesian coordinates specifying atomic position in space.
Occ: Occupancy factor
Temp: Temperature factor (atoms disordered in the crystal have high
temperature factors).
PDB: The PDB data file unique identifier.
Line: Line (record) number in the data file.
PDB Example
HEADER LYASE 06-JUL-99 1QU4
TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE
TITLE 2 DECARBOXYLASE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE;
COMPND 3 CHAIN: A, B, C, D;
COMPND 4 EC: 4.1.1.17;
COMPND 5 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI;
SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA;
SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3;
SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID
KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA
KEYWDS 2 BARREL, LYASE
EXPDTA X-RAY DIFFRACTION
AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
AUTHOR 2 E.J.GOLDSMITH
REVDAT 2 29-DEC-99 1QU4 1 JRNL COMPND REMARK
REVDAT 1 17-NOV-99 1QU4 0
JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
JRNL AUTH 2 E.J.GOLDSMITH
JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM
JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE
JRNL TITL 3 STRUCTURE IN COMPLEX WITH
JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE
JRNL REF BIOCHEMISTRY V. 38 15174 1999
JRNL REFN ASTM BICHAW US ISSN 0006-2960
REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.90 ANGSTROMS.
REMARK …
DBREF 1QU4 A 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 B 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 C 1 425 SWS P07805 DCOR_TRYBB 21 445
DBREF 1QU4 D 1 425 SWS P07805 DCOR_TRYBB 21 445
SEQRES 1 A 425 GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS
SEQRES 2 A 425 ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS
SEQRES 3 A 425 LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO
SEQRES 4 A 425 PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS
SEQRES 5 A 425 GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE
SEQRES 6 A 425 TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY
SEQRES 7 A 425 THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER
SEQRES 8 A 425 ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO
SEQRES 9 A 425 PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE
SEQRES 10 A 425 SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL
SEQRES 11 A 425 MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA
SEQRES 12 A 425 LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER
SEQRES 13 A 425 THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS
SEQRES 14 A 425 PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU
SEQRES 15 A 425 GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER
SEQRES 16 A 425 PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE
SEQRES 17 A 425 ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET
SEQRES 18 A 425 GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE
SEQRES 19 A 425 GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS
SEQRES 20 A 425 PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU
SEQRES 21 A 425 LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA
SEQRES 22 A 425 GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU
SEQRES 23 A 425 ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL
SEQRES 24 A 425 GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN
SEQRES 25 A 425 SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER
SEQRES 26 A 425 PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO
SEQRES 27 A 425 LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR
SEQRES 28 A 425 PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP
SEQRES 29 A 425 GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL
SEQRES 30 A 425 GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR
SEQRES 31 A 425 VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO
SEQRES 32 A 425 THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL
SEQRES 33 A 425 VAL ARG GLU LEU LYS SER GLN LYS SER
HET PLP A 600 15
HET PLP B 600 15
HET PLP C 600 15
HET PLP D 600 15
HETNAM PLP PYRIDOXAL-5'-PHOSPHATE
HETSYN PLP VITAMIN B6 COMPLEX
FORMUL 5 PLP 4(C8 H10 N1 O6 P1)
HELIX 1 1 LEU A 45 LEU A 59 1 15
HELIX 2 2 LYS A 69 ASN A 71 5 3
HELIX 3 3 ASP A 73 GLY A 84 1 12
HELIX 4 4 SER A 91 ILE A 101 1 11
HELIX 5 5 PRO A 104 GLU A 106 5 3
HELIX 6 6 GLN A 116 SER A 126 1 11
HELIX 7 7 CYS A 135 HIS A 146 1 12
HELIX 8 8 LYS A 173 GLU A 175 5 3
HELIX 9 9 ASP A 176 LEU A 187 1 12
HELIX 10 10 ALA A 205 LEU A 225 1 21
HELIX 11 11 LYS A 247 PHE A 263 1 17
HELIX 12 12 GLY A 276 ALA A 281 1 6
HELIX 13 13 PHE A 326 HIS A 333 1 8
HELIX 14 14 THR A 390 THR A 394 5 5
HELIX 15 15 SER A 396 PHE A 400 5 5
SHEET 1 A 6 GLN A 365 PRO A 373 0
SHEET 2 A 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372
SHEET 3 A 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354
SHEET 4 A 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317
SHEET 5 A 6 PHE A 40 ASP A 44 -1 O PHE A 40 N ALA A 287
SHEET 6 A 6 THR A 404 VAL A 408 1 O THR A 404 N PHE A 41
SHEET 1 A1 6 GLN A 365 PRO A 373 0
SHEET 2 A1 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372
SHEET 3 A1 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354
SHEET 4 A1 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317
SHEET 5 A1 6 TRP A 380 PHE A 383 -1 N LEU A 381 O VAL A 288
SHEET 6 A1 6 PRO A 338 PRO A 340 -1 O LEU A 339 N LEU A 382
CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.014970 0.000000 0.003264 0.00000
SCALE2 0.000000 0.006592 0.000000 0.00000
SCALE3 0.000000 0.000000 0.011992 0.00000
ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N
ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C
ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C
ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O
ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C
ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C
ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O
ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O
ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N
ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C
ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C
ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O
ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C
ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C
ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C
ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O
ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O
ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N
ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C
ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C
ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O
ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N
ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C
ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C
ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O
ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C
ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C
ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O
ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C
...
CONECT1117911177
CONECT1118011177
MASTER 482 0 4 60 80 0 0 611176 4 64 132
END
Conversion of Sequence Formats
seqret (EMBOSS)
gcg GCG 9.x and 10.x format
embl
swiss
fasta
genbank
nbrf
pir NBRF (PIR)
codata CODATA format.
strider DNA strider format
clustal
phylip PHYLIP non-interleaved multiple alignment format.
acedb ACeDB format
msf Wisconsin Package GCG's MSF multiple sequence format.
hennig86 Hennig86 format
jackknifer Jackknifer format
jackknifernon Jackknifernon format
nexus
paup Nexus/PAUP format
treecon Treecon format
mega Mega format
ig IntelliGenetics format.
staden
text
Conversion of Sequence Formats
Web-based
WWW READSEQ Sequence Conversion at
NIH
http://bimas.cit.nih.gov/molbio/readseq/
WWW READSEQ at Human Genome
Mapping Project (HGMP) Center
http://www.hgmp.mrc.ac.uk/Registered/Webapp/re
adseq/
http://bimas.cit.nih.gov/molbio/readseq/
Conversion of Sequence Formats
Windows-based program
SeqVerter
Downloadable from GeneStudio, Inc. (free)
http://www.genestudio.com/seqverter.htm