Вы находитесь на странице: 1из 75

Fundamentals in Sequence Analysis 1.

(part 1)
Review of Basic biology + database searching in Biology.

Hugues Sicotte NCBI

The Flow of Biotechnology Information


Gene Function

> DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA

> Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDCIEFGKKIANI

Prequisites to Sequence Analysis


Basic Biology so you can understand the language of the databases: Central Dogma (transcription; Translation, Prokaryotes, Eukaryotes,CDS, 3UTR, 5UTR, introns, exons, promoters, operons, codons, start codons, stop codons,snRNA,hnRNA,tRNA, secondary structure, tertiary structure). Before you can analyze sequences.. You have to understand their structure.. And know about Basic Biological Database Searching

Central Dogmas of Molecular Biology


1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]

Central Dogmas of Molecular Biology


3) Each side of the double helix faces its complementary base. A T, and G C. 4) Biochemical process that read off the DNA always read it from the 5side towards the 3 side. (replication and transcription). 5) A gene can be located on either the plus strand or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the strand must be read off by reading the complement of this sequence going backwards e.g. TAGCATCGATCACGT

Central Dogmas of Molecular Biology


6) DNA information is copied over to mRNA that acts as a template to produce proteins.

We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but lets not forget the various RNA genes)

Prokaryotic genes
Prokaryotes (intronless protein coding genes)
TAC
Gene region Downstream (3)

Upstream (5) promoter

DNA

Transcription (gene is encoded on minus strand .. And the reverse complement is read into mRNA)
ATG

mRNA

5 UTR CoDing Sequence (CDS)


ATG

3 UTR

Translation: tRNA read off each codons, 3 bases at a time, starting at start codon until it reaches a STOP codon. protein

Why does Nature bothers with the mRNA?


Why would the cell want to have an intermediate between DNA and the proteins it encodes? Gene information can be amplified by having many copies of an RNA made from one copy of DNA. Regulation of gene expression can be effected by having specific controls at each element of the pathway between DNA and proteins. The more elements there are in the pathway, the more opportunities there are to control it in different circumstances. In Eukaryotes, the DNA can then stay pristine and protected, away from the caustic chemistry of the cytoplasm.

Prokaryotic genes (operons)


Prokaryotes (operon structure)
upstream promoter downstream

Gene 1

Gene 2

Gene 3

In prokaryotes, sometimes genes that are part of the same operational pathway are grouped together under a single promoter. They then produce a pre-mRNA which eventually produces 3 separates mRNAs.

Bacterial Gene Structure of signals

Bacterial genomes have simple gene structure. - Transcription factor binding site. - Promoters

-35 sequence (T82T84G78A65C54A45) 15-20 bases


-10 sequence (T80A95T45A60A50T96) 5-9 bases - Start of transcription : initiation start: Purine90 (sometimes its the A in CAT) - translation binding site (shine-dalgarno) 10 bp upstream of AUG (AGGAGG) - One or more Open Reading Frame start-codon (unless sequence is partial) until next in-frame stop codon on that strand .. Separated by intercistronic sequences. - Termination

Genetic Code
How does an mRNA specify amino acid sequence? The answer lies in the genetic code. It would be impossible for each amino acid to be specified by one nucleotide, because there are only 4 nucleotides and 20 amino acids. Similarly, two nucleotide combinations could only specify 16 amino acids. The final conclusion is that each amino acid is specified by a particular combination of three nucleotides, called a codon: Each 3 nucleotide code for one amino acid. The first codon is the start codon, and usually coincides with the Amino Acid Methionine. (M which has codon code ATG) The last codon is the stop codon and does NOT code for an amino acid. It is sometimes represented by * to indicate the STOP codon. A coding region (abbreviation CDS) starts at the START codon and ends at the STOP codon.

Codon table
Note the degeneracy of the genetic code. Each amino acid might have up to six codons that specify it. Different organisms have different frequencies of codon usage. A handful of species vary from the codon association described above, and use different codons fo different amino acids. How do tRNAs recognize to which codon to bring an amino acid? The tRNA has an anticodon on its mRNA-binding end that is complementary to the codon on the mRNA. Each tRNA only binds the appropriate amino acid for its anticodon.

RNA

RNA has the same primary structure as DNA. It consists of a sugar-phosphate backbone, with nucleotides attached to the 1' carbon of the sugar. The differences between DNA and RNA are that: 1. RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference between deoxyribonucleic acid and ribonucleic acid. 2. Instead of using the nucleotide thymine, RNA uses another nucleotide called uracil: 3. Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure. 4. Because the RNA molecule is not restricted to a rigid double helix, it can form many different stable three-dimensional tertiary structures.

tRNA ( transfer RNA)


is a small RNA that has a very specific secondary and tertiary structure such that it can bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T

Secondary structure of tRNA

Threedimensional Tertiary structure

Bacterial Gene Prediction

Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change.
Most modern gene prediction programs need to be trained. E.g. they find their own consensus and assembly rules given a few examples genes. A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORFs restrict the search space of possible gene candidates. E.g. selfid program(selfid@igs.cnrs-mrs.fr)

Open Reading Frames


The simplest bacterial gene prediction techniques simply 1) identify all open reading frames(ORFs), 2) and blastx them against known proteins. 3) The ORFs with the best homology are retained first. 4) This usually densely covers the bacterial genomes with genes. rRNA and tRNA are detected separately using tRNAScan or blastn.

Open Reading Frames (ORF)


On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or minus strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,... Frame 2: 1st base of start codon can either start at base 2,5,8,11,...

Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame 1,-2,-3 are on minus strand) Some programs have other conventions for naming frames.. (0..5, 1-6, etc) Gene finding in eukaryotic cDNA uses ORF finding +blastx as well.

http://www.ncbi.nlm.nih .gov/gorf/gorf.html
try with gi=41 ( or your own piece of DNA)

Eukaryotic Central Dogma


In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus) The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. ( many eukaryotes genes contain no introns! .. Particularly true in lower organisms) mRNA (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a messenger to carry the information stored in the DNA in the nucleus to the cytoplasm where the ribosomes can make it into protein.

Eukaryotic Nuclear Gene Structure Gene prediction for Pol II transcribed genes. Upstream Enhancer elements.
Upstream Promoter elements.

GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)

TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990)) 14-20 nt spacer DNA CAP site (8 bp) Transcription Initiation. Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon. polyA signal (AATAAA 99%,other)

introns
Transcript region, interrupted by introns. Each introns starts with a donor site consensus (G100T100A62A68G84T63..)

Has a branch site near 3 end of intron (one not very conserved consensus UACUAAC)
ends with an acceptor site consensus. (12Py..NC65A100G100)

UACUAAC

AG

Exons
The exons of the transcript region are composed of: 5UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome) AUG (or other start codon) Remainder of coding region Stop Codon 3 UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)

Structure of the Eukaryotic Genome


~6-12% of human DNA encodes proteins(higher fraction in nematode)

~10% of human DNA codes for UTR


~90% of human DNA is noncoding.

Non-Coding Eukaryotic DNA

Untranslated regions (UTRs) introns (can be genes within introns of another gene!) intergenic regions. - repetitive elements - pseudogenes (dead

genes that may(or not) have been retroposed back in the genome as a single-exon gene

Pseudogenes
Pseudogenes: Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both. Processed pseudogenes: Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail. Insertion process flanks mRNA sequence with short direct repeats. Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence. Do not confuse with single-exon genes.

Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site.

- Retroposons. ( can get copied back into genome)


- Telltale sign: Direct or inverted repeat flank the repeated element. That repeat was the priming site for the RNA that was inserted. LINEs (Long INtersped Elements) L1 1-7kb long, 50000 copies Have two ORFs!!!!! Will cause problems for gene prediction programs. SINEs (Short Intersped Elements)

Low-Complexity Elements
When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation! Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified. The low-complexity sequence can also be hidden at the translated protein level.

Masking
To avoid finding spurious matches in alignment programs, you should always mask out the query sequence. Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs). Before running blastn against a genomic record, you must mask out the repeats. Most used Programs:

CENSOR:
Repeat Masker: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

More Non-Protein genes


rRNA - ribosomal RNA is one of the structural components of the ribosome. It has sequence complementarity to regions of the mRNA so that the ribosome knows where to bind to an mRNA it needs to make protein from. snRNA - small nuclear RNA is involved in the machinery that processes RNA's as they travel between the nucleus and the cytoplasm. hnRNA hetero-nuclear RNA. small RNA involved in transcription.

Protein Processing & localization.

The protein as read off from the mRNA may not be in the final form that will be used in the cell. Some proteins contains Signal Peptide (located at N-terminus (beginning)), this signal peptide is used to guide the protein out of the nucleus towards its final cellular localization. This signal peptide is cleaved-out at the cleavage site once the protein has reach (or is near) its final destination. Various Post-Translational modifications (phosphorylation) The final protein is called the mature peptide

Convention for nucleotides in database


Because the mRNA is actually read off the minus strand of the DNA, the nucleotide sequence are always quoted on the minus strand. In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. There is no symbol for Uracil.. It is always represented by a T Even genomic sequence follows that convention. A gene on the plus strand is quoted so that it is in the same strand as its product mRNA.

Biology Information on the Internet

Biology Information on the Internet


Introduction to Databases Searching the Internet for Biology Information.
General Search methods Biology Web sites

Introduction to Genbank file format. Introduction to Entrez and Pubmed Ref: Chapters 1,2,5,6 of Bioinformatics

Databases:
A collection of Records.
Each record has many fields. Spread-sheet Each field contain specific information. Each field has a data type. Flat-file E.g. money, currency,Text Field, Integer, version of a date,address(text field) ,citation (text field) database. Each record has a primary key. A UNIQUE identifier that unambiguously defines this record.

gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 06/01/00 PRI 9606 homo sapiens 6226762 NM_000014 2 10/12/99 PRI 9606 homo sapiens 4557224 NM_000014 1 02/04/99 PRI 9606 homo sapiens 41 X63129 1 06/06/96 MAM 9913 bos taurus

Number of Chromosomes 22 diploid + X+Y 22 diploid + X+Y 22 diploid + X+Y 29+X+Y

gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 41 X63129 1 06/06/1996 MAM 9913 bos taurus
Gi = Genbank Identifier: Unique Key : Primary Key

Number of Chromosomes 22 diploid + X+Y 22 diploid + X+Y 22 diploid + X+Y 29+X+Y

GI Changes with each update of the sequence record.


Accession Number: Secondary key: Points to same locus and sequence despite sequence updates.

Accession + Version Number equivalent to Gi

gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 41 X63129 1 06/06/1996 MAM 9913 bos taurus

Number of Chromosomes 22 diploid + X+Y 22 diploid + X+Y 22 diploid + X+Y 29+X+Y

Relational Database (Normalizing a database for repeated subelements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.)
gi 6226959 6226762 4557224 41 Accession NM_000014 NM_000014 NM_000014 X63129 version 3 2 1 1 date 01/06/2000 12/10/1999 04/02/1999 06/06/1996 Genbank Division taxid PRI 9606 PRI 9606 PRI 9606 MAM 9913

taxid organims Number of Chromosomes 9606 homo sapiens 22 diploid + X+Y 9913 bos taurus 29+X+Y

Types of Relational databases.


The Internet can be though of as one enormous relational database.
The links/URL are the primary keys.

SQL (Standard Query Language)


Sybase; Oracle ; Access; (Databases systems)
Sybase used at NCBI.

SRS(One type of database querying system of use in Biology)

Indexed searches.
To allow easy searching of a database, make an index. An index is a list of primary keys corresponding to a key in a given field (or to a collection of fields)
Genbank division PRI 6226959;6226762;4557224; MAM 41; Accession NM_000014 6226959;6226762;4557224; X63129 41;

Indexed searches.
Boolean Query: Merging and Intersecting lists: AND (in both lists) (e.g. human AND genome)
+human +genome human && genome

OR (in either lists) (e.g. human OR genome)


human || genome

Search strategies
Search engines use complex strategies that go beyond Boolean queries.
Phrases matching:
human genome -> human genome

togetherness: documents with human close to genome are scored higher. Term expansion & synomyms:
human -> homo sapiens

neigbours:
human genome-> genome projects, chromosomes,genetics

Frequency of links (www.google.com)


To avoid these term mapping, enclose your queries in quotes: human AND genome

Search strategies
Search engines use complex strategies that go beyond Boolean queries.
To avoid these term mapping, enclose your queries in quotes: human AND genome To require that ALL the terms in your query be important, precede them with a + . This also prevents term mapping. To force the order of the words to be important, group sentences within strings. biology of mammals.

Indexed searches.
Example

find the advanced query page at http://www.altavista.com type human (and hit the Search button) Type genome: type human AND genome type human genome (finds the least matches) type human OR genome (finds the most matches)

Search Engines:
Web Spiders: Collection of All web pages, but since Web pages change all the time and new ones appear, they must constantly roam the web and re-index.. Or depend on people submitting their own pages.
www.google.com (BEST!) www.infoseek.com www.lycos.com www.exite.com www.webcrawler.com www.lycos.com www.looksmart.com (country specific)

Search Engines:
www.google.com (BEST!) Google ranks pages according to how many pages with those terms refer to the pages you are asking for. Not only must one document contain ALL the search terms, but other documents which refer to this one must also contain all the terms. Great when you know what you are looking for! You can also use to require immediate proximity and order of terms. E.g. type
Web server for the blast program.

But google only indexes about 40% of the web.. So you may have to use other web spiders.

(disclaimer.. I dont own stock in that company.. But Id like to)

Search Engines:
Curated Collections: Not comprehensive: Contains list of best sites for commonly requested topics, but is missing important sites for more specialized topics (like biology)
www.yahoo.com (Has travel maps too!)

Answer-based curated collections: Easy to use english-like queries. First looks at list of predefined answers, then refines answers based on user interaction. Also answer new questions.
www.askjeeves.com www.magellan.com www.altavista.com(has translation TOOLS) www.hotbot.com

Search Engines:
Meta-Search Engines: Polls several search engines, and returns the consensus of all results. Is likely to miss sites, but the sites it returns are very relevant to the query. Other operating mode is to return the sum of all the results.. Then becomes very sensitive to a very detailled query.
www.metacrawler.com www.savvysearch.com www.1blink.com (fast) www.metafind.com www.dogpile.com

Virtual Libraries: Curated collections of links for Biologists.(by Biologists)


Pedros BioMolecular Research Tools:(1996)
http://www.public.iastate.edu/~pedro/

Virtual Library: Bio Sciences


http://vlib.org/Biosciences.html

Publications and abstract search.


http://www.ncbi.nlm.nih.gov/

Expasy server
http://www.expasy.ch

EBI Biocatalog (software & databases list)


http://www.ebi.ac.uk/biocat/

Biological Databases
Nucleotide databases:
Genbank: International Collaboration
NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia) A bank No curation.. Submission to these database is required for publication in a journal.

Organism specific databases (Exercize: Find URLs using search engines)


FlyBase ChickGBASE pigbase wormpep YPD (Yeast Protein Database) SGD(Saccharomyces Genome Database)

Protein Databases:
NCBI: Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing)
http://www.expasy.ch(latest pay version) NCBI has the latest free version. Translated Proteins from Genbank Submissions

EMBL
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT

PIR

Structure databases:
PDB: Protein structure database.
Http://www.rscb.org/pdb/

MMDB: NCBIs version of PDB with entrez links.


Http://www.ncbi.nlm.nih.gov

Genome Mapping Information:


http://www.il-st-acad-sci.org/health/genebase.html

NCBI(Human) Genome Centers:


Stanford, Washington University, Stanford

Research Centers and Universities

Litterature databases:
NCBI: Pubmed: All biomedical litterature.
Www.ncbi.nlm.nih.gov Abstracts and links to publisher sites for
full text retrieval/ordering journal browsing.

Publisher web sites. Biomednet: Commercial site for litterature search.

Pathways Database:
KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html

Database Identifiers: Primary keys


GI (changes with each sequence update for NCBI only)
Annotation may change without the gi changing!

Accession(stable) version(changes with each sequence update) Version also refers to Accession.version Secondary accession: Records may have been merged in the past.. So the records which were not chosen as the primary were made secondary.

Primary Databases
A primary Database is a repository of data derived from experiments or from research knowledge.
Genbank (Nucleotide repository) Protein DB, Swissprot PDB (MMDB) are primary databases. Pubmed (litterature) Genome Mapping databases. Kegg Database.(pathways)

Secondary Databases
A secondary database contains information derived from other sources.
Refseq (Currated collection of Genbank at NCBI) Unigene (Clustering of ESTs at NCBI)

Organism-specific databases are often a mix between primary and secondary.

Genbank Records
A Bank: No attempt at reconciliation. Submit a sequence Get an Accession Number!
Cannot modify sequences without submitters consent. No attempt at reconciliation.(not a unique collection per LOCUS/gene) Entries of various sequence quality and different sources==> Separate in various divisions based on
High Quality sequences in taxon specific divisions. Low Quality sequences in Usage specific databases.

A Collaboration between NCBI, EMBL and DDBJ. They contain (nearly) the same information, only the data format differs.
EMBL does not differentiate between the different types of RNA records, while NCBI (and DDBJ) do. In Entrez EMBL records are patched up to add that information.

Refseq and LocusLink


Attempt to produce 1 mRNA, 1 protein, and 1 genomic gene for each frequently occuring allele of a protein expressing gene. www.ncbi.nlm.nih.gov/LocusLink Special non-genbank Accession numbers
NM_nnnnnn mRNA refseq NP_nnnnnn protein refseq NC_nnnnnn refseq genomic contig NT_nnnnnn temporary genomic contig NX_nnnnnn predicted gene

Genbank divisions
Sequences in genbank are split into various categories based on 1) The quality and type of sequences 2) The high quality nucleotide sequences are divided into organism-dependant divisions.

Genbank Entry type: (and query to restrict to that


field)
mRNA (1/10000 errors) biomol_mRNA [PROP] cDNA (EST, 95-99% accuracy, single pass ) gbdiv_EST [PROP] genomic ( biomol_genomic [PROP]) in HTGS division: >99% accuracy; gbdiv_HTG [PROP] GSS(low-quality genome survey sequences) gbdiv_GSS [PROP] rest of Genbank; 1/10000 accuracy. Human gbdiv_PRI [PROP] mouse gbdiv_ROD [PROP] bovine gbdiv_MAM [PROP] STS(EST or cDNA used in mapping) gbdiv_STS [PROP]

FASTA Format

>identifier descriptive text nucleotide of amino-acid sequence on multiple lines if needed.


Example:

MOST important data format!!!

>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC CATCACGCGGGGCCTTCTGCTGCTGGC .

Modified FASTA Format


1) A few tools follow the convention that lower case sequences are masked. (repeat masker, some versions of blast, megablast, blastz) 2) A few analysis tools (like CLUSTAL) want a simplified identifier on the defline.. So they can have a short string for the alignment.
>X63129.1 GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC CATCACGCGGGGCCTTCTGCTGCTGGC .

WIM now will talk about GCG

Feature table (NCBI;EMBL/DDBJ)


http://www.ncbi.nlm.nih.gov/collab/FT/inde x.html

Genbank Data format


41
LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 DEFINITION B.taurus mRNA for alpha-1-antitrypsin. ACCESSION X63129 NID g41 VERSION X63129.1 GI:41 KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin. SOURCE Bos taurus. ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.

Genbank References
LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 ... REFERENCE 1 (bases 1 to 1380) AUTHORS Sinha,D. TITLE Direct Submission JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry, Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA REFERENCE 2 (bases 1 to 1380) AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P. TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992) MEDLINE 92223096 FEATURES Location/Qualifiers

Genbank Source Qualifier


LOCUS BTA1AT 1380 bp mRNA ... FEATURES Location/Qualifiers source 1..1380 /organism="Bos taurus" /db_xref="taxon:9913" /tissue_type="liver" /cell_type="hepatocyte" /clone_lib="lambda gt11" /clone="2f-Ic" mRNA <1..>1380 sig_peptide 33..104 ... MAM 30-APR-1992

Genbank mRNA+CDS features


mRNA <1..>1380 sig_peptide 33..104 CDS 33..1283 /codon_start=1 /product="alpha-1-antitrypsin" /protein_id="CAA44840.1" /db_xref="PID:g42" /db_xref="GI:42" /db_xref="SWISS-PROT:P34955" /translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAAC HKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKG LGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLED VKNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVN YISFKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLAS WVLLLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPK LSISETYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGT EAVGSTFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA" mat_peptide 105..1280 /product="alpha-1-antitrypsin" polyA_signal 1343..1348 polyA_site 1368

... BASE COUNT 357 a 413 c 322 g 288 t ORIGIN 1 gaccagccct gacctaggac agtgaatcga taatggcact 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct 121 acgctgtcca agagacagat gatacatccc accaggaagc 181 ccaacctggc caactttgcc ttcagcatat accaccattt 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt 301 ccaagggcaa cactcacact gagatcctga agggcctggg 361 cagaggctga gatccacaaa ggctttcagc atcttctcca ... 1321 gtccccccac tccctccatg gcattaaagg atgactgacc //

Genbank Sequence format


ctccatcacg ggctggagtt agcgtgccac ggctcatcag tgcgatgctc tttcaacctc caccctgaac cggggccttc ctccaaggac aagattgccc tccaacacca tccctgggag actgagctcg cagccaaacc

tagccccgaa aaaaaaaaaa

EMBL DATA FORMAT


Embl: http://www.ebi.ac.uk/Databases/ http://www.ebi.ac.uk/cgi-bin/emblfetch Use Accession X63129

DDBJ DATA FORMAT


DDBJ: http://www.ddbj.nig.ac.jp/ http://ftp2.ddbj.nig.ac.jp:8000/getstarte.html Use Accession X63129 Flat file format same as NCBI/Genbank format.

Entrez
Index Based search system. Each field in the database is searchable individually or as agregate.
(e.g. CDS [FKEY]) default is agregate [ALL FIELDS] *

All primary databases are interlinked as one big relational database.


(e.g. Pubmed links in Genbank records)

Phrase matching.
Human genome -> human genome

Entrez
Available neighbours (related documents or related sequences) In Pubmed searches: Term mapping to neighbouring documents and neighbouring terms. Term mapping to chemical names.
In pubmed: term [All Fields] is term mapped to chemical names + MeSH terms + Text Fields. .. Unless term is whithin double quotes.

Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Tutorials: http://www.ncbi.nlm.nih.gov/Class/MLACo urse/Genetics/index.html
http://www.ncbi.nlm.nih.gov/Literature/pubmed_s earch.html http://www.ncbi.nlm.nih.gov/Database.tut1.html

SWISSPROT
http://www.expasy.ch/sprot/sprot_details.html

1. Core data: protein sequence data; the citation information and the taxonomic data 2. Annotation Function(s) of the protein Domains and sites. For example calcium binding regions, ATPbinding sites, zinc fingers, homeobox, kringle, etc. Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc. Secondary structure Quaternary structure. For example homodimer, heterotrimer, etc Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein Sequence conflicts, variants, etc.

SWISSPROT
http://www.expasy.ch/cgi-bin/get-random-entry.pl?S

REBASE (Restriction enzymes dataBASE)


Restriction enzymes have a pattern recognition sequence, and then within or a few bases away from that pattern is the actual cutting site http://rebase.neb.com/rebase/rebase.html I prefer the bairoch format (SWISSPROT format) http://rebase.neb.com/rebase/rebase.f19.html ID enzyme name ET enzyme type OS microorganism name PT prototype RS recognition sequence, cut site MS methylation site (type) CR commercial sources for the restriction enzyme CM commercial sources for the methylase RN [count] RA authors RL jour, vol, pages, year, etc.

Exercises You can work in teams for this. 1a) Use the first 6000 bases of your genomic piece [ or find a bacterial genomic or mRNA sequence in Entrez with length between 2000:10000 ]

b) Use the ORF finder to find the gene(s). Compare the answer you get to the annotation you can infer from using blastn against genbank and to using blastx against a protein database.
Do the Entrez exercizes. ( separate word document)

Вам также может понравиться