Академический Документы
Профессиональный Документы
Культура Документы
(part 1)
Review of Basic biology + database searching in Biology.
> DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA
We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but lets not forget the various RNA genes)
Prokaryotic genes
Prokaryotes (intronless protein coding genes)
TAC
Gene region Downstream (3)
DNA
Transcription (gene is encoded on minus strand .. And the reverse complement is read into mRNA)
ATG
mRNA
3 UTR
Translation: tRNA read off each codons, 3 bases at a time, starting at start codon until it reaches a STOP codon. protein
Gene 1
Gene 2
Gene 3
In prokaryotes, sometimes genes that are part of the same operational pathway are grouped together under a single promoter. They then produce a pre-mRNA which eventually produces 3 separates mRNAs.
Bacterial genomes have simple gene structure. - Transcription factor binding site. - Promoters
Genetic Code
How does an mRNA specify amino acid sequence? The answer lies in the genetic code. It would be impossible for each amino acid to be specified by one nucleotide, because there are only 4 nucleotides and 20 amino acids. Similarly, two nucleotide combinations could only specify 16 amino acids. The final conclusion is that each amino acid is specified by a particular combination of three nucleotides, called a codon: Each 3 nucleotide code for one amino acid. The first codon is the start codon, and usually coincides with the Amino Acid Methionine. (M which has codon code ATG) The last codon is the stop codon and does NOT code for an amino acid. It is sometimes represented by * to indicate the STOP codon. A coding region (abbreviation CDS) starts at the START codon and ends at the STOP codon.
Codon table
Note the degeneracy of the genetic code. Each amino acid might have up to six codons that specify it. Different organisms have different frequencies of codon usage. A handful of species vary from the codon association described above, and use different codons fo different amino acids. How do tRNAs recognize to which codon to bring an amino acid? The tRNA has an anticodon on its mRNA-binding end that is complementary to the codon on the mRNA. Each tRNA only binds the appropriate amino acid for its anticodon.
RNA
RNA has the same primary structure as DNA. It consists of a sugar-phosphate backbone, with nucleotides attached to the 1' carbon of the sugar. The differences between DNA and RNA are that: 1. RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference between deoxyribonucleic acid and ribonucleic acid. 2. Instead of using the nucleotide thymine, RNA uses another nucleotide called uracil: 3. Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure. 4. Because the RNA molecule is not restricted to a rigid double helix, it can form many different stable three-dimensional tertiary structures.
Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change.
Most modern gene prediction programs need to be trained. E.g. they find their own consensus and assembly rules given a few examples genes. A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORFs restrict the search space of possible gene candidates. E.g. selfid program(selfid@igs.cnrs-mrs.fr)
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame 1,-2,-3 are on minus strand) Some programs have other conventions for naming frames.. (0..5, 1-6, etc) Gene finding in eukaryotic cDNA uses ORF finding +blastx as well.
http://www.ncbi.nlm.nih .gov/gorf/gorf.html
try with gi=41 ( or your own piece of DNA)
Eukaryotic Nuclear Gene Structure Gene prediction for Pol II transcribed genes. Upstream Enhancer elements.
Upstream Promoter elements.
TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990)) 14-20 nt spacer DNA CAP site (8 bp) Transcription Initiation. Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon. polyA signal (AATAAA 99%,other)
introns
Transcript region, interrupted by introns. Each introns starts with a donor site consensus (G100T100A62A68G84T63..)
Has a branch site near 3 end of intron (one not very conserved consensus UACUAAC)
ends with an acceptor site consensus. (12Py..NC65A100G100)
UACUAAC
AG
Exons
The exons of the transcript region are composed of: 5UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome) AUG (or other start codon) Remainder of coding region Stop Codon 3 UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)
Untranslated regions (UTRs) introns (can be genes within introns of another gene!) intergenic regions. - repetitive elements - pseudogenes (dead
genes that may(or not) have been retroposed back in the genome as a single-exon gene
Pseudogenes
Pseudogenes: Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both. Processed pseudogenes: Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail. Insertion process flanks mRNA sequence with short direct repeats. Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence. Do not confuse with single-exon genes.
Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site.
Low-Complexity Elements
When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation! Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified. The low-complexity sequence can also be hidden at the translated protein level.
Masking
To avoid finding spurious matches in alignment programs, you should always mask out the query sequence. Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs). Before running blastn against a genomic record, you must mask out the repeats. Most used Programs:
CENSOR:
Repeat Masker: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
The protein as read off from the mRNA may not be in the final form that will be used in the cell. Some proteins contains Signal Peptide (located at N-terminus (beginning)), this signal peptide is used to guide the protein out of the nucleus towards its final cellular localization. This signal peptide is cleaved-out at the cleavage site once the protein has reach (or is near) its final destination. Various Post-Translational modifications (phosphorylation) The final protein is called the mature peptide
Introduction to Genbank file format. Introduction to Entrez and Pubmed Ref: Chapters 1,2,5,6 of Bioinformatics
Databases:
A collection of Records.
Each record has many fields. Spread-sheet Each field contain specific information. Each field has a data type. Flat-file E.g. money, currency,Text Field, Integer, version of a date,address(text field) ,citation (text field) database. Each record has a primary key. A UNIQUE identifier that unambiguously defines this record.
gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 06/01/00 PRI 9606 homo sapiens 6226762 NM_000014 2 10/12/99 PRI 9606 homo sapiens 4557224 NM_000014 1 02/04/99 PRI 9606 homo sapiens 41 X63129 1 06/06/96 MAM 9913 bos taurus
gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 41 X63129 1 06/06/1996 MAM 9913 bos taurus
Gi = Genbank Identifier: Unique Key : Primary Key
gi Accession version date Genbank Division taxid organims 6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 41 X63129 1 06/06/1996 MAM 9913 bos taurus
Relational Database (Normalizing a database for repeated subelements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.)
gi 6226959 6226762 4557224 41 Accession NM_000014 NM_000014 NM_000014 X63129 version 3 2 1 1 date 01/06/2000 12/10/1999 04/02/1999 06/06/1996 Genbank Division taxid PRI 9606 PRI 9606 PRI 9606 MAM 9913
taxid organims Number of Chromosomes 9606 homo sapiens 22 diploid + X+Y 9913 bos taurus 29+X+Y
Indexed searches.
To allow easy searching of a database, make an index. An index is a list of primary keys corresponding to a key in a given field (or to a collection of fields)
Genbank division PRI 6226959;6226762;4557224; MAM 41; Accession NM_000014 6226959;6226762;4557224; X63129 41;
Indexed searches.
Boolean Query: Merging and Intersecting lists: AND (in both lists) (e.g. human AND genome)
+human +genome human && genome
Search strategies
Search engines use complex strategies that go beyond Boolean queries.
Phrases matching:
human genome -> human genome
togetherness: documents with human close to genome are scored higher. Term expansion & synomyms:
human -> homo sapiens
neigbours:
human genome-> genome projects, chromosomes,genetics
Search strategies
Search engines use complex strategies that go beyond Boolean queries.
To avoid these term mapping, enclose your queries in quotes: human AND genome To require that ALL the terms in your query be important, precede them with a + . This also prevents term mapping. To force the order of the words to be important, group sentences within strings. biology of mammals.
Indexed searches.
Example
find the advanced query page at http://www.altavista.com type human (and hit the Search button) Type genome: type human AND genome type human genome (finds the least matches) type human OR genome (finds the most matches)
Search Engines:
Web Spiders: Collection of All web pages, but since Web pages change all the time and new ones appear, they must constantly roam the web and re-index.. Or depend on people submitting their own pages.
www.google.com (BEST!) www.infoseek.com www.lycos.com www.exite.com www.webcrawler.com www.lycos.com www.looksmart.com (country specific)
Search Engines:
www.google.com (BEST!) Google ranks pages according to how many pages with those terms refer to the pages you are asking for. Not only must one document contain ALL the search terms, but other documents which refer to this one must also contain all the terms. Great when you know what you are looking for! You can also use to require immediate proximity and order of terms. E.g. type
Web server for the blast program.
But google only indexes about 40% of the web.. So you may have to use other web spiders.
Search Engines:
Curated Collections: Not comprehensive: Contains list of best sites for commonly requested topics, but is missing important sites for more specialized topics (like biology)
www.yahoo.com (Has travel maps too!)
Answer-based curated collections: Easy to use english-like queries. First looks at list of predefined answers, then refines answers based on user interaction. Also answer new questions.
www.askjeeves.com www.magellan.com www.altavista.com(has translation TOOLS) www.hotbot.com
Search Engines:
Meta-Search Engines: Polls several search engines, and returns the consensus of all results. Is likely to miss sites, but the sites it returns are very relevant to the query. Other operating mode is to return the sum of all the results.. Then becomes very sensitive to a very detailled query.
www.metacrawler.com www.savvysearch.com www.1blink.com (fast) www.metafind.com www.dogpile.com
Expasy server
http://www.expasy.ch
Biological Databases
Nucleotide databases:
Genbank: International Collaboration
NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia) A bank No curation.. Submission to these database is required for publication in a journal.
Protein Databases:
NCBI: Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing)
http://www.expasy.ch(latest pay version) NCBI has the latest free version. Translated Proteins from Genbank Submissions
EMBL
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT
PIR
Structure databases:
PDB: Protein structure database.
Http://www.rscb.org/pdb/
Litterature databases:
NCBI: Pubmed: All biomedical litterature.
Www.ncbi.nlm.nih.gov Abstracts and links to publisher sites for
full text retrieval/ordering journal browsing.
Pathways Database:
KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html
Accession(stable) version(changes with each sequence update) Version also refers to Accession.version Secondary accession: Records may have been merged in the past.. So the records which were not chosen as the primary were made secondary.
Primary Databases
A primary Database is a repository of data derived from experiments or from research knowledge.
Genbank (Nucleotide repository) Protein DB, Swissprot PDB (MMDB) are primary databases. Pubmed (litterature) Genome Mapping databases. Kegg Database.(pathways)
Secondary Databases
A secondary database contains information derived from other sources.
Refseq (Currated collection of Genbank at NCBI) Unigene (Clustering of ESTs at NCBI)
Genbank Records
A Bank: No attempt at reconciliation. Submit a sequence Get an Accession Number!
Cannot modify sequences without submitters consent. No attempt at reconciliation.(not a unique collection per LOCUS/gene) Entries of various sequence quality and different sources==> Separate in various divisions based on
High Quality sequences in taxon specific divisions. Low Quality sequences in Usage specific databases.
A Collaboration between NCBI, EMBL and DDBJ. They contain (nearly) the same information, only the data format differs.
EMBL does not differentiate between the different types of RNA records, while NCBI (and DDBJ) do. In Entrez EMBL records are patched up to add that information.
Genbank divisions
Sequences in genbank are split into various categories based on 1) The quality and type of sequences 2) The high quality nucleotide sequences are divided into organism-dependant divisions.
FASTA Format
Genbank References
LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992 ... REFERENCE 1 (bases 1 to 1380) AUTHORS Sinha,D. TITLE Direct Submission JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry, Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA REFERENCE 2 (bases 1 to 1380) AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P. TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992) MEDLINE 92223096 FEATURES Location/Qualifiers
... BASE COUNT 357 a 413 c 322 g 288 t ORIGIN 1 gaccagccct gacctaggac agtgaatcga taatggcact 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct 121 acgctgtcca agagacagat gatacatccc accaggaagc 181 ccaacctggc caactttgcc ttcagcatat accaccattt 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt 301 ccaagggcaa cactcacact gagatcctga agggcctggg 361 cagaggctga gatccacaaa ggctttcagc atcttctcca ... 1321 gtccccccac tccctccatg gcattaaagg atgactgacc //
tagccccgaa aaaaaaaaaa
Entrez
Index Based search system. Each field in the database is searchable individually or as agregate.
(e.g. CDS [FKEY]) default is agregate [ALL FIELDS] *
Phrase matching.
Human genome -> human genome
Entrez
Available neighbours (related documents or related sequences) In Pubmed searches: Term mapping to neighbouring documents and neighbouring terms. Term mapping to chemical names.
In pubmed: term [All Fields] is term mapped to chemical names + MeSH terms + Text Fields. .. Unless term is whithin double quotes.
Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Tutorials: http://www.ncbi.nlm.nih.gov/Class/MLACo urse/Genetics/index.html
http://www.ncbi.nlm.nih.gov/Literature/pubmed_s earch.html http://www.ncbi.nlm.nih.gov/Database.tut1.html
SWISSPROT
http://www.expasy.ch/sprot/sprot_details.html
1. Core data: protein sequence data; the citation information and the taxonomic data 2. Annotation Function(s) of the protein Domains and sites. For example calcium binding regions, ATPbinding sites, zinc fingers, homeobox, kringle, etc. Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc. Secondary structure Quaternary structure. For example homodimer, heterotrimer, etc Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein Sequence conflicts, variants, etc.
SWISSPROT
http://www.expasy.ch/cgi-bin/get-random-entry.pl?S
Exercises You can work in teams for this. 1a) Use the first 6000 bases of your genomic piece [ or find a bacterial genomic or mRNA sequence in Entrez with length between 2000:10000 ]
b) Use the ORF finder to find the gene(s). Compare the answer you get to the annotation you can infer from using blastn against genbank and to using blastx against a protein database.
Do the Entrez exercizes. ( separate word document)