Академический Документы
Профессиональный Документы
Культура Документы
In this workshop you will learn the theory behind BLAST, the Basic Local Alignment Search Tool, and how to optimize sequence similarity searches for nucleotide sequences. Example searches as well as special applications as for example Primer-BLAST and Genome-BLAST will be discussed. The NCBI BLAST Web services have a new organization with a simplified interface that provides easier access to important options. The redesign also offers several new features that include a more powerful taxonomic limit, automatic adjustment of search parameters for short sequences, easy tracking and access to recent searches, and the ability to store search strategies.
How to get from here:
to here:
to here:
Global Alignment
Local Alignment A sequence comparison algorithm - a procedure for solving a mathematical problem in a finite number of steps that frequently involves repetition of an operation, optimized for speed used to search sequence databases for optimal local alignments to a query. The BLAST Algorithm is fast and sensitive. Sequences are filtered to remove low complexity regions (to optimize a meaningful sequence alignment).
What is BLAST?
c) A set of high-scoring segment pairs for the best alignment are selected and the largest HSPs are returned as the search output. Local hits on the same accession-numbers are assembled virtually.
all that is being scored is whether or not two bases at a given position are the same o all matches are given the same score (+2 or +1), as are all mismatches (-3 or -2; blastn and Megablast resp.) o mutational analysis show that transitions are more likely than transversions (this is not taken into account in BLAST)
Gap opening and gap extension costs the presence of a gap is more significant than the length of the gap o gap opening costs are higher than extension costs there is no widely accepted theory for selecting gap costs low gap costs allows gaps everywhere in the alignment
The default setting will filter repetitive or low-complexity sequences. The filters are SEG (protein/ X) or DUST (nucleic acid/ N) programs Masking sequences for filtering can also be done by writing the sequence in lower case letters Another option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through lowcomplexity sequence. 4
E-value threshold
The E-value represents the number of hits with an alignment score "S" equal to or better than "S" that would be "expected" by chance (background noise) when searching a db of a particular size default E value for blastn, blastp, blastx and tblastn is 10 At this setting, 10 hits with scores equal to or better than the defined alignment score, S, are expected to occur by chance. The E-value can be increased or decreased to alter the stringency of the search Increase the E value when searching with a short query, since it is likely to be found many times by chance in a given database.
Statistics
E=kmne
E= K= m= n=
-S
Expectation Value = number of matches expected to occur randomly with a given score. In general terms the smaller E is the more likely the match is significant. A variable with a value dependent upon the substitution matrix used and adjusted for Search base size. Length of query (in nucleotides or amino acids) Size of database (in nucleotides or amino acids)
mn = Size of the search space (more on this later) = A statistical parameter used as a natural scale for the scoring system. S = Raw Score = sum of substitution scores (ungapped BLAST)or substitution + gap scores. E value decreases exponentially as the Score (S) increases, E value reflects the size of database and the scoring system in use, a convenient way to create a significance threshold for reporting hits is to alter the E value. When the Expect value threshold is increased from the default value of 10, more hits can be reported.
Example:
Torpedo mormorata (marbled electric ray) mRNA for chloride channel protein 2,673 bp linear mRNA Accession: X56758.1 GI: 64424
Formatting options
BLAST search parameters specified in the query Which database was searched? Which matrix was used during the search? Date on which the query database was built? Size of the database at that time
(BLASTN) Higher eukaryotic genomes contain large amounts of repetitive DNA. The most abundant interspersed repeat in the human genome is the Alu element. Alus tend to occur near genes, within the introns of genes, or in the regions between genes. In some cases, their presence and absence can fairly accurately show the intron-exon structure of a gene. Demonstrate this by performing a nucleotide-nucleotide BLAST search against the Alu database with the genomic sequence of the human Von Hippel Lindau syndrome gene (Accession AF010238). Note that the exons appear in the BLAST graphic as places where the Alu elements do not align. (Genome BLAST) Use Entrez-Gene to find the entry for the human glyceraldehydes-3phosphate dehydrogenase gene. Click on the Map Viewer link to find the map location and the contig containing the GAPD gene. Zoom in to see the exon-intron structure of the gene. How many exons are there? Now use human genome BLAST to verify the location and structure of this gene. Use the GAPD RefSeq (NM_002046) to perform this search. How many contigs do you hit in the human genome? Click on the Genome View button to see the distribution of these hits on the genome. Look at some of the high scoring single hits and to see what's unusual about them. How can you account for these results?
(3)
MPG Bioinformatics Support Service http://www.biochem.mpg.de/iv Wiki on bioinformatics tools developed in the Max Planck Society http://www.bioinfowiki.mpg.de/ N. Gaedeke Last update: September 3rd , 2013 7