Вы находитесь на странице: 1из 7

Nucleotide BLAST

In this workshop you will learn the theory behind BLAST, the Basic Local Alignment Search Tool, and how to optimize sequence similarity searches for nucleotide sequences. Example searches as well as special applications as for example Primer-BLAST and Genome-BLAST will be discussed. The NCBI BLAST Web services have a new organization with a simplified interface that provides easier access to important options. The redesign also offers several new features that include a more powerful taxonomic limit, automatic adjustment of search parameters for short sequences, easy tracking and access to recent searches, and the ability to store search strategies.
How to get from here:

to here:

to here:

Scope of Similarity Searching


Find statistically significant matches, based on sequence similarity, to a nucleotide sequence of interest. Obtain information on inferred function of a gene Localize a sequence in the genome Find orthologous genes in different organisms .

Global Alignment

Local Alignment A sequence comparison algorithm - a procedure for solving a mathematical problem in a finite number of steps that frequently involves repetition of an operation, optimized for speed used to search sequence databases for optimal local alignments to a query. The BLAST Algorithm is fast and sensitive. Sequences are filtered to remove low complexity regions (to optimize a meaningful sequence alignment).

What is BLAST?

How does BLAST work?


BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. A third step produces a non-redundant output. a) The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a matrix. The sequence is broken into words with a word length of 11 for nucleic acids (the maximum number of words can be calculated: L - w + 1= max. word no. (L=seq.length, w=words)). Only those words that score higher than the neighbourhood word score threshold (T) are kept (using a scoring matrix): A so called LOOK-UP-TABLE with high-scoring segment pairs (HSPs) is created. (The blastn algorithm parses nucleotide sequences into 11 letter "words", the same is done for every sequence in the query database, exact word matches are being identified in database sequences) b) Word hits are then extended in both direction in an attempt to generate an alignment with a score
"S"exceeding the threshold. The "T" parameter dictates the speed and sensitivity of the search.

c) A set of high-scoring segment pairs for the best alignment are selected and the largest HSPs are returned as the search output. Local hits on the same accession-numbers are assembled virtually.

The Scoring Matrix Nucleic Acid Scoring Matrices (unitary matrix)

all that is being scored is whether or not two bases at a given position are the same o all matches are given the same score (+2 or +1), as are all mismatches (-3 or -2; blastn and Megablast resp.) o mutational analysis show that transitions are more likely than transversions (this is not taken into account in BLAST)

Gap opening and gap extension costs the presence of a gap is more significant than the length of the gap o gap opening costs are higher than extension costs there is no widely accepted theory for selecting gap costs low gap costs allows gaps everywhere in the alignment

Filters & Masking

The default setting will filter repetitive or low-complexity sequences. The filters are SEG (protein/ X) or DUST (nucleic acid/ N) programs Masking sequences for filtering can also be done by writing the sequence in lower case letters Another option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through lowcomplexity sequence. 4

E-value threshold

The E-value represents the number of hits with an alignment score "S" equal to or better than "S" that would be "expected" by chance (background noise) when searching a db of a particular size default E value for blastn, blastp, blastx and tblastn is 10 At this setting, 10 hits with scores equal to or better than the defined alignment score, S, are expected to occur by chance. The E-value can be increased or decreased to alter the stringency of the search Increase the E value when searching with a short query, since it is likely to be found many times by chance in a given database.
Statistics

E=kmne
E= K= m= n=

-S

Expectation Value = number of matches expected to occur randomly with a given score. In general terms the smaller E is the more likely the match is significant. A variable with a value dependent upon the substitution matrix used and adjusted for Search base size. Length of query (in nucleotides or amino acids) Size of database (in nucleotides or amino acids)

mn = Size of the search space (more on this later) = A statistical parameter used as a natural scale for the scoring system. S = Raw Score = sum of substitution scores (ungapped BLAST)or substitution + gap scores. E value decreases exponentially as the Score (S) increases, E value reflects the size of database and the scoring system in use, a convenient way to create a significance threshold for reporting hits is to alter the E value. When the Expect value threshold is increased from the default value of 10, more hits can be reported.

Example:

Torpedo mormorata (marbled electric ray) mRNA for chloride channel protein 2,673 bp linear mRNA Accession: X56758.1 GI: 64424

Formatting options

Review details of the search process

BLAST search parameters specified in the query Which database was searched? Which matrix was used during the search? Date on which the query database was built? Size of the database at that time

BLAST practice question


(1) (BLASTN) A frequent use of nucleotide-nucleotide BLAST is to check oligonucleotides for hybridization or PCR. The goal most people have when doing this is to make sure that the primer will give a unique product from the target genome or cDNA population. Because BLAST is local and searches both strands, one can simply concatenate a pair of +/- strand primers and use them in a single search. Combine the following pair of candidate PCR primers in a nucleotide-nucleotide search against the nr (non-redundant) and the default (human genomic + transcript) nucleotide databases and identify the gene amplified. F12 R8 (2) GTCAAGTGGCAACTCCGTCAG TTGAGAGATGGATTGTTGCGC

(BLASTN) Higher eukaryotic genomes contain large amounts of repetitive DNA. The most abundant interspersed repeat in the human genome is the Alu element. Alus tend to occur near genes, within the introns of genes, or in the regions between genes. In some cases, their presence and absence can fairly accurately show the intron-exon structure of a gene. Demonstrate this by performing a nucleotide-nucleotide BLAST search against the Alu database with the genomic sequence of the human Von Hippel Lindau syndrome gene (Accession AF010238). Note that the exons appear in the BLAST graphic as places where the Alu elements do not align. (Genome BLAST) Use Entrez-Gene to find the entry for the human glyceraldehydes-3phosphate dehydrogenase gene. Click on the Map Viewer link to find the map location and the contig containing the GAPD gene. Zoom in to see the exon-intron structure of the gene. How many exons are there? Now use human genome BLAST to verify the location and structure of this gene. Use the GAPD RefSeq (NM_002046) to perform this search. How many contigs do you hit in the human genome? Click on the Genome View button to see the distribution of these hits on the genome. Look at some of the high scoring single hits and to see what's unusual about them. How can you account for these results?

(3)

MPG Bioinformatics Support Service http://www.biochem.mpg.de/iv Wiki on bioinformatics tools developed in the Max Planck Society http://www.bioinfowiki.mpg.de/ N. Gaedeke Last update: September 3rd , 2013 7

Вам также может понравиться