Вы находитесь на странице: 1из 27

Aylwyn Scally, Wellcome Trust Sanger Institute

March 2012

Key tasks in sequence analysis

Data handling
Alignment to a reference sequence
Alignment file handling
Variant calling
SNPs, genotypes
structural variation

Sequence assembly

Data handling
Important to have a data hierarchy
corresponding to experimental factors
species

hsa

strain/subspecies/
population

YRI

CEU

individual

NA1287
8

NA1924
0

sequencing
technology

SLX

library

NA1287
8-WG

lane/run

297_1

454

297_2

SLX

NA1924
0-WG

505_7

505_8

Raw sequence data


FASTQ format
original Sanger standard for capillary data
derived from FASTA format
sequence and an associated per base quality
score
PHRED quality scores encoded as ASCII
printable characters (ASCII 33126)
standard offset 33 but older Solexa/Illumina variants
used 64
@title
sequence
+optional_text
quality

@SRR010930.8436795/1!
ACCCCAGGATCAACACTTCACATGCATTAGCAGAGAGAGATAAATCAA!
+!
=>=??A?<@B@A:?B?D;AC@@CAAAD<AAA:99?:@=?@B@77C><4!

PHRED quality scores


Encodes the probability of an erroneous
call
quality score Q = 10 log10 P
error probability P = 10Q/10
example: call with Q = 30 has error probability
P = 10-3 = 1 in 1000
ASCII encoding
encoding ! # $ % & ( ) * + , - . / 0 1 2 3 4 !
Q score 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!

Alignment pipeline
Get data
Prepare and index reference

DATA PROCESSING
ALIGNMENT

sequence names; alternate haplotypes etc

Align data
by lane or smaller unit optimise throughput

Sort by position
Merge alignments
Improve alignments
Merge libraries
Index final alignment

SAM FILE
PROCESSING

Alignment pipeline
Sample
merge

BAM

Library
merge
BAM
Improvement

Sample/Platform

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

Fastq

Fastq

Fastq

Fastq

Fastq

Alignment

Library

Lane/plex

bwa
bwa index [-a bwtsw|div|is] [-c] <in.fasta>
Burroughs-Wheeler transform construction algorithm
bwtsw for vertebrate sized genomes, is for smaller genomes

bwa aln [options] <prefix> <in.fq>


align each single-ended fastq file individually
<prefix> is name of reference file
options control alignment parameters, scoring matrix, seed length

bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>


generates pairwise alignment from sai files produced by bwa aln
produces SAM output

bwa bwasw [options] <prefix> <query.fa>


alignment of long reads in query.fa
produces SAM output

bwa usage notes


bwa finds matches up to a finite edit distance
by default for 100-bp reads allows 5 edits

Important to quality-clip reads


-q in bwa aln, e.g. set to 20

Non-ACGT bases on reads are treated as


mismatches
Parallelise for speed
split data into 1 Gbp blocks
bwa takes ~8 hrs per block

Check for truncated BAM files


e.g. with samtools flagstats

Alignment improvement
Library duplicate removal
samtools, Picard

Realignment around indels


GATK

Base quality recalibration


GATK

Library duplicate removal


PCR amplification step in library preparation can result
in duplicate DNA fragments
PCR-free protocols exist but require larger volumes of
input DNA
Generally a low number of duplicates in good libraries;
increases with depth of sequencing

Duplicates can result in false SNP calls


manifest as high read depth support

Removal method
Identify read-pairs where outer ends map to the same
position on the genome and remove all but one copy
samtools rmdup
Picard/GATK MarkDuplicates

Realignment
Short indels in the sample relative to reference pose
difficulties for alignment
Indels occurring near the ends of reads often not aligned
correctly
Aligners prefer to introduce SNPs rather than an indel

Realignment algorithm
Input set of known indel sites and a BAM file
Previously published indel sites, dbSNP, 1000 Genomes, or
estimate from alignment

At each site, model the indel and reference haplotypes and


select best fit with data
New BAM file produced, modified where indels have been
introduced by realignment
Implemented in GATK (IndelRealigner function)

Additional alignment issues


Separate chromosomal BAMs
easier to process in parallel

Realign/assemble unmapped reads


recover sequence missed due to reference
incompatibility or incompleteness

SAM/BAM
Sequence Alignment/Map format
unified format for storing read alignments to a
reference genome

BAM (Binary Alignment/Map) format


binary equivalent of SAM

Features

stores alignments from most alignment programs


supports multiple sequencing technologies
supports indexing for quick retrieval
reads can be classed into logical groups
e.g. lanes, libraries, individuals

SAM file format


Header
Alignment lines (one per read)
11 mandatory fields
several optional fields (format TAG:TYPE:VALUE)
Col
1
2
3
4
5
6
7
8
9
10
11

Field
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
RNEXT
PNEXT
TLEN
SEQ
QUAL

Type
str
int
str
int
int
str
str
int
int
str
str

Description
query name of the read or the read pair
bitwise flag (pairing, mapped, mate mapped, etc.)
reference sequence name
1-based leftmost position of clipped alignment
mapping quality (Phred scaled)
extended CIGAR string (details of alignment)
mate reference name (= if same as RNAME)
position of mate/next segment
observed template length
segment sequence
ASCII of Phred-scaled base quality

SAM format
Example
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
RNEXT
PNEXT
TLEN
SEQ
QUAL

IL4_315:7:105:408:43!
177!
X!
1741!
0!
1S35M!
X!
56845228!
0!
ATTTGGCTCTCTGCTTGTTTATTATTGGTGTATNGG!
+1,1+16;>;166>;>;;>>;>>>>>>,>>>>>+>>!

http://picard.sourceforge.net/explain-flags.html

SAM/BAM file processing tools


samtools
C program and library
http://samtools.sourceforge.net

view: SAM-BAM conversion


sort, index, merge multiple BAM files
flagstat: summary counts of mapping flags

Picard
Java program suite
http://picard.sourceforge.net

MarkDuplicates, CollectAlignmentSummaryMetrics,
CreateSequenceDictionary, SamToFastq, MeanQualityByCycle

Pysam
Python interface to samtools API
http://code.google.com/p/pysam/

Variant calling
Call SNPs with genotypes (heterozygous and
homozygous), indels and structural variants
Tools
samtools, bcftools
GATK, SOAPsnp, Dindel
SVMerge

File formats:
VCF, pileup

Filters and calling protocols


depth, quality, strand bias, multiple samples

Indels harder to call accurately than SNPs


structural variation harder still

Variant Call Format (VCF)


Stores polymorphism data with annotation
SNPs, insertions, deletions and structural variants

Can be indexed for fast data retrieval


Variant calls across many samples
Metadata
e.g. dbSNP accession, filter status, validation status

Arbitrary tags can be used to describe new types


of variant
Note: binary BCF produced by samtools
get vcf with samtools mpileup | bcftools view

VCF Format
Header
Arbitrary number of INFO definition lines starting with ##
Column definition line starts with single #

Mandatory columns

Chromosome (CHROM)
Position of the start of the variant (POS)
Unique identifiers of the variant (ID)
Reference allele (REF)
Comma separated list of alternate non-reference alleles
(ALT)
Phred-scaled quality score (QUAL)
Site filtering information (FILTER)
User extensible annotation (INFO)

VCF format
Example
CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
sample1
sample2
sample3

3!
74393!
.!
G!
T!
999!
.
DP=31;AF1=0.7002;AC1=4;DP4=4,0,22,2
; GT:PL:DP:GQ!
1/1:181,57,0:19:57!
1/1:90,15,0:5:16!
0/0:0,12,85:4:7!

see H. Li, Bioinformatics 27(21): 29872993 (2011) for


details of likelihood and population genetic calculations

More information
SNP calling and genotyping
Samtools
http://bioinformatics.oxfordjournals.org/content/25/16/2078.long
http://samtools.sourceforge.net

GATK
http://www.broadinstitute.org/gsa/wiki/index.php/
The_Genome_Analysis_Toolkit

VCF
VCFtools
http://vcftools.sourceforge.net
Danacek et al. Bioinformatics 27(15): 2156-2158 (2011)

http://www.1000genomes.org/wiki/Analysis/Variant%20Call
%20Format/vcf-variant-call-format-version-41

Structural variation

Structural variation
Read depth and pairing information used to detect events
deviations from the expected fragment size
presence/absence of mate pairs
excessive/reduced read depth (CNV)

Several methods/tools released


SVMerge pipeline
makes SV predictions using a collection of callers
Input is one BAM file per sample
callers run individually & outputs converted into standard BED
format
calls merged
computationally validated using local de novo assembly
http://svmerge.sourceforge.net/

Assembly
Tools
Abyss
http://www.bcgsc.ca/platform/bioinfo/software/abyss

SGA
https://github.com/jts/sga

SOAPdenovo
http://soap.genomics.org.cn/soapdenovo.html

ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog

Cortex
http://cortexassembler.sourceforge.net/

Velvet
http://www.ebi.ac.uk/~zerbino/velvet/

Assembly metrics
N50, N10, N90 etc
x % of assembly is in fragments larger than
Nx

Number of contigs, mean/max contig


length
Realignment
fraction of read pairs mapped correctly
correct homozygous SNPs
identify breakpoints

Thanks to Thomas Keane and the Vertebrate


Resequencing team at WTSI for several slides

Вам также может понравиться