Академический Документы
Профессиональный Документы
Культура Документы
March 2012
Data handling
Alignment to a reference sequence
Alignment file handling
Variant calling
SNPs, genotypes
structural variation
Sequence assembly
Data handling
Important to have a data hierarchy
corresponding to experimental factors
species
hsa
strain/subspecies/
population
YRI
CEU
individual
NA1287
8
NA1924
0
sequencing
technology
SLX
library
NA1287
8-WG
lane/run
297_1
454
297_2
SLX
NA1924
0-WG
505_7
505_8
@SRR010930.8436795/1!
ACCCCAGGATCAACACTTCACATGCATTAGCAGAGAGAGATAAATCAA!
+!
=>=??A?<@B@A:?B?D;AC@@CAAAD<AAA:99?:@=?@B@77C><4!
Alignment pipeline
Get data
Prepare and index reference
DATA PROCESSING
ALIGNMENT
Align data
by lane or smaller unit optimise throughput
Sort by position
Merge alignments
Improve alignments
Merge libraries
Index final alignment
SAM FILE
PROCESSING
Alignment pipeline
Sample
merge
BAM
Library
merge
BAM
Improvement
Sample/Platform
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
BAM
Fastq
Fastq
Fastq
Fastq
Fastq
Alignment
Library
Lane/plex
bwa
bwa index [-a bwtsw|div|is] [-c] <in.fasta>
Burroughs-Wheeler transform construction algorithm
bwtsw for vertebrate sized genomes, is for smaller genomes
Alignment improvement
Library duplicate removal
samtools, Picard
Removal method
Identify read-pairs where outer ends map to the same
position on the genome and remove all but one copy
samtools rmdup
Picard/GATK MarkDuplicates
Realignment
Short indels in the sample relative to reference pose
difficulties for alignment
Indels occurring near the ends of reads often not aligned
correctly
Aligners prefer to introduce SNPs rather than an indel
Realignment algorithm
Input set of known indel sites and a BAM file
Previously published indel sites, dbSNP, 1000 Genomes, or
estimate from alignment
SAM/BAM
Sequence Alignment/Map format
unified format for storing read alignments to a
reference genome
Features
Field
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
RNEXT
PNEXT
TLEN
SEQ
QUAL
Type
str
int
str
int
int
str
str
int
int
str
str
Description
query name of the read or the read pair
bitwise flag (pairing, mapped, mate mapped, etc.)
reference sequence name
1-based leftmost position of clipped alignment
mapping quality (Phred scaled)
extended CIGAR string (details of alignment)
mate reference name (= if same as RNAME)
position of mate/next segment
observed template length
segment sequence
ASCII of Phred-scaled base quality
SAM format
Example
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
RNEXT
PNEXT
TLEN
SEQ
QUAL
IL4_315:7:105:408:43!
177!
X!
1741!
0!
1S35M!
X!
56845228!
0!
ATTTGGCTCTCTGCTTGTTTATTATTGGTGTATNGG!
+1,1+16;>;166>;>;;>>;>>>>>>,>>>>>+>>!
http://picard.sourceforge.net/explain-flags.html
Picard
Java program suite
http://picard.sourceforge.net
MarkDuplicates, CollectAlignmentSummaryMetrics,
CreateSequenceDictionary, SamToFastq, MeanQualityByCycle
Pysam
Python interface to samtools API
http://code.google.com/p/pysam/
Variant calling
Call SNPs with genotypes (heterozygous and
homozygous), indels and structural variants
Tools
samtools, bcftools
GATK, SOAPsnp, Dindel
SVMerge
File formats:
VCF, pileup
VCF Format
Header
Arbitrary number of INFO definition lines starting with ##
Column definition line starts with single #
Mandatory columns
Chromosome (CHROM)
Position of the start of the variant (POS)
Unique identifiers of the variant (ID)
Reference allele (REF)
Comma separated list of alternate non-reference alleles
(ALT)
Phred-scaled quality score (QUAL)
Site filtering information (FILTER)
User extensible annotation (INFO)
VCF format
Example
CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
sample1
sample2
sample3
3!
74393!
.!
G!
T!
999!
.
DP=31;AF1=0.7002;AC1=4;DP4=4,0,22,2
; GT:PL:DP:GQ!
1/1:181,57,0:19:57!
1/1:90,15,0:5:16!
0/0:0,12,85:4:7!
More information
SNP calling and genotyping
Samtools
http://bioinformatics.oxfordjournals.org/content/25/16/2078.long
http://samtools.sourceforge.net
GATK
http://www.broadinstitute.org/gsa/wiki/index.php/
The_Genome_Analysis_Toolkit
VCF
VCFtools
http://vcftools.sourceforge.net
Danacek et al. Bioinformatics 27(15): 2156-2158 (2011)
http://www.1000genomes.org/wiki/Analysis/Variant%20Call
%20Format/vcf-variant-call-format-version-41
Structural variation
Structural variation
Read depth and pairing information used to detect events
deviations from the expected fragment size
presence/absence of mate pairs
excessive/reduced read depth (CNV)
Assembly
Tools
Abyss
http://www.bcgsc.ca/platform/bioinfo/software/abyss
SGA
https://github.com/jts/sga
SOAPdenovo
http://soap.genomics.org.cn/soapdenovo.html
ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog
Cortex
http://cortexassembler.sourceforge.net/
Velvet
http://www.ebi.ac.uk/~zerbino/velvet/
Assembly metrics
N50, N10, N90 etc
x % of assembly is in fragments larger than
Nx