Академический Документы
Профессиональный Документы
Культура Документы
...
Thomas Girke
Slide 1/53
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
Slide 2/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
Overview
Slide 3/53
RNA-Seq Technology
Sample 1
Sample 2
1. mRNA
Isolation
2. Illumina
Sequencing
Sample 1
Sample 2
3. Align Sequences
against Genome
Gene A
Gene B
Gene A
Gene B
Overview
Slide 4/53
3. Normalization
Main adjustment for sequencing depth and compositional bias.
5. Specialty applications
Splice variant discovery (semi-quantitative), gene discovery, antisense
expressions, etc.
6. Cluster Analysis
Identication of genes with similar expression proles across many
samples.
Overview
Slide 5/53
What features?
Genes, transcript models, exons
Alternative splicing
Often restricted to splice junction analysis
Objective: discovery vs. quantication
Overview
Slide 6/53
Overview
Slide 7/53
Overview
Slide 8/53
12345678901234 5678901234567890123456789012345
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
TTAGATAAAGGATA*CTG
aaaAGATAA*GGATA
gcctaAGCTAA
ATAGCT..............TCAGC
ttagctTAGGC
CAGCGGCAT
SAM Format
r001 163 ref 7 30 8M2I4M1D3M
r002
0 ref 9 30 3S6M1P1I4M
r003
0 ref 9 30 5S6M
r004
0 ref 16 30 6M14N5M
r003 2064 ref 29 17 6H5M
r001
83 ref 37 30 9M
= 37 39
* 0
0
* 0
0
* 0
0
* 0
0
= 7 -39
TTAGATAAAGGATACTG
AAAAGATAAGGATA
GCCTAAGCTAA
ATAGCTTCAGC
TAGGC
CAGCGGCAT
*
*
* SA:Z:ref,29,-,6H5M,17,0;
*
* SA:Z:ref,9,+,5S6M,30,1;
* NM:i:1
Link
Overview
Slide 9/53
Normalization Required
(c)
0.8
(a)
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
qq
-2
0.2
0.0
Density
(b)
0.4
-6
-4
-2
q
q
qq q
q
q q
qqqq
q
q
q q
q q
q q
q q
q
q qq
q q qq
qq
qq
q q
q
qq
qq qq
q
q qq
q
q
q
q q qqq
q q qqqq
qq
q
qqqqq
q qq
q qqq
qq qqq
q
qqqqqq
q qq q
qqqqqqq
q qqqq q
q
qqqqqqq
qqqqqqq
q q qq
qqqqqqq
qqq qq
qqqqqqq
qqqqqqq
qqqqqqq
qqqq qq
qqqqqqq
q
qqqqqqq
qqqqqq
qqqqqqq
qqqqqqq
q qq q
qqqqqqq
qqqqqqq
qqqqqqq
qqq qqq
qqqqqqq
qqqqqqq
qqqqqqqq
qqqqq
-4
qqqqqqq
q
qqqqqq
qqqqqqqq
qq qq
qq qqqq
qqqqqqq
qqqqqqqq
qqqqqqq
qqq
qq qqq
qq qqqq
qq qq
qq qq
qq qq
qqqqqqqq
qqqqqq
q
qq qq
qqqqq qq
qqqq
qqqqqqq
qqqqqqq
qqqqqqqq
qqq q q
qqq qqqq
qqqqqqq
qq
q qqqqqq
qqqqqqq
qqqqqqq
qq
qqqqqqq
qqqqqqq
qq qqqq
-5
-6
0.4
0.0
Density
qqqqqqq
qqqqqq
q qqq
qqqqq q
qqqqqqqq
qqqqqq
qqqqqqq
qqqqqqq
qqqqqqq
qqqq qq
qqqqqqq
qqq
qqqqqqq
qqqqqq
q
q q qq q
qqqqqqq
qqqqqqq
q
q qqqqq
q qqq q
qqqqqqq
q
qqqqqqq
q qqqqq
q qq q
q
qqqq qq
qqq q
q qqqqq
qq q q
q qqq qq
q qqq qq
q
qqqqqq
qq q
qqqq qq
qqq q q
qq q q
qq q
q qq qqq
q
q q
q q qq
qqqqqq
q qqq q
qq q
qqqqqqq
q qqqqq
qq qqqq
qq q
q
qqqqqqqq
q
qq q
q qqqqqq
qqqqqq
qqq qqq
q q
qq qq
qq q
q q q qqq
qqqq qq
qq q q q
qqq q
q
qq
q qqqqq
qqq
q
q
q
q qq
qqqqqq
q qq qq
q qqq
q
q q
q q q
q q qqq
q
q
q q qq
q
qq
q
qq q qq
q
q qq qq
q q q
qq qq q
q qq
q q
q qq q
qq qq
qq
q q q
q
q q qq
q
q
q q qqq
q
q
q
q qq q
q
q q qq
q
q
q
qq q qq
q q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q qq
q
q
q
q q
q qq q q q q
q
q
q
q
q q q q qq q
q
q q q qq q
q q
qq
q
q
q
q
q
q
q
q q q
q q q
q
q
q q q
q
q
q qq q q q q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q q q
q
q
q
q q
q
q
q
q
q q q qq q
q
q
q
q q
q
q q q
q
qq
q q
q
q
q
q
q
q
q
q qq
q
q
q
q q
q
q
qq
q qq q
q q
q
q qq qq q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq q
q
q
q
q q
q
q
q
q q q q qq
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
qq q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
qq q q q
q
q
q q q
q
q
q
q
q q
q
q
q qq q
qq q q q
q
q
q
q
q
q
q
q
q q q q qq q q q q q q
qq
q
q q q qq
q qq
q
q
q
qq q q q q q q q q
q
qq q
q
q
q
q
q
q q q q
q
q
q
q
q
qq
q q
q
q
q q
q q q q q qq qq q qq q qq q q qqq qq qq
q q q
q qq
q
q qq
q
q
q
q qq
q
q
q qq q q
q
q
q
q q q q qqq q q q
q
q
q
q
q qq q q q q
q
q q q q qqq q qqqqq q qq q q q q
q
qq
q
q q q qq q
q
q q
q q
qq q
q
q
q q qq q q qq q q qq
q q q q qq qqq qq q
q q
q
q q
q q q qq q q qq qqq q q q q q q q q qqqq q q q q
q q qq
q
q
q qq
q qq
q
q
q
q q q q q q q q q qqqqq qq qq q qqqq q
qq q
q
qq
q
q
q qq q q
q q q q q q q q qqq q q qq qqq qq q qq q q q q q
q
q
q
q
q
q qq
q
q
q
q
q
q q q q q qqqq qqq q qqqqqqqqq q q q q q q
q q qqq q q q
q q q q q qq q q
q
q
q q
qq q qq q
q
q
q
qq
q
q qqq q q q q q q qq q qq q qqq q q q
q q q q q qqqqqq q q q qqq qqq qqq q q q q qq qq qqq
q
q
q
q q q qq q q q q qq q q q q
q
qq q
q q qq q qq q qqq qq q qqqq qq q qqqq q qq
q q q
q
q q q q qq qqqqqqqqqqqqq q qqq qqqqqqqq qqq q q q q
q
qq qq qq q q q q q qq q qq q q q qqq
q q qq q q
q qqqqq qqqqqqqqqq qqqq qqq q q qqq q
q
q
qq q q q q
q q q qq qqqqqqqqq qqqqq qqqqqqqqq qqqqqq qq qq q
q
q qqqqqqqqqq q qqqq qqqqq q q qq q q qq q
qq qqq q q q qq q q q
q
q
qq qq
q
q q
q
q
qq q
q qq qqqqq qq q qqq qqq q qq qqq q
q qq qqqqqqqqq qq qqqqq qq qqqqqqqqq qqq q qqq q q q
qq q q
q
q qq q
q qqqqqq q qqq q qq q qqq q
q
q
q qq q
qq qq qqq qq q q qq q q q q q q qq
q q q qq qqqq qqqq qqq qq qqqqqqqqqqqqqqqq qq qqqqqqq
q qq qq qqqqqq qqqqqqqqqqq qqqq q q q qq
qq
q q q q q qqq
qqqq qqqqqqqqqq qqq qqq q qq qqq q q q q q
q q q qq qq q
q
q
q
q
q
qq qqqq qqqq qqqqq qqqqqq qqq
q q qqqqq qqqqqqqq qqqqqqqqqqqqqq qqq qqq qq qqq qq q
q
q qqq q
q
q q q q qq q qq qqqqqq q
qq
q
q qqqqqqq qq qqqqqqqqq qqqqqqqqqqq qqq q q q
q q q q qqq q q qq q
qq q
q q q
q
q qq
q qqqqqq qqqqqq qqqqqqqqqq qq q
q q q q qq qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq qqq q qqq qqq
q qqqqqqqq qqqqqqqqqq qq
q
qqqqqqqqqq qq q qqqq q qq q q q
q q qqqqqqqqq qq q qq qq qqq qq q q q q q
q qqqqqqqqqqqqq qq qqqqqqqqqqqqqqqq q qqq q q
q q q q q q qq q q q q
q
q
q q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqq qqq
q
q qq
q qq qq qq q qqqqq q q q
q
q
qq
q
q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q q q
q
qqqqq q qqqqq qq qqqqqqq
qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qq q q
qqqq qqqqq qq q qqq qq q q q
q q qq q
q q qqqqqqqqqqqqqqqqqqqqqq q qqqq q
q
q
qqqqq qqqqqqqqqqqqqqqqq qq q
q
qq qq qq qqq qq q q qqq q q
q q qqqqq qqqqqqqqqqqqqqqqqqqq q q q q q
qqq q q qqqq qqqqqqqq q
q
q
qqqqqqqqqqqqqqqqqqqqqqqq qqqqqq qqq qq q q
q
q q q q q qqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq qq q q q q q q
qq q
qq q
q q qq q qq q q q q
qq q
qq q q q qqqq qqqq qq q
qqqqq qqqqqqqqqqqqqqqqqqqqqqqq
q
qq q q q
q q
qq qqq q qqq qqqq qqq q q
qqq qq qqqq qqqqq qqqq q q q
qq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q
q
q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q
q qqqq q qqqqqqqqqqqqqqqq qq q q
q
q
q qqqqqqqqqqqqqqqqqqqqqqqqqq qq q q
qqqq qqqqq qqq q
q
qqqq qqqqq qqq q
q
q q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q
q qqq qq q
q q qqqqqqqqqqqqqqq qq qq
q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q qq
q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q
q
q q q q qqq
qqq qqqqqq qqqqqqqqqqqq q q q
q
q qq qq qqqq q
qq
q
qq q qqqqqqqqqqqqq qq q q q
q
qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqq
q qqq q qq qqq q q q
q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qq q q q q q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q
q
qqq qqqq qqqqqqqqqqq qq q q q q
q
q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q
q q q qqqq
qq q q
q q q qq qqqqqqqqqq qqqqq q q
qq q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q
q q qqq qqqqqq q q q
qqqq qqq qqqqqqqqqq qqqqqqqq q q q
q
qq
q q qq q q
q
qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q
q qq qq qqqqqqqqqqqqqqqqqq qqqqq q q
q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qq q
q
q q qq q qq q
qq qqqqq qqqqqqqq qqqqqqqq q
qq qqq qqqq q q q q q
q q
q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq
q q
q
q
q qq qqqqqqqqqqqqqqqqqqq qq qq
qq q q qqq q
q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q
qq qqqqqqqqqq q q q q
q qq q qq
q qqqq qqqqqqqqqqqq qq q
q qqq qqqqqqqqqqqqqqq qqqqqqq q q q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q
q qqq qq qqqqq q q
q q qqqq qq
q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q
qqqqqqqqqqqqqq qqqq q q
qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q
qqqqqqqqq qqq q qqq q
qq qqqqqqqqqqqqqqqqqqqqq qqq q q q q
q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q
q qqqqqqqqqqqqqqqqqqqqq qq q q
qqqq qqqqqqqqq q q q q q q q q q q q q
q
qqq qqq q qqqqqq
q qqq q qqq
qq qqqq qq qq q q
q q q q q qqqqqqqqqq qqqqqqqqqqqqqq qq qqqqqq qq q
qqq qqqqqq qqqqq qq qqqqq qq
q
q
qq
q q
qqqqqqqqqq qqq q
q
qq
q q
qqq qqqqq qqqqq qq qqqq qq qq q q
q
q qqq qq q q
q q
q qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq q
qq q q qqqqqqq q qq q
qqqqqqqqqqqqqqqqqqqqqqq qqqq qqqqq
q qqqqqqqqqqqqqqqq qqq
q
qq q qqq q q q q
q q
q
q q qqq
q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q
q
qqqq qqqqqq q qqq q
q
q qqqqqqqqqqqqqqqqqq qqqqq q q
q q qqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q
q qq qqqqqqqqqqqqqqq qqq qq
qqqq qqqqqq qqqqq q q
qqqqq qqqqqqqqqqqqqqqqqqq qqqq q q qq q q q q
q
q qqqqq qq q qq q
q qq qq q q q qq q
q
q
q
qq q
q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q qq
q
q qqqqqq qqqq qqqqqqqqqqqqqqqqqq qq qq q qq q q qq
q qqq q qq qqq qq qqq qqq q q
q q qq qqqqqqqqqqqqqqqqqq qq q q q q
q
q qq q qqq qqqqqqqqqqqqqqqqqqqqqq
q q
q q q qq q qq
q q
qq qq qqq q qq qqq q
q
q q
q qq q q
q q q qqqqqqqqqqqq qqqqqqqqqqqqqqq q qqq qqq
qq q qqqqqqqqqqqqqqqq qqqq qq q
q
q
q qqq q qqqqqqqqqq qqq q q
q
qqq q q qq qq q q qq
q q q q q qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqq
q
q
q
qqq qqq qqqqqq qq q qq q q q q
qq
q
q qqqq qqqqqqqqqqqqqqqqqq q q q q q q
q qqqqqqq q q
q q q q q qqqqqqqqqqqqqqqqqqqqqq qqq q qq q q q q q
q qqqqqqqqqqq q q q q q q
q
q q q q qq
q
q qq
q
q qq q
q q qqq q qqqq qqqqqqqqqqqqqqqqqq q qqq q
q q q q q qqqqqq qqqqqqqqq qqqqqqqqqqqqq qqqq q q q
q qqq qqqqqqqqqqqqqqq qq q q q q
q
qq q q q q q
q qq qq qqqqqqqq qq qqqqqqqq qq
qq
q
q qq qq qq qq qq q q
q
q q q q q q qqqqqqqqqqqqqqqqq qqqqqqq qq q q q
q
q
q qq q
q q q qq q qq q q q qq q q
q q q q q qq qqqqqq qqqqqqqqqqqqq qqq
q
q q q q qq qqqqq qqqqqqqqqqqqqqqqqqqq q qq
q
q q
q
qq qqq qqq q qqq q q q
q qq
q
q qq
qqq q qq qqq qqqqqq q q
q q q q q qqqqqqqqqqqqqqqqqqqqqqq qqqq q q
q
q
q qqq
q
q
q
q
q q q q qqqqqqqqqq qqqqq q qqq q qqq q q
q
qqqq q qqq
q
q
qqqqqq qq qqq q q qq
q
q qq q q
q q q q q qqqqqqqq qqq qqqqqqqqqq qq qq
qq
q
q qq
q
q q q q qqqqqqqqq qqq qqqq qq qqqqqq qq qq q
q qq q q q q
q
q
q
q q q q q q qqq qq
q
q
q
q
q q q q q qqqqqq qqqqqqqqqq q q q q qq
q
q q q q q qqqqqqqqqq qqqqqqq qq qq qqq
q q q qq qq q q
q
q
q q
q
q qq
q
q q q q qqq q qqqq q qq qqqqqq q qq q q
q q
q q q q qq qqq qqqqqqqq q q q
q
q
qq
q q
q
q q q q qq qq qqq q q
q
q q q q qq qqq qq q qq q qq q qq q
q
q
q
q
q q q q qq q q q qq q q q qq q q
q
q q q q qq qq qqqq q qq q
q
qq q qq qq q qq q
q
q q q q qq q
q q
q q q q q q q q q q qq q
q
q
q
q
q q
q
q q q q qq q q q q q qq qq q qq
q q
q
q q q q q qq q qq q qq q q
q
q
q
q
q
q
qq
q
q q q q qq qq qq qq q q
q
q
q q q q qqq q q qq q q q q q
qq
q
q
qq q q
q q q q qqq qqq
q
q q
q
q
q
q
q q q q qqq qq q q q q q q q q q q q q
q
q
q q q q qq q q qq q q q
q
q q q
q q
q
q q q q q qq
q
q
qq
q qq
q
q q q q
q
q
q q q
q
q q q q q qq
qq q
qq q q
q
qq
q q q
q
q
q
q q q qq q
q
q q
q
q
q q q qq q
q
q q
q
q
q q
q
q
q
q
q
q q q q q qq q q q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q q qq qqq
q
q
q
q q
q
q
q q qq
q
q q
q
q
q
q q q
q
q
qq q
q
q
qq
q
q q q
q
q
q q
q
q
q
q
q q
q
q
qq q
q
q
q
q
qq
q
q
q
q q q
q
q
q
q
q
q
q
q
q
qq
q q
q q q q qq
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q qq
q
q
q
q
q
q q q q q q q
q
q
q
q q q
q
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
-20
-15
A = log2( Liver
q
q q
qq
q
q
Housekeeping genes
Unique to a sample
-10
NL . Kidney
NK )
Log ratio distributions (a and b) and MA plot (c) for two tissue samples (from
Robinson and Oshlack, 2010).
Analysis of RNA-Seq Data with R/Bioconductor
Overview
Slide 10/53
Overview
Slide 11/53
Overview
Slide 12/53
Overview
Slide 13/53
Overview
Slide 14/53
Overview
Slide 15/53
GenomicRanges
Rsamtools
Link
Link
: BAM support
Link
Link
DEXSeq
QuasR
Link
Link
Link
: RNA-Seq workows
Overview
Slide 16/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
Slide 17/53
Link
Direct your R session into the resulting Rrnaseq directory. It contains four
slimmed down FASTQ les (SRA023501 Link ) from A. thaliana, as well as the
corresponding reference genome sequence (FASTA) and annotation (GFF) le.
Start the analysis by opening in your R session the Rrnaseq.R script
which contains the code shown in this slide show in pure text format.
Link
The FASTQ les are organized in the provided targets.txt le. This is the only le
in this analysis workow that needs to be generated manually, e.g. in a spreadsheet
program. To import targets.txt, we run the following commands from R:
> download.file("http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Wor
> targets <- read.delim("./data/targets.txt")
> targets
1
2
3
4
RNA-Seq Analysis
Slide 18/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
Slide 19/53
RNA-Seq Analysis
Slide 20/53
Rsubread is an R/Bioc package that implements an extremely fast aligner for RNA-Seq data. It is currently only available for
OS X and Linux, but not for Windows.
RNA-Seq Analysis
Slide 21/53
Link .
library(Rsamtools)
dir.create("results") # Note: all output data will be written to directory 'results'
input <- input <- paste("./data/", targets$FileName, sep="")
output <- paste("./results/", targets$FileName, sep="")
reference <- "./data/tair10chr.fasta"
for(i in seq(along=input)) {
unlink(paste(output[i], ".tophat", sep=""), force=TRUE, recursive=TRUE)
tophat_command <- paste("tophat -p 4 -g 1 --segment-length 15 -i 30 -I 3000 -o ", output[i], ".tophat
# -G: supply GFF with transcript model info (preferred!)
# -g: ignore all alginments with >g matches
# -p: number of threads to use for alignment step
# -i/-I: min/max intron lengths
# --segment-length: length of split reads (25 is default)
system(tophat_command)
sortBam(file=paste(output[i], ".tophat/accepted_hits.bam", sep=""), destination=paste(output[i], ".top
indexBam(paste(output[i], ".tophat/accepted_hits.bam", sep=""))
}
RNA-Seq Analysis
Slide 22/53
Alignment Summary
The following enumerates the number of reads in each FASTQ le and how many of them
aligned to the reference. Note: the percentage of aligned reads is 100% in this particular
example because only alignable reads were selected when generating the sample FASTQ les
for this exercise. For QuasR this step can be omitted because the qAlign function generats
this information automatically.
> library(ShortRead); library(Rsamtools)
> Nreads <- countLines(dirPath="./data", pattern=".fastq$")/4
> bfl <- BamFileList(paste0("./results/", targets$FileName, ".bam"), yieldSize=50000
> Nalign <- countBam(bfl)
> (read_statsDF <- data.frame(FileName=names(Nreads), Nreads=Nreads, Nalign=Nalign$r
+
Perc_Aligned=Nalign$records/Nreads*100))
SRR064154.fastq
SRR064155.fastq
SRR064166.fastq
SRR064167.fastq
RNA-Seq Analysis
Slide 23/53
Quality Reports
The following shows how to create read quality reports with QuasRs qQCReport function or
with the custom seeFastq function.
> qQCReport(proj, pdfFilename="results/qc_report.pdf")
> source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/fastqQuali
> myfiles <- paste0("data/", targets$FileName); names(myfiles) <- targets$SampleName
> fqlist <- seeFastq(fastq=myfiles, batchsize=50000, klength=8)
> pdf("results/fastqReport.pdf", height=18, width=4*length(myfiles)); seeFastqPlot(f
RNA-Seq Analysis
Slide 24/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
Slide 25/53
RNA-Seq Analysis
Slide 26/53
Storing annotation ranges in TranscriptDb databases makes many operations more robust and
convenient.
> library(GenomicFeatures)
> txdb <- makeTranscriptDbFromGFF(file="data/TAIR10_GFF3_trunc.gff",
+
format="gff3",
+
dataSource="TAIR",
+
species="Arabidopsis thaliana")
> saveDb(txdb, file="./data/TAIR10.sqlite")
> txdb <- loadDb("./data/TAIR10.sqlite")
> eByg <- exonsBy(txdb, by="gene")
RNA-Seq Analysis
Slide 27/53
RNA-Seq Analysis
Slide 28/53
RNA-Seq Analysis
Slide 29/53
RNA-Seq Analysis
Slide 30/53
RPKM: reads per kilobase of exon model per million mapped reads
> returnRPKM <- function(counts, gffsub) {
+
geneLengthsInKB <- sum(width(reduce(gffsub)))/1000 # Length of exon union
+
millionsMapped <- sum(counts)/1e+06 # Factor for converting to million of
+
rpm <- counts/millionsMapped # RPK: reads per kilobase of exon model.
+
rpkm <- rpm/geneLengthsInKB # RPKM: reads per kilobase of exon model per m
+
return(rpkm)
+ }
> countDFrpkm <- apply(countDF, 2, function(x) returnRPKM(counts=x, gffsub=eByg))
> countDFrpkm[1:4,]
AT1G01010
AT1G01020
AT1G01030
AT1G01040
RNA-Seq Analysis
Slide 31/53
library(ape)
d <- cor(countDFrpkm, method="spearman")
hc <- hclust(dist(1-d))
plot.phylo(as.phylo(hc), type="p", edge.col=4, edge.width=3, show.node.label=TRUE, no.margin=TRUE)
SRR064155.fastq
SRR064154.fastq
SRR064167.fastq
SRR064166.fastq
RNA-Seq Analysis
Slide 32/53
RNA-Seq Analysis
Slide 33/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
DEG Analysis
Slide 34/53
SRR064154.fastq_SRR064155.fastq SRR064166.fastq_SRR064167.fastq
185.90466
1177.8237
504.78356
1248.7285
12.26145
210.7279
542.11595
1559.4639
RNA-Seq Analysis
DEG Analysis
Slide 35/53
library(DESeq)
countDF <- read.table("./results/countDF")
conds <- targets$Factor
cds <- newCountDataSet(countDF, conds) # Creates object of class CountDataSet derived from eSet class
counts(cds)[1:4, ] # CountDataSet has similar accessor methods as eSet class.
AT1G01010
AT1G01020
AT1G01030
AT1G01040
>
>
>
>
>
>
>
cds <- estimateSizeFactors(cds) # Estimates library size factors from count data. Alternatively, one can provi
cds <- estimateDispersions(cds) # Estimates the variance within replicates
res <- nbinomTest(cds, "AP3", "TRL") # Calls DEGs with nbinomTest
res <- na.omit(res)
res2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,]
res2foldpadj <- res2fold[res2fold$padj <= 0.05, ]
res2foldpadj[1:4,1:8]
5
6
7
14
RNA-Seq Analysis
DEG Analysis
Slide 36/53
DEG analysis with classical edgeR approach. Note: raw read count data are expected by all methods!
>
>
>
>
>
>
>
library(edgeR)
countDF <- read.table("./results/countDF")
y <- DGEList(counts=countDF, group=conds) # Constructs DGEList object
y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("AP3", "TRL")) # Computes exact test for the negative binomial distribution.
topTags(et, n=4)
FDR
4.060290e-126
1.907910e-113
2.543314e-113
6.781044e-106
RNA-Seq Analysis
DEG Analysis
Slide 37/53
library(edgeR)
countDF <- read.table("./results/countDF")
y <- DGEList(counts=countDF, group=conds) # Constructs DGEList object
## Filtering and normalization
keep <- rowSums(cpm(y)>1) >= 2; y <- y[keep, ]
y <- calcNormFactors(y)
design <- model.matrix(~0+group, data=y$samples); colnames(design) <- levels(y$samples$group) # Design matrix
## Estimate dispersion
y <- estimateGLMCommonDisp(y, design, verbose=TRUE) # Estimates common dispersions
RNA-Seq Analysis
DEG Analysis
Slide 38/53
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
setlist <- list(edgeRexact=rownames(edge2foldpadj), edgeRglm=rownames(edgeglm2foldpadj), DESeq=as.character(re
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts, mymain="DEG Comparison")
DEG Comparison
edgeRglm
edgeRexact
DESeq
RPKM
0
1
0
5
0
4
2
24
7
0
3
33
RNA-Seq Analysis
DEG Analysis
Slide 39/53
library(lattice); library(gplots)
y <- countDFrpkm[rownames(edgeglm2foldpadj)[1:20],]
colnames(y) <- targets$Factor
y <- t(scale(t(as.matrix(y))))
y <- y[order(y[,1]),]
levelplot(t(y), height=0.2, col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="Expression Values
Expression Values (DEG Filter: FDR 1%, FC > 2)
1.0
ATCG00270
ATCG00120
ATCG00020
ATMG00160
ATCG00130
ATCG00140
ATCG00280
Gene ID
ATCG00340
AT2G01021
ATCG00490
ATCG00480
ATMG00030
ATCG00350
ATCG00170
AT2G01008
ATMG00090
AT1G01070
AT1G01050
AT4G00050
AT3G01120
AP3 AP3 TRL TRL
RNA-Seq Analysis
DEG Analysis
Slide 40/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
GO Analysis
Slide 41/53
The following performs GO term enrichment analysis of one of the identied DEG sets using the GOstats Link package.
Another package, among many others, to consider here is the goseq Link that considers gene length bias in RNA-Seq data.
>
>
>
>
+
+
>
>
1
2
3
4
GOMFID
GO:0008324
GO:0015075
GO:0015077
GO:0015078
transmembrane
transmembrane
transmembrane
transmembrane
transporter
transporter
transporter
transporter
RNA-Seq Analysis
GO Analysis
Slide 42/53
a
a
a
a
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
Slide 43/53
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064154.fastq.ba
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064155.fastq.ba
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064166.fastq.ba
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064167.fastq.ba
To view area of interest, enter its coordinates Chr1:49,457-51,457 in position menu on top.
RNA-Seq Analysis
Slide 44/53
library(ggbio)
AP3 <- readGAlignmentsFromBam("./results/SRR064154.fastq.bam", use.names=TRUE, param=ScanBamParam(which=GRange
TRL <- readGAlignmentsFromBam("./results/SRR064166.fastq.bam", use.names=TRUE, param=ScanBamParam(which=GRange
p1 <- autoplot(AP3, geom = "rect", aes(color = strand, fill = strand))
p2 <- autoplot(TRL, geom = "rect", aes(color = strand, fill = strand))
p3 <- autoplot(txdb, which=GRanges("Chr1", IRanges(49457, 51457)), names.expr = "gene_id")
tracks(AP3=p1, TRL=p2, Transcripts=p3, heights = c(0.3, 0.3, 0.4)) + ylab("")
strand
AP3
>
>
>
>
>
>
>
TRL
strand
+
Transcripts
AT1G01100
AT1G01100
AT1G01100
AT1G01100
50 kb
50.5 kb
RNA-Seq Analysis
51 kb
Slide 45/53
Task 1 Store the identiers of the upregulated genes from each of the four DEG
methods in separate components of a list. Note: the denition of up and down
is arbitrary and one needs to check how it is dened by the dierent DEG
methods!
Task 2 Do the same for the downregulated genes.
Task 3 Compare the overlaps among the dierent up/down sets in a single 4-way venn
diagram.
RNA-Seq Analysis
Slide 46/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
RNA-Seq Analysis
Slide 47/53
SRR064154.fastq SRR064155.fastq
2
4
2
1
3
3
6
1
RNA-Seq Analysis
Slide 48/53
library(DEXSeq)
samples <- as.character(targets$Factor); names(samples) <- targets$FileName
countDFdex[is.na(countDFdex)] <- 0
## Construct ExonCountSet from scratch
exset <- newExonCountSet2(countDF=countDFdex) # fData(exset)[1:4,]
## Performs normalization
exset <- estimateSizeFactors(exset)
## Evaluate variance of the data by estimating dispersion using Cox-Reid (CR) likelihood estimation
exset <- estimateDispersions(exset)
....
Done
>
>
>
>
>
>
>
>
>
>
FALSE
20
TRUE
1
RNA-Seq Analysis
Slide 49/53
DEXSeq Plots
Sample plot showing tted expression of exons
>
>
>
>
Parent=AT1G01100
TRL
Fitted expression
1000
100
10
1
E001
50090
50213
E002
50336
E003
50459
50582
RNA-Seq Analysis
E004
50705
E005
50828
50951
E006
51074
51197
Slide 50/53
Session Information
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats
graphics
other attached packages:
[1] DEXSeq_1.8.0
[10] GO.db_2.10.1
[19] ape_3.0-11
[28] QuasR_1.2.2
utils
datasets
ggbio_1.10.7
RSQLite_0.11.4
GenomicFeatures_1.14.0
Rbowtie_1.2.0
ggplot2_0.9.3.1
DBI_0.2-7
AnnotationDbi_1.24.0
GenomicRanges_1.14.2
RNA-Seq Analysis
grDevices methods
BiocInstaller_1.12.0
XML_3.98-1.1
digest_0.6.3
labeling_0.2
splines_3.0.2
base
xtable_1.7-1
Matrix_1.1-0
Biobase_2.22.0
XVector_0.2.0
ath112150
gplots_2.
rtracklay
IRanges_1
GSEABase_1.24.0
annotate_1.40.0
gdata_2.13.2
latticeExtra_0.6-26
statmod_1.4.18
Hmisc
bioma
genef
munse
stats
Slide 51/53
Outline
Overview
RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Dierential Exon Usage
References
References
Slide 52/53
References I
Anders, S., Huber, W., 2010. Dierential expression analysis for sequence count data.
Genome Biol 11 (10).
URL http://www.hubmed.org/display.cgi?uids=20979621
Anders, S., Reyes, A., Huber, W., Oct 2012. Detecting dierential usage of exons
from RNA-seq data. Genome Res 22 (10), 20082017.
URL http://www.hubmed.org/display.cgi?uids=22722343
Robinson, M. D., McCarthy, D. J., Smyth, G. K., Jan 2010. edgeR: a Bioconductor
package for dierential expression analysis of digital gene expression data.
Bioinformatics 26 (1), 139140.
URL http://www.hubmed.org/display.cgi?uids=19910308
Robinson, M. D., Oshlack, A., Mar 2010. A scaling normalization method for
dierential expression analysis of RNA-seq data. Genome Biol 11 (3).
URL http://www.hubmed.org/display.cgi?uids=20196867
Trapnell, C., Hendrickson, D. G., Sauvageau, M., Go, L., Rinn, J. L., Pachter, L.,
Jan 2013. Dierential analysis of gene regulation at transcript resolution with
RNA-seq. Nat Biotechnol 31 (1), 4653.
URL http://www.hubmed.org/display.cgi?uids=23222703
References
Slide 53/53