Вы находитесь на странице: 1из 5

RNA-seq Analysis Exercise

Galaxy provides the tools necessary to creating and executing a complete RNA-seq analysis
pipeline. This exercise introduces these tools and guides you through a simple pipeline using
some example datasets. Familiarity with Galaxy and the general concepts of RNA-seq analysis
are useful for understanding this exercise. This exercise should take 1-2 hours. You can check
your work by looking at the history and visualization at the bottom of this page, which contain
the datasets for the completed exercise.
Input Datasets
Below are small samples of datasets from the Illumina BodyMap 2.0 project; specifically, the
datasets are paired-end 50bp reads from adrenal and brain tissues. The sampled reads map
mostly to a 500Kb region of chromosome 19, positions 3-3.5 million
(chr19:3000000:3500000).
RNA-seq data from adrenal tissue:

Galaxy Dataset | adrenal_1.fastq


Forward RNA-seq reads from BodyMap 2.0 project, adrenal tissue, mapping to
chr19:3000000:3500000

and

Galaxy Dataset | adrenal_2.fastq


Reverse RNA-seq reads from BodyMap 2.0 project, adrenal tissue, mapping to
chr19:3000000:3500000

RNA-seq data from brain tissue:

Galaxy Dataset | brain_1.fastq


Forward RNA-seq reads from BodyMap 2.0 project, brain tissue, mapping to
chr19:3000000:3500000

and

Galaxy Dataset | brain_2.fastq


Reverse RNA-seq reads from BodyMap 2.0 project, brain tissue, mapping to
chr19:3000000:3500000

You'll also need one additional dataset: a gene annotation in GTF format. Here is iGenomes
gene annotation for the UCSC hg19 build:

Galaxy Dataset | iGenomes UCSC hg19, chr19 gene annotation


iGenomes gene annotation for chr19 of hg19 build

Here's a history containing all five datasets; click on the green plus to import (copy) it into
your workspace. Use this history to complete the exercise.

Galaxy History | RNA-seq exercise datasets


RNA-seq exercise datasets
33.39 MB

Datasets needed for Galaxy RNA-seq exercise



Dataset Annotation
iGenomes gene
annotation for chr19 of
1 iGenomes UCSC hg19, chr19 gene annotation hg19 build

Forward RNA-seq reads


2 adrenal_1.fastq from BodyMap 2.0
project, adrenal tissue,
mapping to
chr19:3000000:3500000

Reverse
3 adrenal_2.fastq RNA-seq reads from
BodyMap 2.0 project,
adrenal tissue, mapping
to chr19:
3000000:3500000

Forward RNA-seq reads


4 brain_1.fastq from BodyMap 2.0
project, brain tissue,
mapping to
chr19:3000000:3500000
Reverse RNA-seq reads
from BodyMap 2.0
5 brain_2.fastq project, brain tissue,
mapping to chr19:
3000000:3500000
Understanding and QCing the reads
You should understand the reads a bit before analyzing them. Quality control (QC) may be
needed as well.

Step 1: Run a quality control check on your data using the [NGS: QC and manipulation
>] FastQC tool. Often, it is useful to trim reads to remove base positions that have a low
median (or bottom quartile) score. For this exercise, assume a median quality score of below
20 to be unusable. Given this criterion, is trimming needed for the datasets? If so, which base
pairs should be trimmed? The datasets do not pass FastQC's quality control checks for per-
sequence GC content and sequence duplication levels. This is expected because FastQC is
designed for DNA sequencing data, not RNA sequencing. What unique characteristics of RNA-
seq data will cause these two QC checks to fail?

Step 2: If necessary, trim the reads based on your answers to step A using [NGS: QC and
manipulation >] FASTQ Quality Trimmer

Map processed reads


The next step is mapping the processed reads to the genome. The major challenge when
mapping RNA-seq reads is that the reads, because they come from RNA, often cross splice
junction boundaries; splice junctions are not present in a genome's sequence, and hence
typical NGS mappers such as Bowtie and BWA are not ideal without modifying the genome
sequence. Instead, it is better to use a mapper such as Tophat that is designed to map RNA-
seq reads.

Step 1: Use the [NGS: RNA Analysis >] Tophat tool to map RNA-seq reads to the hg19 build.
Because the reads are paired, you'll need to set mean inner distance between pairs; this is the
average distance in basepairs between reads, not the total insert/fragment size. Use a mean
inner distance of 110 for BodyMap data.
Look at the documentation to understand the datasets that Tophat produces. What percentage
of left reads were aligned (mapped)? What percentage of read pairs aligned uniquely (as
opposed to those that have multiple alignments)? How many splice junctions did Tophat find?
Are most splice junctions supported by (i.e. spanned by) more or less than 10 reads? (Hint: the
score column in the splice junctions dataset is useful for answering this question.)

Step 2: To view Tophat's output, create a simple Galaxy visualization by selecting Visualization
> New Track Browser from the main Galaxy menu at the top. Create the visualization using
the hg19 build and add datasets to your visualization by clicking on the Add Datasets to
Visualization button. Add (a) Tophat's accepted hits BAM datasets, which are the mapped
reads; (b) Tophat's splice junctions datasets, which denote the junctions found by mapping the
reads; and (c) the iGenomes gene annotation.
Navigate to chr19 using the select box at the top of visualization and look at the data. Zoom
in to view the data at greater detail. You can zoom in by (a) double-clicking anywhere on the
visualization to zoom in on that area or (b) dragging on the base number area at the top of the
visualization to create a zoom area; or (c) clicking on the location bar and entering the region
with data, chr19:3000000:3500000
You should be able to see: (a) the reads mapped by Tophat, including reads mapped across
introns (there are a lot of reads!); (b) the splice junctions produced by Tophat; and (c) how
Tophat's reads and junction correspond to the annotated genes. Find a gene locus where there
are mapped reads, and then find an example of a splice junction between 2 known exons, and
find an example where a splice junction might be found but is not.

Assemble and analyze transcripts


After mapping the reads, the next step is to assemble the reads into complete transcripts that
can be analyzed for differential expression and phenomena such as splicing events and
transcriptional start sites.

Step 1: Run [NGS: RNA Analysis >] Cufflinks on each BAM dataset produced by Tophat using
de novo assembly; this will assemble the reads into transcripts. Here is the documentation for
the datasets that Cufflinks produces. Look at the assembled transcripts dataset and find some
transcripts that have multiple exons.

Step 2: Add the Cufflinks' assembled transcripts datasets to the visualization you created
earlier in order to view the transcripts alongside the mapped reads, junctions, and reference
genes. Can you find examples where Cufflinks assembled a complete or almost complete
transcript?

Step 3a: Run [NGS: RNA Analysis >] Cuffmerge on two datasets of assembled transcripts from
Step 1 and use the UCSC genes annotation as the reference annotation. Cuffmerge produces a
merged transcripts dataset that includes all transcript in both datasets. The merged set of
transcripts are needed in subsequent steps. You can find information about this dataset by
reading the Cuffmerge documentation.

Step 3b: Run [NGS: RNA Analysis >] Cuffdiff on (a) the merged transcripts produced by
Cuffmerge and (b) Tophat's accepted hits datasets for each dataset to perform differential
expression analyses on the two samples. Cuffdiff produces quite a few output datasets; you'll
want to browse the Cuffdiff documentation to get a sense of what they do.
Look at the transcript FPKM tracking dataset at the top of your history. Find an entry for a
novel isoform--class code 'j'--and an entry for an isoform that matches a reference isoform--
class code '='. What is the nearest gene and transcription start site for each entry and what is
each transcript's FPKM value?
Look at transcript differential expression testing dataset (second from the top of your history)
and find two transcripts that are differentially expressed. What are the p and q values for each
differentially expressed transcript?

On your own
How many novel transcripts/isoforms were found in Cuffdiff? (Hint: use [Filter and Sort >]
Filter on the transcript FPKM dataset.) Which genes are these novel transcripts in and are they
all expressed in both tissues?

What genes are differentially expressed? (Hint: use [Filter and Sort >] Filter on the gene
differential expression testing dataset.)

Which genes are expressed at greater than 10FPKM in Adrenal tissue? (Hint: use [Filter and
Sort >] Filter on the gene FPKM dataset.)
[This is a difficult question to answer.] Create a workflow from your history (in history's
options menu, select 'Extract Workflow'), import the full Adrenal and Brain datasets from the
BodyMap data library (go to the Shared Data tab, then Data Libraries, then search for
'bodymap'), and run the workflow on the full datasets. How many differentially expressed
genes do you find?

More Resources
A history with the exercise completed (coming soon!):
And a visualization of mapped/aligned reads, assembled transcripts, and gene annotation
(coming soon!):
Full RNA-seq BodyMap datasets are available in the data library 'Illumina BodyMap 2.0' and
available for use. Use the Shared Data --> Data Libraries link in the header above to find this
library.
Tophat documentation
Cufflinks/compare/merge/diff documentation
Nature Protocols paper describing RNA-seq analysis using Tophat-Cuff* pipeline
iGenomes data

Вам также может понравиться