Академический Документы
Профессиональный Документы
Культура Документы
Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper Commented [1]: Make sure to revise paper at the end
to follow ASR format
Nishant Patil Commented [2]: Turn paper into past tense
Table of Contents
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2
Rationale of Study 3
Research Concept 4
C. Application Questions 6
D. Hypothesis 7
F. Assumptions 9
H. Importance of Study 10
E. References 20-21
Rationale
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 3
The main purpose of this research is to match the bioinformatics analysis and the data
visualization of the published data by Wang et al. (2018). Data scientists all over the world use R
for analysis of their data. R gives access to many ready-to-use packages and great data
visualization capabilities and has free licensing, making it the best preferred tool. Although R
offers these great capabilities, it requires an analysis of the requirements and development of
code to program solutions in order to plot complex data (SAS, n.d., para. 3). The scope of this
research also includes using programs like the TopHat algorithm to map gene reads and the
Cuffdiff algorithm to compare gene mapping for bioinformatics analysis (Center for
Computational Biology at Johns Hopkins University, 2016). The output of the TopHat and
Genetics researchers collect a lot of gene samples and other data. Traditional methods of
data analysis may not suffice due to the huge volume and complexity of the collected data.
Scientists use various commands such as prefetch, fastq-dump, TopHat, bamToBed, and Cuffdiff
for preprocessing the genetic data to prepare it for analysis and graphing. Data visualization tools
in R allow scientists to more easily discern data through graphic representation. One of the main
strengths of R is data manipulation and visualization. R’s visualization components make it easy
For a person who is passionate about computer programming, it is important to see how
that technology can apply to various fields. This research project is a good opportunity to explore
how computer programming applies to the field of genetics. It will also provide insight into how
Research Concept
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 4
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 5
How well can one reproduce the bioinformatics analysis and graphs from the
In their research, Wang et al. (2018) studied genetic reprogramming of mice during early
trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of
DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications
following fertilization, and thus the mouse genome becomes demethylated during early
embryonic development. LTRs become hypomethylated and need to be regulated, and previous
studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).
chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.
(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent
heterochromatin.
Application Questions
2. Foundational subproblem: How can one visualize bioinformatics analysis data using R
programming?
3. Applied subproblem: How well can one reproduce the bioinformatics analysis of the
4. Applied subproblem: How well can one reproduce the graphs of published data using R?
Hypothesis
1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the
data analysis that the Wang et al. (2018) have produced with a satisfactory degree of
similarity. The independent variable is the bioinformatics algorithms, and the dependent
2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the
visualization that Wang et al. (2018) have produced with a satisfactory degree of
similarity. The independent variable is the packages used in R programming, and the
Definition of Terms
1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store
3. Cuffdiff: A program that you can use to find significant changes in transcript expression,
splicing, and promoter use (Trapnell Lab at the University of Washington's Department
5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).
6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which
tries to take the good parts of base and lattice graphics and none of the bad parts
(HGCC) is a computer network composed of 1 head node and 22 compute nodes and
serves multiple functions related to genomic projects and data storage (Department of
8. Open Source: Open source software is software with source code that anyone can
9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,
para. 1).
10. R: R is a system for statistical computation and graphics (R Core Team, 2018).
11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample
12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in
conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at
Assumptions
1. The R programming licensing will remain free during the course of the internship.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 9
2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the
researcher will use will remain free during the course of the internship.
3. The mentor will supervise the researcher during the course of the internship.
5. The researcher assumes that the published figures from the paper by Wang et al. (2018)
are accurate.
research.
2. Delimitation: The scope of this research is limited to reproducing the data analysis and
visualization results of the published data from the study by Wang et al. (2018).
3. Delimitation: Only the TopHat and CuffDiff algorithms are in the scope to map and
4. Limitation: The researcher will use RNAseq data, which limits the embryonic stages that
the researcher will analyze and will cause some graph reproductions to not look similar,
because they may complement the ChIPseq data of the original figures.
5. Limitation: The time window available for the internship is from September 10th, 2018
Importance of study
The study of genetics involves collection of a lot of complex data. Analysis of that data
can be a very overwhelming, challenging, and time consuming job. This research aims to
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 10
reproduce the figures and analysis of a published research in order to learn how figures are
produced and how to interpret these figures. The process of reproduction is useful knowledge for
helping researchers to understand how to conduct genetic research in the future, how to account
Chapter 2: Literature Review Commented [3]: Write a 5th page for the literature
review. Minimum of 5 pages
This research consists of two foundational subproblems: bioinformatics analysis and data
RNAseq or CHIPseq data, converting them to appropriate format, mapping the genetic data to an
appropriate genome, comparing differential gene expression, and analysis of the data ("RNA
Sequencing Analysis with TopHat", n.d., p.4). The second foundational subproblem involves
importation of output from the previous subproblem into R programs for data visualization. This
(Wickham, 2015, para. 1), and programs in R. The output of this stage consists of various graphs
The first foundational subproblem of the research is the bioinformatics analysis. As a part
of bioinformatics analysis, the researcher collects and processes RNA sequencing samples. The
researcher retrieves data samples using prefetch, a bioinformatics command that "allows
command-line downloading of SRA (Sequence Read Archive), dbGaP, and ADSP data"
(Sequence Read Archive, 2018a, para. 2). They may have to use various options such as '-h' to
get help regarding the documentation of commands. The '-h' command option "displays ALL
options, general usage, and version information" (Sequence Read Archive, 2018a, para. 2). This
is a very useful option for technical help. The researcher may use the '-f' option to "force object
download" to ensure proper retrieval of raw data (Sequence Read Archive, 2018a, para. 2). The '-
l' option lists "the contents of a kart file" (Sequence Read Archive, 2018a, para. 2).
After downloading the data, the next step is to perform a file conversion from SRA to the
'.fastq.gz' file type. The 'fastq-dump' command allows conversion of "SRA data into fastq
format" (Sequence Read Archive, 2018b, para. 1). The researcher may use various options
available for this command. As always, the '-h' option is handy for easy reference of
documentation. The '-M' option is useful to filter data "by sequence length" (Sequence Read
Archive, 2018b, para. 2). The '-o' option helps to specify the output directory to store the
information (Sequence Read Archive, 2018b, para. 2). There are two types of data: pair-end
sequence data and single-read sequence data. FASTQ files generated by the 'fastq-dump'
command has different outputs based on whether the SRA data is pair-end or single-read.
The next important step is to align the "RNA-Seq reads to a genome in order to identify
exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,
2016, para. 1). The researcher can use the TopHat command for this purpose. TopHat produces
outputs differently based on whether the sequences are pair-ended or single-read. One of the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 12
outputs of the TopHat command is an 'accepted_hits.bam' file, which is essential for further
computation (Center for Computational Biology at Johns Hopkins University, 2016, para. 19).
The next step is to use the bamToBed command, which "is a conversion utility that
converts sequence alignments in BAM format to BED" records (Quinlan Lab at University of
Utah, 2017, p. 46). The researcher inputs the 'accepted_hits.bam' file generated by the TopHat
command to run the bamToBed command on. The researcher can use various command options
such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan
The last important step is to compare differential gene expression using Cuffdiff "to find
significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the
command compares differences in gene mapping for analysis in an experiment. This step
bioinformatic data output from the first foundational subproblem. The researcher will use the R
programming language for statistical analysis and graphical representation of data. The
researcher imports the output of the first subproblem into R programs for data visualization. One
of the advantages R offers is a wide range of open-source (i.e. free) packages that provide
advanced algorithms and complex graphing capabilities (Krill, 2015, para. 5). Some of the
important packages the researcher may use in R are 'ggplot2'and 'dplyr' (Krill, 2015, para. 8).
Packages like ggplot2 and plotly provide specialized functionality related to plotting graphs.
Different options like bar graphs, pie charts, scatter plots, and heat maps are available for
visualization. Researchers can use these two packages together since they complement each
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 13
other. In ggplot2, "you provide the data, tell 'ggplot2' how to map variables to aesthetics, what
graphical primitives to use, and it takes care of the details" (Wickham et al., 2018, p. 1). 'ggplot'
is the most important function available in this package. This function allows the researcher to
create a new ggplot graph to visualize data (Wickham et al., 2018, p. 112). The researcher may
use this function in conjunction with functions like 'geom_bar' to create bar charts (Wickham et
al., 2018, p. 43). After creating a graph in ggplot2, they can use 'plotly' to "easily translate
visualizations directly from R" for ease of access from anywhere on the internet (Sievert et al,
2018, p. 1). 'ggplotly' is on the important functions that allows conversion of ggplot2 to plotly
(Sievert et al., 2018, p. 22). Specialized packages such as 'RcolorBrewer' create beautiful color
palettes for data visualization (Neuwirth, 2015, p. 1). 'dplyr' is a package that provides "fast,
consistent tool for working with data frame like objects, both in memory and out of memory"
(Wickham, François, Henry, & Müller, 2018, p. 1). The researcher may use this package to work
with datasets using functions like 'bind' (Wickham, François, Henry, & Müller, 2018, p. 2),
'select' (Wickham, François, Henry, & Müller, 2018, p. 3), and 'filter' (Wickham, François,
Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is another useful package that
provides a tool for report generation in R (Xie et al., 2018, p. 1). 'knit' and 'stitch' are two
important functions in this package. 'knit' converts the data in an input file to a proper format
(Davis, 2018, p. 27). 'stitch' automatically creates "a report based on an R script and a template"
(Xie and Friendly, 2018, p. 65). This package is very useful in cases where the researcher can
create templates with documentation for other researchers to reference. The output of this
collected. Only analysis of data is not enough for a clear understanding of that data. That is the
data visualization aspect of bioinformatics comes into picture. Data visualization complements
the analysis by providing a clear and easily understandable graphs. Technology helps this
process by providing tools for data analysis as well as tools for visualization. The software
packages and commands available with technology allow one to reproduce the bioinformatics
Chapter 3: Methodology
The main requirement specified by the mentor is to reproduce the bioinformatics analysis
and the data visualization of the published mouse genetic data which is referred to in the
analysis and the graphs must meet a certain degree of accuracy to the results of this paper.
This research will follow the engineering design process (EDP). The research involves
requirements from the mentor, analysis of those requirements, and designing and developing a
solution in order to replicate the bioinformatics analysis and the graphs of the research paper by
Wang et al. (2018). Finally, as a part of testing, the mentor will compare the visualizations and
analyses of this research with those from the paper and provide feedback. Following are the
1. This research will use R programming since it is an open-source software and available
for free.
2. The mentor will provide the same data that was used during research by Wang et al.
(2018).
3. The mentor will review the results of the visualization and the bioinformatics analysis.
The researcher will review the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher will make reproductions of figures 1e and
2a, and then provide an analysis of the trends. The researcher will also create a gene ontology
table, a figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure
will essentially only similar in concept as the researcher will be attempting reproductions with
RNA-seq data, leading to different gene ontology terms and y-values. The figures the researcher
Figure 1e:
Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 17
Figure 2c:
Figure 2c.
2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of
Springer Nature.
Figure 2a:
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
Before reproducing the graphs, the researcher will carry out the following steps to prepare the
1. The researcher will fetch mouse gene data provided by the mentor through HGCC
2. The researcher will access a script file to setup and execute the TopHat algorithm (an
algorithm that maps genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping the mouse genes to the mouse genome
(a set of chromosomes). The output of this algorithm (.bam and .bed files) is used as
3. The researcher will access a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compare differences in gene expression) on the mouse gene data from the
various stages of embryo development using this algorithm. The output of the CuffDiff
between two embryonic stages, which is used as input when further processing data for
visualization.
Since the goal of this research is to reproduce the graphs of the research paper using the same
data, statistical analysis of the data does not apply to this research and is not part of the scope.
The results of this processing is in the form of data tables which one can understand better
through data visualization (i.e. plotting graphs). The researcher will carry out the following steps
1. The researcher will import the data from '.diff' files generated the CuffDiff comparisons
to other embryonic cell stages into an R program by reading the files into R.
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher will write R programs to filter the imported data for selective processing
Use the subset function in R programming to filter out the data based on a filter applied
processing.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 20
3. The researcher will extract a list and the number of genes specific to each embryonic
development stage.
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
4. The researcher will write a bar graph R program to reproduce figure 1e imported data.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction is complete following these steps, the researcher will analyze the trends of
The researcher will carry out the following steps for producing gene ontology tables for genes
(www.geneontology.org) and input four lists total: two lists for combined 2 cell, 4 cell,
and 8 cell stages' upregulated and downregulated genes and two lists for combined ICM
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 21
and morula stages' upregulated and downregulated genes. The Gene Ontology
2. The researcher will then make Excel tables out of gene ontology terms with the highest P
values with an emphasis on highest P values first and then uniqueness of the terms for a
variety of gene ontology terms. The x-axis will be the gene ontology terms and the y-axis
3. The researcher will write a bar graph R program to plot each of the Excel tables
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the gene ontology graphs are complete following these steps, the researcher will analyze
The researcher will carry out the following steps to make the reproduction of figure 2a:
1. The researcher will take the lists of upregulated and downregulated genes specific to each
embryonic stage and extract the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations will be saved as an Excel
file.
2. The researcher will input this Excel file into an R script which will generate a matrix
from this data, normalize the matrix, and then run a k-means clustering on the matrix.
3. The researcher will create a heat map R program to visualize the final matrix.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction is complete following these steps, the researcher will analyze the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher will carry out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher will use the following
Table 1
Test case #: Identifies each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
Expected figure : A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generates as a result of following the method to
Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. A rating of 1 means that the reproduced figure is a different type of figure than
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 23
the expected figure. A rating of 2 means that the reproduced figure is the correct type of graph,
but with some error in the data being graphed, such as incorrect processing of data or plotting a
wrong column as an axis. A rating of 3 means that the reproduced figure is the same type of
figure as the expected figure and graphs the correct data, but cannot be numerically accurate due
to a data limitation. A rating of 4 means that the reproduced figure is the correct type of graph
and has plotted the correct values, but either has minor errors that prevents it from being an exact
reproduction. A rating of 5 means that the reproduced figure is exactly visually identical to the
expected figure.
Table 2
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 24
Conclusion: Determines whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. For example, test conclusion of a reproduced graph can be
either a "satisfactory match" if the same conclusion is derived from both figures or an
After completion of the above testing, the researcher will provide the following products to the
The mentor will review the results provided by the researcher, compare those with the original
research paper results, and will validate that the comparisons of results are satisfactory.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 25
The researcher will reproduce data visualization similar to the below sample graph as a part of
The researcher will conduct the research under the supervision and guidance of the
mentor. The timeline for this research will span from November 1st, 2018 to December 7, 2018.
The intention of this research is to reproduce the data analysis and graphs of the original
research. Since the researcher will use the same data and similar tools and techniques used by
Wang et al. (2018), the authors of the research paper, this research is quite feasible and
replicable.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 26
References
Center for Computational Biology (2016). A spliced read mapper for RNA-Seq
https://ccb.jhu.edu/software/tophat/manual.shtml
https://www.sas.com/en_us/insights/big-data/data-visualization.html
http://genetics.emory.edu/about/index.html
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 27
Krill, P. (2015, June 30). Why R? The pros and cons of the R language. Retrieved from
https://www.infoworld.com/article/2940864/application-development/r-programming-
language-statistical-data-analysis.html
https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf
https://opensource.com/resources/what-open-source
Patil, N. (2018, November 18). Reproducing the Bioinformatics Analysis and Data
Porter, S. (2007, January 28). Basics: How do you sequence a genome? part III, reads and
sequence-genome-part-iii-reads-and-chromats
https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].
project.org/doc/manuals/r-release/R-lang.pdf
RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from
https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf
Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch
Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., &
Despouy, P. (July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF].
https://cran.r-project.org/web/packages/plotly/plotly.pdf
https://www.medicinenet.com/script/main/art.asp?articlekey=16836
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of
http://had.co.nz/ggplot2/
http://r-pkgs.had.co.nz/intro.html
Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version
https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,
https://cran.r-project.org/web/packages/knitr/knitr.pdf
https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-
protocol-299461
http://www.chipseq.com/chromatin-immunoprecipitation/