Patil Nishant Projectproposal

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1
Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper Commented [1]: Make sure to revise paper at the end
to follow ASR format
Nishant Patil Commented [2]: Turn paper into past tense
Wheeler High School Center for Advanced Studies
Internship at Yao Lab at Emory University
Table of Contents
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2
Rationale of Study 3
Research Concept 4
Chapter 1: The problem and its setting 5-10
A. Statement of the Problem 6
B. Foundation Context of the Study 6
C. Application Questions 6
D. Hypothesis 7
E. Definition of Terms 8-9
F. Assumptions 9
G. Delimitations and Limitations 9-10
H. Importance of Study 10
Chapter 2: Review of Literature 11-15
Chapter 3: Methodology 16-20
A. Evaluation and Needs of End Product 17
B. Type of Design and its Underlying Assumptions 17
C. Develop and Prototype Solution 18-19
D. Management Plan, Timeline, and Feasibility 19
E. References 20-21
Rationale
The main purpose of this research is to match the bioinformatics analysis and the data
visualization of the published data by Wang et al. (2018). Data scientists all over the world use R
for analysis of their data. R gives access to many ready-to-use packages and great data
visualization capabilities and has free licensing, making it the best preferred tool. Although R
offers these great capabilities, it requires an analysis of the requirements and development of
code to program solutions in order to plot complex data (SAS, n.d., para. 3). The scope of this
research also includes using programs like the TopHat algorithm to map gene reads and the
Cuffdiff algorithm to compare gene mapping for bioinformatics analysis (Center for
Computational Biology at Johns Hopkins University, 2016). The output of the TopHat and
Cuffdiff commands is an essential input for data visualization using R.
Genetics researchers collect a lot of gene samples and other data. Traditional methods of
data analysis may not suffice due to the huge volume and complexity of the collected data.
Scientists use various commands such as prefetch, fastq-dump, TopHat, bamToBed, and Cuffdiff
for preprocessing the genetic data to prepare it for analysis and graphing. Data visualization tools
in R allow scientists to more easily discern data through graphic representation. One of the main
strengths of R is data manipulation and visualization. R’s visualization components make it easy
to turn huge numbers in the data into easy-to-understand plots.
For a person who is passionate about computer programming, it is important to see how
that technology can apply to various fields. This research project is a good opportunity to explore
how computer programming applies to the field of genetics. It will also provide insight into how
programmers can effectively represent data visually using R.
Research Concept
Chapter 1: The Problem and Its Setting
Statement of the Problem

How well can one reproduce the bioinformatics analysis and graphs from the
"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo
development" research paper?
Foundation Context for the Study
In their research, Wang et al. (2018) studied genetic reprogramming of mice during early
embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9
trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of
DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications
following fertilization, and thus the mouse genome becomes demethylated during early
embryonic development. LTRs become hypomethylated and need to be regulated, and previous
studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).
However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being
chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.
(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent
heterochromatin.
Application Questions
1. Foundational subproblem: How can one analyze bioinformatics data?
2. Foundational subproblem: How can one visualize bioinformatics analysis data using R
programming?
3. Applied subproblem: How well can one reproduce the bioinformatics analysis of the
genetics data from published genetic data?

4. Applied subproblem: How well can one reproduce the graphs of published data using R?
Hypothesis
1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the
data analysis that the Wang et al. (2018) have produced with a satisfactory degree of
similarity. The independent variable is the bioinformatics algorithms, and the dependent
variable is the degree of similarity.
2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the
visualization that Wang et al. (2018) have produced with a satisfactory degree of
similarity. The independent variable is the packages used in R programming, and the
dependent variable is the degree of similarity.
Definition of Terms
1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store
biological data (Stöppler, 2016, para. 1).
2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or
binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation
(ChIP)?" n.d., para. 1).
3. Cuffdiff: A program that you can use to find significant changes in transcript expression,
splicing, and promoter use (Trapnell Lab at the University of Washington's Department
of Genome Sciences, 2017).

4. Data visualization: Data visualization is the presentation of data in a pictorial or
graphical format (SAS, n.d., para. 1).
5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).
6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which
tries to take the good parts of base and lattice graphics and none of the bad parts
(Wickham, 2013, para. 1).
7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster
(HGCC) is a computer network composed of 1 head node and 22 compute nodes and
serves multiple functions related to genomic projects and data storage (Department of
Human Genetics at Emory University School of Medicine, 2017, para. 3).
8. Open Source: Open source software is software with source code that anyone can
inspect, modify, and enhance (Opensource.com, 2013, para. 3).
9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,
para. 1).
10. R: R is a system for statistical computation and graphics (R Core Team, 2018).
11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample
(Mackenzie, 2018, para. 1).
12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in
conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at
Johns Hopkins University, 2016, para. 1).
Assumptions
1. The R programming licensing will remain free during the course of the internship.
2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the
researcher will use will remain free during the course of the internship.
3. The mentor will supervise the researcher during the course of the internship.
4. The researcher will conduct research at Yao Lab at Emory University.
5. The researcher assumes that the published figures from the paper by Wang et al. (2018)
are accurate.
Delimitations and Limitations
1. Delimitation: R programming is the choice of language for data visualization in this
research.
2. Delimitation: The scope of this research is limited to reproducing the data analysis and
visualization results of the published data from the study by Wang et al. (2018).
3. Delimitation: Only the TopHat and CuffDiff algorithms are in the scope to map and
compare genetic datasets.
4. Limitation: The researcher will use RNAseq data, which limits the embryonic stages that
the researcher will analyze and will cause some graph reproductions to not look similar,
because they may complement the ChIPseq data of the original figures.
5. Limitation: The time window available for the internship is from September 10th, 2018
to December 11th, 2018 only.
Importance of study
The study of genetics involves collection of a lot of complex data. Analysis of that data
can be a very overwhelming, challenging, and time consuming job. This research aims to
reproduce the figures and analysis of a published research in order to learn how figures are
produced and how to interpret these figures. The process of reproduction is useful knowledge for
helping researchers to understand how to conduct genetic research in the future, how to account
for differences, and interpreting sources of differences.
Chapter 2: Literature Review Commented [3]: Write a 5th page for the literature
review. Minimum of 5 pages
This research consists of two foundational subproblems: bioinformatics analysis and data
visualization. The bioinformatics foundational subproblem consists of collecting the sequenced
RNAseq or CHIPseq data, converting them to appropriate format, mapping the genetic data to an
appropriate genome, comparing differential gene expression, and analysis of the data ("RNA
Sequencing Analysis with TopHat", n.d., p.4). The second foundational subproblem involves
importation of output from the previous subproblem into R programs for data visualization. This
subproblem involves usage of various packages, a "fundamental unit of shareable code",
(Wickham, 2015, para. 1), and programs in R. The output of this stage consists of various graphs
that data scientists refer to for trends and related analysis.

The first foundational subproblem of the research is the bioinformatics analysis. As a part
of bioinformatics analysis, the researcher collects and processes RNA sequencing samples. The
researcher retrieves data samples using prefetch, a bioinformatics command that "allows
command-line downloading of SRA (Sequence Read Archive), dbGaP, and ADSP data"
(Sequence Read Archive, 2018a, para. 2). They may have to use various options such as '-h' to
get help regarding the documentation of commands. The '-h' command option "displays ALL
options, general usage, and version information" (Sequence Read Archive, 2018a, para. 2). This
is a very useful option for technical help. The researcher may use the '-f' option to "force object
download" to ensure proper retrieval of raw data (Sequence Read Archive, 2018a, para. 2). The '-
l' option lists "the contents of a kart file" (Sequence Read Archive, 2018a, para. 2).
After downloading the data, the next step is to perform a file conversion from SRA to the
'.fastq.gz' file type. The 'fastq-dump' command allows conversion of "SRA data into fastq
format" (Sequence Read Archive, 2018b, para. 1). The researcher may use various options
available for this command. As always, the '-h' option is handy for easy reference of
documentation. The '-M' option is useful to filter data "by sequence length" (Sequence Read
Archive, 2018b, para. 2). The '-o' option helps to specify the output directory to store the
information (Sequence Read Archive, 2018b, para. 2). There are two types of data: pair-end
sequence data and single-read sequence data. FASTQ files generated by the 'fastq-dump'
command has different outputs based on whether the SRA data is pair-end or single-read.
The next important step is to align the "RNA-Seq reads to a genome in order to identify
exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,
2016, para. 1). The researcher can use the TopHat command for this purpose. TopHat produces
outputs differently based on whether the sequences are pair-ended or single-read. One of the
outputs of the TopHat command is an 'accepted_hits.bam' file, which is essential for further
computation (Center for Computational Biology at Johns Hopkins University, 2016, para. 19).
The next step is to use the bamToBed command, which "is a conversion utility that
converts sequence alignments in BAM format to BED" records (Quinlan Lab at University of
Utah, 2017, p. 46). The researcher inputs the 'accepted_hits.bam' file generated by the TopHat
command to run the bamToBed command on. The researcher can use various command options
such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan
Lab at University of Utah, 2017, p. 46).
The last important step is to compare differential gene expression using Cuffdiff "to find
significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the
University of Washington's Department of Genome Sciences, 2017). In simple terms, this
command compares differences in gene mapping for analysis in an experiment. This step
concludes the first foundational subproblem of the research.
The second foundational subproblem of the research is data visualization of the
bioinformatic data output from the first foundational subproblem. The researcher will use the R
programming language for statistical analysis and graphical representation of data. The
researcher imports the output of the first subproblem into R programs for data visualization. One
of the advantages R offers is a wide range of open-source (i.e. free) packages that provide
advanced algorithms and complex graphing capabilities (Krill, 2015, para. 5). Some of the
important packages the researcher may use in R are 'ggplot2'and 'dplyr' (Krill, 2015, para. 8).
Packages like ggplot2 and plotly provide specialized functionality related to plotting graphs.
Different options like bar graphs, pie charts, scatter plots, and heat maps are available for
visualization. Researchers can use these two packages together since they complement each
other. In ggplot2, "you provide the data, tell 'ggplot2' how to map variables to aesthetics, what
graphical primitives to use, and it takes care of the details" (Wickham et al., 2018, p. 1). 'ggplot'
is the most important function available in this package. This function allows the researcher to
create a new ggplot graph to visualize data (Wickham et al., 2018, p. 112). The researcher may
use this function in conjunction with functions like 'geom_bar' to create bar charts (Wickham et
al., 2018, p. 43). After creating a graph in ggplot2, they can use 'plotly' to "easily translate
'ggplot2' graphs to an interactive web-based version and/or create custom web-based
visualizations directly from R" for ease of access from anywhere on the internet (Sievert et al,
2018, p. 1). 'ggplotly' is on the important functions that allows conversion of ggplot2 to plotly
(Sievert et al., 2018, p. 22). Specialized packages such as 'RcolorBrewer' create beautiful color
palettes for data visualization (Neuwirth, 2015, p. 1). 'dplyr' is a package that provides "fast,
consistent tool for working with data frame like objects, both in memory and out of memory"
(Wickham, François, Henry, & Müller, 2018, p. 1). The researcher may use this package to work
with datasets using functions like 'bind' (Wickham, François, Henry, & Müller, 2018, p. 2),
'select' (Wickham, François, Henry, & Müller, 2018, p. 3), and 'filter' (Wickham, François,
Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is another useful package that
provides a tool for report generation in R (Xie et al., 2018, p. 1). 'knit' and 'stitch' are two
important functions in this package. 'knit' converts the data in an input file to a proper format
(Davis, 2018, p. 27). 'stitch' automatically creates "a report based on an R script and a template"
(Xie and Friendly, 2018, p. 65). This package is very useful in cases where the researcher can
create templates with documentation for other researchers to reference. The output of this
subproblem consists of various graphs for data visualization.

Understanding bioinformatics data requires thorough analysis of the sample data
collected. Only analysis of data is not enough for a clear understanding of that data. That is the
data visualization aspect of bioinformatics comes into picture. Data visualization complements
the analysis by providing a clear and easily understandable graphs. Technology helps this
process by providing tools for data analysis as well as tools for visualization. The software
packages and commands available with technology allow one to reproduce the bioinformatics
analysis and graphs by reusing the raw data.

Chapter 3: Methodology
Evaluation and Needs of End Product
The main requirement specified by the mentor is to reproduce the bioinformatics analysis
and the data visualization of the published mouse genetic data which is referred to in the
"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo
development" research paper by Wang et al (2018). The reproductions of the bioinformatics
analysis and the graphs must meet a certain degree of accuracy to the results of this paper.
Type of Design and its Underlying Assumptions
This research will follow the engineering design process (EDP). The research involves
requirements from the mentor, analysis of those requirements, and designing and developing a
solution in order to replicate the bioinformatics analysis and the graphs of the research paper by
Wang et al. (2018). Finally, as a part of testing, the mentor will compare the visualizations and
analyses of this research with those from the paper and provide feedback. Following are the
underlying assumptions of this design:
1. This research will use R programming since it is an open-source software and available
for free.
2. The mentor will provide the same data that was used during research by Wang et al.
(2018).
3. The mentor will review the results of the visualization and the bioinformatics analysis.
Develop and Prototype Solution

The researcher will review the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher will make reproductions of figures 1e and
2a, and then provide an analysis of the trends. The researcher will also create a gene ontology
table, a figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure
will essentially only similar in concept as the researcher will be attempting reproductions with
RNA-seq data, leading to different gene ontology terms and y-values. The figures the researcher
will reproduce are shown below:
Figure 1e:
Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
embryonic development. Reprinted from "Reprogramming of H3K9me3-dependent
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
Figure 2c:
Figure 2c.
Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming of
H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et al.
2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of
Springer Nature.
Figure 2a:
Figure 2a. A heatmap showing H3K9me3 domains during mouse

embryo development. Reprinted from "Reprogramming of H3K9me3-dependent
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Wang et al. (2018):
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
demonstrate which stages to compare to.
Before reproducing the graphs, the researcher will carry out the following steps to prepare the
raw data for further processing for visualization:
1. The researcher will fetch mouse gene data provided by the mentor through HGCC
(Human Genetic Computer Cluster), a network of computers for genome projects, by
logging into their HGCC account.
2. The researcher will access a script file to setup and execute the TopHat algorithm (an
algorithm that maps genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping the mouse genes to the mouse genome
(a set of chromosomes). The output of this algorithm (.bam and .bed files) is used as
input in the next step.

3. The researcher will access a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compare differences in gene expression) on the mouse gene data from the
TopHat output and compare modifications of H3K9me3-dependent heterochromatin at
various stages of embryo development using this algorithm. The output of the CuffDiff
algorithm is a '.diff' file generated from comparison of differential gene expression
between two embryonic stages, which is used as input when further processing data for
visualization.
Since the goal of this research is to reproduce the graphs of the research paper using the same
data, statistical analysis of the data does not apply to this research and is not part of the scope.
The results of this processing is in the form of data tables which one can understand better
through data visualization (i.e. plotting graphs). The researcher will carry out the following steps
to make the reproduction of figure 1e:
1. The researcher will import the data from '.diff' files generated the CuffDiff comparisons
to other embryonic cell stages into an R program by reading the files into R.
Below is the programming logic for this step:
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher will write R programs to filter the imported data for selective processing
based on predefined criteria for status and significance of genes.
Use the subset function in R programming to filter out the data based on a filter applied
on particular columns (status, significant, etc) to remove undesirable data from
processing.
3. The researcher will extract a list and the number of genes specific to each embryonic
development stage.
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
4. The researcher will write a bar graph R program to reproduce figure 1e imported data.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
as 'geom_bar', etc. to define the bar graph to be created.
After the reproduction is complete following these steps, the researcher will analyze the trends of
the graph to provide bioinformatics analysis of the data.
The researcher will carry out the following steps for producing gene ontology tables for genes
specific to stages of embryonic development for (a reproduction of figure 2c):
1. The researcher will go to the Gene Ontology Consortium website
(www.geneontology.org) and input four lists total: two lists for combined 2 cell, 4 cell,
and 8 cell stages' upregulated and downregulated genes and two lists for combined ICM
and morula stages' upregulated and downregulated genes. The Gene Ontology
Consortium will analyze these lists for biological processes.
2. The researcher will then make Excel tables out of gene ontology terms with the highest P
values with an emphasis on highest P values first and then uniqueness of the terms for a
variety of gene ontology terms. The x-axis will be the gene ontology terms and the y-axis
will be the P value of the gene ontology term.
3. The researcher will write a bar graph R program to plot each of the Excel tables
as 'geom_bar', etc. to define the bar graph to be created.
After the gene ontology graphs are complete following these steps, the researcher will analyze
the trends of the graphs to provide bioinformatics analysis of the data.
The researcher will carry out the following steps to make the reproduction of figure 2a:
1. The researcher will take the lists of upregulated and downregulated genes specific to each
embryonic stage and extract the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations will be saved as an Excel
file.
2. The researcher will input this Excel file into an R script which will generate a matrix
from this data, normalize the matrix, and then run a k-means clustering on the matrix.
3. The researcher will create a heat map R program to visualize the final matrix.

as 'geom_tile', etc. to define the heat map to be created.
After the reproduction is complete following these steps, the researcher will analyze the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher will carry out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher will use the following
format for testing prototypes:
Table 1
Data Visualization Testing
Test case # Figure being Expected figure Reproduced Degree of

reproduced figure similarity (1-5)
Test case #: Identifies each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
researcher has attempted a reproduction of.
Expected figure : A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generates as a result of following the method to
reproduce the corresponding figure.
Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. A rating of 1 means that the reproduced figure is a different type of figure than
the expected figure. A rating of 2 means that the reproduced figure is the correct type of graph,
but with some error in the data being graphed, such as incorrect processing of data or plotting a
wrong column as an axis. A rating of 3 means that the reproduced figure is the same type of
figure as the expected figure and graphs the correct data, but cannot be numerically accurate due
to a data limitation. A rating of 4 means that the reproduced figure is the correct type of graph
and has plotted the correct values, but either has minor errors that prevents it from being an exact
reproduction. A rating of 5 means that the reproduced figure is exactly visually identical to the
expected figure.
Table 2
Bioinformatics Analysis Testing
Test case # Associated Analysis of the Analysis of the Conclusion

figure associated figure reproduced
figure
Test case #: Unique number to identify the test case.
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
has attempted a reproduction of.
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
Conclusion: Determines whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. For example, test conclusion of a reproduced graph can be
either a "satisfactory match" if the same conclusion is derived from both figures or an
"unsatisfactory match" if the analyses of the graphs are not complementary.
After completion of the above testing, the researcher will provide the following products to the
mentor for review:
● Data visualization graphs for bioinformatics analysis
● Bioinformatics analysis data results
The mentor will review the results provided by the researcher, compare those with the original
research paper results, and will validate that the comparisons of results are satisfactory.
The researcher will reproduce data visualization similar to the below sample graph as a part of
this research using R programming (Patil, 2018):
Management Plan, Timeline, and Feasibility
The researcher will conduct the research under the supervision and guidance of the
mentor. The timeline for this research will span from November 1st, 2018 to December 7, 2018.
The intention of this research is to reproduce the data analysis and graphs of the original
research. Since the researcher will use the same data and similar tools and techniques used by
Wang et al. (2018), the authors of the research paper, this research is quite feasible and
replicable.
References
Center for Computational Biology (2016). A spliced read mapper for RNA-Seq
[Software]. Retrieved from
https://ccb.jhu.edu/software/tophat/manual.shtml
Data Visualization: What it is and why it matters. (n.d.). Retrieved from
https://www.sas.com/en_us/insights/big-data/data-visualization.html
Emory University School of Medicine Department of Genetics. (n.d.). Retrieved from
http://genetics.emory.edu/about/index.html
Krill, P. (2015, June 30). Why R? The pros and cons of the R language. Retrieved from
https://www.infoworld.com/article/2940864/application-development/r-programming-
language-statistical-data-analysis.html
Neuwirth, E. (February 19, 2015). Package 'RColorBrewer' (Version 1.1-2) [Software
PDF]. Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf
Opensource.com (2013). What is open source? Retrieved from
https://opensource.com/resources/what-open-source
Patil, N. (2018, November 18). Reproducing the Bioinformatics Analysis and Data
Visualization of a Research Paper. Unpublished manuscript.
Porter, S. (2007, January 28). Basics: How do you sequence a genome? part III, reads and
chromats. Retrieved from https://digitalworldbiology.com/archive/basics-how-do-you-
sequence-genome-part-iii-reads-and-chromats
Quinlan Lab at University of Utah (December 08, 2017). Bedtools Documentation
(Version 2.27.0) [Software PDF]. Retrieved from
https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].
Comprehensive R Archive Network. Retrieved from https://cran.r-
project.org/doc/manuals/r-release/R-lang.pdf
RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from
https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf
Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].
National Center for Biotechnology Information. Retrieved from

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch
Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].
National Center for Biotechnology Information. Retrieved from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., &
Despouy, P. (July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF].
Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/web/packages/plotly/plotly.pdf
Stöppler, M. C. (2016). Definition of Bioinformatics. Retrieved from
https://www.medicinenet.com/script/main/art.asp?articlekey=16836
Trapnell Lab at University of Washington's Department of Genome Sciences (2017) [Software].
Cufflinks. Retrieved from
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of
H3K9me3-dependent heterochromatin during mammalian embryo development. Nature
Cell Biology, 20(5), 620-631. doi:10.1038/s41556-018-0093-4
Wickham, H. (2013). ggplot2 [Software]. Retrieved from
http://had.co.nz/ggplot2/
Wickham, H. (2015). R Packages [Software]. Retrieved from
http://r-pkgs.had.co.nz/intro.html
Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July
3, 2018). Package 'ggplot2' (Version 3.0.0) [Software PDF]. Comprehensive R Archive
Network. Retrieved from

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version
0.7.7) [Software PDF]. Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,
2018). Package 'knitr' (Version 1.20) [Software PDF]. Comprehensive R Archive
Network. Retrieved from
https://cran.r-project.org/web/packages/knitr/knitr.pdf
Add: Commented [4]: Add these references
https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-
protocol-299461
http://www.chipseq.com/chromatin-immunoprecipitation/

Patil Nishant Projectproposal

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Patil Nishant Projectproposal

Загружено:

Авторское право:

Доступные форматы

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1

Wheeler High School Center for Advanced Studies

Internship at Yao Lab at Emory University

Chapter 1: The problem and its setting 5-10

A. Statement of the Problem 6

B. Foundation Context of the Study 6

E. Definition of Terms 8-9

G. Delimitations and Limitations 9-10

Chapter 2: Review of Literature 11-15

Chapter 3: Methodology 16-20

A. Evaluation and Needs of End Product 17

B. Type of Design and its Underlying Assumptions 17

C. Develop and Prototype Solution 18-19

D. Management Plan, Timeline, and Feasibility 19

Cuffdiff commands is an essential input for data visualization using R.

to turn huge numbers in the data into easy-to-understand plots.

programmers can effectively represent data visually using R.

Chapter 1: The Problem and Its Setting

Statement of the Problem

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper?

Foundation Context for the Study

embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9

However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being

1. Foundational subproblem: How can one analyze bioinformatics data?

genetics data from published genetic data?

variable is the degree of similarity.

dependent variable is the degree of similarity.

biological data (Stöppler, 2016, para. 1).

2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or

binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation

(ChIP)?" n.d., para. 1).

of Genome Sciences, 2017).

4. Data visualization: Data visualization is the presentation of data in a pictorial or

graphical format (SAS, n.d., para. 1).

(Wickham, 2013, para. 1).

7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster

Human Genetics at Emory University School of Medicine, 2017, para. 3).

inspect, modify, and enhance (Opensource.com, 2013, para. 3).

(Mackenzie, 2018, para. 1).

Johns Hopkins University, 2016, para. 1).

4. The researcher will conduct research at Yao Lab at Emory University.

Delimitations and Limitations

1. Delimitation: R programming is the choice of language for data visualization in this

compare genetic datasets.

to December 11th, 2018 only.

for differences, and interpreting sources of differences.

visualization. The bioinformatics foundational subproblem consists of collecting the sequenced

subproblem involves usage of various packages, a "fundamental unit of shareable code",

that data scientists refer to for trends and related analysis.

Lab at University of Utah, 2017, p. 46).

University of Washington's Department of Genome Sciences, 2017). In simple terms, this

concludes the first foundational subproblem of the research.

The second foundational subproblem of the research is data visualization of the

'ggplot2' graphs to an interactive web-based version and/or create custom web-based

subproblem consists of various graphs for data visualization.

Understanding bioinformatics data requires thorough analysis of the sample data

analysis and graphs by reusing the raw data.

Evaluation and Needs of End Product

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper by Wang et al (2018). The reproductions of the bioinformatics

Type of Design and its Underlying Assumptions

underlying assumptions of this design: