Вы находитесь на странице: 1из 29

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1

Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper Commented [1]: Make sure to revise paper at the end
to follow ASR format
Nishant Patil Commented [2]: Turn paper into past tense

Wheeler High School Center for Advanced Studies

Internship at Yao Lab at Emory University

Table of Contents
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2

Rationale of Study 3

Research Concept 4

Chapter 1: The problem and its setting 5-10

A. Statement of the Problem 6

B. Foundation Context of the Study 6

C. Application Questions 6

D. Hypothesis 7

E. Definition of Terms 8-9

F. Assumptions 9

G. Delimitations and Limitations 9-10

H. Importance of Study 10

Chapter 2: Review of Literature 11-15

Chapter 3: Methodology 16-20

A. Evaluation and Needs of End Product 17

B. Type of Design and its Underlying Assumptions 17

C. Develop and Prototype Solution 18-19

D. Management Plan, Timeline, and Feasibility 19

E. References 20-21

Rationale
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 3

The main purpose of this research is to match the bioinformatics analysis and the data

visualization of the published data by Wang et al. (2018). Data scientists all over the world use R

for analysis of their data. R gives access to many ready-to-use packages and great data

visualization capabilities and has free licensing, making it the best preferred tool. Although R

offers these great capabilities, it requires an analysis of the requirements and development of

code to program solutions in order to plot complex data (SAS, n.d., para. 3). The scope of this

research also includes using programs like the TopHat algorithm to map gene reads and the

Cuffdiff algorithm to compare gene mapping for bioinformatics analysis (Center for

Computational Biology at Johns Hopkins University, 2016). The output of the TopHat and

Cuffdiff commands is an essential input for data visualization using R.

Genetics researchers collect a lot of gene samples and other data. Traditional methods of

data analysis may not suffice due to the huge volume and complexity of the collected data.

Scientists use various commands such as prefetch, fastq-dump, TopHat, bamToBed, and Cuffdiff

for preprocessing the genetic data to prepare it for analysis and graphing. Data visualization tools

in R allow scientists to more easily discern data through graphic representation. One of the main

strengths of R is data manipulation and visualization. R’s visualization components make it easy

to turn huge numbers in the data into easy-to-understand plots.

For a person who is passionate about computer programming, it is important to see how

that technology can apply to various fields. This research project is a good opportunity to explore

how computer programming applies to the field of genetics. It will also provide insight into how

programmers can effectively represent data visually using R.

Research Concept
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 4
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 5

Chapter 1: The Problem and Its Setting

Statement of the Problem


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 6

How well can one reproduce the bioinformatics analysis and graphs from the

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper?

Foundation Context for the Study

In their research, Wang et al. (2018) studied genetic reprogramming of mice during early

embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9

trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of

DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications

following fertilization, and thus the mouse genome becomes demethylated during early

embryonic development. LTRs become hypomethylated and need to be regulated, and previous

studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).

However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being

chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.

(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent

heterochromatin.

Application Questions

1. Foundational subproblem: How can one analyze bioinformatics data?

2. Foundational subproblem: How can one visualize bioinformatics analysis data using R

programming?

3. Applied subproblem: How well can one reproduce the bioinformatics analysis of the

genetics data from published genetic data?


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 7

4. Applied subproblem: How well can one reproduce the graphs of published data using R?

Hypothesis

1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the

data analysis that the Wang et al. (2018) have produced with a satisfactory degree of

similarity. The independent variable is the bioinformatics algorithms, and the dependent

variable is the degree of similarity.

2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the

visualization that Wang et al. (2018) have produced with a satisfactory degree of

similarity. The independent variable is the packages used in R programming, and the

dependent variable is the degree of similarity.

Definition of Terms

1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store

biological data (Stöppler, 2016, para. 1).

2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or

binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation

(ChIP)?" n.d., para. 1).

3. Cuffdiff: A program that you can use to find significant changes in transcript expression,

splicing, and promoter use (Trapnell Lab at the University of Washington's Department

of Genome Sciences, 2017).


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 8

4. Data visualization: Data visualization is the presentation of data in a pictorial or

graphical format (SAS, n.d., para. 1).

5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).

6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which

tries to take the good parts of base and lattice graphics and none of the bad parts

(Wickham, 2013, para. 1).

7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster

(HGCC) is a computer network composed of 1 head node and 22 compute nodes and

serves multiple functions related to genomic projects and data storage (Department of

Human Genetics at Emory University School of Medicine, 2017, para. 3).

8. Open Source: Open source software is software with source code that anyone can

inspect, modify, and enhance (Opensource.com, 2013, para. 3).

9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,

para. 1).

10. R: R is a system for statistical computation and graphics (R Core Team, 2018).

11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample

(Mackenzie, 2018, para. 1).

12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in

conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at

Johns Hopkins University, 2016, para. 1).

Assumptions

1. The R programming licensing will remain free during the course of the internship.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 9

2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the

researcher will use will remain free during the course of the internship.

3. The mentor will supervise the researcher during the course of the internship.

4. The researcher will conduct research at Yao Lab at Emory University.

5. The researcher assumes that the published figures from the paper by Wang et al. (2018)

are accurate.

Delimitations and Limitations

1. Delimitation: R programming is the choice of language for data visualization in this

research.

2. Delimitation: The scope of this research is limited to reproducing the data analysis and

visualization results of the published data from the study by Wang et al. (2018).

3. Delimitation: Only the TopHat and CuffDiff algorithms are in the scope to map and

compare genetic datasets.

4. Limitation: The researcher will use RNAseq data, which limits the embryonic stages that

the researcher will analyze and will cause some graph reproductions to not look similar,

because they may complement the ChIPseq data of the original figures.

5. Limitation: The time window available for the internship is from September 10th, 2018

to December 11th, 2018 only.

Importance of study

The study of genetics involves collection of a lot of complex data. Analysis of that data

can be a very overwhelming, challenging, and time consuming job. This research aims to
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 10

reproduce the figures and analysis of a published research in order to learn how figures are

produced and how to interpret these figures. The process of reproduction is useful knowledge for

helping researchers to understand how to conduct genetic research in the future, how to account

for differences, and interpreting sources of differences.

Chapter 2: Literature Review Commented [3]: Write a 5th page for the literature
review. Minimum of 5 pages

This research consists of two foundational subproblems: bioinformatics analysis and data

visualization. The bioinformatics foundational subproblem consists of collecting the sequenced

RNAseq or CHIPseq data, converting them to appropriate format, mapping the genetic data to an

appropriate genome, comparing differential gene expression, and analysis of the data ("RNA

Sequencing Analysis with TopHat", n.d., p.4). The second foundational subproblem involves

importation of output from the previous subproblem into R programs for data visualization. This

subproblem involves usage of various packages, a "fundamental unit of shareable code",

(Wickham, 2015, para. 1), and programs in R. The output of this stage consists of various graphs

that data scientists refer to for trends and related analysis.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 11

The first foundational subproblem of the research is the bioinformatics analysis. As a part

of bioinformatics analysis, the researcher collects and processes RNA sequencing samples. The

researcher retrieves data samples using prefetch, a bioinformatics command that "allows

command-line downloading of SRA (Sequence Read Archive), dbGaP, and ADSP data"

(Sequence Read Archive, 2018a, para. 2). They may have to use various options such as '-h' to

get help regarding the documentation of commands. The '-h' command option "displays ALL

options, general usage, and version information" (Sequence Read Archive, 2018a, para. 2). This

is a very useful option for technical help. The researcher may use the '-f' option to "force object

download" to ensure proper retrieval of raw data (Sequence Read Archive, 2018a, para. 2). The '-

l' option lists "the contents of a kart file" (Sequence Read Archive, 2018a, para. 2).

After downloading the data, the next step is to perform a file conversion from SRA to the

'.fastq.gz' file type. The 'fastq-dump' command allows conversion of "SRA data into fastq

format" (Sequence Read Archive, 2018b, para. 1). The researcher may use various options

available for this command. As always, the '-h' option is handy for easy reference of

documentation. The '-M' option is useful to filter data "by sequence length" (Sequence Read

Archive, 2018b, para. 2). The '-o' option helps to specify the output directory to store the

information (Sequence Read Archive, 2018b, para. 2). There are two types of data: pair-end

sequence data and single-read sequence data. FASTQ files generated by the 'fastq-dump'

command has different outputs based on whether the SRA data is pair-end or single-read.

The next important step is to align the "RNA-Seq reads to a genome in order to identify

exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,

2016, para. 1). The researcher can use the TopHat command for this purpose. TopHat produces

outputs differently based on whether the sequences are pair-ended or single-read. One of the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 12

outputs of the TopHat command is an 'accepted_hits.bam' file, which is essential for further

computation (Center for Computational Biology at Johns Hopkins University, 2016, para. 19).

The next step is to use the bamToBed command, which "is a conversion utility that

converts sequence alignments in BAM format to BED" records (Quinlan Lab at University of

Utah, 2017, p. 46). The researcher inputs the 'accepted_hits.bam' file generated by the TopHat

command to run the bamToBed command on. The researcher can use various command options

such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan

Lab at University of Utah, 2017, p. 46).

The last important step is to compare differential gene expression using Cuffdiff "to find

significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the

University of Washington's Department of Genome Sciences, 2017). In simple terms, this

command compares differences in gene mapping for analysis in an experiment. This step

concludes the first foundational subproblem of the research.

The second foundational subproblem of the research is data visualization of the

bioinformatic data output from the first foundational subproblem. The researcher will use the R

programming language for statistical analysis and graphical representation of data. The

researcher imports the output of the first subproblem into R programs for data visualization. One

of the advantages R offers is a wide range of open-source (i.e. free) packages that provide

advanced algorithms and complex graphing capabilities (Krill, 2015, para. 5). Some of the

important packages the researcher may use in R are 'ggplot2'and 'dplyr' (Krill, 2015, para. 8).

Packages like ggplot2 and plotly provide specialized functionality related to plotting graphs.

Different options like bar graphs, pie charts, scatter plots, and heat maps are available for

visualization. Researchers can use these two packages together since they complement each
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 13

other. In ggplot2, "you provide the data, tell 'ggplot2' how to map variables to aesthetics, what

graphical primitives to use, and it takes care of the details" (Wickham et al., 2018, p. 1). 'ggplot'

is the most important function available in this package. This function allows the researcher to

create a new ggplot graph to visualize data (Wickham et al., 2018, p. 112). The researcher may

use this function in conjunction with functions like 'geom_bar' to create bar charts (Wickham et

al., 2018, p. 43). After creating a graph in ggplot2, they can use 'plotly' to "easily translate

'ggplot2' graphs to an interactive web-based version and/or create custom web-based

visualizations directly from R" for ease of access from anywhere on the internet (Sievert et al,

2018, p. 1). 'ggplotly' is on the important functions that allows conversion of ggplot2 to plotly

(Sievert et al., 2018, p. 22). Specialized packages such as 'RcolorBrewer' create beautiful color

palettes for data visualization (Neuwirth, 2015, p. 1). 'dplyr' is a package that provides "fast,

consistent tool for working with data frame like objects, both in memory and out of memory"

(Wickham, François, Henry, & Müller, 2018, p. 1). The researcher may use this package to work

with datasets using functions like 'bind' (Wickham, François, Henry, & Müller, 2018, p. 2),

'select' (Wickham, François, Henry, & Müller, 2018, p. 3), and 'filter' (Wickham, François,

Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is another useful package that

provides a tool for report generation in R (Xie et al., 2018, p. 1). 'knit' and 'stitch' are two

important functions in this package. 'knit' converts the data in an input file to a proper format

(Davis, 2018, p. 27). 'stitch' automatically creates "a report based on an R script and a template"

(Xie and Friendly, 2018, p. 65). This package is very useful in cases where the researcher can

create templates with documentation for other researchers to reference. The output of this

subproblem consists of various graphs for data visualization.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 14

Understanding bioinformatics data requires thorough analysis of the sample data

collected. Only analysis of data is not enough for a clear understanding of that data. That is the

data visualization aspect of bioinformatics comes into picture. Data visualization complements

the analysis by providing a clear and easily understandable graphs. Technology helps this

process by providing tools for data analysis as well as tools for visualization. The software

packages and commands available with technology allow one to reproduce the bioinformatics

analysis and graphs by reusing the raw data.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 15

Chapter 3: Methodology

Evaluation and Needs of End Product

The main requirement specified by the mentor is to reproduce the bioinformatics analysis

and the data visualization of the published mouse genetic data which is referred to in the

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper by Wang et al (2018). The reproductions of the bioinformatics

analysis and the graphs must meet a certain degree of accuracy to the results of this paper.

Type of Design and its Underlying Assumptions

This research will follow the engineering design process (EDP). The research involves

requirements from the mentor, analysis of those requirements, and designing and developing a

solution in order to replicate the bioinformatics analysis and the graphs of the research paper by

Wang et al. (2018). Finally, as a part of testing, the mentor will compare the visualizations and

analyses of this research with those from the paper and provide feedback. Following are the

underlying assumptions of this design:

1. This research will use R programming since it is an open-source software and available

for free.

2. The mentor will provide the same data that was used during research by Wang et al.

(2018).

3. The mentor will review the results of the visualization and the bioinformatics analysis.

Develop and Prototype Solution


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 16

The researcher will review the requirements provided by the mentor to reproduce the

bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the

research paper by Wang et al. (2018). The researcher will make reproductions of figures 1e and

2a, and then provide an analysis of the trends. The researcher will also create a gene ontology

table, a figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure

will essentially only similar in concept as the researcher will be attempting reproductions with

RNA-seq data, leading to different gene ontology terms and y-values. The figures the researcher

will reproduce are shown below:

Figure 1e:

Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during

embryonic development. Reprinted from "Reprogramming of H3K9me3-dependent

heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell

Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 17

Figure 2c:

Figure 2c.

Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming of

H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et al.

2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of

Springer Nature.

Figure 2a:

Figure 2a. A heatmap showing H3K9me3 domains during mouse


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 18

embryo development. Reprinted from "Reprogramming of H3K9me3-dependent

heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell

Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.

The following figure demonstrates the logic behind reproducing figure 1e from the paper by

Wang et al. (2018):

Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.

Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages

(before and after) were used for comparison, with the 2 cell stage as an example stage to

demonstrate which stages to compare to.

Before reproducing the graphs, the researcher will carry out the following steps to prepare the

raw data for further processing for visualization:

1. The researcher will fetch mouse gene data provided by the mentor through HGCC

(Human Genetic Computer Cluster), a network of computers for genome projects, by

logging into their HGCC account.

2. The researcher will access a script file to setup and execute the TopHat algorithm (an

algorithm that maps genetic data to corresponding parts of a genome) for processing of

the mouse gene data through HGCC for mapping the mouse genes to the mouse genome

(a set of chromosomes). The output of this algorithm (.bam and .bed files) is used as

input in the next step.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 19

3. The researcher will access a script file to setup and execute the Cuffdiff algorithm (an

algorithm that compare differences in gene expression) on the mouse gene data from the

TopHat output and compare modifications of H3K9me3-dependent heterochromatin at

various stages of embryo development using this algorithm. The output of the CuffDiff

algorithm is a '.diff' file generated from comparison of differential gene expression

between two embryonic stages, which is used as input when further processing data for

visualization.

Since the goal of this research is to reproduce the graphs of the research paper using the same

data, statistical analysis of the data does not apply to this research and is not part of the scope.

The results of this processing is in the form of data tables which one can understand better

through data visualization (i.e. plotting graphs). The researcher will carry out the following steps

to make the reproduction of figure 1e:

1. The researcher will import the data from '.diff' files generated the CuffDiff comparisons

to other embryonic cell stages into an R program by reading the files into R.

Below is the programming logic for this step:

Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate

'.diff' file and import it into an R data table by reading each line of data.

2. The researcher will write R programs to filter the imported data for selective processing

based on predefined criteria for status and significance of genes.

Below is the programming logic for this step:

Use the subset function in R programming to filter out the data based on a filter applied

on particular columns (status, significant, etc) to remove undesirable data from

processing.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 20

3. The researcher will extract a list and the number of genes specific to each embryonic

development stage.

Below is the programming logic for this step:

Create lists of genes intersecting the previous and the next stage CuffDiff comparison

'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from

the CuffDiff comparison to the previous stage with the downregulated genes in the next

stage. For downregulated genes specific to a stage, intersect the downregulated genes

from the CuffDiff comparison to the previous stage with the upregulated genes in the

next stage.

4. The researcher will write a bar graph R program to reproduce figure 1e imported data.

Below is the programming logic for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the reproduction is complete following these steps, the researcher will analyze the trends of

the graph to provide bioinformatics analysis of the data.

The researcher will carry out the following steps for producing gene ontology tables for genes

specific to stages of embryonic development for (a reproduction of figure 2c):

1. The researcher will go to the Gene Ontology Consortium website

(www.geneontology.org) and input four lists total: two lists for combined 2 cell, 4 cell,

and 8 cell stages' upregulated and downregulated genes and two lists for combined ICM
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 21

and morula stages' upregulated and downregulated genes. The Gene Ontology

Consortium will analyze these lists for biological processes.

2. The researcher will then make Excel tables out of gene ontology terms with the highest P

values with an emphasis on highest P values first and then uniqueness of the terms for a

variety of gene ontology terms. The x-axis will be the gene ontology terms and the y-axis

will be the P value of the gene ontology term.

3. The researcher will write a bar graph R program to plot each of the Excel tables

Below is the programming logic for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the gene ontology graphs are complete following these steps, the researcher will analyze

the trends of the graphs to provide bioinformatics analysis of the data.

The researcher will carry out the following steps to make the reproduction of figure 2a:

1. The researcher will take the lists of upregulated and downregulated genes specific to each

embryonic stage and extract the start and end locations of the genes from the CuffDiff

output. The genes and their respective start and end locations will be saved as an Excel

file.

2. The researcher will input this Excel file into an R script which will generate a matrix

from this data, normalize the matrix, and then run a k-means clustering on the matrix.

3. The researcher will create a heat map R program to visualize the final matrix.

Below is the programming logic for this step:


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 22

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_tile', etc. to define the heat map to be created.

After the reproduction is complete following these steps, the researcher will analyze the trends of

the graph to provide bioinformatics analysis of the data. After completing the reproductions and

their analysis, the researcher will carry out testing of the bioinformatics analysis results and data

visualization plots before submitting them to the mentor. The researcher will use the following

format for testing prototypes:

Table 1

Data Visualization Testing

Test case # Figure being Expected figure Reproduced Degree of


reproduced figure similarity (1-5)

Test case #: Identifies each scenario number with a number for each analysis situation.

Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the

researcher has attempted a reproduction of.

Expected figure : A picture of the original figure from the paper by Wang et al. (2018)

Reproduced figure: The figure the researcher generates as a result of following the method to

reproduce the corresponding figure.

Degree of similarity: The visual similarity between the expected figure and the reproduced figure

on a scale of 1-5. A rating of 1 means that the reproduced figure is a different type of figure than
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 23

the expected figure. A rating of 2 means that the reproduced figure is the correct type of graph,

but with some error in the data being graphed, such as incorrect processing of data or plotting a

wrong column as an axis. A rating of 3 means that the reproduced figure is the same type of

figure as the expected figure and graphs the correct data, but cannot be numerically accurate due

to a data limitation. A rating of 4 means that the reproduced figure is the correct type of graph

and has plotted the correct values, but either has minor errors that prevents it from being an exact

reproduction. A rating of 5 means that the reproduced figure is exactly visually identical to the

expected figure.

Table 2

Bioinformatics Analysis Testing

Test case # Associated Analysis of the Analysis of the Conclusion


figure associated figure reproduced
figure

Test case #: Unique number to identify the test case.

Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher

has attempted a reproduction of.

Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated

figure.

Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced

figure.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 24

Conclusion: Determines whether the analysis of the associated figure is a satisfactory match to

the analysis of the reproduced figure. For example, test conclusion of a reproduced graph can be

either a "satisfactory match" if the same conclusion is derived from both figures or an

"unsatisfactory match" if the analyses of the graphs are not complementary.

After completion of the above testing, the researcher will provide the following products to the

mentor for review:

● Data visualization graphs for bioinformatics analysis

● Bioinformatics analysis data results

The mentor will review the results provided by the researcher, compare those with the original

research paper results, and will validate that the comparisons of results are satisfactory.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 25

The researcher will reproduce data visualization similar to the below sample graph as a part of

this research using R programming (Patil, 2018):

Management Plan, Timeline, and Feasibility

The researcher will conduct the research under the supervision and guidance of the

mentor. The timeline for this research will span from November 1st, 2018 to December 7, 2018.

The intention of this research is to reproduce the data analysis and graphs of the original

research. Since the researcher will use the same data and similar tools and techniques used by

Wang et al. (2018), the authors of the research paper, this research is quite feasible and

replicable.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 26

References

Center for Computational Biology (2016). A spliced read mapper for RNA-Seq

[Software]. Retrieved from

https://ccb.jhu.edu/software/tophat/manual.shtml

Data Visualization: What it is and why it matters. (n.d.). Retrieved from

https://www.sas.com/en_us/insights/big-data/data-visualization.html

Emory University School of Medicine Department of Genetics. (n.d.). Retrieved from

http://genetics.emory.edu/about/index.html
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 27

Krill, P. (2015, June 30). Why R? The pros and cons of the R language. Retrieved from

https://www.infoworld.com/article/2940864/application-development/r-programming-

language-statistical-data-analysis.html

Neuwirth, E. (February 19, 2015). Package 'RColorBrewer' (Version 1.1-2) [Software

PDF]. Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf

Opensource.com (2013). What is open source? Retrieved from

https://opensource.com/resources/what-open-source

Patil, N. (2018, November 18). Reproducing the Bioinformatics Analysis and Data

Visualization of a Research Paper. Unpublished manuscript.

Porter, S. (2007, January 28). Basics: How do you sequence a genome? part III, reads and

chromats. Retrieved from https://digitalworldbiology.com/archive/basics-how-do-you-

sequence-genome-part-iii-reads-and-chromats

Quinlan Lab at University of Utah (December 08, 2017). Bedtools Documentation

(Version 2.27.0) [Software PDF]. Retrieved from

https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf

R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].

Comprehensive R Archive Network. Retrieved from https://cran.r-

project.org/doc/manuals/r-release/R-lang.pdf

RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from

https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf

Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].

National Center for Biotechnology Information. Retrieved from


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 28

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch

Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].

National Center for Biotechnology Information. Retrieved from

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., &

Despouy, P. (July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF].

Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/web/packages/plotly/plotly.pdf

Stöppler, M. C. (2016). Definition of Bioinformatics. Retrieved from

https://www.medicinenet.com/script/main/art.asp?articlekey=16836

Trapnell Lab at University of Washington's Department of Genome Sciences (2017) [Software].

Cufflinks. Retrieved from

http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/

Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of

H3K9me3-dependent heterochromatin during mammalian embryo development. Nature

Cell Biology, 20(5), 620-631. doi:10.1038/s41556-018-0093-4

Wickham, H. (2013). ggplot2 [Software]. Retrieved from

http://had.co.nz/ggplot2/

Wickham, H. (2015). R Packages [Software]. Retrieved from

http://r-pkgs.had.co.nz/intro.html

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July

3, 2018). Package 'ggplot2' (Version 3.0.0) [Software PDF]. Comprehensive R Archive

Network. Retrieved from


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 29

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version

0.7.7) [Software PDF]. Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/web/packages/dplyr/dplyr.pdf

Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,

2018). Package 'knitr' (Version 1.20) [Software PDF]. Comprehensive R Archive

Network. Retrieved from

https://cran.r-project.org/web/packages/knitr/knitr.pdf

Add: Commented [4]: Add these references

https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-

protocol-299461

http://www.chipseq.com/chromatin-immunoprecipitation/

Вам также может понравиться