Вы находитесь на странице: 1из 8

Data Collection, Processing, and Analysis

The researcher reviewed the requirements provided by the mentor to reproduce the

bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the

research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,

and then provided an analysis of the trends. The researcher also created a gene ontology table, a

figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was

essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,

which led to different gene ontology terms and y-values. The figures the researcher attempted to

reproduce are shown below:

Figure 1e:

Figure 1e. ChIP-seq graph: Established (red) and

disappeared (blue) H3K9me3 marks during embryonic development. Reprinted from

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development," by Wang et al. 2018, Nature Cell Biology, 20, p. 621. Copyright 2018 by

Macmillan Publishers Limited, part of Springer Nature.


Figure 2c:

Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming

of H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et

al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part

of Springer Nature.

Figure 2a:

Figure 2a. A heatmap showing H3K9me3 domains during mouse

embryo development. Reprinted from "Reprogramming of H3K9me3-dependent

heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell

Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by

Wang et al. (2018):

Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.

Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages

(before and after) were used for comparison, with the 2 cell stage as an example stage to

demonstrate which stages to compare to.

Before reproducing the graphs, the researcher carried out the following steps to prepare the raw

data for further processing for visualization:

1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human

Genetic Computer Cluster), a network of computers for genome projects, by logging into

their HGCC account.

2. The researcher accessed a script file to setup and execute the TopHat algorithm (an

algorithm that mapped genetic data to corresponding parts of a genome) for processing of

the mouse gene data through HGCC for mapping and aligning the mouse genes to the

mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)

was used as input in the next step.

3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an

algorithm that compared differences in gene expression) on the mouse gene data from the

TopHat output and compared modifications of H3K9me3-dependent heterochromatin at


various stages of embryo development using this algorithm. The output of the CuffDiff

algorithm is a '.diff' file generated from comparison of differential gene expression

between two embryonic stages, which was used as input when further processing data for

visualization.

Since the goal of this research was to reproduce the graphs of the research paper using the

RNA-seq data, statistical analysis of the data did not apply to this research and was not part of

the scope. The results of this processing were in the form of data tables which one can

understand better through data visualization (i.e. plotting graphs). The researcher carried out the

following steps to make the reproduction of figure 1e:

1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to

other embryonic cell stages into an R program by reading the files into R.

Below is the programming logic for the researcher used for this step:

Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate

'.diff' file and import it into an R data table by reading each line of data.

2. The researcher wrote R programs to filter the imported data for selective processing

based on predefined criteria for status and significance of genes.

Below is the programming logic the researcher used for this step:

Use the subset function in R programming to filter out the data based on a filter applied

on particular columns (status, significant, etc) to remove undesirable data from

processing.

3. The researcher extracted a list and the number of genes specific to each embryonic

development stage.

Below is the programming logic the researcher used for this step:
Create lists of genes intersecting the previous and the next stage CuffDiff comparison

'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from

the CuffDiff comparison to the previous stage with the downregulated genes in the next

stage. For downregulated genes specific to a stage, intersect the downregulated genes

from the CuffDiff comparison to the previous stage with the upregulated genes in the

next stage.

4. The researcher wrote a bar graph R program to reproduce figure 1e.

Below is the programming logic the researcher used for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the reproduction was complete following these steps, the researcher analyzed the trends of

the graph to provide bioinformatics analysis of the data.

The researcher carried out the following steps for producing gene ontology tables for

genes specific to stages of embryonic development for a reproduction of figure 2c:

1. The researcher went to the Gene Ontology Consortium website (www.geneontology.org)

and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'

upregulated and downregulated genes and two lists for combined ICM and morula stages'

upregulated and downregulated genes. The Gene Ontology Consortium analyzed these

lists for biological processes.

2. The researcher then made Excel tables out of gene ontology terms with the highest P

values. The most emphasis was on highest P values first, and then uniqueness of the
terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and

the y-axis was the P value of the gene ontology term which resulted from the analysis.

3. The researcher wrote a bar graph R program to plot each of the Excel tables.

Below is the programming logic the researcher used for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the gene ontology graphs were completed following these steps, the researcher analyzed

the trends of the graphs to provide bioinformatics analysis of the data.

The researcher carried out the following steps to make the reproduction of figure 2a:

1. The researcher took the lists of upregulated and downregulated genes specific to each

embryonic stage and extracted the start and end locations of the genes from the CuffDiff

output. The genes and their respective start and end locations were saved as an Excel file.

2. The researcher input this Excel file into an R script which generated a matrix from this

data, normalized the matrix, and then ran a k-means clustering code on the matrix.

3. The researcher created a heat map R program to visualize the final matrix.

Below is the programming logic for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_tile', etc. to define the heat map to be created.

After the reproduction was complete following these steps, the researcher analyzed the trends of

the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher carried out testing of the bioinformatics analysis results and data

visualization plots before submitting them to the mentor. The researcher used the following

format for testing prototypes:

Table 1

Data Visualization Testing

Test case # Figure being Expected figure Reproduced Degree of


reproduced figure similarity (1-5)

1
2
3

Test case #: Identified each scenario number with a number for each analysis situation.

Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the

researcher attempted a reproduction of.

Expected figure: A picture of the original figure from the paper by Wang et al. (2018)

Reproduced figure: The figure the researcher generated as a result of following the method to

reproduce the corresponding figure.

Degree of similarity: The visual similarity between the expected figure and the reproduced figure

on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.

Table 2

Bioinformatics Analysis Testing

Test case # Associated Analysis of the Analysis of the Conclusion


figure associated figure reproduced
figure
1
2
3

Test case #: Unique number to identify the test case.

Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher

attempted a reproduction of.

Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated

figure.

Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced

figure.

Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to

the analysis of the reproduced figure. The criteria is provided in the previous section.

After completion of the above testing, the researcher will provide the following products to the

mentor for review:

● Data visualization graphs for bioinformatics analysis

● Bioinformatics analysis data results

The mentor reviewed the results provided by the researcher, compare those with the original

research paper results, and approved that the results of the research were made properly.

Вам также может понравиться