Академический Документы
Профессиональный Документы
Культура Документы
The researcher reviewed the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,
and then provided an analysis of the trends. The researcher also created a gene ontology table, a
figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was
essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,
which led to different gene ontology terms and y-values. The figures the researcher attempted to
Figure 1e:
development," by Wang et al. 2018, Nature Cell Biology, 20, p. 621. Copyright 2018 by
Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming
al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part
of Springer Nature.
Figure 2a:
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
Before reproducing the graphs, the researcher carried out the following steps to prepare the raw
1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human
Genetic Computer Cluster), a network of computers for genome projects, by logging into
2. The researcher accessed a script file to setup and execute the TopHat algorithm (an
algorithm that mapped genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping and aligning the mouse genes to the
mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)
3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compared differences in gene expression) on the mouse gene data from the
between two embryonic stages, which was used as input when further processing data for
visualization.
Since the goal of this research was to reproduce the graphs of the research paper using the
RNA-seq data, statistical analysis of the data did not apply to this research and was not part of
the scope. The results of this processing were in the form of data tables which one can
understand better through data visualization (i.e. plotting graphs). The researcher carried out the
1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to
other embryonic cell stages into an R program by reading the files into R.
Below is the programming logic for the researcher used for this step:
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher wrote R programs to filter the imported data for selective processing
Below is the programming logic the researcher used for this step:
Use the subset function in R programming to filter out the data based on a filter applied
processing.
3. The researcher extracted a list and the number of genes specific to each embryonic
development stage.
Below is the programming logic the researcher used for this step:
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
Below is the programming logic the researcher used for this step:
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction was complete following these steps, the researcher analyzed the trends of
The researcher carried out the following steps for producing gene ontology tables for
and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'
upregulated and downregulated genes and two lists for combined ICM and morula stages'
upregulated and downregulated genes. The Gene Ontology Consortium analyzed these
2. The researcher then made Excel tables out of gene ontology terms with the highest P
values. The most emphasis was on highest P values first, and then uniqueness of the
terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and
the y-axis was the P value of the gene ontology term which resulted from the analysis.
3. The researcher wrote a bar graph R program to plot each of the Excel tables.
Below is the programming logic the researcher used for this step:
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the gene ontology graphs were completed following these steps, the researcher analyzed
The researcher carried out the following steps to make the reproduction of figure 2a:
1. The researcher took the lists of upregulated and downregulated genes specific to each
embryonic stage and extracted the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations were saved as an Excel file.
2. The researcher input this Excel file into an R script which generated a matrix from this
data, normalized the matrix, and then ran a k-means clustering code on the matrix.
3. The researcher created a heat map R program to visualize the final matrix.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher carried out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher used the following
Table 1
1
2
3
Test case #: Identified each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
Expected figure: A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generated as a result of following the method to
Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.
Table 2
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. The criteria is provided in the previous section.
After completion of the above testing, the researcher will provide the following products to the
The mentor reviewed the results provided by the researcher, compare those with the original
research paper results, and approved that the results of the research were made properly.