Data Collection Process

Data Collection, Processing, and Analysis
The researcher reviewed the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,
and then provided an analysis of the trends. The researcher also created a gene ontology table, a
figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was
essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,
which led to different gene ontology terms and y-values. The figures the researcher attempted to
reproduce are shown below:
Figure 1e:
Figure 1e. ChIP-seq graph: Established (red) and
disappeared (blue) H3K9me3 marks during embryonic development. Reprinted from
"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo
development," by Wang et al. 2018, Nature Cell Biology, 20, p. 621. Copyright 2018 by
Macmillan Publishers Limited, part of Springer Nature.

Figure 2c:
Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming
of H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et
al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part
of Springer Nature.
Figure 2a:
Figure 2a. A heatmap showing H3K9me3 domains during mouse
embryo development. Reprinted from "Reprogramming of H3K9me3-dependent
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Wang et al. (2018):
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
demonstrate which stages to compare to.
Before reproducing the graphs, the researcher carried out the following steps to prepare the raw
data for further processing for visualization:
1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human
Genetic Computer Cluster), a network of computers for genome projects, by logging into
their HGCC account.
2. The researcher accessed a script file to setup and execute the TopHat algorithm (an
algorithm that mapped genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping and aligning the mouse genes to the
mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)
was used as input in the next step.
3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compared differences in gene expression) on the mouse gene data from the
TopHat output and compared modifications of H3K9me3-dependent heterochromatin at

various stages of embryo development using this algorithm. The output of the CuffDiff
algorithm is a '.diff' file generated from comparison of differential gene expression
between two embryonic stages, which was used as input when further processing data for
visualization.
Since the goal of this research was to reproduce the graphs of the research paper using the
RNA-seq data, statistical analysis of the data did not apply to this research and was not part of
the scope. The results of this processing were in the form of data tables which one can
understand better through data visualization (i.e. plotting graphs). The researcher carried out the
following steps to make the reproduction of figure 1e:
1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to
other embryonic cell stages into an R program by reading the files into R.
Below is the programming logic for the researcher used for this step:
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher wrote R programs to filter the imported data for selective processing
based on predefined criteria for status and significance of genes.
Below is the programming logic the researcher used for this step:
Use the subset function in R programming to filter out the data based on a filter applied
on particular columns (status, significant, etc) to remove undesirable data from
processing.
3. The researcher extracted a list and the number of genes specific to each embryonic
development stage.
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
4. The researcher wrote a bar graph R program to reproduce figure 1e.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
as 'geom_bar', etc. to define the bar graph to be created.
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data.
The researcher carried out the following steps for producing gene ontology tables for
genes specific to stages of embryonic development for a reproduction of figure 2c:
1. The researcher went to the Gene Ontology Consortium website (www.geneontology.org)
and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'
upregulated and downregulated genes and two lists for combined ICM and morula stages'
upregulated and downregulated genes. The Gene Ontology Consortium analyzed these
lists for biological processes.
2. The researcher then made Excel tables out of gene ontology terms with the highest P
values. The most emphasis was on highest P values first, and then uniqueness of the
terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and
the y-axis was the P value of the gene ontology term which resulted from the analysis.
3. The researcher wrote a bar graph R program to plot each of the Excel tables.
as 'geom_bar', etc. to define the bar graph to be created.
After the gene ontology graphs were completed following these steps, the researcher analyzed
the trends of the graphs to provide bioinformatics analysis of the data.
The researcher carried out the following steps to make the reproduction of figure 2a:
1. The researcher took the lists of upregulated and downregulated genes specific to each
embryonic stage and extracted the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations were saved as an Excel file.
2. The researcher input this Excel file into an R script which generated a matrix from this
data, normalized the matrix, and then ran a k-means clustering code on the matrix.
3. The researcher created a heat map R program to visualize the final matrix.
Below is the programming logic for this step:
as 'geom_tile', etc. to define the heat map to be created.
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher carried out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher used the following
format for testing prototypes:
Table 1
Data Visualization Testing
Test case # Figure being Expected figure Reproduced Degree of

reproduced figure similarity (1-5)
1
2
3
Test case #: Identified each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
researcher attempted a reproduction of.
Expected figure: A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generated as a result of following the method to
reproduce the corresponding figure.
Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.
Table 2
Bioinformatics Analysis Testing
Test case # Associated Analysis of the Analysis of the Conclusion

figure associated figure reproduced
figure
1
2
3
Test case #: Unique number to identify the test case.
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
attempted a reproduction of.
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. The criteria is provided in the previous section.
After completion of the above testing, the researcher will provide the following products to the
mentor for review:
● Data visualization graphs for bioinformatics analysis
● Bioinformatics analysis data results
The mentor reviewed the results provided by the researcher, compare those with the original
research paper results, and approved that the results of the research were made properly.

Data Collection Process

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Collection Process

Загружено:

Авторское право:

Доступные форматы

Data Collection, Processing, and Analysis

reproduce are shown below:

Figure 1e. ChIP-seq graph: Established (red) and

disappeared (blue) H3K9me3 marks during embryonic development. Reprinted from

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

Macmillan Publishers Limited, part of Springer Nature.

of H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et

Figure 2a. A heatmap showing H3K9me3 domains during mouse

embryo development. Reprinted from "Reprogramming of H3K9me3-dependent

Wang et al. (2018):

demonstrate which stages to compare to.

data for further processing for visualization:

their HGCC account.

was used as input in the next step.

TopHat output and compared modifications of H3K9me3-dependent heterochromatin at

algorithm is a '.diff' file generated from comparison of differential gene expression

following steps to make the reproduction of figure 1e:

based on predefined criteria for status and significance of genes.

on particular columns (status, significant, etc) to remove undesirable data from

4. The researcher wrote a bar graph R program to reproduce figure 1e.

as 'geom_bar', etc. to define the bar graph to be created.

the graph to provide bioinformatics analysis of the data.

genes specific to stages of embryonic development for a reproduction of figure 2c:

1. The researcher went to the Gene Ontology Consortium website (www.geneontology.org)

lists for biological processes.

as 'geom_bar', etc. to define the bar graph to be created.

the trends of the graphs to provide bioinformatics analysis of the data.

Below is the programming logic for this step:

as 'geom_tile', etc. to define the heat map to be created.

format for testing prototypes:

Data Visualization Testing

Test case # Figure being Expected figure Reproduced Degree of

researcher attempted a reproduction of.

reproduce the corresponding figure.

Bioinformatics Analysis Testing

Test case # Associated Analysis of the Analysis of the Conclusion

Test case #: Unique number to identify the test case.

attempted a reproduction of.

mentor for review:

● Data visualization graphs for bioinformatics analysis

● Bioinformatics analysis data results

Вам также может понравиться