Вы находитесь на странице: 1из 8

Available online at www.sciencedirect.

com

ScienceDirect
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate
/procedia
Procedia Computer Science 168 (2020) 97–104
www.elsevier.com/locate
/procedia
Complex Adaptive Systems Conference with Theme:
Leveraging AI and Machine Learning for Societal Challenges, CAS 2019
Complex Adaptive Systems Conference with Theme:
Leveraging AI and Machine Learning for Societal Challenges, CAS 2019
K-means Clustering and Principal Components Analysis of Microarray Data
K-means Clustering and of L1000 Components
Principal Landmark Genes Analysis of Microarray Data
of L1000 Landmark Genes
Carly L. Claymana,*, Satish M. Srinivasana, Raghvinder S. Sangwana
a
Carly L. Clayman *, Satish M. Srinivasan , Raghvinder S. Sangwan
School of Professional Studies, Pennsylvaniaa,State University, Engineering Department,
a 30 Swedesford Rd, Malvern, Pennsylvania,
a 19355, USA

a
School of Professional Studies, Pennsylvania State University, Engineering Department, 30 Swedesford Rd, Malvern, Pennsylvania, 19355, USA
Abstract

Abstract
Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs
well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing
Dimensionality
gene microarrayreduction
data frommethods such as genes,
978 landmark principal component
which have been analysis (PCA)shown
previously are used to selectexpression
to predict relevant features,
of ~81%and of k-means clustering
the remaining 21,290performs
target
well
geneswhen
withapplied to data
low error. with low
Groups effective
within dimensionality.
the L1000 dataset wereThis study integrated
characterized usingPCAbothand k-means clustering
microarray and clinicalusing the L1000
metadata dataset,
to assess containing
whether 978
gene microarray
landmark data from
genes would 978 landmark
improve clusteringgenes,
results,which have to
compared been previously
a random shown
set of to predict
978 genes. Theexpression of ~81%
role of clinical of the including
variables, remainingmorphological
21,290 target
genes withwere
diagnosis, low assessed
error. Groups
acrosswithin
k-means the clustering
L1000 dataset
groupswere characterized
within homogeneous usingtissue
both samples
microarray andL1000
in the clinical metadata
dataset. to assess
Results show whether
that the 978
landmark genes better
woulddifferentiated
improve clustering
k-means results, compared
clusters, relativetotoa 978
random set of selected
randomly 978 genes. The role ofgenes.
non-landmark clinicalK-means
variables, including
clusters morphological
generated from the
diagnosis, were assessed
landmark genes showed across k-means clustering
more separation of clustergroups
groupswithin
when homogeneous
plotted againsttissue samples
the first two inprincipal
the L1000 dataset. Results
components, which show thata the
capture 978
greater
landmark genes
proportion better differentiated
of variation k-meansgenes.
for the 978 landmark clusters, relative
These to 978
results randomly
suggest that theselected non-landmark
978 landmark genes.represent
genes better K-meansthe clusters
overallgenerated from the
genetic profile of
landmark genes showed
these heterogeneous moreFuture
samples. separation
studiesofwill
cluster groupspredictive
implement when plotted against
analytics the firsttotwo
techniques principal
further components,
investigate which ofcapture
the interaction a greater
microarray data
proportion
and clinicalofvariables
variationsuch
for as
thecancer
978 landmark
stage. genes. These results suggest that the 978 landmark genes better represent the overall genetic profile of
these heterogeneous samples. Future studies will implement predictive analytics techniques to further investigate the interaction of microarray data
and clinical variables such as cancer stage.
© 2019 The Authors. Published by Elsevier B.V.
© 2020
This is anThe
openAuthors. Published
access article by CC
under the Elsevier
BY-NC-ND B.V. license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This
© is an
2019 Theopen
Peer-review underaccess
Authors. articlebyunder
Published
responsibility thethe
Elsevier
of CC BY-NC-ND
B.V.
scientific committee of license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
the Complex Adaptive Systems Conference with Theme: Leveraging AI and
Peer-review
This under
is anLearning
Machine open forresponsibility
access article under
Societal of CC
the
Challenges the BY-NC-ND
scientific committee of the Complex Adaptive Systems Conference with Theme: Leveraging
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
AI and Machine
Peer-review underLearning for Societal
responsibility Challenges
of the scientific committee of the Complex Adaptive Systems Conference with Theme: Leveraging AI and
Machine Learning for Societal Challenges
Keywords: Landmark Genes, L1000; Microarray, K-Means Clustering, Principal Components Analysis, Dimensionality Reduction

Keywords: Landmark Genes, L1000; Microarray, K-Means Clustering, Principal Components Analysis, Dimensionality Reduction

Corresponding Author Email: cuc1134@psu.edu

Corresponding
1877-0509 © 2019Author
TheEmail:
Authors.cuc1134@psu.edu
Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review©under
1877-0509 2019responsibility
The Authors. of the scientific
Published committee
by Elsevier B.V.of the Complex Adaptive Systems Conference with Theme: Leveraging AI and Machine Learning for
Societal
This is anChallenges
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Complex Adaptive Systems Conference with Theme: Leveraging AI and Machine Learning for
Societal Challenges

1877-0509 © 2020 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Complex Adaptive Systems Conference with Theme: Leveraging AI
and Machine Learning for Societal Challenges
10.1016/j.procs.2020.02.265
98 Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

Introduction

Computational costs of biological data analysis call for increasingly efficient methods of determining which genetic and clinical
factors are most relevant for understanding the overall genetic and clinical profiles of human patients [1]. This challenge is especially
difficult given that distinct individuals may possess distinct profiles of genetic expression, and genetic conditions may be more
optimally represented using a varying number of genetic features. Dimensionality reduction methods, such as random forest analysis,
k-means clustering, and principal components analysis are used to capture essential elements and explain larger datasets using a
subset of relevant features. Previous studies have demonstrated the utility of k-means clustering as it both relates to protein expression
and cancer outcomes [2]. Principal component analysis (PCA) is an effective method of data representation when linear relationships
are present [3]. PCA can detect multiple cancer types while also selecting relevant features [3].
The dataset analyzed in the present study was derived from the L1000 dataset previously analyzed by Chen and colleagues
[4].This dataset contains 978 landmark genes, which characterize the remaining genes in the dataset with low error in predicting
expression of ~81% of the target genes [4]. This study aimed to build upon the analysis performed by Chen and colleagues by
establishing a methodology for characterizing subgroups within the L1000 dataset using both genetic and clinical data. This was
performed to assess to what extent the 978 landmark genes would offer improved clustering results, compared to a random set of 978
genes selected from the L1000 dataset.

Dataset

Dataset Description

The dataset, previously analyzed by Chen et al 2016, consists of 978 landmark genes and 1,300 non-landmark genes within a set
of 22,678 total genes. These genes were measured across 129,157 observations / samples in the dataset. These observations were
associated with metadata, containing clinical variables, such as morphological diagnosis and Ann Arbor stage. This clinical metadata
was extracted using methodology documented previously in the literature [5,6]. A subset of genes in the dataset (978 landmark genes
or 3 separate sets of 978 randomly selected non-landmark genes across 22,678 total genes) were selected for analysis of a
heterogeneous and homogeneous dataset. The heterogeneous dataset consisted of various tissue types with highly variable clinical
characteristics, while the homogeneous dataset consisted of samples derived from lymphoma / leukemia tissue.
Previously determined 978 landmark genes were well-suited for clustering analysis, including 978 numeric features for clustering
along with 1 categorical variable of BSM vs. GSM id with which to assess clustering effectiveness / distribution of clinical attributes
among clusters. 3 random samples of 978 non-landmark genes were also used for clustering analysis. The clinical variable Ann Arbor
cancer stage was assessed for clustering effectiveness within one subset of cancer tissue-derived samples for which there was
consistent metadata across samples with variations of lymphoma / leukemia morphological diagnoses. This subset of homogeneous
data was chosen given that there was sufficient and consistent metadata for clustering analysis, including various levels of Ann Arbor
stage so that clustering effectiveness may be visualized within levels of each of these categorical factors.

Methods

K-Means Clustering and Principal Components Analysis

K-means clustering was chosen given that K-means clustering has been shown to effectively differentiate groups within clinical
datasets, including cancer data. K-means clustering computes the distance between samples and forms clusters by representing a gene
as a vector of expression values, i.e. gene Gx = <e1, e2, e3,…, en> where n is the number of tissue samples and ey is the expression
value of gene G in tissue y. As a result, the k-means clustering algorithm is computationally intensive. Dimensionality reduction by
selecting 978 landmark genes is utilized to specify relevant features so that k-means clustering may be efficiently applied to a smaller
subset of data.
K-means clustering was performed separately for each of the 4 models described below. Following this, total within- and between-
cluster sum of squares was computed. Total within-cluster sum of squares represents variance within clusters such that a low value
indicates high similarity within clusters. Between-cluster sum of squares represents variance between clusters such that a high value
indicates low similarity between clusters. Consequently, a relatively higher ratio of between- to total within-cluster sum of squares
indicates improved clustering effectiveness as a greater proportion of total variance in the dataset is explained by clustering groups.
K-means clustering results were plotted against the first two principal components.
Clustering results for each model were compared with respect to within- and between-cluster sum of squares, and ratio of
between- to within-cluster sum of squares, two direct methods were used to define the optimal number of clusters: 1) the silhouette
width method 2) the elbow method, based on within-cluster sum of squares. Lastly, k-means clustering results based on non-
Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104 99
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

normalized and normalized data were plotted against the first 2 principal components and visually inspected for clustering
effectiveness. Clustering effectiveness was also assessed based on concordance of clustered groups with the clinical variable of Ann
Arbor Stage. Clustering effectiveness was also assessed based on accuracy, macro precision, macro recall, and macro F1 for
classification with GSM or BSM id.

Experimental Design

Four models were constructed to assess the varying ability of k-means clustering analysis and visualization across the first two
principal components to differentiate groups within the data and to better characterize how these groups vary according to with
clinical variables. The models include each of the following (Table 1).

Table 1. Experiments.
Model Number Dataset Features / predictors Response Variable
1 Homogeneous 978 landmark genes Categorical clinical variables (GSM id or BSM id, Ann Arbor cancer stage, and sex)
2 Homogeneous Random sample of 978 non- Categorical clinical variables (GSM id or BSM id, Ann Arbor cancer stage, and sex)
landmark genes
3 Heterogeneous 978 landmark genes Categorical clinical variables (GSM id or BSM id)
4 Heterogeneous Random sample of 978 non- Categorical clinical variables (GSM id or BSM id)
landmark genes

Results

K-means clustering & PCA for homogeneous samples using 978 landmark vs. non-landmark genes as features

An analysis of a homogeneous sample of lymphoma / leukemia tissue samples was performed using the 978 landmark genes
(Model 1) vs. 978 randomly selected non-landmark genes (Model 2) as features. While both landmark and non-landmark genes
displayed 2 optimal 2 clusters by the silhouette width and gap statistic methods, an optimal number of 4 clusters was indicated by
within sum of squares (elbow method) (Figure 1).

(A) (B)

Fig. 1. Between-cluster sum of squares (first column), within-cluster sum of squares (second column), and ratio of between- to within-cluster sum of squares (third
column) are plotted against each value of k (2 through 10) for k-means clusters. Optimal number of clusters is indicated by the vertical line at a given level of k for
each metric. Metrics for defining optimal clusters were computed with the elbow (first column), silhouette width (second column), and gap statistic (third column).
Results are shown for A) 978 landmark genes and B) 978 randomly selected non-landmark genes as features, which display 4 optimal clusters for the elbow method
and 2 optimal clusters for silhouette and gap statistic methods.
100 Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

Minimal overlap between samples is indicated when 2 clusters are plotted against the first 2 principal components, consistent
with 2 optimal number of clusters based on the silhouette and gap statistic methods (Figure 2). Clusters generated based on the 978
landmark genes possess less overlap and a greater value for between cluster sum of squares and variability captured by the first 2
principal components (PCA1: 12.7%; PCA2: 5.4%), compared to results generated from 978 non-landmark genes as features (PCA1:
8.1%; PCA2: 6.5%). Non-landmark genes as features results in improved clustering consistency with the clinical variable of Ann
Arbor stage, corresponding with relatively higher accuracy, macro precision, macro recall, and macro F1 scores when using the non-
landmark genes as features to cluster samples into high Ann Arbor stage scores (III or IV) vs. non-high scores (I, II, NA) (Figure 3).

(A) (B)

Fig. 2. Data was used to generate k-means cluster results, which are depicted based on blocks of color. These cluster groups are plotted against the first two principal
components (x-axis and y-axis) of the respective gene expression data models used to generate the k-means clusters. Results are shown for A) 978 landmark genes and
B) 978 randomly selected non-landmark genes used as features. More separation of clusters is displayed for the landmark genes compared to the non-landmark genes
as features, especially as the number of clusters groups increases.

(A) (C)

(B)

Fig. 3. Results of k-means clustering are depicted in stacked bar graphs across each value of k (1 through 4) for the Ann Arbor stage variable. Results are shown for A)
978 landmark genes and B) 978 randomly selected non-landmark genes used as features. (C) Accuracy, macro precision, macro recall, and macro F1 scores were
relatively higher when using 978 non-landmark genes vs. the 978 landmark genes as features.

K-means clustering & PCA for heterogeneous samples using 978 landmark vs. non-landmark genes as features

An analysis of heterogeneous tissue samples was performed using 978 landmark genes vs. 978 randomly selected non-landmark
genes. A subset of one quarter of observation were analysed for landmark and non-landmark genes given computational constraints
of the PAM clustering method. Plots were generated for within-cluster sum of squares, between-cluster sum of squares, and ratio of
between to within- cluster sum of squares using 978 landmark genes or 978 non-landmark genes as features (Figure 4A). Optimal
Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104 101
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

number of clusters was 4 based on the elbow method and 2 based on silhouette width (Figure 4B). Clusters were plotted for k-means
clusters compared to PAM and hierarchical clustering (Figure 4C). Accuracy, macro precision, macro recall, and macro F1 were
computed (Figure 4D). Distributions of observations within each k-means cluster (2 - 4) are depicted in stacked bar graphs based on
GSM or BSM id (Figure 5).

(A) (C)

(B)
(D)

Fig. 4. Clustering results from landmark (left panels) and non-landmark genes (right panels) for the heterogeneous dataset (A) Within and between cluster sum of
squares and their ratio are plotted across each level of K for landmark and non-landmark genes. (B) The optimal number of clusters is 4 using the elbow method and 2
using the silhouette method for both 978 landmark genes and 978 non-landmark genes as features (C) K-means clustering for distinguishing clusters using 978
landmark genes vs. 978 non-landmark genes was validated based on cluster plots, which depict more separation of clusters when using K-means clustering, relative to
more overlap of clusters when using partitioning against medoids (PAM) and hierarchical clustering. (D) Macro precision, macro recall, and macro F1 are greater for
k-means clustering with landmark genes as features, while accuracy is greater for each clustering method with non-landmark genes as features.
102 Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

Although 2 or 4 optimal number of clusters is suggested, there is a greater degree of overlap depicted between samples within
clusters for 3 and 4 clusters using 978 non-landmark genes as features (Figure 5B), compared to the same analysis performed using
the 978 landmark genes as features (Figure 5A). However, it should be noted that the first two principal components capture a lower
percentage of the overall variation in the data when using the non-landmark genes as features (9.4% and 6.2% for the first and second
principal component, respectively) (Figure 6A), compared to using the landmark genes as features (13.1% and 9.3% for each the first
and second principal component, respectively) (Figure 6B).

(A)

(B)

Fig. 5. GSM and BSM ids are depicted in stacked bar graphs across each value of k (1 through 4) for clusters generated from A) 978 landmark genes vs. B) 978 non-
landmark genes. More consistency within clusters is shown for 978 landmark genes, compared to non-landmark genes.

(A) (B)

Fig. 6. Data was used to generate k-means cluster results, which are depicted based on blocks of color for k=2 through k=5. These cluster groups are plotted against
the first two principal components (x-axis and y-axis) of the respective gene expression data models used to generate the k-means clusters from A) 978 landmark
genes vs. B) 978 non-landmark genes. Greater cluster separation is shown using the 978 landmark genes as features across 2 to 5 number of cluster groups, compared
to greater cluster overlap using the non-landmark genes as features.
Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104 103
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

Discussion

As the number of k-means cluster groups (the value of k) increases, there is an apparent increase in the visual overlap of each
given cluster. This is consistent with the increase in within- and between-cluster sum of squares as the value of k increases. The
optimal number of clusters is generally present at k values of 2 (gap statistic) and 4 (elbow plot based on within-cluster sum of
squares). The optimal number of clusters was consistent when using either the 978 landmark genes or 978 non-landmark genes as
features. However, the effectiveness of clustering varied when using the 978 landmark genes vs. the 978 non-landmark genes as
features. Variability depending upon the set of genes selected is also depicted based on the distinct appearance of cluster plots, which
possess more visual overlap and greater between-cluster sum of squares for the non-landmark genes, compared to the landmark genes
for both the homogeneous and heterogeneous datasets. K-means clustering results coincide with the clinical variable of Ann Arbor
cancer stage to a greater extent when using non-landmark genes as features, compared to landmark genes. Results from each model
are depicted below (Table 2).
For the heterogeneous dataset, the percentage of variation captured by each of the first two principal components was greater for
the 978 landmark genes (PCA1: 13.1%; PCA2: 9.2%) vs. the 978 non-landmark genes (PCA1: 9.4%; PCA2: 6.2%), with similar
results for the homogeneous dataset. These results are consistent with the ratio of between to total within-sum of squares error, which
represents clustering effectiveness or the amount of total variability in the dataset captured by the clustering groups. The 978
landmark genes generated improvement of clustering results for both the homogeneous and the heterogeneous datasets. However, the
978 landmark genes contributed more relative improvement of clustering effectiveness than the randomly selected non-landmark
genes for the heterogeneous dataset compared to the homogeneous dataset. This is expected given that the 978 landmark genes can
predict the expression values of the remaining genes in the entire dataset with a high level of accuracy. Consequently, these 978
genes may contribute to relative improvement of clustering effectiveness for a dataset with a high degree of variability since these
978 genes may capture more of the overall variability in the dataset, compared to a homogeneous dataset with less inherent
variability.

Table 2. Summary of Results.


Model Dataset Features / Between Optimal Optimal PCA PCA Accuracy Macro Macro Macro
predictors to Total Number Number of 1 2 Precision Recall F1
Within- of Clusters
Cluster Clusters (Silhouette
Sum of (Elbow) Width)
Squares
(K=4)
1 Homogeneous 978 24% 4 2 12.7% 5.4% .343 .431 .425 .342
landmark
genes
2 Homogeneous Random 22% 4 2 8.1% 6.5% .362 .444 .442 .362
sample of
978 non-
landmark
genes
3 Heterogeneous 978 38% 4 2 13.1% 9.2% .489 .544 .541 .488
landmark
genes
4 Heterogeneous Random 23% 4 2 9.4% 6.2% .511 .503 .503 .478
sample of
978 non-
landmark
genes

Similar methods such as those used here may be utilized to consider clustering as a tool to identify whether certain subsets of
genes in a dataset may be used to predict cluster results. This study depicted the use of 978 landmark genes as more effective method
of identifying distinct clusters of individuals according to visualization of data clusters against the first two principal components of
the data when assessing large heterogeneous datasets. Clusters in these plots are more distinct compared to cluster plots generated by
using 978 randomly selected non-landmark genes in the dataset, supporting the use of these landmark genes as a representation of the
genetic profile of these samples when assessing heterogeneous datasets [4]. In contrast, landmark genes capture more of the variation
in the data for the and heterogeneous dataset studied here. Despite this, non-landmark genes of the homogeneous dataset established
104 Carly L. Clayman et al. / Procedia Computer Science 168 (2020) 97–104
C.Clayman; SM Srinivasan; R. Sangwan / Procedia Computer Science 00 (2019) 000–000

clustering into groups that were more consistent with clinical variables, compared to 978 landmark genes. These results are consistent
with findings of previous studies, which have shown that gene subsets derived from various biological processes may establish
distinct clustering results by capturing specific aspects of microarray data [7].
Certain genes or clinical variables may be more predictive of clustering results than others. When assessing separation of groups,
the role of sets of individual genes and clinical variables may be examined further. Clustering analysis may be utilized to inform
future studies on the ability of genes to predict clinical variables as well as the ability of clinical variables to characterize clusters
derived from gene expression results, as examined in this study. Future studies may also build upon this analysis using predictive
analytics techniques to further develop understanding of how to investigate the relationship between genetic and clinical variables by
accounting for both coding and non-coding genetic variants [8]. This may be especially relevant toward applications for personalized
medicine such as treatment responsiveness depending upon the combination of genetic and clinical variables. Future studies may
assess whether clustering results based on gene expression levels predict various indices of cancer stage and whether specific clinical
groups may possess more clusters than other groups.

Acknowledgements

This work was supported by Pennsylvania State University.

References

[1] Duan, Qiaonan, St Patrick Reid, Neil R Clark, Zichen Wang, Nicholas F Fernandez, Andrew D Rouillard, Ben Readhead, Sarah R Tritsch, Rachel Hodos, Marc
Hafner, Mario Niepel, Peter K Sorger, Joel T Dudley, Sina Bavari, Rekha G Panchal, Avi Ma’ayan. (2016) “L1000CDS2: LINCS L1000 characteristic direction
signatures search engine.” Npj Systems Biology and Applications 2: 1-25.
[2] Duncan, R., B. Carpenter, L.C. Main, C. Telfer, and G.I. Murray. (2008) “Characterization and protein expression profiling of annexins in colorectal cancer.”
British Journal of Cancer 98(2): 426-433.
[3] Chen, Xi., Jin Xie, and Qingcong Yuan. (2018) “A Method to Facilitate Cancer Detection and Type Classification from Gene Expression Data using a Deep
Autoencoder and Neural Network.” ArXiv:1812.08674v1.
[4] Chen, Yifei, Yi Li, Rajiv Narayan, Aravind Subramanian, and Xiaohui Xie. (2016) “Gene expression inference with deep learning.” Bioinformatics 32(12): 1832–
1839.
[5] Enache, Oana. M., David L. Lahr, Ted E. Natoli, Lev Litichevskiy, David Wadden, Corey Flynn, Joshua Gould, Jacob K. Asiedu, Rajiv Narayan, Aravind
Subramanian. (2018) “The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and
annotations.” BioRxiv.
[6] Subramanian, Aravind, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, John F. Davis, Andrew A. Tubelli, Jacob
K. Asiedu, David L. Lahr, Jodi E. Hirschman, Zihan Liu, Melanie Donahue, Bina Julian, Mariya Khan, David Wadden, Ian Smith, Daniel Lam, Arthur Liberzon,
Courtney Toder, Mukta Bagul, Marek Orzechowski, Oana M. Enache, Federica Piccioni, Alice H. Berger, Alyhan Shamji, Angela N. Brooks, Anita Vrcic, Corey
Flynn, Jacqueline Rosains, David Takeda, Desiree Davison, Justin Lamb, Kristin Ardlie, Larson Hogstrom, Nathanael S. Gray, Paul A. Clemons, Serena Silver,
Xiaoyun Wu, Wen-Ning Zhao, Willis Read-Button, Xiaohua Wu, Stephen J. Haggarty, Lucienne V. Ronco, Jesse S. Boehm, Stuart L. Schreiber, John G. Doench,
Joshua A. Bittker, David E. Root, Bang Wong, Todd R. Golub. (2017) “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.”
Cell 171(6): 1437-1452.e17.
[7] Chopra, Pankaj, Jaewoo Kang, Jiong Yang, HyungJun Cho, Haenam Stanley Kim, and Min-Goo Lee. (2008) “Microarray data mining using landmark gene-guided
clustering.” BMC Bioinformatics 9(92): 1–13.
[8] Quang, Daniel, Yifei Chen, and Xiaohui Xie. (2015) “DANN: A deep learning approach for annotating the pathogenicity of genetic variants.” Bioinformatics
31(5): 761–763.

Вам также может понравиться