Вы находитесь на странице: 1из 25

Accepted Manuscript

Hybrid Method for the Analysis of Time Series Gene Expression Data
Lixin Han, Hong Yan
PII: S0950-7051(12)00085-8
DOI: 10.1016/j.knosys.2012.04.003
Reference: KNOSYS 2273
To appear in: Knowledge-Based Systems
Received Date: 2 September 2011
Revised Date: 23 March 2012
Accepted Date: 1 April 2012
Please cite this article as: L. Han, H. Yan, Hybrid Method for the Analysis of Time Series Gene Expression Data,
Knowledge-Based Systems (2012), doi: 10.1016/j.knosys.2012.04.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Hybrid Method for the Analysis of Time
Series Gene Expression Data
Lixin Han
1,2,3
and Hong Yan
2,4

1
College of Computer and Information, Hohai University, China
2
Department of Electronic Engineering
City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
3
State Key Laboratory of Novel Software Technology
Nanjing University, China
4
School of Electrical and Information Engineering
University of Sydney, NSW 2006, Australia

Abstract-- Time series analysis plays an increasingly important role in the study of gene expression data.
Some problems, such as a large amount of noise and a small number of replicates, are computational
challenges in time series expression data analysis. This paper proposes a hybrid method for analyzing time
series gene expression data (HMTS). In the HMTS method, we employ a combination of K-means clustering,
regression analysis and piecewise polynomial curve fitting. The K-means clustering procedure is used to
divide noisy time series into different clusters, and regression analysis is used to delete outliers according to
different clusters. All time series data are divided into multiple segmentations, and polynomial curve fitting
is used to fit all segmentation data. The HMTS method can obtain good estimates, especially when there is
noise in the data.
Index Terms-- Time series analysis, Gene expression, Regression analysis, Function approximation.
1. Introduction
Time series analysis plays an increasingly important role in many applications [1],
[2], [3], [4], [5], [6], [7], [8], [9], [10], [11]. For example, in biological systems,
various proteins are synthesized under different conditions, mRNA is transcribed
continuously and new proteins are built because of protein degradation. Thus, gene
expression is a temporal process and the data can be considered to be a time series
[12].
There exists a large amount of time-series gene expression data. Time series

expression data analysis encounters many computational challenges because these
data often contain a large amount of noise and the number of data points is small. For
example, it is difficult to obtain a continuous representation of gene expression time
series due to noise and a small number of replicates. In addition, some data points
may be missing. This makes straightforward techniques such as interpolation of
individual time series difficult [12]. In this paper, we propose a hybrid method for
analyzing time series gene expression data (HMTS). This method employs a
combination of K-means clustering, regression analysis and piecewise polynomial
curve fitting in order to reduce noise and outliers and to obtain a good fit.
2. Related work
Mukhopadhyay and Chatterjee [13] present a method of establishing causality and
pathway maps for microarray time series. In this method, they employ Granger
causality to identify causality, and compute the minimal spanning tree for pathway
detection. This method cannot guarantee functional causality. Arnold et al. [1]
examine causal modeling methods based on Granger causality and graphical modeling
for time series analysis. They produce causal relationships between time-persistent
features, instead of temporal variables. Later, Lozano [14] presents a grouped
graphical Granger modeling method to analyze causality in gene expressions. This
method mainly employs a combination of regression methods and variable selection
in graphical Granger modeling, and leverages the group structure among the temporal
variables by time series, in order to adapt a good many time series for genome-wide

microarray analysis.
Hwang and Valls Pereira [15] employ Monte Carlo simulations to demonstrate the
estimated persistence of ARCH [16] and GARCH [17] models exaggerated by the
existence of structural breaks in persistence parameters.
Magni et al. [18] develop a software tool for clustering time series gene expression
data. In the software package, they provide end-users with two clustering algorithms,
Bayesian clustering (BC) [19] and temporal abstraction clustering (TAC) [20], for
analyzing short time series together with the well-known hierarchical clustering and
self-organizing maps.
Jiang and Yan [21] propose a new algorithm that employs a combination of the
HilbertHuang transform (HHT) and wavelet transform to analyze spectral properties
of short genes. The novelty of this method lies in the introduction of the HHT
algorithm to biological knowledge for discovering the spectral patterns of very short
gene sequences. Liu, Dai and Yan [22] develop the local weighted approximation
method for missing microarray data estimation. Zhang, Liu and Yan [23] propose a
regularized spline regression method for gene expression profile analysis.
Lin et al. [24] employ hidden Markov models (HMMs) to classify time series
clinical expression data for the varying response rates of each patient. This method
takes into consideration not only classifying the time series expression datasets but
also the differences in patient rates.
Smith et al. [25] and Smith [26] propose a multi-segment alignment method of
computing alignments for sparse gene-expression time series and assessing their

similarity. In addition, they propose a method of computing clustered alignments by
simultaneously clustering genes and computing a common alignment for the genes
within a cluster.
Li et al. [27] present an unsupervised conditional random fields (CRF) model for
clustering gene expression time series. This model makes use of the local
characteristic of Markov random fields to control the learning process of their model
in order to reduce computational complexity and facilitate faster convergence.
Goel et al. [28] present a method of system estimation of metabolic time-series data.
This method is composed of a model-free phase and a model-based phase. The
model-free phase reveals inconsistencies within the data, and leads to numerical
representations of fluxes as functions of the variables affecting them. The
model-based phase presents the mathematical formulation of the processes in the
biological system.
Kim et al. [29] propose a method of inferring biomolecular networks from multiple
time-series data. They employ linear time-varying systems instead of linear
time-invariant models to infer biomolecular networks. They model the network
inference as an optimization problem, and employ random perturbations to reduce the
possible number of solutions for the optimization problem.
Gennemark and Wedelin [30] present a method which best represents an ordinary
differential equation (ODE) identifying problems as optimization problems for
evaluation and comparisons between different methods, and to define a suitable file
format for such problems.

Costa et al. [31] present a method of constraint-based mixture estimation of hidden
Markov models in order to analyze and classify clinical time series. This method can
handle noise, missing data and mislabeled samples. This method using mixture
estimation for classification tasks is easy to perform on sub-groups of patients.
Hermans and Tsiporkova [32] present a method of pasting expression profiles from
different microarray cell synchronization experiments together, in order that a curve
of the merged multiple datasets can cover different phases of the cell cycle. In this
method, a dynamic time warping (DTW) alignment is introduced to determine the
optimal pasting overlap between the experiments.
Futschik and Herzel [33] study the influence of the choice of background model on
microarray time series analysis. Their study indicates that the randomized and
Gaussian background models ignore the dependency structure within time series data.
Thus, they employ the AR(1)-based background model to represent the data structure
in the time series data in order to avoid overestimating the number of periodically
expressed genes.
Tsiporkova and Boeva [34] present a method of combining the procedures of
recursive hybrid aggregation and hierarchical merging. A recursive hybrid aggregation
algorithm uses a set of different aggregation operators to recursively aggregate to
extract a set of genes. A hierarchical merge procedure then uses dynamic time
warping alignment for combining the multiple-experiment expression profiles of the
selected genes.
In contrast to the above work, our HMTS method employs a combination of

K-means clustering, regression analysis and piecewise polynomial curve fitting. In the
HMTS method, the K-means clustering is used to divide the noisy time series into
different clusters, and regression analysis is used to delete outliers according to
different clusters. Lastly, all time series data are divided into multiple segmentations,
and polynomial curve fitting is used to fit all segmentation data.
3. The HMTS method for the analysis of time series data
In this section, we briefly review the regression analysis and polynomial fitting
methods, and then propose the HMTS method.
3.1 Regression analysis for removing outliers
In the early days, few papers studied regression diagnostics, and therefore
Chatterjee and Hadi [35] employ regression analysis to perform outliers and influence
analysis of individual observations. They found inter-relationships among the existing
variety of measures based on residuals, the prediction matrix, volume of confidence
ellipsoids, influence functions, and partial influence, to study outliers, influential
observations, and high leverage points. These relationships allow them to choose
three suitable measures for analyzing those outliers which excessively affect the
regression equation.
We introduce a multiple linear regression model that is useful for us to describe the
above five groups of measures. The multiple linear regression model can be
formulated as Y X = + , where Y denotes an 1 N vector of values of the
dependent variable, X denotes an N p full-column rank matrix of known

predictors, denotes a 1 p vector of unknown coefficients to be estimated, and
denotes an 1 N vector of independent random variables, each of which is of
zero mean and unknown variance
2
.
These measures are described as follows [35]:
(1) The measures based on residuals are used to inspect the least squares residuals.
The measure is one of the early methods of detecting model failures.
Some important formulae in the measures based on residuals are described below:

i i i
e y x = , where x
i
and

y
i
denotes the ith row of X and Y,

denotes a 1 p
vector of estimated coefficients, and
i
e denotes the ith row of residuals.
1
( )
T T
P X X X X

= , where X denotes a N p full-column rank matrix of known


predictors.
1
i
i
i
e
t
p
=

, where its calibration point approximately equal to (0,1) N for


i
t ,
i
e denotes the ith row of residuals,
i
p denotes the ith diagonal element of P, and
denotes the residual mean estimate.
*
2
1
i
i
i
N p
t t
N p t

=

, where its calibration point is approximately equal to
( 1) t N p for
*
i
t .
*
i
t denotes a monotonic transformation of
i
t , and follows a
t-distribution with (N p 1) degrees of freedom.
*
i
t reflects larger deviations than
i
t .
(2) The measures based on the prediction matrix can project perpendicularly an
N-dimensional vector Y into a p-dimensional subspace, in order to generate the
predicted values.

Some important formulae in the measures based on the prediction matrix are
described below:
1
( )
T T
i i i
p x X X x

= , where 2p/N is used as a calibration point for


i
p ,
i
p denotes the
ith diagonal element of the Hat matrix P.
i
p can be regarded as the amount of
leverage of the response value
i
y on the corresponding value
i
y . X denotes an N p
full-column rank matrix of known predictors, x
i
denotes the ith row of X.
(3) The measures based on the volume of confidence ellipsoids are based on the
change in the size of confidence ellipsoids with and without the ith observation.
Some important formulae in the measures based on volume of confidence ellipsoids
are described below:
* 2
1 AP
T
i i i i
p p e e e = = + , where 2( 1) p N + is used as a calibration point for
*
i
p , e
denotes the vector of residuals when Y is regressed on X,
i
e denotes the ith row of
residuals. AP
i
cannot discern between a high leverage point in the factor space and
an outlier in the response-factor space.
2
2
*
*
( 1) 1
LD log 1
1 (1 )( 1)
1
i
i
i
i
t N N N p
N
N p N p
t N p


= +



+


, where
2
p
is used as a
calibration point for LD
i
. The likelihood distance is related to the asymptotic
confidence region
( )
( )
{ }
2
, 1

: 2 L L
p

+



, where
2
, 1 p

+
denotes the upper
point of the
2
distribution with (p+1) degrees of freedom.
2
CVR (1 )
1
p
i
i i
N p t
p
N p

=



, where CVR 1 3
i
p N > is used as a rough
calibration point for CVR
i
.
( )
( ) ( )
{ }
; , ; , 1
1
CW log CVR log
2 2
i i p N p p N p
p
F F

= + , where
( ; , ) a
F

denotes the

upper -point of the F-distribution with the appropriate degrees of freedom.
(4) The measures based on the influence functions introduce the influence function
( )
( )
,
0
[ 1 ] [ ]
IF ; ; ; lim
i i
x y
i i i
T F T F
x y F T

+
= , where ( ) T denotes a vector
valued statistic based on a random sample from the cumulative distribution function
F ,
,
1
i i
x y
= at ( , )
i i
x y and
,
0
i i
x y
= otherwise, IF
i
measures the influence on T of
adding observation point ( , )
i i
x y to a very large sample.
Some important formulae in the measures based on influence functions are
described below:
2
C (1 )
i i i i
p t p p = , where ( , ) F p N p is used as a calibration point for C
i
.

*
WK
1
i
i i
i
p
t
p
=

, where ( ) 2 / p N is used as a calibration point for WK


i
.
1
W WK
1
i i
i
N
p

, where3 p is used as a calibration point for W


i
, W
i
is more
sensitive than WK
i
to
i
p . The advantage of WK
i
is that it is easier to interpret for the
authors.
( ) { }
*
C WK
i i
N P p = , where ( ) { }
2 N p N is used as a calibration point
for
*
C
i
.
(5) In the influence measures, suppose that all regression coefficients are of equal
interest. This assumption may be unreasonable. Thus, the measures based on partial
influence are more useful. Such measures reflect one or a few dimensions that make
an observation influential. Further, suppose all regression coefficients are of unequal
interest.
Some important formulae in the measures based on partial influence are described

below:
2 2
(1 )
ij
j
i
ij T
j i
t w
D
W W p
=

, where
ij
D measures the influence of the ith observation on
the jth coefficient, W
j
denotes the vector of residuals when
j
X is regressed on
[ ] j
X ,
where
[ ] j
X

denotes the matrix X with the jth column deleted,
ij
w denotes the ith
element of W
j
.
{ }
*
*
W W (1 )
ij
j
i ij
T
j i
t w
D
p
=

, where 2/ N is used as a calibration point for


*
ij
D .
2
2
W W
ij
ij T
j j
w
= , where 2 N is used as a calibration point for
2
ij
,
2
ij
denotes the
change in the ith diagonal element of the prediction matrix when
j
X is added to
or deleted from the regression model.
( )
2 2 2
1
, ,
T
j j Nj
= denotes the normalized
vector of squared residuals acquired from the regression of
j
X on all other
columns of X.
The experiment results [35] show that WK
i
, CW
i
, and
ij
D

or alternatively
*
C
i
,
CVR
i
, and
*
ij
D seem able to detect influential observations.
3.2. Polynomial curve fitting for fitting time series data
Curve fitting can be used to obtain a continuous representation of time series gene
expression data. Curve fitting methods can be classified into two different categories
[36]. Firstly, interpolation is a method of finding a curve passing through all the
known data points. Secondly, approximation is a method of constructing a function
that approximately fits the known data points. In function approximation, the curve
does not need to pass through all the known data points, therefore it has a higher

robustness to overcoming noise influence than interpolation. There is a considerable
amount of noise in gene expression microarray data. Thus, we choose function
approximation.
The flexibility of simple polynomial models makes them suitable for fitting
complicated curves. We employ polynomial curve fitting by least squares, in order to
fit time series data [36]. Given
0
( , )
n
j
n j
j
s a x a x
=
=

, where the function s


n
() is a
polynomial of degree n. Function approximation can be regarded as globally
optimizing an objective function, that is,
1
2
0 0
min ( ) ( )( ( ) )
n
m n
j
i i j i
a E
i j
F a x f x a x
+

= =
=

,
where ( )
i
f x is the function value at the ith data point, ( )
i
x is the weight at the
ith data point.
3.3. The HMTS method
The HMTS method is described below:
Step1. The noise of the time series in every cluster is reduced. The widely used
K-means clustering is employed to acquire cluster information corresponding to the
specific clusters, in order to divide the noisy time series into different clusters and to
ensure that the mean profile of each cluster can be used to fit the subsequent
polynomial. According to different clusters, we employ regression analysis [35] to
delete outliers. An n x 2 matrix of intervals is used to detect outliers. If the signs of the
two elements in the ith row of the matrix are the same, this indicates that the ith
observation is an outlier. This is because the corresponding residual is larger than
expected in 95% of new observations.

Step2. Time series data are fitted. All time series data are divided into multiple
segmentations. Polynomial curve fitting is used to fit all segmentation data, and least
squares are used to obtain error estimates or predictions. Specifically, a Vandermonde
matrix is created. The elements of the Vandermonde matrix are powers of x, that
is,
,
n j
i j i
x

= . The backslash operator is used to solve the least squares problem, that is,
p y . A recursive digital filter is used to evaluate every polynomial.
The HMTS method employs a combination of noise reduction and polynomial
curve fitting to analyze time series gene expression data. The HMTS algorithm
consists of the K-means clustering, regression analysis and piecewise polynomial
curve fitting, whose time complexities are O(n
2
), O(n), and O(n
3
) respectively. Thus,
the time complexity of the HMTS method is O(n
3
).
In contrast to the traditional noise reduction method [35], our HMTS method uses
the K-means clustering algorithm for preprocessing, in order to divide the noisy time
series into different clusters. We take into consideration not only past values of the
same time series but also present and past values of the exogenous time series.
In contrast to the traditional polynomial curve fitting method [36], our HMTS
method divides all time series data into multiple segmentations, and employs
polynomial curve fitting to fit all segmentation data. Thus, our method attains better
fitting results than the traditional polynomial curve fitting method.
In contrast to Smith et al. [25], Smith [26] and Tsiporkova and Boeva [34], our
HMTS method employs K-means clustering to divide the noisy time series into
different clusters, employs regression analysis to delete outliers according to different

clusters, and divides all time series data into multiple segmentations so that
polynomial curve fitting is used to fit all segmentation data. Especially when there is
noise in the data, the HMTS method can remove noise, delete useless outliers and
obtain good estimates.
4. Experiment results and discussion
In this section, we conduct experiments to evaluate the performance of the HMTS
method. In these experiments, the experimental programs are written in MATLAB.
4.1. Comparative analysis using simulated data
In this subsection, simulated data are used to understand the functionality of the
HMTS method and to compare the HMTS method with the other methods via
simulated data analysis.
The HMTS method is compared to the polynomial curve fitting method [36], 1-D
data interpolation method [37], cubic spline data interpolation method [37] by
generating a sine curve. In Figure 1, the noisy data are produced by adding random
numbers with the normal distribution of mean 0 and standard deviation deviation 0.1
multiplied by ten to the noise-free data. The experiment result in Figure 1 shows that
the HMTS method and the polynomial curve fitting method creates smoother curve
than the 1-D data interpolation method and cubic spline data interpolation method.
The experiment result in Figure 1 shows that even when there is noise, our method
can remove more noise and achieve a better fit to noise-free data than the polynomial
curve fitting method, 1-D data interpolation method, and cubic spline data

interpolation method. In Figure 2, the x-axis denotes different methods, the y-axis
denotes the root mean squared (RMS) error, the root mean squared (RMS) error is
used to measure the precision of the fit, and x=1, x=2, x=3 and x=4 respectively
denote the HMTS method, the polynomial curve fitting method, the 1-D data
interpolation method and the cubic spline data interpolation method. The experiment
result in Figure 2 shows that the HMTS method is better than the polynomial curve
fitting method, the 1-D data interpolation method and the cubic spline data
interpolation method in root mean square error, that is, the HMTS method achieves a
better fit than the polynomial curve fitting method, the 1-D data interpolation method
and the cubic spline data interpolation method.
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
raw data without noise
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
data with noise
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
our method for fitting data
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
Polynomial curve fitting method
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
1-D data interpolation method
0 2 4 6 8 10 12 14 16 18 20
-2
0
2
Cubic spline data interpolation method

Fig.1 The comparative result of generating a sine curve.



Fig.2 The comparative result of the root mean squared error in the above four methods.

The HMTS method is compared to the Futschik and Herzel's method [33] by
generating a sine curve. In Figure 3, the x-axis denotes a value range from 0 to 10 and
the y-axis denotes error estimation. Again, the noisy data are produced by adding
random numbers with the normal distribution of mean 0 and standard deviation 0.1
multiplied by ten to the noise-free data. Method 1 denotes the Futschik and Herzel's
method and method 2 denotes the HMTS method. The experiment result in Figure 3
shows that even when there is noise, the HMTS method can remove more noise and
produces smaller error than the Futschik and Herzel's method. In Figure 4, the x-axis
denotes different methods, that is, method 1 denotes the Futschik and Herzel's method
and method 2 denotes the HMTS method. The y-axis denotes the root mean squared
(RMS) error. The root mean squared (RMS) error is used to measure the precision of
the fit. The experiment result in Figure 4 shows that even when there is noise, the
HMTS method is better than the Futschik and Herzel's method in root mean square
error, that is, the HMTS method achieves a better fit than the Futschik and Herzel's
method.
1 2 3 4
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Different methods (x)
R
M
S
E
r
r
o
r
s
(
y
)


0 2 4 6 8 10 12
-3
-2
-1
0
1
2
3
Value
E
r
r
o
r


Method 1
Method 2

Fig.3 The comparative result of the error estimate in the HMTS method and the Futschik and Herzel's
method.

1 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Method 1 Method 2
R
M
S
E
r
r
o
r
s
(
y
)


Fig.4 The comparative result of the root mean squared error in the above two methods.

4.2. Comparative analysis using real microarray gene expression data
In this subsection, real microarray gene expression data are used to verify the
functionality of the HMTS method and compare this approach with the traditional
curve fitting method. In every data set, a gene expression matrix of the columns or
rows is given, where each column corresponds to a time point of the individual array
experiment and each row corresponds to a gene which is regarded as a regulated cell
cycle. Each element represents the degree for which each data point for a gene in a
certain time is cell cycle regulated. The higher the score, the better will be the degree.

Each element is computed by using the combined Cy3Cy5, where Cy5 is the red
fluorescent dye, and Cy3 is the green fluorescent dye. Inductions or repressions of
equal magnitude are represented as numerically equal but of opposite sign values [38].
We set the missing values in these data sets to zero.
Yeast cell cycle data are called the combined data set. The yeast cell cycle data
contain the tab delimited data for the alpha factor, elutriation time courses, cdc15, and
cdc28. The combined data set created by Spellman et al. [38] uses independent
methods, that is, a factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive
mutant. This microarray data can be found in [39]. The HMTS method is compared to
the polynomial curve fitting method, 1-D data interpolation method, and cubic spline
data interpolation method via the alpha, elutriation time courses, cdc15, and cdc28
data sets. The experiment results of the alpha, elutriation time courses, cdc15, and
cdc28 are described as follows:
There are 6178 rows and 18 columns in the alpha data set. For a more detailed
understanding of the experiment results, we show the area where most of the data
points are distributed densely. The experiment result in Figure 5 shows that the HMTS
method attains less noise than the polynomial curve fitting method, 1-D data
interpolation method, and cubic spline data interpolation method. The experiment
result in Figure 5 also shows that the 1-D data interpolation method has a better fit
than the other methods. However, the other methods can remove more noise and
delete more outliers than the 1-D data interpolation method.


Fig.5 The comparative experimental result of the alpha data set.

There are 6178 rows and 14 columns in the elutriation time course data set. For a
more detailed understanding of the experiment results, we show the area where most
of the data points distribute densely. The experiment result in Figure 6 shows that the
HMTS method attains less noise and outliers than the polynomial curve fitting method,
1-D data interpolation method and cubic spline data interpolation method.


Fig.6 The comparative experimental result of the elutriation time courses data set.

There are 6178 rows and 24 columns in the cdc15 data set. For a more detailed
understanding of the experiment results, we show the area where most of the data
points are distributed densely. The experiment result in Figure 7 shows that the HMTS
method and the polynomial curve fitting method attains less noise and outliers than
the 1-D data interpolation method and cubic spline data interpolation method. The
experiment result in Figure 7 also shows that the 1-D data interpolation method has a
better fit than the other methods. However, the other methods can remove more noise
and delete more outliers than the 1-D data interpolation method.


Fig.7 The comparative experimental result of the cdc15 data set.

There are 6178 rows and 17 columns in the cdc28 data set. For a more detailed
understanding of the experiment results, we show the area where most of the data
points are distributed densely. The experiment result in Figure 8 shows that the HMTS
method attains less noise and outliers than the polynomial curve fitting method, 1-D
data interpolation method and cubic spline data interpolation method. The experiment
result in Figure 8 also shows that the 1-D data interpolation method has a better fit
than the other methods. However, the other methods can remove more noise and
delete more outliers than the 1-D data interpolation method.



Fig.8 The comparative experimental result of the cdc28 data set.
5. Conclusion
In recent years, there has been considerable interest in the analysis of time series.
Gene expression is a temporal process, and therefore time series analysis plays an
important role in the study of gene expression data. Time series expression data have
unique features and problems, and therefore the analysis of these data encounters
many computational challenges. In this paper, we propose a novel hybrid method,
called HMTS, for acquiring the continuous representation of the time course of the
expression of all genes. The HMTS method consists of preprocessing and fitting. In
preprocessing, the K-means clustering is used to divide the noisy time series into
different clusters so that the mean profile of each cluster can be used to fit the
subsequent polynomial, and regression analysis is used to delete outliers according to

different clusters. Thus, we take into consideration not only past values of the same
time series but also the present and past values of the exogenous time series. In fitting,
all time series data are divided into multiple segmentations, and polynomial curve
fitting is used to fit all segmentation data. The experiment shows that the HMTS
method can remove noise, delete useless outliers and obtain a good fit.
Acknowledgements
The authors thank the anonymous reviewers for their constructive comments. This
work is supported by grants from the Hong Kong Research Grant Council (project
CityU123809), the National Natural Science Foundation of China (project 60673186,
project 60571048, project 60873264 and project 60971088), the Qing Lan Project,
and the State Key Laboratory Foundation of Novel Software Technology at Nanjing
University under grant KFKT2009B04.
References
[1] Andrew Arnold, Yan Liu, Naoki Abe. Temporal causal modeling with graphical granger methods. Proceedings of the 13th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12-15,
2007. ACM 2007. pp. 66-75.
[2] De A. Arajo, R.. A class of hybrid morphological perceptrons with application in time series forecasting. Knowledge-Based
Systems 24 (4) (2011) 513-529.
[3] Hadavandi, E., Shavandi, H., Ghanbari, A.. Integration of genetic fuzzy systems and artificial neural networks for stock price
forecasting. Knowledge-Based Systems 23 (8) (2010) 800-808.
[4] Cho, V.. MISMIS A comprehensive decision support system for stock market investment. Knowledge-Based Systems 23 (6)
(2010) 626-633.
[5] Khan, M.S., Coenen, F., Reid, D., Patel, R., Archer, L.. A sliding windows based dual support framework for discovering
emerging trends from temporal data. Knowledge-Based Systems 23 (4) (2010) 316-322.
[6] Arajo, R.d.A. 2011. A robust automatic phase-adjustment method for financial forecasting. Knowledge-Based Systems. 27
(2012) 245261.
[7] Puteri N.E. Nohuddin, Frans Coenen, Rob Christley, Christian Setzkorn, Yogesh Patel, Shane Williams. Finding interesting
trends in social networks using frequent pattern mining and self organizing maps. Knowledge-Based Systems 29 (2012)

104113.
[8] Vit Niennattrakul, Dararat Srisai, Chotirat Ann Ratanamahatana. Shape-based template matching for time series data.
Knowledge-Based Systems 26 (2012) 18.
[9] Yi-Shian Lee, Lee-Ing Tong. Forecasting time series using a methodology based on autoregressive integrated moving average
and genetic programming. Knowledge-Based Systems 24 (2011) 6672.
[10] Hailin Li, Chonghui Guo. Piecewise cloud approximation for time series mining. Knowledge-Based Systems 24 (2011)
492500.
[11] Joong Hyuk Chang. Mining weighted sequential patterns in a sequence database with a time-interval weight.
Knowledge-Based Systems 24 (2011) 19.
[12] Ziv Bar-Joseph. Analyzing time series gene expression data. Bioinformatics 2004 20(16):2493-2503.
[13] Nitai D. Mukhopadhyay and Snigdhansu Chatterjee. Causality and pathway search in microarray time series experiment.
Bioinformatics 2007 23(4): 442-449.
[14] Aurlie C. Lozano, Naoki Abe, Yan Liu and Saharon Rosset. Grouped graphical Granger modeling for gene expression
regulatory networks discovery. Bioinformatics 2009 25(12): i110-i118.
[15] Soosung Hwang, Pedro L. Valls Pereira. The Effects of Structural Breaks in ARCH and GARCH Parameters on Persistence
of GARCH Models. Communications in StatisticsSimulation and Computation, 37(3): 571578, 2008.
[16] Robert F. Engle. Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation.
Econometrica, Vol. 50, No. 4 (Jul., 1982), pp. 987-1007.
[17] Tim Bollerslev. Generalized Autoregressive Conditional Heteroskedasticity, Journal of Econometrics, 31:307-327, 1986.
[18] Paolo Magni, Fulvia Ferrazzi, Lucia Sacchi and Riccardo Bellazzi. TimeClust: a clustering tool for gene expression time
series. Bioinformatics 2008 24(3):430-432.
[19] Ferrazzi, Fulvia; Magni, Paolo, Bellazzi, Riccardo. Random walk models for Bayesian clustering of gene expression profiles.
Applied Bioinformatics, Volume 4, Number 4, 2005, pp. 263-276(14).
[20] L. Sacchi, R. Bellazzi, C. Larizza, P. Magni, T. Curk, U. Petrovic, B. Zupan. TA-Clustering: cluster analysis of gene
expression profiles through temporal abstractions. International Journal of Medical Informatics, Vol. 74, No. 7-8. (August 2005),
pp. 505-517.
[21] Rong Jiang, Hong Yan. Studies of spectral properties of short genes using the wavelet subspace HilbertHuang transform
(WSHHT). Physica A: Statistical Mechanics and its Applications. Volume 387, Issues 16-17, 1 July 2008, Pages 4223-4247.
[22] Chao-Chun Liu, Dao-Qing Dai, Hong Yan. The theoretic framework of local weighted approximation for microarray
missing value estimation. Pattern Recognition 2010 43(8): 2993-3002.
[23] Wei-Feng Zhang, Chao-Chun Liu, Hong Yan. Clustering of temporal gene expression data by regularized spline regression
and an energy based similarity measure. Pattern Recognition 2010 43(12): 3969-3976.
[24] Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph. Alignment and classification of time series gene expression in clinical
studies. Bioinformatics 2008 24(13):i147-i155.
[25] Adam A. Smith, Aaron Vollrath, Christopher A. Bradfield and Mark Craven. Clustered alignments of gene-expression time
series data. Bioinformatics 2009 25(12):i119-i1127.
[26] http://pages.cs.wisc.edu/~aasmith/smith_dissertation.pdf, (2009).
[27] Chang-Tsun Li, Yinyin Yuan and Roland Wilson. An unsupervised conditional random fields approach for clustering gene
expression time series. Bioinformatics 2008 24(21):2467-2473.
[28] Gautam Goel, I-Chun Chou and Eberhard O. Voit. System estimation from metabolic time-series data. Bioinformatics 2008
24(21):2505-2511.
[29] Jongrae Kim, Declan G. Bates, Ian Postlethwaite, Pat Heslop-Harrison and Kwang-Hyun Cho. Linear time-varying models
can reveal non-linear interactions of biomolecular regulatory networks using multiple time-series data. Bioinformatics 2008
24(10):1286-1292.

[30] Peter Gennemark and Dag Wedelin. Benchmarks for identification of ordinary differential equations from time series data.
Bioinformatics 2009 25(6):780-786.
[31] Ivan G. Costa, Alexander Schnhuth, Christoph Hafemeister and Alexander Schliep. Constrained mixture estimation for
analysis and robust classification of clinical time series. Bioinformatics 2009 25(12): i6-i14.
[32] Filip Hermans and Elena Tsiporkova. Merging microarray cell synchronization experiments through curve alignment.
Bioinformatics 2007 23(2): e64-e70.
[33] Matthias E. Futschik and Hanspeter Herzel. Are we overestimating the number of cell-cycling genes? The impact of
background models on time-series analysis. Bioinformatics 2008 24(8): 1063-1069.
[34] Elena Tsiporkova and Veselka Boeva. Fusing time series expression data through hybrid aggregation and hierarchical merge.
Bioinformatics (2008) 24 (16): i63-i69.
[35] Chatterjee, S., A. S. Hadi. Influential Observations, High Leverage Points, and Outliers in Linear Regression. Statistical
Science, 1986, pp. 379- 416.
[36] Kecun Zhang, Yingliang Zhao. Algorithm and analysis for numerical calculation. Beijing: Science Press; 2003.
[37] de Boor, C., A Practical Guide to Splines, Springer-Verlag, 1978.
[38] Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown,
David Botstein, and Bruce Futcher. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces
cerevisiae by Microarray Hybridization. Molecular Biology of the Cell. Vol. 9, Issue 12, pp. 3273-3297, December 1998.
[39] http://genome-www.stanford.edu/. (1999)

Вам также может понравиться