Вы находитесь на странице: 1из 5

Analysis of Gene Microarray Data using Association Rule Mining

Jitendra Agrawal
1
Ramesh C. Jain
2


1
School of Information Technology, Rajiv Gandhi Proudyogiki Vishwavidhyalaya, Bhopal (M.P.), India
2
Samrat Ashok Technological Institute, Vidisha (M.P.), India

Abstract : With the developing of new technologies and revolutionary changes in biomedicine and biotechnologies, there was
an explosive growth of biological data during the last few years. Due to the huge amount of data, most of recent studies are
focused on the analysis and the extraction of useful and interesting information from microarray data. Microarray technology is
a powerful tool to analyze thousands of gene expression values in the field of biomedical research, providing researchers with a
means of looking at how genes are expressed under certain conditions such as drug treatment or disease. The association rule
could be one of data mining technique to extract biological information and lead us to discover interactions that exist between
different genes. This paper describes a work done in the bioinformatics sector to understand association based on the
correlations in Gene Probes according to the expression levels which has actually been executed successfully.

Keywords- Association Rule Mining, Gene expression, Micro array data


I. INTRODUCTION
GENE is a segment of DNA, which contains the formula
for the chemical composition of one particular protein. Since
microarray technology was introduced, scientists started to
develop informatics tools for the analysis and information
extraction from this kind of data. Due to the characteristics
of microarray data data mining approaches has become the
suitable tools to perform any kind of analysis on these data.
Many techniques can be applied to analyse microarray data,
The association rule could be one of data mining technique
to analysis such type of data.

The term data mining refers to the finding of relevant and
useful information from databases. Data mining is frequently
referred in the literature as the process of extracting
interesting information or patterns from large databases. The
term data mining has been stretched beyond its limits to
apply any form of data analysis. Data mining [4] is the
search for relationship and global patterns that exists in the
large databases but is hidden among the vast amount of data
such as relationship between patient data and their medical
diagnosis. These relationships represent valuable knowledge
about the database.

Microarray technology is generating huge amounts of data
about the expression level of thousands of genes, or even
whole genomes, across different experimental conditions.

Jitendra Agrawal is with the Rajiv Gandhi Proudyogiki
Vishwavidhyalaya, Bhopal (M.P.), India
Ramesh C. Jain, Director, is with the Samrat Ashok
Technological Institute, Vidisha (M.P.), India

To extract biological knowledge, and to fully understand
such datasets, it is essential to include external biological
information about genes and gene products to the analysis of
expression data. This paper implemented Association Rule
Mining (ARM)[5],[8],[9] on microarray data .While going
through the Dataset it is observed that the data is initially
unsuitable for the Application of Association Rule Mining,
since ARM technique would need to have the Descretized or
Binary Dataset. Therefore for Gene expression data to
discover intrinsic associations from the dataset co-expressed
patterns. We first applied Max-50%{The Gene Expression
Levels said to be Expressed whenever its value is greater
than the mean expression value for the particular Experiment
(Row) also the Expression levels for Genes are said to be
Repressed whenever its value is less than the mean
expression value for the particular Experiment (Row)}
Approach to make the Dataset, the Descritized set of values
for the same experiment level. Now the values are now
converted according Expressed or Repressed levels. All
Expressed values are now denoted as 1 and all the
repressed values are now denoted as 0. Association Rule
Mining methodology is then Applied to Descritized gene
expression dataset in which genes were extracted with gene
expression pathways and revealed significant relationships
among these gene attributes and expression patterns. We
found that the Association Rule Mining methodology is able
to integrate multiple gene probes and expression data in the
same analytic framework and extract meaningful
associations among heterogeneous sources of data. The
whole paper is organized as : part II gives the literature
survey of some of the paper which have given us a
motivating role, Part III describes the result and the
implementation work performed during the whole work, Part
IV describes the analysis of rules generated ,Part V describes
the conclusion and part VI describes the future work.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 1, JANUARY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 94
II. LITERATURE REVIEW
In the case of gene-expression data, it may be generated
from DNA arrays or SAGE approaches, and initial values
are continuous. In [7] generation of the Boolean matrix from
such continuous data is a critical step that may impinge on
the interpretation of the rules generated. They described
three different approaches to generating such matrices that
all try to capture a gene-expression pattern. The arrows
indicate tags that were found to associate through both
methods. If the columns of the matrix are denoted C1,
C2, Cn, then the ARD technique discovers association
rule that are expressions X > Y where X and Y are sets of
columns (often called itemsets) and X Y = . The
relevance of a rule X > Y can be measured by its
frequency and its confidence [5],[8]. The frequency is the
number of rows where all columns in X and Y have a value
of 1 (true) simultaneously. This number is often divided by
the number of rows to provide a relative frequency. The
confidence is the ratio between the frequency and the
number of rows where all the columns in X have a value of
1, that is, it estimates the conditional probability of
observing the properties denoted by Y when the properties
denoted by X are true. Thus, when a rule has a confidence of
1, it means that 100% of the rows that have 1 in the columns
in X also have 1 in the columns in Y. The classical
association-rule mining task concerns the discovery of every
rule such that its frequency and its confidence are greater or
equal to user-defined thresholds.
The concept of using high confidence [12] we taken from
they present an association rule mining method for mining
high-confidence rules, which describe interesting gene
relationships from microarray data sets. Microarray data sets
typically contain an order of magnitude more genes than
experiments, rendering many data mining methods
impractical as they are optimized for sparse data sets. A new
family of row-enumeration rule mining algorithms has
emerged to facilitate mining in dense data sets. These
algorithms rely on pruning infrequent relationships to reduce
the search space by using the support measure. This major
shortcoming results in the pruning of many potentially
interesting rules with low support but high confidence. A
row-enumeration rule mining method, MAXCONF [12], to
mine high confidence rules from microarray data. It is a
support-free algorithm that directly uses the confidence
measure to effectively prune the search space.

In [6] they proposed a new parameter cohesion based
Association mining tasks with two more information taking
equally important: a) accuracy of knowledge extracted in a
rule with respect to known biological functions, and b)
predictability of biological interactions from discovered
rules. Most of the support and/or confidence-based
techniques address only predictability or neither of them. It
requires tedious post-processing to unearth the actually
interesting ones from the bulky output set. In the present
work, they exploit the notion of direct interaction (DI) and
cohesion [6] to develop a sound methodology for binding
genes under common affinity groups and mine intra-group
associations. To evaluate soundness, they apply the method
in cell cycle data of yeast and analyze result with the help of
known biological interactions in BIND[6]. They found
impressive values for both accuracy and predictability.

III. RESULT AND IMPLEMENTATION
1. Brief Description of Dataset Used
Briefly, the experiment investigates the temporal program of
gene expression accompanying the metabolic shift from
fermentation to respiration that occurs when fermenting
yeast cells, inoculated into a glucose-rich medium, turn to
aerobic utilization of the ethanol produced during the
fermentation after the fermentable sugar is exhausted. This
dataset contains whole-genome expression levels during this
metabolic change. Experiments are numbered from time
points one to seven (T1-T7) and correspond to samples
harvested at successive two-hour intervals after an initial
nine hours of growth. This dataset is incorporated with
external information about metabolic pathways and
transcriptional regulators that bind to promoter regions. This
annotated dataset [10],[11]was first transformed into a
transaction dataset and association rules were then extracted
using the constraints that gene annotations appear in the
antecedent and gene expression patterns in the consequent.
We evaluated our method mining the data using the co-
expression patterns [6].

The following table shows the snapshot of the Microarray
gene expression dataset [1],[2],[3]used and the details are
given by the figure 1.1 ahead.

Figure 1.1 shows the microarray dataset used .
2. Methodology
In this work, we concentrate on analysing microarrays as they
are specifically designed to understand the relationships
between genes by implementing Association Rule Mining
(ARM). Directly we could not apply ARM on Microarray
datasets because Microarray datasets always consists of
continuous values that may not be useful to show any direct
relation since every Microarray suppose when we say a gene1
is said to be expressed on particular time say; t1 (experiment)
if and only if its value is greater than the mean value of the
whole experiment t1 et. al. Gavin Sherlock [3] .For all those
we need to do some pre-processing based on methods so that
we may be able to get the values converted into some discrete
values for that the Association Rule Mining can be applied
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 1, JANUARY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 95
easily. In pre-processing part while dealing with Microarray
data we also faced some redundant or misplaced geneIDs.

While implementing Association Rule Mining on Microarray
data experiments for identifying common gene relationships
across the experiment we gone through following steps:-
- Removing Redundant GeneIDs or Misplaced IDs.
- Applying discretization methods.
- Applying Association Rule Mining.
- Analysing the generated rules.

A. Removing Redundant GeneIDs or Misplaced IDs
During the process it is observed that some fields in position
of Gene Probes / GeneIDs it is represented as EMPTY.
On calculating the total numbers of same we found that it as
89 GeneIDs were having these redundancies. So to avoid
the misinterpretation we let them removed them from the
data.

B. Applying Discretization Methods
The gene expression data is usually represented as a series of
continuous numbers. This brings difficulty to scientists.
Some data mining methods, e.g. association rule mining,
require data to be transformed using some Discretization
Techniques [4] to convert the data into discretized or binary
form. In addition, decision tree classification performs better
in categorical data sets. So a proper data discretization
method is needed for microarray data analyzing for
understandable results.

All quantitative values in gene expression data have given
rise to one Boolean value, that is, true (1) or false (0).
Becquet et al. (2002) [7] proposed three different
discretization procedures, which are the max minus x%,
mid-ranged, and x% of highest value approaches.
Assume a given expression data is denoted as . d
A. max minus x %
Max minus x% method (Becquet et al.,) [7]consists of
identifying the maximum expression value (MaxValue).
Max minus x% defines a value of 1 when the expression
value is greater than (MaxValue-x) /100. Otherwise the
expression of the value is assigned a value of 0. We have
taken value for x is 50.

if 100 / ) 50 ( > MaxValue d
if ( ) 100 / 50 s MaxValue d
Other limitations occurred in those methods are various. For
example, in Max minus x% method, if the maximum
value in the data set is turn out to be a noise, then the whole
discretization will be wrong by based on a wrongly defined
maximum value.

After Applying the binarization technique [7] (max-50%
Techniques) on the yeast cell cycle Microarray Data we
could have the Data set in the form of Binary values [2]or in
Discretized form:

Table : Binarized values using Max-50% method..

C. Applying Association Rule on Discretized Data
We used the software "Concept Explorer"(ConExp) version
1.3 (available at http://www.sourceforge.net) [13].
Choosing the dataset in CSV (comma separated value) format
for the implementation purpose on ConExp tool. For analysis
the data set we selected the min confidence 0.7 i.e, Confidence
values greater than or equal to 70%.& min support is set to 2.

figure shows ConExp software Implements the Association
Rule Mining.
IV. ANALYSING THE RULES GENERATED
Analysing the rules generated after applying the Association
rule mining on the Microarray dataset we have following
generated rules we only considered the single antecedent
rules for analysis, some of them are as follows:

'YIL124W' ==> 'YIR043C' 'YJL114W' 'YJL210W' 'YKL035W' 'YKL140W';
'YJL114W'==> 'YIL124W' 'YIR043C' 'YJL210W' 'YKL035W' 'YKL140W';
'YKL011C' ==> 'YIR043C' 'YJL062W' 'YJL210W' 'YKL108W';
'YKL035W' ==> 'YIL124W' 'YIR043C' 'YJL114W' 'YJL210W' 'YKL140W';
'YKL108W' ==> 'YIR043C' 'YJL062W' 'YJL210W' 'YKL011C';
'YKL140W' ==> 'YIL124W' 'YIR043C' 'YJL114W' 'YJL210W' 'YKL035W';

The above Rules are Generated After Keeping The
Threshold values for minimum Support value as 4 % &
minimum Confidence value as 70%. The analysis is based
on the above rules generated and these all are giving the
strength to the real work for the Gene Expression
Microarray data that how the genes are related in a co-
expressed manner.

On considering the Rule 1
'YIL124W'=====>YIR043C'YJL114W''YJL210W''YKL035W'
YKL140W';
While searching the all the geneIDs/Gene Probes in
discretized dataset we found the following Observations:

=
0
1
v
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 1, JANUARY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 96

Figure : Showing Matching Expression Levels for Rule1

The meaning of all the 1s in data sets indicates that the
Gene is Expressed in the particular experiment phase.
If YIL124W is Expressed, then at the same time
YIR043C,YJL114W,'YJL210W','YKL035W' & 'YKL140W'
are also Expressed .
On observation also it gives the very good accuracy for the
generated rules .Hence the rules generated are very much
according to the required Analysis and also According to
Interestingness of the dataset .
V. CONCLUSION
With the advances in Microarray technology, Main
requirement is expression levels of thousands of genes can
be simultaneously measured efficiently during important
biological process and across collections of related samples.
Analysing the microarray data to identify localized co-
expressed gene patterns has become the new focuses of
researchers as such gene patterns are essential in revealing
the gene functions, gene regulations, subtypes of cells, and
cellular processes of gene regulation networks. We handled
the Microarray data with the co-expressed patterns based on
their expression of levels either high or low .To apply the
Association Rule Mining we first need to have discretized
the microarray data. Which will eventually produces the
rules according to the gene expression levels that is now
discretized values in Microarray. The Implementation in
such a way given us new insights for biological research.
VI. FUTURE WORK
While implementation of ARM to find out the co-expressed
gene patterns mining, a number of issues could be further
investigated.
- Although there have been some encouraging results
on co-expressed pattern mining from microarray
dataset, the number of resulting patterns is still not
small. This will bring some difficulty for biologists
to analyse them. New approaches may consider
how to make use of some biological discoveries on
gene networks as prior-knowledge of interesting
frequent pattern mining. The prior-knowledge of
gene relationships could not only act as a post-filter
to figure out more interesting patterns, but also
could be put into the early pruning process to
enhance mining efficiency.
- Further exploration on the resulting co-expressed
patterns will be another interesting research
approach. Gene association rule mining from
Microarray data has been well studied in the
literature. Hence, association rule mining and
classifier building Microarray data could be further
explored.
- Based on the interestingness scheme of the mining
methodology, we could further extend the co-
tendency patterns from microarray datasets. That is,
new efficient algorithm for mining could be
designed based on their expression or repression
values could be found according to the time series
for that they are belonging.

Finally, although the time-series pattern mining algorithm
can deliver the detailed and complete time-series based
information between genes/gene clusters, the genes/gene
clusters that act as the activator/inhibitor have the same
affecting time periods as the genes/gene clusters that are
activated/ inhibited. In gene regulatory networks, there also
exist genes/gene clusters that regulate each other but have
different affecting time periods. Future work can be done to
mine such genes patterns so that they may be found with
reference to the experimental conditions.

REFERENCES
[1] T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano,
Identification of genetic networks by strategic gene
disruptions and gene over expressions under a Boolean
model. Theoretical Computer Science, vol. 298, pp. 235251,
2003.
[2] T. Akutsu, S. Miyano, and S. Kuhara, Identification of
genetic networks from a small number of gene expression
patterns under the boolean network model. in Pacific
Symposium on Biocomputing, vol. 4, pp. 1728, 1999.
[3] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K.Anders, M.
Eisen, P. Brown, D. Botstein, and B. Futcher, Comprehensive
identification of cell cycle-regulated genes of the yeast
Saccharomyces cerevisiae by microarray hybridization.
Molecular Biology of the Cell, vol. 9, pp. 32733297, 1998.
[4]Data Mining Concept and Techniques, Jiawei Han and
Micheline Kamber ,2
nd
Edition ,Moran Kaufmann Publishers;
ISBN no.978-81-312-0535-8
[5] R. Agrawal, T. Imielinksi, and A. N. Swami, Mining
association rules between sets of items in large databases. in
ACM SIGMOD Intl Conf. on Management of Data, pp. 207
216, 1993.
[6]Improving Accuracy of Discovered Knowledge through
Direct Interaction and Cohesion-based Framework: A Study in
Cell Cycle Data of Yeast Ramkishore Bhattacharyya, 2009
Seventh International Conference on Advances in Pattern
Recognition
[7]C. Becquet, S. Blachon, B. Jeudy, JF. Boulicaut and O.
Gandrillon, Strong association-rule mining for large scale
gene expression data analysis: a case study on human SAGE
data, Genome Biology, vol. 3, no. 12, (2002) 1-16.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 1, JANUARY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 97
[8]Agrawal, R. and Srikant, R. 1994. Fast algorithms for
mining association rules. In Proc. 20th Int. Conf. Very Large
Data Bases, 487-499.
[9]Brin, S., Motwani, R. and Silverstein, C., Beyond Market
Baskets: Generalizing Association Rules to Correlations, Proc.
ACM SIGMOD Conf., pp. 265-276, May 1997.
[10]J.L. DeRisi, V.R. Lyer, P.O. Brown, Exploring the
metabolic and genetic control of gene expression on a
genomic scale, Science 278 (1997) 680686.
[11]Joseph L. DeRisi, Vishwanath R. Iyer, and Patrick O.
Brown, Exploring the Metabolic and Genetic Control of Gene
Expression on a Genomic Scale. Science, Vol. 278, Issue
5338, 680-686, 24 October 1997
[12]High-Confidence Rule Mining for Microarray Analysis
Tara McIntosh and Sanjay Chawla, IEEE/ACM Transaction
On Computational Biology And Bioinformatics, Vol. 4, No. 4,
October-December 2007.
[13]http://sourceforge.net/projects/conexp/files/conexp/1.3/con
exp-1.3.zip/download

R. C. Jain, is a Director of Samrat Ashok
Technological Institute (Engg. College) Vidisha (M.
P.) India. He has 35 years of teaching experience
and 15 years research experience. He is actively
involved in Research with area of interest as Soft
Computing, Data Mining and Adhoc Networks. He
has published more than 125 research papers national and
international journals, produced 7 Ph. Ds. And 10 Ph. Ds are
under progress.

J. Agrawal is a Assistant Professor, School of
Information Technology, Rajiv Gandhi
Proudyogiki Vishwavidyalaya, Bhopal (M. P.)
India. He is working on Association Mining
Algorithms to improve their efficiency under
guidance of Dr. R. C. Jain. His area of interest is
DBMS, Data Mining, and Soft Computing.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 1, JANUARY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 98

Вам также может понравиться