Вы находитесь на странице: 1из 13

PSO based fuzzy clustering for replicated gene expression data

Abstract:

An integration of particle swarm optimization (PSO) and K-Means


algorithm is becoming one of the popular strategies for solving
clustering problem, especially unsupervised gene clustering. It is known
as PSO based k-means clustering algorithm (PSO-KM). In the field of
genetics,

thousands

of

gene

expression

levels

are

measured

simultaneously, using microarray technology. In this technology, gene


clustering approach is used to discover the similarity of biological
function within the genes. Clustering can also be thought of as a form of
data compression, where a large number of samples are converted into a
small number of representative prototypes or clusters. Depending on the
data and the application, different types of similarity measures may be
used to identify classes, where the similarity measure controls how the
clusters are formed. In this paper a new algorithm PSO for clustering
gene datasets is proposed, based on PSO and K-Means clustering
algorithms. PSO algorithm is a promising method in gene clustering,
which provide an ability of stronger global convergence towards an
optimal solution. This study proposes an enhanced cluster matching to
further improve PSO-CM. In the proposed scheme, fuzzy clustering, the
data points can belong to more than one cluster, and associated with each
of the points are membership grades which indicate the degree to which

the data points belong to the different clusters. This makes particles to
perform better in searching the optimum in collaborative manner.

EXISTING SYSTEM:
The PSO based k-means clustering algorithm (PSO-KM) causes
the dimensionality of clustering problem to expand in PSO search space.
The sequence of clusters represented in particle is not evaluated. This
study proposes an enhanced cluster matching to further improve PSOKM. In the proposed scheme, prior to the PSO updating process, the
sequence of cluster centroids encoded in a particle is matched with the
corresponding ones in the global best particle with the closest distance.
On this basis, the sequence of centroids is evaluated and optimized with
the closest distance. This makes particles to perform better in searching
the optimum in collaborative manner. Gene clustering is becoming
popular because of matured microarray technology and increasing
computing power. In the DNA microarray experiment, a numerical value
of gene expression level in dataset can be attained from well prepared
genes of interest through laser excitation of hybridized targets and
preprocessed using software. Microarray technology allows monitoring
huge amount of gene expression level simultaneously for whole genome
though a single chip only. A cluster analysis plays an important role in
extracting useful information from the massive raw data.

DISADVANTAGES OF EXISTING SYSTEM:


There is a demand for observing and analyzing interactions among
thousands of genes in the massive datasets.
PSO based k-means clustering algorithm (PSO-KM) causes the
dimensionality of clustering problem to expand in PSO search space.
PSO-KM(CM), are also less sensitive to the initial conditions.
The sequence of clusters represented in particle is not evaluated.

PROPOSED SYSTEM:
In the field of genetics, thousands of gene expression levels are
measured simultaneously, using microarray technology. In this
technology, gene clustering approach is used to discover the similarity of
biological function within the genes. In this approach, many clustering
algorithms are used. In this paper a new algorithm PSO for clustering
gene datasets is proposed, based on PSO-KM and automatic clustering
algorithms. PSO-KM algorithm is a promising method in gene
clustering, which provide an ability of stronger global convergence
towards an optimal solution. By using spectral algorithm, cluster number
can be selected automatically during the cluster process, which reduces
the overall time taken to cluster the genes. A population-based random
search technique, known as particle swarm optimization (PSO) has been
applied to data clustering. Crossing and mutation. A new variant of PSO,
called quantum-behaved particle swarm optimization (PSO-KM), has
been proposed to improve the global search ability of the original PSO.
The iterative equation of PSO-KM is different from that of PSO. The
main drawback of this algorithm is, it leads to premature convergence,
since the particle is guided by both global best and personal best
positions.

To overcome this drawback, a new version of PSO-KM

algorithm was introduced, known as particle swarm optimization (PSOKM)[9]. In PSO-KM algorithm, the particles search is influenced by the
position, which may lie in a promising search region than that of global

position. So the particles have much chance to search this region to find
out the global optimal solution. As a result, PSO-KM have better overall
performance than the original PSO-KM. The main disadvantage of this
algorithm is, it cannot select the cluster number automatically during
the clustering process. So, this algorithms combined with one of the
prominent automatic clustering

algorithm called spectral clustering

algorithm. By combining PSO-KM with spectral clustering algorithm, it


provides better overall convergence to the best solution by automatically
selecting the cluster number during the clustering process.

ADVANTAGES OF PROPOSED SYSTEM:

It is easier to implement than the earlier approaches, since it do not


undergo any complex operations such as selection.
It was proved that this iterative equation, leads pso-km to be a global
convergent than pso, since it need no velocity vectors for particles
and has only fewer parameters to adjust.

INTRODUCTION:

Much research was done to generate a large amount of gene


datasets, so, the clustering can be applied in molecular biology for
analyzing gene expression data. Using clustering algorithms, different
clusters of similar expression patterns of gene dataset are assigned
according to a dissimilarity measure between any two genes. The
ultimate goal of the clustering process is to identify the genes with the
same functions or the same regulatory mechanisms. In clustering
technology, hierarchical and k-means approaches are used in the earlier
process. The fundamental strategy of these clustering approaches is to
imitate the evolution process of nature and evolve the solutions of
clustering from one generation to the next. Then genetic k-means
algorithm was used in the clustering process, which combine the robust
nature of the genetic algorithm and the high performance of the k-means
algorithm in the year 1995, a population-based random search technique,
known as particle swarm optimization (PSO) has been applied to data
clustering. It is easier to implement than the earlier approaches, since it
do not undergo any complex operations such as selection, crossing and
mutation. A new variant of PSO, called Advanced Particle Swarm
Optimization (PSO)[6 - 8], has been proposed to improve the global
search ability of the original pso. The iterative equation of qpso is
different from that of pso. It was proved that this iterative equation, leads
qpso to be a global convergent than pso, since it need no velocity vectors
for particles and has only fewer parameters to adjust. The main

drawback of this algorithm is, it leads to premature convergence, since


the particle is guided by both global best and personal best positions. To
overcome this drawback, a new version of qpso algorithm was
introduced, known as multi-elitist quantum behaved particle swarm
optimization (meqpso)[9]. In meqpso algorithm, the particles search is
influenced by the position, which may lie in a promising search region
than that of global position. So the particles have much chance to search
this region to find out the global optimal solution. As a result, meqpso
have better overall performance than the original qpso. The main
disadvantage of this algorithm is, it cannot select the cluster number
automatically during the clustering process. So, this algorithm is
combined with one of the prominent automatic clustering algorithm
called spectral clustering algorithm. By combining meqpso with spectral
clustering algorithm, it provides better overall convergence to the best
solution by automatically selecting the cluster number during the
clustering process. The rest of this paper is organized as follows. Section
ii explains about the spectral clustering and meqpso algorithms. Section
iii provides details on how the clustering process is done automatically
using the proposed spectral meqpso algorithm. Finally, the paper is
concluded in section iv. In the meqpso algorithm, by using a parameter
called the growth rate is calculated to find the degree of evolution for
each particle. The value of is increased when the fitness value of the
particle of the tth iteration is better than that of (t-1)th iteration of the

same particle. In this algorithm, two best positions are used. They are
pbest and gbest. The pbest (personal best) is the value of each particle
which track its coordinates within the problem space that are associated
with the best solution (fitness) which it has achieved so far. And, best is
the global best value of particle, which takes all the populations which
are present in the problem space as its topological neighbors. On each
iteration, the best position of every particle is updated. The pbest
position which has a better fitness value, than that of gbest position
which are obtained before are taken into a candidate area. The updating
of gbest position is based on the selection probability pc. Before
updating, the random number is generated. If the random number is
greater than pc and the candidate area is not empty, the gbest position is
replaced by pbest position with the highest growth rate , selected from
the candidate area. If not, the gbest position is considered to be the best
fitness value of a particle in a present population. The algorithm is
terminated, when the limit on the number of iterations is reached.
Gene expression profiles
well assume we have a 2d matrix of gene expression measurements
rows represent genes
columns represent different experiments, time points, individuals etc.
(what we can measured using one* microarray)

well refer to individual rows or columns as profiles a row is a


profile for a gene * depending on the number of genes being considered,
we might actually use several arrays per experiment, time point,
individual.
Task definition: clustering gene expression profiles

given:

expression

profiles

for

set

of

genes

or

experiments/individuals/time points (whatever columns represent)


do: organize profiles into clusters such that instances in the same
cluster are highly similar to each other instances from different clusters
have low similarity to each other
motivation for clustering
exploratory data analysis understanding general characteristics of
data visualizing data.
generalization infer something about an instance (e.g. a gene) based on
how it relates to other instances everyone else is doing it
The clustering landscape
there are many different clustering algorithms
they differ along several dimensions hierarchical vs. Partitional (flat)
hard (no uncertainty about which instances belong to a cluster) vs. Soft
clusters disjunctive (an instance can belong to multiple clusters) vs.

Non-disjunctive deterministic (same clusters produced every time for a


given data set) vs. Stochastic distance (similarity) measure used
Distance/similarity measures
many clustering methods employ a distance (similarity) measure to
assess the distance between a pair of instances a cluster and an
instance a pair of clusters
given a distance value, it is straightforward to convert it into a
similarity value not necessarily straightforward to go the other way
well describe our algorithms in terms of distances evaluating
clustering results.
given random data without any structure, clustering algorithms will
still return clusters
the gold standard: do clusters correspond to natural categories?
do clusters correspond to categories we care about? (there are lots of
ways to partition the world
Purpose
Clustering analysis is based on partitioning a collection of data points
into a number of subgroups, where the objects inside a cluster (a
subgroup) show a certain degree of closeness or similarity. It has been
playing an important role in solving many problems in pattern
recognition and data processing.
Scope

Clustering is useful in several exploratory pattern-analysis, grouping,


decision-making, and machine learning situations, including data
mining,

document

retrieval,

image

segmentation,

classification.
List Of Modules:
1. Distance matrix construction.
2. Distance calculation.
3. Pair Selection.
4. Checking For Matched Clusters using fuzzy clustering
5. Principal component Analysis
6. Self Organizing Map for Particle swarm optimization
7. Best Particle finding

and

pattern

Вам также может понравиться