In Silico Analysis of Microarray Data For Breast Cancer

INTRODUCTION
Cancer is a class of diseases or disorders characterized by uncontrolled division of cells and the ability of these to spread, either by direct growth into adjacent tissue through invasion, or by implantation into distant sites by metastasis (where cancer cells are transported through the bloodstream or lymphatic system). Cancer may affect people at all ages, but risk tends to increase with age. It is one of the principal causes of death in developed countries. There are many types of cancer. Severity of symptoms depends on the site and character of the malignancy and whether there is metastasis. A definitive diagnosis usually requires the histological examination of tissue by a pathologist. This tissue is obtained by biopsy or surgery. Most cancers can be treated and some cured, depending on the specific type, location, and stage. Once diagnosed, cancer is usually treated with a combination of surgery, chemotherapy and radiotherapy. As research develops, treatments are becoming more specific for the type of cancer pathology. Drugs that target specific cancers already exist for several types of cancer. If untreated, cancers may eventually cause illness and death, though this is not always the case. The unregulated growth that characterizes cancer is caused by damage to DNA, resulting in mutations to genes that encode for proteins controlling cell division. Many mutation events may be required to transform a normal cell into a malignant cell. These mutations can be caused by radiation, chemicals or physical agents that cause cancer, which are called carcinogens, or by certain viruses that can insert their DNA into the human genome. Mutations occur spontaneously, and may be passed down from one cell generation to the next as a result of mutations within germ lines. However, some carcinogens also appear to work through non-mutagenic pathways that affect the level of transcription of certain genes without causing genetic mutation .Many forms of cancer are associated with exposure to environmental factors such as tobacco smoke, radiation, alcohol, and certain viruses. Some risk factors can be avoided or reduced.
Breast tissue is a cancer originating from breast tissue, most commonly from the inner lining of milk ducts or the lobules that supply the ducts with milk. The primary risk factors for breast cancer are sex, age, lack of childbearing or breastfeeding, high hormone levels, race, etc. The classification of breast tissue includes TNM classification (tumor nodes metastatic grade, receptor status and the presence/absence of genes as determined by DNA testing). Breast cancer, like other cancers, occurs because of an interaction between the environment and a defective gene. Normal cells divide as many times as needed and stop. They attach to other cells and stay in place in tissues. Cells become cancerous when mutations destroy their ability to stop dividing, to attach to other cells and to stay where they belong. When cells divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting mechanisms. These mutations are either inherited or acquired after birth. Presumably, they allow the other mutations, which allow uncontrolled division, lack of attachment, and metastasis to distant organs. Normal cells will commit cell suicide (apoptosis) when they are no longer needed. Until then, they are protected from cell suicide by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these protective pathways are mutated in a way that turns them permanently "on", rendering the cell incapable of committing suicide when it is no longer needed. This is one of the steps that cause cancer in combination with other mutations. Normally, the PTEN protein turns off the PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and the cancer cell does not commit suicide. Mutations that can lead to breast cancer have been experimentally linked to estrogen exposure. Failure of immune surveillance, the removal of malignant cells throughout one's life by the immune system. Abnormal growth factor signalling in the interaction between stoma cells and epithelial cells can facilitate malignant cell growth. The extensive use of DNA microarray technology in the characterization of the cell transcriptome is leading to an ever 2
increasing amount of microarray data from cancer studies. Although similar questions for the same type of cancer are addressed in different studies, a comparative analysis of their results is hampered by the use of heterogeneous microarray platforms and analysis methods. Gene expression profiling by DNA microarrays has become an important tool for studying the transcriptome of cancer cells, and has been successfully used in many studies of tumour classification and of identification of marker genes associated with cancer. With an increasing number of microarray data becoming available, the comparison of studies with similar research goals, to identify genes being differentially expressed in normal versus tumour tissue, has gained high importance. In general, the evaluation of multiple data sets promises to yield more reliable and more valid results since these results are based on a larger number of samples and the effects of individual study-specific biases are weakened. However, the comparison of results from different microarray studies is hampered by the fact that different studies use different protocols, microarray platforms and analysis techniques. The question whether the results of gene expression measurements obtained by different platforms can be compared has been addressed in several studies. It has been found that results derived from the measurements like lists of tumour subtype marker genes or measures of intra-study correlation of gene expression patterns can be compared and thus intervalidated between different platforms. However, the measures of gene expression themselves could not be directly compared between different platforms. Some studies propose methods for meta-analysis of microarray data with the goal to identify significantly differentially expressed genes across studies by using statistical techniques that avoid the direct comparison of gene expression values. The goal of this study is to investigate the benefit of performing supervised classification analyses across disparate sources of microarray data. Methods of supervised classification analysis render it possible to automatically build classifiers that distinguish among specimens on the basis of predefined class label information (phenotypes), and in many cancer research studies the application of these methods has shown promising results of improved tumor diagnosis and prognosis.
Objectives:
1. To identify and retrieve Microarray data for breast tumor cells. 2. To carry out the clustering of tumorous breast cells based on the expression profile of the genes. 3. To carry out the clustering algorithms such as Self Organizing Maps(SOM) and K-means. 4. To compare and analyze the expression patterns on the basis of similarity.
REVIEW OF LITERATURE
Worldwide breast cancer comprises 22.9% of all non-skin cancer incidence among women, making it the most common cause of cancer death. In 2008, breast cancer caused 458,503 deaths worldwide. The World Cancer Research Fund estimated that 38% of breast cancer cases in US are preventable through reducing alcohol intake, increasing physical activity levels and maintaining a healthy weight. Smoking tobacco also increases risk of breast cancer. Breast cancer, like other cancers, occurs because of an interaction between the environment and a defective gene. Normal cells divide as many times as needed and stop. They attach to other cells and stay in place in tissues. Cells become cancerous when mutations destroy their ability to stop dividing, to attach to other cells and to stay where they belong. When cells divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting mechanisms. These mutations are either inherited or acquired after birth. Presumably, they allow the other mutations, which allow uncontrolled division, lack of attachment, and metastasis to distant organs. Normal cells will commit cell suicide (apoptosis) when they are no longer needed. Until then, they are protected from cell suicide by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these protective pathways are mutated in a way that turns them permanently "on", rendering the cell incapable of committing suicide when it is no longer needed. This is one of the steps that causes cancer in combination with other mutations. Normally, the PTEN protein turns off the PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and the cancer cell does not commit suicide. Mutations that can lead to breast cancer have been experimentally linked to estrogen exposure. Failure of immune surveillance, the removal of malignant cells throughout one's life by the immune system. Abnormal growth factor signalling in the interaction between stromal cells and epithelial cells can facilitate malignant cell growth. In the United States, 10 to 20 percent of patients with breast cancer and patients with ovarian cancer have a first- or second-degree relative with one of these diseases. 5
Mutations in either of two major susceptibility genes, breast cancer susceptibility gene 1 (BRCA1) and breast cancer susceptibility gene 2 (BRCA2), confer a lifetime risk of breast cancer of between 60 and 85 percent and a lifetime risk of ovarian cancer of between 15 and 40 percent. However, mutations in these genes account for only 2 to 3 percent of all breast cancers. Breast cancer cell lines provide a useful starting point for the discovery and functional analysis of genes involved in breast cancer.38 established breast cancer cell lines were studied to determine recurrent genetic alterations and the extent to which these cell lines resemble uncultured tumors.DNA copy number profiles were generated by comparative genome hybridization (CGH) for most of the publicly available breast cancer cell lines (Kallioniemi et al., 2000). Immunotherapy approaches to fight cancer are based on the principle of mounting an immune response against a self-antigen expressed by the tumor cells. In order to reduce potential autoimmunity side-effects, the antigens used should be as tumor-specific as possible. A complementary approach to experimental tumor antigen discovery is to screen the human genome in silico, particularly the databases of "Expressed Sequence Tags" (ESTs), in search of tumor-specific and tumor-associated antigens. The public databases currently provide a massive amount of ESTs from several hundreds of cDNA tissue libraries, including tumoral tissues from various types. We describe a novel method of EST database screening that allows new potential tumor-associated genes to be efficiently selected. The resulting list of candidates is enriched in known genes, described as being expressed in tumor cells (Vinals et al., 2001). The ways in which molecular biology has affected the clinical management of breast cancer is very challenging. There is a technique which can be used to compare expression patterns of thousands of genes between different cell types. Cancer researchers use a tumors gene expression profile to distinguish it from other tumor types. Only 5% of patients with a good profile developed metastasis by 5 years after initial therapy and 15% developed metastasis by 10 years after therapy. Using multivariate analysis, the gene expression signature was the strongest predictor for metastasis-free survival. This gene expression profile may be used in tailoring therapy for breast cancer and it can greatly reduce the 6
number of patients who would receive unnecessary adjuvant systemic treatment (Kristine, 2002). Sequence analysis of individual targets is an important step in annotation and validation. Investigation of human breast cancer BCA3 gene was carried with LION Target Engine and with other bioinformatics tools. LION Target Engine confirmed that BCA3 gene is located 11p154.A significant numbers of new orthologs were also identified and these were the basis for a high quality protein secondary structure prediction. Sequence conservation from multiple sequence alignment (MSA), splice variant identification (SVI), secondary structure prediction and predicted phosphorylation sites suggested that the removal of interaction sites through alternative splicing might play a modulatory role in BCA3 (Canaves and Leon, 2003). Genome wide monitoring of gene expression using DNA microarrays represents one of the latest breakthroughs in experimental biology. As microarray analysis emerges from its infancy, there is widespread hope that microarrays will significantly impact on our ability to explore the genetic changes associated with cancer etiology and development and ultimately lead to the discovery of new biomarkers for disease diagnosis (Macgregor, 2003). Breast cancer is a heterogeneous disease whose evolution is difficult to predict by using classic histoclinical prognostic factors. Prognostic classification can benefit from molecular analysis such as large scale expression Profiling. Hierarchical Clustering (HCL) identified relevant clusters of co expressed proteins and clusters of tumors. Protein expression profiling may be a clinically useful approach to assess breast cancer heterogeneity (Bertucci et al., 2004). Inflammatory breast cancer (IBC) is a rare but aggressive form of breast cancer with a five year survival limited to 40% .Diagnosis based on clinical/pathological criteria, may be difficult. The potential of gene expression profiling to contribute to a better understanding of IBC and to provide new diagnostic and predictive factors for IBC as well as for potential therapeutic targets (Jones et al., 2004).
Correlations of risk factors with genomic data promises to provide special treatment for individual patients and needs interpretation of complex, multivariate patterns in gene expression data, as well as assessment of their ability to improve clinical predictions.DNA microarray data were analyzed from samples of primary breast tumors using non-linear statistical analysis to assess multiple patterns of interactions of group of genes that have predictive value for the individual patient with respect to lymph node metastasis and cancer recurrence. Aggregate patterns of gene expression (metagenes) were identified with about 90% accuracy (Potamias et al., 2004). Publicly available human genomic sequence data provide an unprecedented opportunity for researchers to decode the functionality of human genomic Cancer Genome Anatomy Project (CGAP) and Gene Expression Omnibus (GEO) are two bioinformatics infrastructures for studying functional genomics. The feasibility was explored for incorporating the Internetavailable bioinformatics databases to discover human breast cancer-related genes. Several tools including Gene Finder, Virtual Northern and SAGE digital gene expression displayer (DGED) were used to analyze differential gene expression between benign and malignant breast tissue. A pilot study was performed using both expressed sequence tags(EST) and SAGE to analyze the expression of panel of known genes including high abundance genes actin and G3PDH, BRCA1 and p53 and two breast cancer related genes, Her2/heu and MUC1.They produced 53 differentially expressed genes according to the screening of a criteria greater than 5-fold difference and p< 0.01.Combined multiple high throughput analysis was an effective data mining strategy in cancer gene identification (Shen et al., 2005). Characteristic patterns of gene expression have emerged, reflecting molecular differences among different types of cancer. The application of microarray based technologies in the investigations of molecular markers has evolved from descriptive biological definitions of the tumor landscape to more functional and mechanistic studies with correlations to clinically important traits within the field of breast cancer management. Array based molecular profiling represents a technological breakthrough in how biological samples for instance tumor specimens, can be analyzed (Ingrid et al., 2006).
The identification of the change of the gene expression in multifactorial diseases such as breast cancer is a major goal of DNA microarray experiment. Consideration of genes behaviour in a wide variety of experiment can improve the statistical reliability on identifying genes with moderate changes between samples. The approach was evaluated via the reidentification of breast cancer specific genes experiment. It successfully prioritized several genes associated with breast tumor for which the experiment differs between normal and breast cancer cells was marginal and would have been difficult to recognize using conventional analysis methods (Yang et al., 2006). Genome-wide expression microarray studies have revealed that the biological and clinical heterogeneity of breast cancer can be partly explained by information embedded within a complex but ordered transcriptional architecture. Comprising this architecture are gene expression networks, or signatures, reflecting biochemical and behavioral properties of tumors that might be harnessed to improve disease sub typing, patient prognosis and prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate knowledge of pathways and other biological phenomena in the signature discovery process are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic mechanisms that better inform therapeutic options (Miller and Edison, 2007). Gene expression measurements from breast cancer (BRCA) tumors are established clinical predictive tools to identify tumor subtypes, identify patients showing poor/good prognosis and identify patients likely to have disease recurrence. Experiment data from 9 published microarray studies examining estrogen receptor negative(ER-) BRCA tumor cases from the Oncomine database. Some of the genes identified, distinguish key transcripts previously seen in array studies, while others are newly defined. Many of the genes identified as over expressed in ER-tumors were previously identified as expression markers for neoplastic transformation in multiple human cancers (David et al., 2008).
Germ line mutations of high penetrant BRCA1 and BRCA2 genes have been associated to hereditary breast cancer risk, while polymorphic variants of the two genes still have an unknown role in breast pathogenesis. The aim of our study was to characterize BRCA1 and BRCA2 genes polymorphic variants in familial breast cancer. 110 patients affected by familial breast and/or ovarian cancer have been consecutively enrolled according to family history and BRCA mutation risk. All of them have been screened for BRCA1 and BRCA2 pathogenetic mutations, SNPs and intronic variants. In silico analyses have been also performed using different computational methods to individualize genetic variations that can alter the two genes expression and function. BRCA1 resulted mutated in 14% while BRCA2 in 3% of cases, while 80% of patients presented at least one polymorphism. A neural network splicing prediction model individualized one BRCA1 and one BRCA2 intronic variants able to determine alternative splicing. Furthermore, Q356R BRCA1 and N289H BRCA2 appear to show a possible harmful role also due to their location in functional regions of the two genes. However, in silico data are not always consistent with biological evidences. In conclusion, SNPs profile provides a basis for DNA-based cancer risk classification and help to define the gene alterations that could influence biochemistry activity protein or could modify drug sensitivity (Tommasi et al., 2008). Cancer patients make antibodies to tumor derived proteins that are potential biomarkers for early detection. To detect antibodies to tumor antigens in patient sera, novel high-density custom protein microarrays (NAPPA) were adapted. They were probed with sera from patients with early stage breast cancer and healthy women. Custom in situ protein microarrays can be used to detect serum tumor antigen-specific antibodies and enables the rapid, simultaneous detection of immunogenic tumor antigens from patient sera. These auto antibodies were being evaluated as potential biomarkers for the early diagnosis of breast cancer (Anderson et al., 2009). Detection of circulating tumor cells (CTC) may provide diagnostic and prognostic information in breast cancer patients. Deregulation of miRNAs is frequent in tumor progression. Phase I preclinical study was performed by means of computational tools for miRNA profiling including MIRGATOR, MIRBASE, SMIRNAdb, Gene HUB-GEPIS, MICRORNA. ORG. Computational tools identify a set of miRNA bioinformatics approach is 10
a useful high-throughput method to select associated miRNAs. The selected miRNAs should be further evaluated for CTC detection (Calvo et al., 2009). Estrogen receptor positive breast cancer can be prevented by Tamoxifen. Microarray technology has identified genomic classifiers for breast cancer prognosis and prediction. Meta-analysis of microarray data was conducted to identify individual genes associated with estrogen receptor(ER) status of breast cancer (Holko et al., 2009). Breast carcinoma is one of the most common causes of cancer related death. Serial Analysis of Gene Expression (SAGE) is a comprehensive profiling method that allows for global, unbiased and quantitative characterization of transcriptomes. Four normal breast tissues, eleven primary breast tumors and three breast metastatic tissues were retrieved and the data were analyzed by Correspondence Analysis, Hierarchical Clustering, Support Tree and Significance Analysis of Microarray. Comprehensive analysis from whole transcriptomes of primary breast cancer tissues compared with normal breast tissue and breast cancer metastatic tissues revealed that clinically disparate outcomes could be linked to a relatively small number of transcripts (Margossian et al., 2009). The ability to compare genome-wide expression profiles in human tissue samples has the potential to add an invaluable molecular pathology aspect to the detection and evaluation of multiple diseases. Applications include initial diagnosis, evaluation of disease subtype, monitoring of response to therapy and the prediction of disease recurrence. The derivation of molecular signatures that can predict tumor recurrence in breast cancer has been a particularly intense area of investigation and a number of studies have shown that molecular signatures can outperform currently used clinic pathologic factors in predicting relapse in this disease. However, many of these predictive models have been derived using relatively simple computational algorithms and whether these models are at a stage of development worthy of large-cohort clinical trial validation is currently a subject of debate. In this review, we focus on the derivation of optimal molecular signatures from high-dimensional data and discuss some of the expected future developments in the field (Goodison et al., 2010).
11
Breast tumors consist of several different tissue components. Despite the heterogeneity, most gene expression analysis has traditionally been performed without prior micro dissection of the tissue sample. Thus the gene expression profiles obtained reflect the m RNA contribution from the various tissue components. The differentially expressed genes were validated in independent gene expression data from a set of laser capture micro dissected invasive ductal carcinomas (Hyat et al., 2010). The receptor tyrosine kinase HER2 is an oncogene amplified in invasive breast cancer and its over expression in mammary epithelial cell lines is a strong determinant of a tumorigenic phenotype. Accordingly, HER2-overexpressing mammary tumors are commonly indicative of poor prognosis in patients. Several quantitative proteomic studies have employed twodimensional gel electrophoresis in combination with MS/MS, which provides only limited information about the molecular mechanisms underlying HER2/neu signaling. In the present study, we used a SILAC-based approach to compare the proteomic profile of normal breast epithelial cells with that of Her2/neu-overexpressing mammary epithelial cells, isolated from primary mammary tumors arising in mouse mammary tumor virus-Her2/neu transgenic mice. We identified 23 proteins with relevant annotated functions in breast cancer, showing a substantial differential expression. This included over expression of creatine kinase, retinolbinding protein 1, thymosin 4 and tumor protein D52, which correlated with the tumorigenic phenotype of Her2-overexpressing cells. The differential expression pattern of two genes, gelsolin and retinol binding protein 1, was further validated in normal and tumor tissues. Finally, an in silico analysis of published cancer microarray data sets revealed a 23-gene signature, which can be used to predict the probability of metastasis-free survival in breast cancer patients (Pandey et al., 2010). Genome wide DNA methylation changes in a cell line model of breast cancer metastasis have been identified. The complex epigenetic change that was observed along with concurrent karyotype analysis have led to hypothesize that complex genomic alterations in cancer cells are superimposed over promoter specific methylation events that are responsible for gene specific changes observed in breast cancer metastasis (Wendy et al., 2010).
12
Aberrant miRNA activity has been reported in many diseases and often numerous miRNAs concurrently deregulated. There was coincident loss of expression of 6 miRNAs with metastatic potential in breast cancer. A computational method, miR-AT! was used to investigate combinatorial activity among this group of miRNAs. A number of genes previously implicated in cancer metastasis are among the predicted combinatorial targets, including TGFB1, ARPC3 and RANKL (Sultana et al., 2011).
2.1 Cluster analysis of microarray information

Microarray experiments generate mountains of data, which has to be stored and analyzed. For humans it is difficult to handle very large numeric data sets. Therefore a general concept is to try to reduce the dimensionality of the data. A number of clustering techniques have been used to group genes based on their expression patterns. The user has to choose the appropriate method for each task. The basic concepts in clustering are to try to identify and group together similarly expressed genes and then try to correlate the observations to biology. The idea is that co-regulated and functionally related genes are grouped into clusters. Clustering provides the framework for this analysis. The hard part is to analyze the biological processes and consequences. However, clustering can be a very useful tool. The other side of the coin is the visualization of the information. Before clustering a large number of requirements have to be met. The data has to be of good quality, the chip design should be correct, one should have interesting genes contained in the chip, sample preparation has to be flawless, data has to be properly treated (outliers, normalization etc). Clustering does not improve the quality of the data. If poor-quality data is used then also the outcome is useless.
2.2 Principles of clustering

Clustering organizes the data into a small number of (relatively) homogeneous groups. Usually normalization of the expression values is used. At this stage, it is of interest to look at the changes in expression patterns, not to follow the actual numeric changes. Thus, the methods are used to find similar expression motifs irrespective of the expression level. Therefore, both low and high expression level genes can end up in the same cluster if the expression profiles are correlated by shape. The majority of clustering methods has been 13
available already for long. During the last few years, some new microarray analysis dedicated methods have been developed, too. The methods can be described and classified in different ways. The most commonly used methods are described here. Clustering methods can be grouped as supervised and unsupervised. Supervised methods assign some predefined classes to a data set, whereas in unsupervised methods no prior assumptions are applied. Hierarchical clustering, K-means, self-organizing maps (SOMs), and principal component analysis (PCA) have been commonly used. There are also other methods, such as multidimensional scaling (MDS), minimum description length (MDS), gene wshaving (GS), decision trees, and support vector machines (SVMs).
2.3 Hierarchical clustering

Hierarchical clustering is a statistical method for finding relatively homogeneous clusters. The hierarchical clustering algorithm either iteratively joins the two closest clusters starting from single clusters (agglomerative, bottom-up approach) or iteratively partitions clusters starting from the complete set (divisive, top-down approach). After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. If there are N cases, N-1 clustering steps are needed. There are several methods of hierarchical cluster analysis including Single linkage (minimum method, nearest neighbor) complete linkage (maximum method, furthest neighbor) Average linkage (UPGMA). For a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the hierarchical clustering is performed as follows: 1. Assign each gene to a cluster of its own. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute the distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all genes are clustered.
14
Step 3 can be performed in different ways depending on the chosen approach. In single linkage clustering, the distance between one cluster and another is considered to be equal to the shortest distance from any member of one cluster to any member of the other cluster. In complete linkage clustering, the distance between one cluster and another cluster is considered to be equal to the longest distance from any member of one cluster to any member of the other cluster. In average linkage, the distance between one cluster and another cluster is considered to be equal to the average distance from any member of one cluster to any member of the other cluster. The hierarchical clustering can be represented as a tree, or a dendrogram. Branch lengths represent the degree of similarity between the genes. The method does not provide clusters as such. Conceptually, different clusters and sizes of clusters can be obtained by moving along the trunk or branches of the tree and deciding on which level to put forth branches (cut the tree).Hierarchical clustering is often applied in the analysis of patient samples to organize the data based on the cases. Indeed, for patient samples the hierarchical clustering method is most often the best option. Usually, in addition to patient based organization, genes are also clustered by applying two-way clustering. On one axis are the samples (patients) and on the other axis the genes
2.4 Self-organizing map

Kohonens self-organizing map (SOM) is a neural net that uses unsupervised learning for which no prior knowledge of classes is required. SOMs are usually used to visualize and interpret large high-dimensional data sets. In SOM, every input is connected to every output via connections with variable weights. Also, the output nodes are highly interconnected. SOM tries to learn to map similar input vectors (gene expression profiles) to similar regions of the output array of nodes. The method maps the multidimensional distances of the feature space to two-dimensional distances in the output map. The SOM algorithm is iterative. The map attempts to represent all the available observations with optimal accuracy using a restricted set of models. At the same time, the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. So the order and organization of the nodes (tentative clusters) contain more information than just the actual partition of genes to clusters. In SOMs, the number of clusters has to be predetermined. The value is given by the dimensions of the two-dimensional grid or array. 15
One has to experiment with the actual number of clusters. Most programs facilitate the analysis of all the genes within a cluster or clusters. One should find such array dimensions that there is a minimum number of poorly fitting genes in the clusters. The actual number of clusters is also difficult to assign, but the square root of the number of genes is a good initial estimate. However, it depends on the data set.
2.5 K-means clustering

K-means is a least-squares partitioning method for which the number of groups, K, has to be provided. The algorithm computes cluster centroids and uses them as new cluster seeds, and assigns each object to the nearest seed. However, it is also possible to estimate K from the data, taking the approach of a mixture density estimation problem. 1. The genes are arbitrarily divided into K centroids. The reference vector i.e., location of each centroid, is calculated. 2. Each gene is examined and assigned to one of the clusters depending on the minimum distance. 3. The centroids position is recalculated. 4. Steps 2 and 3 are repeated until all the genes are grouped into the final required number of clusters. During the course of iterations, the program tries to minimize the sum, over all groups, of the squared within-group residuals, which are the distances of the objects to the respective group centroids . Convergence is reached when the objective function (i.e. the residual sum-ofsquares) cannot be lowered any more. The obtained groups are geometrically as compact as possible around their respective centroids. K-means partitioning is a so-called NP-hard problem (there is no known algorithm that would be able to solve the problem in polynomial time), thus there is no guarantee that the absolute minimum of the objective function has been reached. Therefore, it is good practice to repeat the analysis several times using randomly selected initial group centroids, and check whether these analyses produce comparable results.
16
MATERIALS AND METHODS

Progress in micro array data generation for breast cancer has now provides considerable resources that may be used for the in-silico analysis of its gene expression; define genes that are important in the development of breast cancer. This information can further be used to identify and validate novel drug targets. The next advance in availability of computational tools, which can be, used for the analysis of micro array data. Numerous databases are now available which contain micro array data for various kinds of cancers. Most of these are accessible over the Internet through convenient Web browser interfaces. Many also permit downloading of micro array data. The whole work for the present study can be summarized in the following steps: 1. Organisms Disease - Breast Cancer 2. Operating systems Windows and Linux 3. Tools and Softwares Available online/Offline e.g. Genesis, Bioinformatics Research Tools, etc 4. Research Hub Recourses SMD, PMC etc. 5. Retrieval of micro array data Prepare datasets using different databases. 6. Analysis of micro array data by using different clustering techniques for example Hierarchical clustering, K-mean clustering, Self organizing map, Principle component analysis etc. For the identification of the genes responsible for the occurrence of breast cancer with the help of microarray technology, the gene expression data from wet labs is collected and maintained in online data bases to be further worked upon with the help of bioinformatics tools and techniques. By application of tools that help to demarcate the genes of interest and by studying their expression levels our objectives can be solved. Such microarray based approaches are enumerated below
17
METHOD Algorithm for Clustering Analysis: Raw Data from SMD were collected and downloaded
Five experiments were taken and their log transformation ratios were arranged in a separate excel file for 5000 genes with all the parameters such as clone id, cluster id, accession number etc.same.
The data sets were uploaded in the GENESIS software.
K-means clustering and Self Organizing Maps were generated.
Expression profile images were executed.
Comparision of clusters of gene expression based on their similarity.
Calculation of number of genes manually that were co-expressed. 18
3.1 Softwares for clustering and visualization Several softwares are available for clustering and visualization of gene expression patterns. Many programs are freely available on the Internet. Both the Gene Spring and Kensington package available at CSC contain several options for clustering. Despite the initial time required for learning the use of these programs, they provide some benefits, mainly due to allowing several analyses to be done within a single package. These programs are not as such any better than freely available programs. They, however, might have a more user-friendly interface. Table 3.1 Software for cluster analysis and visualization.
Program
Genesis Cluster (Version 2.11) SAM (Version 2.0)
Author
Alexander Sturn (2000) Michael Eisen (1998) Rob Tibshirani (2005)
Platform
Windows, Linux Windows
Description
Clustering, Visualization Performs hierarchical clustering, self organizing maps Significance Analysis of Microarrays: Supervised Learning Processes fluorescent images of microarrays Graphical results of analysis from Cluster Analysis and clustering of gene expression data Self-organizing maps Clustering and visualization
Excel Add-in
ScanAlyze (Version 2.5.3) TreeView (Version 1.6) Expression Profiler (Version 1.0) GeneCluster (Version 2.0) J-Express (Version 7.8.1,7.8.2,7.8.3)
Michael Eisen (2002) Michael Eisen (2002) EBI (2007) Whitehead Institute/ MIT(2004) MolMine (2011)
Windows
Windows Web
Java Java
19
SMD (Stanford Microarray Database)

1. The SMD (http://www.smd.Stanford.EDU/) project stores raw and normalized data from microarray experiments, as well as their corresponding image files. 2. In addition, SMD provides interfaces for data retrieval, analysis and visualization. SMD is funded by the National Cancer Institute at the US National Institutes of Health, the National Science Foundation, and the Howard Hughes Medical Institute fund the Microarray Database. The database is a joint project in the Departments of Biochemistry and Genetics at the School of Medicine, Stanford University. 3. Therefore, already normalized data for breast cancer cell lines was obtained for its gene expression analysis. The details of the data are given below.
3.3 Normalization 1. Normalization refers to computational data transformations intended to remove certain
systematic biases from microarray data, such as dye effects, intensity dependence and spatial or print-tip effects. (In this context, it doesn't necessarily have anything to do with the normal or Gaussian distribution.)
2. The normalization constant is calculated by the database (Stanford Microarray
Database), the first step is to select good spots on which to base the normalization. The threshold value is initially set to > 0.65. If fewer than 10% of the spots in the print pass these criteria, the program will use > 0.60. If fewer than 10% of the spots in the print pass the .60 threshold, the program will use > 0.55. All spots that pass the 0.55 threshold are used in the normalization calculation, regardless of how many there are. If more than 10% of the spots pass any threshold, the program uses those passing spots in the calculation and does not try a lower threshold value.
20
3.1.Normalized microarray data analysis Total intensity normalization relies on the assumption that most genes do not respond to experimental conditions and so the average log ratio on the array should be zero. A single, global, multiplicative adjustment is performed so that the average log ratio is zero for well measured spots. All spots are normalized using the same constant, regardless of whether they were used in the calculation. In the database, this is performed by computing normalized values for all intensities by dividing the raw value by normalization constant. (Fig.3.1)
Fig.3.1: Normalization of experiments by log transformation ratios.
3.3 Centralization 1. It is the process of moving a distribution so that it is centered over the expected mean
2. For the log-transformed intensity ratios, an intensity dependent centralization might help to correct the dye bias.
21
3.2 Centroidal K-means clustering Here each gene is represented by a series of ratio values that are relative to the expression level of that gene in the reference sample. Since the reference sample is usually independent from the experiment, the analysis is preferred to be independent from the gene expression observed in the reference sample, and that is exactly what is achieved by mean and/or median centering. After applying this procedure the values of each gene reflect the variation from some property of the series of observed values such as the mean or median (Fig.3.2)
Fig.3.2: Centroidal K-means clustering 22
3.4 Genesis
1. Genesis a versatile and transparent software suite for large-scale gene expression cluster analysis. (Release: Genesis 1.7.6, Date:2010-09-20,Genesis Server: Release 1.1.0) 2. The software enables data import and visualization, data normalization, and clustering using: (1) Hierarchical Clustering, (2) k-means, (3) Self Organizing Maps, (4) Principal Component Analysis and (5) Support Vector Machines. 3. An important and valuable feature of this software is to calculate and compare clustering results from different algorithmic approaches. For example, one can begin with Hierarchical Clustering to get a first impression on the number of patterns hidden in the dataset and then use this information to adjust the parameters for k-means and SOM clustering. 4. PCA can be used to visualize these clusters in 3D space and get an impression on cluster size, integrity, and distribution, and to retrieve the most significant patterns in a study. It can also reveal some information about the number of clusters in the dataset, provided that data clouds of genes in the principal component space representing a cluster can be distinguished. 3.3 Clustering and visualization procedures are executed with this software (Fig.3.3).
Fig.3.3 GENESIS Software
23
3.5 Visualization
1. Scatter plots are very useful for the initial analysis and comparison of data sets. It is customary to present the clustering results by grouping genes of clusters next to each other. 2. In SOM based analysis, also the order of the clusters contains information about the relationships between clusters. 3. For the manual analysis of the goodness of clustering, one has to look at the expression patterns within the generated clusters. 4. Genes within a cluster should follow the average expression pattern of the cluster. Because every gene has to be assigned to a cluster, genes with unique expression patterns do not fit well in any group.
24
RESULTS AND DISCUSSION

4.1 Depending on the raw data 5000 genes were divided into 9 different clusters in Self Organizing Map. The genes (Fig.4.1) which were distributed showed expression of gene pattern. Some genes are under expressed and some genes are over expressed.
Fig. 4.1: Expression profile for k-means clustering 25
4.2 Single cluster image for k means clustering: The cluster image for a particular cluster for K-means clustering showing the expression of genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN denotes the log transformation for five different experiments based on their intensity and all the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession Number remains same for all the five experiments. (Fig.4.2)
Fig.4.2:Single cluster image for k means clustering 26
4.3 Expression profile for Self Organizing Image(SOM) clustering:
Depending on the raw data 5000 genes were divided into 9 different clusters in Self Organizing Map. The genes (Fig.4.3) which were distributed showed expression of gene pattern. Some genes were under expressed and some genes were over expressed.
Fig.4.3:Expression profile for Self Organizing Image(SOM) clustering.
27
4.4 Single cluster image for k means clustering: The cluster image for a particular cluster for SOM clustering showing the expression of genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN denotes the log transformation for five different experiments based on their intensity and all the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession Number remains same for all the five experiments. (Fig.4.4)
Fig.4.4:Single Cluster Image for Self Organizing Map
28
4.1 Distribution of Genes in different cluster by using SOM and K-mean clustering algorithm: After the genes were expressed in K-means clustering and Self Organizing Maps, they were compared for the distribution of the genes in both the clustering algorithm (Table 4.1).
Table 4.1: Distribution of Genes in different cluster by using SOM and K-mean clustering algorithm.
Self Organizing Map Cluster S1 S2 S3 S4 S5 S6 S7 S8 S9 Genes 936 209 1052 40 49 1099 880 280 455 Genes(%) 19 4 21 1 1 22 18 6 9 Cluster K1 K2 K3 K4 K5 K6 K7 K8 K9
K-means Genes 455 766 537 679 435 587 358 581 602 Genes (%) 9 15 11 14 9 12 7 12 12
29
The expression of genes in cluster S4 and K5 (Figure 4.5) was found to be similar and over expressed as they have similar pattern of peaks, which represents the status of gene expressions because those genes were expressed to the usual/expected degree leading to disruption of normal growth control processes of cells. Depending on the fluorescence ratios, the over expressed genes are assigned positive values. The values may range from 0 to +
K5
expression profile.
S4
Figure 4.5: Comparison of Cluster no.4 of SOM & Cluster no.5 of K-mean based on their
The expression of genes in cluster S7 and K9 (Figure 4.6) was found to be similar and most of the genes were under expressed because those genes were not expressed to the usual/expected degree leading to disruption of normal growth control processes of cells. Depending on the fluorescence ratios, the under expressed genes are assigned negative values. The values may range from 0 to +1.
K9
expression profile.
S7
Figure 4.6: Comparison of Cluster no.7 of SOM & Cluster no.9 of K-mean based on their
30
The expression of genes in cluster S1 and K8 of (Figure 4.7) was found to be similar and some genes are showing under express while others are over expressed because some genes were transcribed into m RNA and some are not transcribed leading to disruption of normal growth control processes of cells.
K8 S1 Figure 4.7: Comparison of Cluster no.1 of SOM & Cluster no.8 of K-mean based on their expression profile. The expression of genes in cluster S8 and K1 of (Figure 4.8) was found to be similar and some genes are showing under express while others are over expressed because some genes were transcribed into m RNA and some are not transcribed leading to disruption of normal growth control processes of cells. .
K1
S8
Figure.4.8: Comparison of Cluster no.8 of SOM & Cluster no.1 of K-mean based on their expression profile.
31
STATISTICAL DATA FOR CLUSTER ANALYSIS:

From the above table (Table 4.1), it was seen that some of the genes distributed in the two clusters were highly co expressed but some genes were having same expression profile image (S1,K8), (S8,K1), (S7,K9) and (S4,K5) which were used in the K-means clustering and Self Organizing Maps(SOM).As we are using a breast tumor cell(one sample),therefore we are conducting t-test for one sample to test the effectiveness of the sample and to know the accuracy of the distribution of genes.
Table 4.2. Similar clusters showing overexpressed and underexpressed genes:

SOM Cluster(x) 936 (S1) 40 (S4) 880 (S7) 280 (S8) x=2136 x 876096 1600 774400 78400 x=1730496 K-Means Cluster(y) 455 (K1) 435 (K5) 581 (K8) 602 (K9) y=2073 y 207025 189225 337561 362404 y=1096215
Mean, x=x/n=2136/4 =534 Sample variance, s=x-nx/n-1 =1730496-4*(534)/4-1 =1730496-1140624/3 =589872/3 =196624 S=196624 =443.42 t=x-/s/n =534-520/443.42 32
Mean, y=y/n=2073/4 =518.25 Sample variance. s=y-ny/n-1 =1096215-4*(518.25)/4-1 =1096215-1074332.25/3 =21882.75/3 =7294.25 S=7294.25 =85.4 t= x-/s/n =518.25-520/85.4
=0.0315*2 =0.063 Table value, t9(5%)=2.262
=-0.031*2 =-0.062 |t|=0.062 t9(5%)=2.262(Table value)
Since the calculated value of t is smaller than the table value at 5% probability level and on nine degrees of freedom. Therefore it can be concluded that there is no significant difference between sample mean(518.25,534) and population mean(520).In other words, random sample of the four clusters (K5, S4), (K9,S7), (K8,S1) and (K1,S8) has been selected from the population whose population average efficiency can be 520 genes. From the statistical analysis of the distribution of the genes in both the clusters that were under expressed and over expressed, it was observed that both the clusters are having same t value. Therefore, the genes present in both the clustering algorithms were same and accurate. 4.2. Genes present in clusters of similar expression profile and co- expressed: By observing the results of SOM and K-mean, it was observed that the expression patterns of most of the clusters were almost same and these clusters were also having almost same genes. The list (Table 4.2) of co expressed genes, which was selected after the comparative analysis of clusters showing similar expression profile.
Table 4.3: Genes which were present in clusters of similar expression profile and coexpressed.
S.No
1 2 3 4 5 6
Gene No.
36 55 67 90 146 156
Gene Name
Paxillin G-protein coupled receptor 19 Homo sapiens c DNA FLI40901 fis,clone UTERU2003704 Complement component 8,alpha polypeptide Damage-specific DNA binding protein1,127kDa Connective tissue growth factor 33
7 8 9 10 11
186 211 212 216 231
Rhotekin Hypothetical protein MGC2747 Homo sapiens HC6(HC6),m RNA,complete cds Brain acyl-coA hydrolase ESTs, weakly similar to A43932 mucin 2 precursor, intestinalhuman(fragments) [Homo sapiens]
12 13 14
250 253 256
Glycoprotein IX (platelet) Protein kinase C,theta Potassium voltage-gated channel, shaker-related subfamily
,member1(episodic ataxia with myokymia) 15 16 17 18 19 20 21 22 23 24 25 26 27 261 342 347 349 357 368 384 391 419 431 437 453 497 Nucleoporin 214 kDa Sperm associated antigen 11 Microtubule-associated protein 1B Ribosomal protein L15 JTV1 gene DKFZP566C134 protein Spindlin EST Synaptotagmin VI Cofilin 2(muscle) BRCA1 associated RING domain 1 Integrin, alpha 5(fibronectin receptor, alpha polypeptide) Uncharacterized MDS029) 28 29 30 31 32 514 542 582 665 786 Methyl-CpG binding domain protein ESTs Steroid sulfatase (microsomal),aryl sulfatase C,isozyme S Annexin A7 ESTs, weakly similar to L1 repeat, Tf subfamily, member 18[Mus hematopoietic stem/progenitor cells protein
34
musculus] 33 34 35 36 37 38 39 40 41 42 43 44 45 793 816 819 827 862 892 904 906 912 929 941 972 1001 Homo sapiens c DNA FL1 37399 fis, clone BRAMY 2027587 ESTs Major Histocompatibility Complex, class I,C Zinc finger protein 177 MAD, mothers against decapentaplegic homolog 3(Drosophila) Homo Sapiens, clone IMAGE:3610040, m RNA Copin-like protein Pleckstrin homology,Sec 7 and coiled/coil domains 4 Sulfide dehydrogenase like(yeast) CLLL8 protein ESTs Cytochrome P450,subfamily XXVIA, polypeptide 1 Tripartite motif-containing 5
Clustering of the normalized data by using K-mean and Self Organizing Maps were carried out and in both the cases entire dataset was clustered into nine different clusters based on the expression profile of genes in the breast cancer. The data set was retrieved from SMD. Genesis was used to carry out for normalization of raw data retrieved from Stanford Microarray Database (SMD),Department of Pathology, Stanford University and to carry out clustering of the genes by clustering algorithms The Genes were clustered into nine different clusters in case of both techniques, based on the expression profile of those genes in breast cancer. In case of SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes that were clustered in S6, S3, S1 and S7 were highly co-expressed. Similarly, In case of Kmean Cluster: cluster K2, K4, K6 and K9 had maximum number of genes. Genes that were clustered in cluster number K2, K4 and K9 are highly co-expressed. The expression of genes in some of clusters has found to be similar and some genes were under express while others were over expressed. The genes that were over expressed showed that they were expressed to the usual/expected degree. Depending on the fluorescence ratios, the over expressed genes 35
were assigned positive values. In contrary to the genes that were under expressed showed that they were not expressed up to a certain degree. The under expressed genes were assigned negative values. The value of the log transformation ratio ranges from 0 to + in case of over expressed genes and in case of under expressed genes, the value of log transformation ratio ranges from 0 to -1.Due to the over expressed and under expressed genes in cancer cells, disruption of normal growth control processes of cells occurs.
36
SUMMARY AND CONCLUSION

Breast cancer, like other cancers, occurs because of an interaction between the environment and a defective gene. Normal cells divide as many times as needed and stop. They attach to other cells and stay in place in tissues. Cells become cancerous when mutations destroy their ability to stop dividing, to attach to other cells and to stay where they belong. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting mechanisms. Microarray experiments generate mountains of data, which has to be stored and analyzed. A number of clustering techniques have been used to group genes based on their expression patterns. The user has to choose the appropriate method for each task. The basic concept in clustering is to try to identify and group together similarly expressed genes and then tries to correlate the observations to biology. The idea is that co-regulated and functionally related genes are grouped into clusters .The micro array data analysis for breast cancer was carried out by using Genesis. The data set was retrieved from SMD. Genesis was used to carry out for normalization of raw data retrieved from SMD and to carry out clustering of the genes by clustering algorithms. The results were generated with the help of SOM and Kmean techniques. The Genes were clustered into 9 different clusters in case of both techniques, based on the expression profile of those genes in breast cancer. In case of SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes that were clustered in S6, S3,S1 and S7 were highly co-expressed. Similarly, In case of K-mean Cluster number K2, K4, K6 and K9 has maximum number of genes. Genes that were clustered in cluster number K2, K4 and K9 are highly co-expressed. The expression of genes in some of clusters has found to be similar and some genes were under express while others were over expressed. The expression of genes in some of clusters has found to be similar and some genes were under express while others were over expressed.
37
CONCLUSION
Thus, it was concluded from the statistical analysis that clustering results obtained by two techniques were same and approximately accurate. Forty-five genes have been identified which were co expressed in different clusters. In future, work promoter analysis can be carried out to analyze the regulatory systems of these forty-five genes. Drug target can be identified with the help of this regulatory system analysis. Functions of these forty-five genes are unknown and can be predicted on the bases of the known genes of similar cluster.
38
REFERENCES
Anderson, K.S., Sibani, S., Wong, J. and Raphael, J. (2009). Using custom protein microarrays to identify autoantibody biomarkers for the early detection of breast cancer: Cancer Res; 69: 223-233
Bertucci, F., Birnbaum, D. and Hassoun, J. (2004). Protein Expression Profiling Identifies Subclasses of Breast Cancer and Predicts Prognosis, American Association for Cancer Research.2: 12-29
Calvo, M.B., Antolin, N. S. and Vilamil, M.V. (2009). MicroRNA for circulating tumor cells detection in breast cancer: In-silico and in-vitro analysis, Tumor Biology,11:280-285
Cannaves, J. M. and Leon, D. A. (2003). In silico stydy of breast cancer associated gene 3 using LION Target Engine and other tools. Biotechniques 7: 1222-1228
David, S., Larson, G., Glackin,C. and Shove, O. (2008). Meta-analysis of breast cancer microarray studies in conjunction with conserved cis-elements suggest patterns for coordinate regulation.BMC Bioinformatics 9: 1186-1471
Goodison, S., Sun, Y. and Urquidi, V. (2010). Derivation of cancer diagnostic and prognostic signatures from gene expression data.Bioanalysis 5: 855-862
Holko, M., Scholtens, D. and Khan, S. A. (2009). Differential gene expression associated with estrogen receptor status of breast cancer identified by microarray meta analysis. American Association for Cancer Research 69: 5472
Hyat, M., Tramm,T., Alsner, J. and Finak,G. (2010). In-silico ascription of gene expression differences to tumor and stromal cells in a model to study impact on breast cancer outcome.Public Library of Science 5:1371
39
Ingrid, A. H., Kristen, M.C. and Sofia, K.G. (2006). Microarrays in breast cancer research and clinical practice. Endocrine Related Cancer 13: 1017-1031
Jones, C., Mckay, A., Cossu, A. and Davies, S. (2004). Gene expression profiling for Molecular Characterization of Inflammatory Breast Cancer and Prediction of Response to Chemotherapy.Cancer Res 64:8558
Kristine, D. N. (2002). Microarrays and Breast Cancer: Predicting Metastatic Disease. Medscape Hematology-Oncology.Absract nr 2067
Kallioniemi, A., Veldman, R. and Jiang, Y. (2000). Comparative Genomic Hybridization Analysis of 38 Breast Cancer cell lines-A Basis for Interpreting Complementary DNA Microarray data.American Association of Cancer Research, 60:4519
Macgregor, P.F. (2003). Gene expression in cancer-applications of microarrays. Expert Review of Molecular Diagnostics, 3:185-200
Margossian, A., Diaz, J. and Corvalan, A. (2009). In silico analysis of Breast Cancer Transcriptome Libraries distinguishes Tumor Subclasses. Cancer Res; 69:1165
Miller, D.L. and Edison, T. L. (2007). Expression genomics in breast cancer researchmicroarrays at the crossroads of biology and medicine. Breast Cancer Research, 9:206
Pandey, A., Sukumar, S., Cole, R.N. and Chen, H. (2010). Proteomic characterization of Her 2/neu over expressing breast cancer cells. Proteomics, John Wiley and Sons Ltd,10 :3800-3810
Potamias, G., Analyti, A. and Tollis, Y. (2004). Breast cancer microarrays and Biomedical Informatics-The Prognochip Project.1st International Advanced Research Workshop on In silico Oncology: Advances and Challenges, Sparta, Greece.
40
Shen, D., He, J. and Chang, H.R. (2005). In silico identification of breast cancer genes by combined multiple high throughput analysis. International Journal of Molecular Medicine 15 :205-212
Sultana, Z., Craig, D. B. and Dombkowski, A.A. (2011). In silico analysis of combinatorial microRNA Activity Reveals Target Genes and Pathways Associated with Breast Cancer Metastasis. Cancer Inform 17 :13-29
Tommasi, S., Pilato, B., Pinto, R. and Bruno, M. (2008). Molecular and in silico analysis of BRCA1 and BRCA2 variants.Mutat Res 644 :64-70
Vinals, C., Gauli, S. and Coche.T.(2001). Using in silico transcriptomics to search for tumor-associated antigens for immunotherapy. Vaccine 19 :2607-2614
Wendy, K., Andrews, J., Pilon, J. and Hodgson, A. (2010). Gene expression Profiling predicts clinical outcomes of breast cancer Nature 415:530-536
Yang, Y., Choi, J. and Yoon, S. (2006). Large scale data mining approach for gene-specific standardization of microarray gene expression data. Oxford Journal, Bioinformatics 22 :2818-2904
41
WEBSITES
1. http://www.helsinki.fi/biochipcenter 2. http://www.microarrays.btk.utu.fi 3. http://www.med.uio.no/dnr/microarray/ 4. http://www.genome.tugraz.at/genesisclient/1.7.2/install.htm 5. http://www.smd.Stanford.EDU/ 6. http://www.ncbi.nlm.nih.gov.in 7. http://www.aacr.gov
42

In Silico Analysis of Microarray Data For Breast Cancer

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

In Silico Analysis of Microarray Data For Breast Cancer

Загружено:

Авторское право:

Доступные форматы

INTRODUCTION

2.1 Cluster analysis of microarray information

2.2 Principles of clustering

2.3 Hierarchical clustering

2.4 Self-organizing map

2.5 K-means clustering

MATERIALS AND METHODS

The data sets were uploaded in the GENESIS software.

K-means clustering and Self Organizing Maps were generated.

Expression profile images were executed.

Comparision of clusters of gene expression based on their similarity.

Calculation of number of genes manually that were co-expressed. 18

SMD (Stanford Microarray Database)

Fig.3.1: Normalization of experiments by log transformation ratios.

Fig.3.2: Centroidal K-means clustering 22

Fig.3.3 GENESIS Software

RESULTS AND DISCUSSION

Fig. 4.1: Expression profile for k-means clustering 25

Fig.4.2:Single cluster image for k means clustering 26

4.3 Expression profile for Self Organizing Image(SOM) clustering:

Fig.4.3:Expression profile for Self Organizing Image(SOM) clustering.

Fig.4.4:Single Cluster Image for Self Organizing Map

STATISTICAL DATA FOR CLUSTER ANALYSIS:

Table 4.2. Similar clusters showing overexpressed and underexpressed genes:

=0.0315*2 =0.063 Table value, t9(5%)=2.262

=-0.031*2 =-0.062 |t|=0.062 t9(5%)=2.262(Table value)

186 211 212 216 231

250 253 256

SUMMARY AND CONCLUSION

Вам также может понравиться