Вы находитесь на странице: 1из 5

ENTROPY BASED FEATURE EXTRACTION AND KERNEL F-SCORE FEATURE SELECTION FOR COMPUTER AIDED MEDICAL IMAGE CLASSIFICATION

SYSTEM

S.Shanthi1, Dr.V.Murali Bhaskaran2 ,V.Kavin Kumar3 ,V.Dhivya4 1 Senior Lecturer, 2 Principal, Paavai College of Engineering, Salem, Tamilnadu 3,4 PG Students 1,3,4 Kongu Engineering College, Perundurai, Tamilnadu 1 shanthi.kongumca@gmail.com

Abstract Advances in image acquisition and storage technology have led to tremendous growth in very large and detailed image databases. These images, if analyzed, can reveal useful information to the human users. Image mining deals with the extraction of implicit knowledge, image data relationship, or other patterns not explicitly stored in the images. Image mining is more than just an extension of data mining to image domain. Breast cancer represents the second leading cause of cancer deaths in women today and it is the most common type of cancer in women. In this paper, we have proposed an Entropy based feature extraction and Kernel F-score feature selection methods. This paper presents a method for classifying the tumour detection in digital mammography. We investigate the use of data mining techniques, association rule mining for classification. Keywords: Image mining, Kernel Fscore, Preprocessing, association rule mining. 1. Introduction A vast amount of image data such as medical images, satellite images, and digital photographs are generated every day. These images, if analyzed, can reveal useful information to the

human users. Unfortunately, it is difficult or even impossible for human to discover the underlying knowledge and patterns in the image when handling a large collection of images. Image mining is rapidly gaining attention among researchers in the field of data mining, information retrieval, and multimedia databases because of its potential in discovering useful image patterns that may push the various research fields to new frontiers. Image mining systems that can automatically extract semantically meaningful information from image data are increasingly in demand. Research in image mining can be broadly classified into two main directions. The first direction involves domain-specific applications where the focus is to extract the most relevant image features into a form suitable for data mining [10]. The second direction involves general applications where the focus is to generate image patterns that maybe helpful in the understanding of the interaction between high-level human perceptions of images and low level image features. Different methods neural network [2] association rules [8,9] have been used to classify the images. A few interesting studies and successful applications involving image mining have been reported. For example [3] describes ROI based association rule mining for brain image classification.

In this paper, we investigate the Entropy based feature extraction and Kernel F-score feature selection. And association rules are used for image categorization.

the locations of any abnormalities that may be present. The existing data in the collection consists of the location of the abnormality, its radius, breast position (left or right), type of breast tissues (fatty, fatty glandular and dense) and tumour type if exists (benign or malign). All the mammograms are medio-lateral oblique view. 2.2. Data Preprocessing The process of building the classification model includes preprocessing and extraction of visual features from already labeled images [7, 10]. Mammograms are images difficult to interpret, and a preprocessing phase of the images is necessary to improve the quality of the images and make the feature extraction phase more reliable. Pre-processing significantly improves the effectiveness of the data mining techniques [4]. In the digitization process, noise could be introduced that needs to be reduced by applying some image processing techniques [11].

Figure 1. Image Mining Process Figure 1 shows the image mining process. The images from an image database are first preprocessed to improve their quality. These images then undergo various transformations and feature extraction to generate the important features from the images [10]. With the generated features, mining can be carried out using data mining techniques to discover significant patterns. The resulting patterns are evaluated and interpreted to obtain the final knowledge, which can be applied to applications. 2. Materials and Methods 2.1 Mammography Data Collection To have access to real medical images for experimentation is a very difficult undertaking due to privacy issues and heavy bureaucratic hurdles. The data collection that was used in our experiments was taken from the Mammographic Image Analysis Society (MIAS) [12]. The data set consist three big categories: normal, benign and malign . In addition, the abnormal cases are further divided in six categories: micro calcification, circumscribed masses, spiculated masses, ill-defined masses, architectural distortion and asymmetry. All the images also include

Figure 2. Original Images

Figure 3. The Preprocessing Stage

2.3 Feature Extraction This paper deals with the problem of statistical approaches to

extract texture features in digital mammogram. Gray level histogram moments method is normally used for this purpose. Entropy [5] is an important texture feature, which is computed based on this method, to build a robust descriptor towards correctly classifying abnormal and normal regions of mammograms. Entropy measures the randomness of intensity distribution. In most feature descriptors Shannons measure is used to measure entropy. Shannon-Entropy: It is a measure of randomness.
N g 1 i =0

S = H i log 2 ( H i )
2.4. Feature Selection - Kernel F-score feature selection F-score method is a basic and simple technique that measures the distinguishing between two classes with real values [6]. In F-score method, Fscore values of each feature in dataset are computed according to following equation and then in order to select the features from whole dataset, threshold value is obtained by calculating the average value of F-scores of all features. If the F-score value of any feature is bigger than threshold value, that feature is added to feature space. Otherwise, that feature is removed from feature space. Given training vectors x k, k = 1,. . . ,m, if the number of positive and negative instances are n+ and n-, respectively, then the F-score of the i-th feature is explained as follows:

F (i ) =

( xi( + ) xi ) 2 + ( xi( ) xi ) 2 1 n+ ( + ) 1 n+ ( 3. Association Rule Mining 2 ( xk ,i xi ) + n 1 ( xk ,i) xi )2 n+ 1 k =1 k =1

and negative data sets respectively. xk ,i , is the i-th feature of the k-th positive ( ) instance and xk ,i is the i-th feature of the k-th negative instance. The numerator shows the discrimination between positive and negative sets, and the denominator defines the one within each of the two sets. The larger F-score for one feature means this feature is more discriminative. But a disadvantage of F-score method does not take the mutual information between features into account. In the proposed feature selection method, kernel F-score feature selection method is provided both to transform from non-linearly separable dataset to linearly separable dataset and to decrease the computation cost of classification algorithm. First of all, input spaces (features) of dataset have been mapped to kernel space using Linear (Lin) or Radial Basis Function (RBF) kernel functions. In this way, the dimensions of datasets have transformed to high dimensional feature space. After transforming from input space to kernel space, the F-score values of datasets with high dimensional feature space have been calculated using F-score formula. And then the mean value of calculated F-scores has been computed and also this value is selected as threshold value. If the F-score value of any feature in datasets is bigger than threshold value, that feature will be selected. Otherwise, that feature is removed from feature space. KFFS method, the irrelevant or redundant features are removed from high dimensional input feature space. The cause of using kernel functions transforms from non-linearly separable medical dataset to a linearly separable feature space.

( +)

Where xi , xi( +) , xi( ) is the average of the i-th feature of the whole, positive,

Association rule mining has been extensively investigated in the data mining literature. Many efficient

algorithms have been proposed, the most popular being apriori [1]. Association rule mining typically aims at discovering associations between items in a transactional database. Given a set of transactions D ={T1; ::; Tn} and a set of items I = {i1; ::; im} such that any transaction T in D is a set of items in I, an association rule is an implication A => B where the antecedent A and the consequent B are subsets of a transaction T in D, and A and B have no common items. For the association rule to be acceptable, the conditional probability of B given A has to be higher than a threshold called minimum confidence. Association rules mining is normally a two-step process, wherein the first step frequent item-sets are discovered and in the second step association rules are derived from the frequent item-sets. In our approach, we used the apriori algorithm in order to discover association rules among the features extracted from the mammography database and the category to which each mammogram belongs. We constrained the association rules to be discovered such that the antecedent of the rules is composed of a conjunction of features from the mammogram while the consequent of the rule is always the category to which the mammogram belongs. Once the association rules are found, they are used to construct a classification system that categorizes the mammograms as normal, malign or benign. The most delicate part of the classification with association rule mining is the construction of the classifier itself. Although we have the knowledge extracted from the database by finding the existing association rules, the main question is how to build a powerful classifier from these associations.

The association rules that have been generated from the database in such a manner that they have as consequent a category from the classification classes. The association rules could imply either normal or abnormal. When a new image has to be classified, the categorization system returns the association rules that apply to that image. The first intuition in building the classification system is to categorize the image in the class that has the most rules that apply. This classification would work when the number of rules extracted for each class is balanced. In other cases, a further tuning of the classification system is required. The tuning of the classifier is mainly represented by finding some optimal intervals of the confidence such as both the overall recognition rate and the recognition rate of abnormal cases are at its maximum value. In dealing with medical images it is very important that the false negative rate be as low as possible. It is better to misclassify a normal image than an abnormal one. That is why in our tuning phase we take into consideration the recognition rate of abnormal images. It is not only important to recognize some images, but to be able to recognize those that are abnormal. By applying the apriori algorithm with additional constraints on the form of the rules to be discovered we generate a relatively small set of association rules associating sets of features with class labels. These association rules constitute our classification model. The discovery of association rules in the mammogram feature database represents the training phase of our classifier. To classify a new mammogram, it suffuses to extract the features from the image as was done for the training set, and applying the association rules on the extracted

features to identify the class the new mammogram falls into. 4. Conclusions and Future Work Mammography is one of the best methods in breast cancer detection, but in some cases, radiologists cannot detect tumours despite their experience. Such computer-aided methods like those presented in this paper could assist medical staff and improve the accuracy of detection. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACMSIGMOD Int. Conf. Management of Data, pages 207216, Washington,D.C., May 1993. [2] Fabio Del Frate, Fabio Pacifici, Giovanni Schiavon and Chiara Solimini, Use of Neural Networks for Automatic Classification From High-Resolution Images. In IEEE Transactions on Geoscience and Remote Sensing, VOL 45, NO.4 April 2007. [3] Haiwei Pan, Qilong Han, Guisheng Yin, A ROI-Based Mining Method with Medical Domain Knowledge Guidance, in the proceedings of International Conference on Internet Computing in Science and Engineering, 2008. [4] Jiawei Han and Micheline Kamber. Data Mining,Concepts and Techniques. Morgan Kaufmann, 2001. [5] N. Kapur, Measure of information and their applications, 1st edition Wiley Eastern Limited, New Delhi, 1994.

[6] Kemal Polat, Salih Gunes, A New Feature Selection Method on Classification of Medical Datasets: Kernel F-score Feature Selection, Expert Systems with Applications, vol.36 pp.1036710373, 2009. [7] S. K. Kinoshita, P. M. d. AzevedoMarques, R. R. P. Jr,J. A. H. Rodrigues, and R. M. Rangayyan. Content based retrieval of mammograms using visual features related to breast density patterns. In Journal of Digital Imaging, 20(2):172190, 2007. [8] Marcela X. Ribeiro, Agma J. M. Traina, Caetano Traina Jr., and Paulo M. Azevedo-Marques. An Association Rule-Based Method to Support Medical Image Diagnosis With Efficiency in IEEE Transactions on Multimedia, VOL. 10, NO. 2, FEBRUARY 2008. [9] Marcela X. Ribeiro, Agma J. M. Traina, Caetano Traina Jr., Natalia A. Rosa, and Paulo M. A. Marques How To Improve Medical Image Diagnosis through Association Rules: The IDEA Method. In 21st IEEE International Symposium on Computer-Based Medical Systems. [10] Naga R. Mudigonda, Rangaraj M. Rangayyan and J. E. Leo Desautels Detection of Breast Masses in Mammograms by Density Slicing and Texture Flow Field Analysis. In IEEE Transactions on Medical Imaging, VOL. 20, NO. 12, DECEMBER 2001 [11] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing, 2nd edition. Addison-Wesley, 1993. [12] http://www.wiau.man.ac.uk/ services/MIAS/MIASweb.html.

Вам также может понравиться