0 оценок0% нашли этот документ полезным (0 голосов)

19 просмотров35 страниц© © All Rights Reserved

DOCX, PDF, TXT или читайте онлайн в Scribd

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

19 просмотров35 страниц© All Rights Reserved

Вы находитесь на странице: 1из 35

of

Master of Technology (Computer Science)

3rd Semester

SUBMITTED BY:

MD FARHAN HAIDER

M.Tech (CS)3rd Semester

Enrollment No.: A160025

Associate Professor (Dean)

Department of CS&IT, MANUU, Hyderabad

SCHOOL OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY

MAULANA AZAD NATIONAL URDU UNIVERSITY, HYDERABAD

GACHIBOWLI, HYDERABAD - 500032, INDIA

NOVEMBER, 2018

MAULANA AZAD NATIONAL URDU UNIVERSITY

Gachibowli, Hyderabad-500032

Certificate

This is to certify that the Dissertation Part-1 Progress Report entitled “LUNG DISEASE PREDICTION

SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” submitted by MD FARHAN HAIDER bearing Roll No

A160025 in partial fulfillment of the requirements for the award of Master of Technology (CS) Degree during

2017-2019 at the Department of CS&IT is an authentic work carried out by him/her under my guidance and

supervision.

The results presented in this report have been verified and are found to be satisfactory. The results

embodied in this dissertation have not been submitted to any other University or Institute for the award of any

other degree or diploma.

Supervisor’s Signature

Department of CS&IT

CANDIDATE’S DECLARATION

I hereby declare that the thesis work presented in this Dissertation Part-1 Progress Report entitled “LUNG

DISEASE PREDICTION SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” towards the partial fulfillment

of the requirement for the award of the degree of Master of Technology (Computer Science) submitted in the

Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, Telangana, India is an authentic

record of my own work carried out under the guidance of MR. ABDUL WAHID, Associate Professor (Dean),

Department of CS&IT, Maulana Azad National Urdu University, Hyderabad (Telangana).

I have not submitted the matter embodied in this progress report for the award of any other degree or diploma

to any other University or Institute.

Date:

Place: MD FARHAN HAIDER

ACKNOWLEDGEMENT

I express my sincere gratitude towards my Supervisor MR. ABDUL WAHID, Associate Professor (Dean),

Department of CS&IT, MANUU Hyderabad for consistently providing me with the required guidance to

I am deeply indebted to Coordinator MR. MOHAMMAD ISLAM, Department of CS&IT, MANUU for his

valuable suggestions and support. In spite of his extremely busy schedules in Department, he was

always available to share with me his deep insights, wide knowledge and extensive experience.

Again I sincerely thank Professor Abdul Wahid, Dean School of Computer Science & Information Technology,

Dr. Pradeep Kumar, Head Department of Computer Science & Information Technology and all other faculty

members of our department for their valuable feedback during internal evaluations.

ABSTRACT

Data mining techniques are starting gaining its popularity nearly three decades

ago. Till last few years data mining approach was not in been used in health

care organization. Researchers have started paying attention towards this field,

it has been found by the researcher health care sector is possessing a very large

volume of data but all this are highly unorganized. If this organized in a proper

way using data mining technique. It can be easily used for the prediction of

various diseases.

I will develop a hybrid approach by using two technique Naïve Bayes and K -

means algorithm. Different 14 parameters are considered for prediction of the

lung disease. It helps in predicting lung disease using various attributes and it

predicts the output as in the prediction form. For the grouping of various

attributes, it uses k-means algorithm and for predicting it uses naïve Bayes

algorithm.

TABLE OF CONTENTS

DESCRIPTION PAGE NO

Contents I-II

List of Tables IV

1. Introduction 1-4

1.1) Introduction

2. Objectives 5-6

2.1) Objectives

4.1) Methodologies

6. Tools 19-21

6.1) Tools

I

References 24-26

II

LIST OF FIGURES

Figure 4.2 Taking dataset and preprocess 13

Figure 4.3 Clustering using k-means 14

Figure 4.4 Classification using Naïve Bayes 15

Figure 6.1 WEKA tool 21

III

LIST OF TABLES

Table 3.2 Research Papers using naïve bayes 8

Table 3.3 Research Papers using k-means clustering 9

Table 3.4 Research Papers using WEKA 9

Table 5.1 Plan of Work 18

IV

CHAPTER 1

INTRODUCTION

1

1.1) Introduction

Lung cancer is the leading cause of cancer-related death and is responsible for more than a quarter

of all deaths due to cancer in the United States. It accounts for 13-14% of all cancer diagnoses,

making it the second most commonly diagnosed malignancy in both men and women (not counting

skin cancers). Until the 20th century, however, lung cancer was a relatively rare disease. That

changed with the advent of wide-scale cigarette smoking, which remains the leading cause of lung

cancer today.

There are two main types of lung cancer: non-small cell lung cancer (NSCLC) and small cell lung

cancer (SCLC). The majority of lung cancer patients have NSCLC, which usually grows and

spreads more slowly and has a better 5-year overall survival rate than SCLC.

In the real world, Lung cancer accounts for more deaths than any other cancer in both men

and women. Lung Cancer disease is the fifth leading cause of death in the world over the past

10 years (World Health Organization 2016). According to the WHO (World Health

Organization) report lung Disease is the leading cause of death across the world accounting

for 1.58 million, accounting for about 27 % of all cancer deaths. Death rate began declining

in 1991 in men and in 2003 in women.

Early detection of lung cancer is essential in reducing life losses. However, earlier treatment

requires the ability to detect lung cancer in early stages. Early diagnosis requires an accurate and

reliable diagnostic procedure that allows physicians to distinguish benign lung disease from

malignant ones.

Health data is rapidly increasing in the world. Health data is very large and complex due to this

processing of data using traditional data processing techniques is very difficult. For simplicity,

machine-learning techniques like KNN, SVM, D.T have been used. Some tool like Python (pandas)

and Weka are widely used in the data analytics field.

The two main concepts that we will come across repeatedly throughout this work are:

K-Means Clustering

Naïve Bayes

2

1.2) K-Means Clustering

K-means is the simplest learning algorithm to solve the clustering problems. The process is

simple and easy, it classifies given data set into a certain number of clusters. It defines k

centroids for each cluster. They must be placed as much as possible far away from each other.

Then take each point belonging to the given data set and relate into the nearest centroid. If no

point is pending then a group age is done. Then we re-calculate knew centroid for the cluster

resulting from previous steps. When we get the k centroid, a new binding is to be done

between sane data points and nearest centroid. A loop is been generated because of this loop

key centroid change the location step by step until no more changes are done.

Algorithm:-

1) Select k center from the problem (random)

2) Divide data into k clusters by grouping points.

3) Calculate the mean of k cluster to find new centers.

4) Repeat steps 2 and 3 until centers do not change.

Naïve Bayes classifier is based on Bayes theorem. It has strong independence assumption. It

is also known as an independent feature model. Naïve Bayes is mainly used when the inputs

are high. It gives output in more sophisticated form. The probability of each input attribute is

shown from the predictable state.

Bayes theorem:-

P(H|X) = P(X|H) P(H)

P(X)

Where

P(H|X ) is a posterior probability of H conditioned on X

P(X|H) is a posterior probability of X conditioned on H

P(H)is a prior probability of H

P(X) is a prior probability of X

3

Naïve Bayes will basically predict the output whether the patient will have chances of getting

the lung disease or not.

The k-means clustering and naïve Bayes hybrid approach has been used for some other disease

prediction and has been shown to produce better results than the simple approaches.

The model dataset that we get after applying the K-Means algorithm will compare the values

of a dataset with a trained dataset. It will apply the Bayes theorem and the probability will be

obtained whether the patient will have lung disease or not.

K-means clustering has the ability to handle massive data and cluster those data efficiently

and quickly.

4

CHAPTER 2

OBJECTIVES

5

2.1) Objectives

To study and analyze existing systems for lung disease and identify issues and challenges.

To design a system for more accuracy in lung disease prediction than already existing

systems.

Prediction of the lung disease is a very complicated task, and in the current world, it mainly

depends upon the individual medical practitioner. If all individual medical practitioners are

combined on one data set, it will be very useful for the younger generation of the medical

practitioner and ultimately it will help the people. In this paper for heart attack prediction

hybrid approach is been used, the combination of the most popular clustering technique

called ‘K-Means' and as a Classifier ‘Naïve Bayes' algorithm are used. Because of a hybrid

approach, this technique is most suitable for any complex problem and it produces results

with very good accuracy.

6

CHAPTER 3

LITERATURE

SURVEY

7

3.1) Literature Survey:

A variety of research papers were studied and analyzed during the literature survey for the research

on the various disease methods that have been employed over these years using k-means clustering,

or Naïve Bayes. The methodologies used in the research studies and their findings are presented

below.

8

Table 3.3: Research Papers using naïve Bayes

[1] Data mining technique widely used for computational and discovering patterns in large

data sets. Data mining approach was found by researchers in the middle of 90’s, and its been

observed that it is very important technique for fetching unknowns patterns and vital

information from large data set.

9

[2] Rucha Shinde, proposed heart disease prediction system using naïve bayes and k-means

clustering. We are using k-means clustering for increasing the efficiency of the output. This is the

most effective model to predict patients with heart disease. This model could answer complex

queries, each with its own strength with respect to ease of model interpretation, access to detailed

information and accuracy.

[3] Priyanka D proposed a system to implement K-Means Clustering algorithms. This performs

certain number of iterations randomly, which access the nearest observations into k, to attain the

high-speed time consumption and offers stability of the accurate result. Here, this research

approaches the Compactness and Connectedness for accuracy result. The compactness and

connectedness for complementary measures are used and it is found that the efficiency and

effectiveness of the method for predicting Heart Disease is better than the other three techniques

through software prototype.

[4] The main aim of this analysis is to develop a prototype Health Care Prediction System using,

Naive Bayes. The System will discover and extract hidden data related to diseases (heart attack,

cancer and diabetes) from a historical heart disease database. It will answer complicated queries

for diagnosing sickness and so assist care practitioners to form intelligent clinical selections, which

ancient call support systems cannot. By providing effective treatments, it conjointly helps to reduce

treatment prices. To reinforce visualization and easy interpretation.

[5] Some implementations of K-means only allow numerical values for attributes. In that case, it

may be necessary to convert the data set into the standard spreadsheet format and convert

categorical attributes to binary.It may also be necessary to normalize the values of attributes that

are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides

filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in

WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of

categorical and numerical attribute. The WEKA SimpleKMeans algorithm uses Euclidean distance

measure to compute distances between instances and clusters.

10

CHAPTER 4

PROPOSED METHOD

11

4.1) Methodologies

The methodologies used in our proposed system will be based on the combination of k-means

clustering and naïve Bayes algorithm.

i) Clustering

ii) Classification

To analyze data related to lung diseases for data mining through Weka.

K-means clustering has the ability to handle massive data and cluster those data

efficiently and quickly.

A simple and straightforward iterative method will be used to partition the data set

into k-number of clusters.

12

4.2) Proposed Method

Firstly, I will preprocess the data because data in the real world is dirty, incomplete and

noisy. Incomplete in lacking attributes values and lacking attributes of interest or containing

only aggregate value noisy in terms of containing errors or outliers and inconsistent

containing discrepancies in names or codes. And then apply clustering algorithm on dataset

after applying clustering algorithm we use classification for predicting lung disease.

Firstly, Run Weka software, launch the explorer window and select the ―Preprocess‖ tab. Then

Open the lung dataset, and enter what information do you have about the data set (e.g. number of

instances, attributes and classes)? What type of attributes does this dataset contain (nominal or

numeric)? What are the classes in this dataset? Which attribute has the greatest standard deviation?

What does this tell you about that attribute? After entered the data set under ―Filter, choose the

Standardize filter and apply it to all attributes. What does it do? How does it affect the attributes’

statistics? Click ―Undo to understanding the data and now apply the ―Normalize, filter and apply

it to all the attributes. What does it do? How does it affect the attributes’ statistics? How does it

differ from ―Standardize? Click Undo again to return the data to its original state. At the bottom,

right of the window there should be a graph, which visualizes the dataset, making sure ―Class:

class (Nom) is selected in the drop-down box .click Visualize All. What can you interpret from

these graphs? Which attribute(s) discriminate best between the classes in the dataset? How do the

13

Standardize and Normalize filter affects these graphs? Under Filter, choose the Attribute Selection

filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory

above? How does its behavior change as you alter its parameters?

Clustering in WEKA:

This pattern divides the records in database into different groups. In the same group, the

groups have the similar properties. Between groups the differences should be as bigger as

possible, and in the same group, the differences should be as smaller as possible. There is no

predefined class that’s why its comes under the unsupervised learning .

Load the data file browsers .arff into WEKA using the same steps we used to load data into

the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the

columns, the attribute data, the distribution of the columns, etc. With this data set, we are

looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab.

Click Choose and select technique from the choices that appear.

14

Classification in WEKA:

Classification is the process of finding a set of models that describe and distinguish data classes

and concepts, for the purpose of being able to use the model to predict the class whose label is

unknown. Classification is a two step process, first, it build classification model using training

data. Every object of the dataset must be pre-classified i.e. its class label must be known; second

the model generated in the preceding step is tested by assigning class labels to data objects in a test

data set. Each tuple/sample is assumed to belong to a predefined class, as determined by the class

label attribute. The model is represented as classification rules, decision trees, or mathematical

formulae. Second step is model usage. It is for classifying future or unknown objects. It estimates

accuracy of the model. The known label of test sample is compared with the classified result from

the model. Model construction describe a set of predetermines classes. Accuracy rate is the

percentage of test set samples that are correctly classified by the model. Test set is independent of

training set, otherwise over-fitting will occur. If the accuracy is acceptable, use the model to

classify data tuples whose class labels are not known.

Basically there are four steps involved in WEKA for classification.

Preparing the data

Choose classify and apply algorithm

Generate trees

Analysis the result or output

Firstly, Prepare the data, load the data and the data should be in .arff format. After loaded the data

choose classify then choose classification algorithm and generate the trees.

15

4.3) Performance Measurement

After the training process is completed, the system will be tested for its performance. The testing

will be done on the Test Dataset. Test Dataset will be part of text Dataset that hasn’t been used for

the training purpose. The ratios for Training Dataset vs Testing Dataset can be 75:25, 65:35, 60:40

and so on based on the size of the available dataset. The ratio is decided on the basis of the size of

the dataset so that enough of a dataset is available for both training the system and then testing it

as well.

The performance of the proposed method will be then measured using a confusion matrix. As both

the input data required and the output by the system is discrete, therefore confusion matrix makes

the best choice for evaluating the final performance of our system. The final performance of the

system will be measured by comparing the total number of True Positives and True Negatives with

the total number of False Positives and False Negatives as predicted by the system and thus giving

a clear idea of the performance of the system.

True Positive (TP) : Observation is positive, and is predicted to be positive.

False Negative (FN) : Observation is positive, but is predicted negative.

True Negative (TN) : Observation is negative, and is predicted to be negative.

False Positive (FP) : Observation is negative, but is predicted positive.

Accuracy: TP +TN

TP+TN+FP+FN

16

CHAPTER 5

TIMETABLE

(PLAN OF WORK)

17

5.1) Plan of Work:

complete chapter 1.

supervisor.

Pending

chapters.

Receive feedback on previous work.

main chapters.

18

CHAPTER 6

TOOLS

19

6.1) Tools

WEKA Tool is used to implementing K-Means Clustering and Naïve Bayes will be:

Naïve Bayes for classification

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can

either be applied directly to a dataset or called from your own Java code. Weka contains tools for

data pre-processing, classification, regression, clustering, association rules, and visualization. It is

also well-suited for developing new machine learning schemes.

WEKA contains “clusters” for finding groups of similar instances in a dataset. The clustering

schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Clusters can be

visualized and compared to “true” clusters (if given). Evaluation is based on log-likelihood if

clustering scheme produces a probability distribution.

In the ‘Clusterer’ box click on the ‘Choose' button. In pull-down menu select WEKA Æ Clusterers,

and select the cluster scheme ‘SimpleKMeans’. Some implementations of K-means only allow

numerical values for attributes; therefore, we do not need to use a filter.

Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning

schemes available in WEKA include decision trees and lists, instance-based classifiers, support

vector machines, multi-layer perceptrons, logistic regression, and Bayes' nets. “Meta”- classifiers

include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.

20

Figure 6.1: WEKA Tools

21

CHAPTER 7

TENTATIVE OUTCOMES

22

7.1) Tentative Outcomes

The proposed system aims to use the hybrid algorithm for lung disease using K-Means and NB

that’s more efficient and can be trained easily compared to the existing simple algorithms in a

lesser time and gives better output results. The results from the proposed system are expected to

be more precise and more accurate than those that are produced from the existing simple algorithms

for lung disease.

This research aims to extend the capabilities of K-Means Clustering with the help of NB, by taking

some pre-trained K-means Clustering and classify them for a lung disease problem.

The primary outcomes that are expected from the proposed system are as follows:

Lung disease prediction system will be developed by combining Naïve Bayes and K-Means

algorithm.

The prediction system may be faster, less computationally expensive, time efficient and

produce more accurate results.

The proposed system will help doctors to efficiently predict lung diseases in the initial

stages for better treatment.

23

REFERENCES

24

World Health Organization (2011) The top ten causes of death. World Health Organization (2013)

Deaths from coronary heart disease.

[1] P. V. Maral, “Heart Disease Prediction Using Naive Bayes and K-Means Techniques,”

Novat. Publ. Int. J. Res. Publ. Eng. Technol., vol. 3, no. 6, pp. 2454–7875, 2017.

[2] R. Shinde, S. Arjun, P. Patil, and P. J. Waghmare, “An Intelligent Heart Disease

Prediction System Using K-Means Clustering and Naïve Bayes Algorithm,” Int. J. Comput.

Sci. Inf. Technol., vol. 6, no. 1, pp. 637–639, 2015.

[3] D. Priyanka and M. S. S. Banu, “Prediction on Lung Disease Using K means

Algorithm,” vol. 1, no. 11, pp. 239–242, 2015.

[4] G. Singh, K. Bagwe, S. Shanbhag, S. Singh, and S. Devi, “Heart disease prediction

using Naïve Bayes,” Int. Res. J. Eng. Technol., vol. 4, no. 3, pp. 1–3, 2017.

[5] S. Jain, M. Aalam, and M. Doja, “K-means clustering using weka interface,” Proc.

4th Natl. Conf., 2010.

[6] W. Zhang and F. Gao, “An improvement to naive bayes for text

classification,” Procedia Eng., vol. 15, pp. 2160–2164, 2011.

[7] K. Vanitha and G. R. L. Rani, “Analysis of Classification and Clustering

Algorithms using Weka For Banking Data,” no. 0976, pp. 104–107.

[8] S. Singhal and M. Jena, “W-06. Study on WEKA Tool for Data Preprocessing

, Classification and Clustering,” India - WEKA, vol. 2, no. 6, pp. 250–253, 2013.

[9] P. Ramachandran, N. Girija, T. Bhuvaneswari, and A. Professor, “Early

Detection and Prevention of Cancer using Data Mining Techniques,” Int. J. Comput.

Appl., vol. 97, no. 13, pp. 975–8887, 2014.

[10] S. Vijiyarani, S. Sudha, and M. P. Research Scholar, “Disease Prediction in

Data Mining Technique – A Survey,” Int. J. Comput. Appl. Inf. Technol., vol. II, no. I,

pp. 2278–7720, 2013.

[11] T. Karthikeyan and P. Thangaraju, “PCA-NB Algorithm to Enhance the

Predictive Accuracy,” vol. 6, no. 1, pp. 381–387, 2014.

[12] D. Kavinya, “Lung Disease Classification Using Support Vector Machine,”

vol. 3, no. 3, pp. 84–86, 2015.

[13] A. Trivedi, “International Journal of Advanced Research in Computer Science

and Software Engineering Evaluation of Student Classification Based On Decision

Tree,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 4, no. 2, pp. 111–112, 2014.

25

[14] U. Sharma, “Suitability of neural network for disease prediction : a

comprehensive literature review,” vol. 5, no. 6, pp. 12–20, 2017.

[15] C. H. Chen, W. T. Huang, T. H. Tan, C. C. Chang, and Y. J. Chang, “Using K-

nearest neighbor classification to diagnose abnormal lung sounds,” Sensors

(Switzerland), vol. 15, no. 6, pp. 13132–13158, 2015.

[16] M. Makinaci, “Support vector machine approach for classification of

cancerous prostate regions,” Int. Enformatika Conf., vol. 1, no. 7, pp. 166–169, 2005.

[17] A. Kumar, M. Kamaleshwar, S. K. K, S. K. R. S, and J. Arunnehru, “An

Improved Disease Prediction System Using Machine Learning,” no. 4, 2018.

[18] P. Mirajkar and A. Pradesh, “An Integrated Cancer Prediction System Using

Data Mining Techniques,” vol. 3, no. 1, pp. 1497–1501, 2018.

[19] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary, “A

Lung Cancer Outcome Calculator Using Ensemble Data Mining on SEER Data

Categories and Subject Descriptors,” Kdd, 2011.

[20] R. Ada, & Kaur, “A Study of Detection of Lung Cancer Using Data Mining

Classification Techniques,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3,

pp. 131–134, 2013.

[21] B. Sciences, B. G. Krishna, and A. Pradesh, “a Predictive Model for Heart

Disease Using clustering techniques,” vol. 8, no. 3, pp. 529–534, 2017.

[22] V. Krishnaiah, G. Narsimha, N. Subhash, and C. #3, “Diagnosis of Lung

Cancer Prediction System Using Data Mining Classification Techniques,” Int. J.

Comput. Sci. Inf. Technol., vol. 4, no. 1, pp. 39–45, 2013.

[23] Nur Hafieza Ismail, Fadhilah Ahmad, Azwa Abdul Aziz, “Implementing WEKA as a

data mining tool to analyze students academic performance using naïve Bayes classifier”,

26