Вы находитесь на странице: 1из 35



Dissertation Part-I Progress Report

Master of Technology (Computer Science)

3rd Semester


M.Tech (CS)3rd Semester
Enrollment No.: A160025


Mr. Abdul Wahid

Associate Professor (Dean)
Department of CS&IT, MANUU, Hyderabad


Gachibowli, Hyderabad-500032

This is to certify that the Dissertation Part-1 Progress Report entitled “LUNG DISEASE PREDICTION
A160025 in partial fulfillment of the requirements for the award of Master of Technology (CS) Degree during
2017-2019 at the Department of CS&IT is an authentic work carried out by him/her under my guidance and

The results presented in this report have been verified and are found to be satisfactory. The results
embodied in this dissertation have not been submitted to any other University or Institute for the award of any
other degree or diploma.

Supervisor’s Signature

DRC Member’s Signature Head

Department of CS&IT

I hereby declare that the thesis work presented in this Dissertation Part-1 Progress Report entitled “LUNG
of the requirement for the award of the degree of Master of Technology (Computer Science) submitted in the
Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, Telangana, India is an authentic
record of my own work carried out under the guidance of MR. ABDUL WAHID, Associate Professor (Dean),
Department of CS&IT, Maulana Azad National Urdu University, Hyderabad (Telangana).

I have not submitted the matter embodied in this progress report for the award of any other degree or diploma
to any other University or Institute.


I express my sincere gratitude towards my Supervisor MR. ABDUL WAHID, Associate Professor (Dean),
Department of CS&IT, MANUU Hyderabad for consistently providing me with the required guidance to

help me in the timely and successful completion of this report.

I am deeply indebted to Coordinator MR. MOHAMMAD ISLAM, Department of CS&IT, MANUU for his
valuable suggestions and support. In spite of his extremely busy schedules in Department, he was

always available to share with me his deep insights, wide knowledge and extensive experience.

Again I sincerely thank Professor Abdul Wahid, Dean School of Computer Science & Information Technology,
Dr. Pradeep Kumar, Head Department of Computer Science & Information Technology and all other faculty
members of our department for their valuable feedback during internal evaluations.

Data mining techniques are starting gaining its popularity nearly three decades
ago. Till last few years data mining approach was not in been used in health
care organization. Researchers have started paying attention towards this field,
it has been found by the researcher health care sector is possessing a very large
volume of data but all this are highly unorganized. If this organized in a proper
way using data mining technique. It can be easily used for the prediction of
various diseases.
I will develop a hybrid approach by using two technique Naïve Bayes and K -
means algorithm. Different 14 parameters are considered for prediction of the
lung disease. It helps in predicting lung disease using various attributes and it
predicts the output as in the prediction form. For the grouping of various
attributes, it uses k-means algorithm and for predicting it uses naïve Bayes

Contents I-II

List of Figures III

List of Tables IV

1. Introduction 1-4

1.1) Introduction

1.2) K-means clustering

1.3) Naïve Bayes

1.4) K-Means – Naïve Bayes Hybrid

2. Objectives 5-6

2.1) Objectives

3. Literature Survey 7-10

3.1) Literature Survey

4. Proposed Method 11-16

4.1) Methodologies

4.2) Proposed Method

4.3) Performance Measurement

5. Time Table (Plan of Work) 17-18

5.1) Plan of Work

6. Tools 19-21

6.1) Tools

7. Tentative Outcomes 22-23

7.1) Tentative Outcomes

References 24-26


Figure No. Name of the Figure Page No.

Figure 4.1 K-means clustering process 12

Figure 4.2 Taking dataset and preprocess 13
Figure 4.3 Clustering using k-means 14
Figure 4.4 Classification using Naïve Bayes 15
Figure 6.1 WEKA tool 21


Table No. Name of the Table Page No.

Table 3.1 Base Papers 8

Table 3.2 Research Papers using naïve bayes 8
Table 3.3 Research Papers using k-means clustering 9
Table 3.4 Research Papers using WEKA 9
Table 5.1 Plan of Work 18


1.1) Introduction

Lung cancer is the leading cause of cancer-related death and is responsible for more than a quarter
of all deaths due to cancer in the United States. It accounts for 13-14% of all cancer diagnoses,
making it the second most commonly diagnosed malignancy in both men and women (not counting
skin cancers). Until the 20th century, however, lung cancer was a relatively rare disease. That
changed with the advent of wide-scale cigarette smoking, which remains the leading cause of lung
cancer today.
There are two main types of lung cancer: non-small cell lung cancer (NSCLC) and small cell lung
cancer (SCLC). The majority of lung cancer patients have NSCLC, which usually grows and
spreads more slowly and has a better 5-year overall survival rate than SCLC.
In the real world, Lung cancer accounts for more deaths than any other cancer in both men
and women. Lung Cancer disease is the fifth leading cause of death in the world over the past
10 years (World Health Organization 2016). According to the WHO (World Health
Organization) report lung Disease is the leading cause of death across the world accounting
for 1.58 million, accounting for about 27 % of all cancer deaths. Death rate began declining
in 1991 in men and in 2003 in women.
Early detection of lung cancer is essential in reducing life losses. However, earlier treatment
requires the ability to detect lung cancer in early stages. Early diagnosis requires an accurate and
reliable diagnostic procedure that allows physicians to distinguish benign lung disease from
malignant ones.

Health data is rapidly increasing in the world. Health data is very large and complex due to this
processing of data using traditional data processing techniques is very difficult. For simplicity,
machine-learning techniques like KNN, SVM, D.T have been used. Some tool like Python (pandas)
and Weka are widely used in the data analytics field.

The two main concepts that we will come across repeatedly throughout this work are:

 K-Means Clustering
 Naïve Bayes

1.2) K-Means Clustering

K-means is the simplest learning algorithm to solve the clustering problems. The process is
simple and easy, it classifies given data set into a certain number of clusters. It defines k
centroids for each cluster. They must be placed as much as possible far away from each other.
Then take each point belonging to the given data set and relate into the nearest centroid. If no
point is pending then a group age is done. Then we re-calculate knew centroid for the cluster
resulting from previous steps. When we get the k centroid, a new binding is to be done
between sane data points and nearest centroid. A loop is been generated because of this loop
key centroid change the location step by step until no more changes are done.

The advantages of k means clustering algorithms are simplicity and speed.

1) Select k center from the problem (random)
2) Divide data into k clusters by grouping points.
3) Calculate the mean of k cluster to find new centers.
4) Repeat steps 2 and 3 until centers do not change.

1.3) Naïve Bayes

Naïve Bayes classifier is based on Bayes theorem. It has strong independence assumption. It
is also known as an independent feature model. Naïve Bayes is mainly used when the inputs
are high. It gives output in more sophisticated form. The probability of each input attribute is
shown from the predictable state.

Bayes theorem:-
P(H|X) = P(X|H) P(H)
P(H|X ) is a posterior probability of H conditioned on X
P(X|H) is a posterior probability of X conditioned on H
P(H)is a prior probability of H
P(X) is a prior probability of X

Naïve Bayes will basically predict the output whether the patient will have chances of getting
the lung disease or not.

1.4) K-Means – Naïve Bayes Hybrid:

The k-means clustering and naïve Bayes hybrid approach has been used for some other disease
prediction and has been shown to produce better results than the simple approaches.
The model dataset that we get after applying the K-Means algorithm will compare the values
of a dataset with a trained dataset. It will apply the Bayes theorem and the probability will be
obtained whether the patient will have lung disease or not.

 K-means clustering has the ability to handle massive data and cluster those data efficiently
and quickly.

 Naive Bayes algorithm will be used as a classification.


2.1) Objectives

The objectives of this research are as follows:

 To study different disease prediction algorithms and literature review.

 To study and analyze existing systems for lung disease and identify issues and challenges.

 To develop a k-means – naïve Bayes hybrid system for lung disease.

 To design a system for lung disease prediction based on patient data.

 To design a system for more accuracy in lung disease prediction than already existing

 To implement a system using hybrid algorithms for increasing efficiency.

 To test and validate the proposed system.

Prediction of the lung disease is a very complicated task, and in the current world, it mainly
depends upon the individual medical practitioner. If all individual medical practitioners are
combined on one data set, it will be very useful for the younger generation of the medical
practitioner and ultimately it will help the people. In this paper for heart attack prediction
hybrid approach is been used, the combination of the most popular clustering technique
called ‘K-Means' and as a Classifier ‘Naïve Bayes' algorithm are used. Because of a hybrid
approach, this technique is most suitable for any complex problem and it produces results
with very good accuracy.


3.1) Literature Survey:

A variety of research papers were studied and analyzed during the literature survey for the research
on the various disease methods that have been employed over these years using k-means clustering,
or Naïve Bayes. The methodologies used in the research studies and their findings are presented

Table 3.1: Base Papers

Table 3.2: Research Papers using k-means clustering

Table 3.3: Research Papers using naïve Bayes

Table 3.4: Research Papers using WEKA

[1] Data mining technique widely used for computational and discovering patterns in large
data sets. Data mining approach was found by researchers in the middle of 90’s, and its been
observed that it is very important technique for fetching unknowns patterns and vital
information from large data set.

[2] Rucha Shinde, proposed heart disease prediction system using naïve bayes and k-means
clustering. We are using k-means clustering for increasing the efficiency of the output. This is the
most effective model to predict patients with heart disease. This model could answer complex
queries, each with its own strength with respect to ease of model interpretation, access to detailed
information and accuracy.

[3] Priyanka D proposed a system to implement K-Means Clustering algorithms. This performs
certain number of iterations randomly, which access the nearest observations into k, to attain the
high-speed time consumption and offers stability of the accurate result. Here, this research
approaches the Compactness and Connectedness for accuracy result. The compactness and
connectedness for complementary measures are used and it is found that the efficiency and
effectiveness of the method for predicting Heart Disease is better than the other three techniques
through software prototype.

[4] The main aim of this analysis is to develop a prototype Health Care Prediction System using,
Naive Bayes. The System will discover and extract hidden data related to diseases (heart attack,
cancer and diabetes) from a historical heart disease database. It will answer complicated queries
for diagnosing sickness and so assist care practitioners to form intelligent clinical selections, which
ancient call support systems cannot. By providing effective treatments, it conjointly helps to reduce
treatment prices. To reinforce visualization and easy interpretation.

[5] Some implementations of K-means only allow numerical values for attributes. In that case, it
may be necessary to convert the data set into the standard spreadsheet format and convert
categorical attributes to binary.It may also be necessary to normalize the values of attributes that
are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides
filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in
WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of
categorical and numerical attribute. The WEKA SimpleKMeans algorithm uses Euclidean distance
measure to compute distances between instances and clusters.


4.1) Methodologies

The methodologies used in our proposed system will be based on the combination of k-means
clustering and naïve Bayes algorithm.

i) Clustering
ii) Classification

 To analyze data related to lung diseases for data mining through Weka.

 K-means clustering and naïve Bayes techniques will be used.

 K-means clustering has the ability to handle massive data and cluster those data
efficiently and quickly.

 A simple and straightforward iterative method will be used to partition the data set
into k-number of clusters.

 Naive Bayes algorithm will be used as a classification algorithm.

Table 4.1: k-means clustering process

4.2) Proposed Method

Firstly, I will preprocess the data because data in the real world is dirty, incomplete and
noisy. Incomplete in lacking attributes values and lacking attributes of interest or containing
only aggregate value noisy in terms of containing errors or outliers and inconsistent
containing discrepancies in names or codes. And then apply clustering algorithm on dataset
after applying clustering algorithm we use classification for predicting lung disease.

Data preprocessing steps in Weka:

Firstly, Run Weka software, launch the explorer window and select the ―Preprocess‖ tab. Then
Open the lung dataset, and enter what information do you have about the data set (e.g. number of
instances, attributes and classes)? What type of attributes does this dataset contain (nominal or
numeric)? What are the classes in this dataset? Which attribute has the greatest standard deviation?
What does this tell you about that attribute? After entered the data set under ―Filter, choose the
Standardize filter and apply it to all attributes. What does it do? How does it affect the attributes’
statistics? Click ―Undo to understanding the data and now apply the ―Normalize, filter and apply
it to all the attributes. What does it do? How does it affect the attributes’ statistics? How does it
differ from ―Standardize? Click Undo again to return the data to its original state. At the bottom,
right of the window there should be a graph, which visualizes the dataset, making sure ―Class:
class (Nom) is selected in the drop-down box .click Visualize All. What can you interpret from
these graphs? Which attribute(s) discriminate best between the classes in the dataset? How do the

Figure 4.2: Taking dataset and preprocess

Standardize and Normalize filter affects these graphs? Under Filter, choose the Attribute Selection
filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory
above? How does its behavior change as you alter its parameters?

Clustering in WEKA:

This pattern divides the records in database into different groups. In the same group, the
groups have the similar properties. Between groups the differences should be as bigger as
possible, and in the same group, the differences should be as smaller as possible. There is no
predefined class that’s why its comes under the unsupervised learning .

Steps involved in WEKA

Load the data file browsers .arff into WEKA using the same steps we used to load data into
the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the
columns, the attribute data, the distribution of the columns, etc. With this data set, we are
looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab.
Click Choose and select technique from the choices that appear.

Figure 4.3: Clustering by using k-means

Classification in WEKA:
Classification is the process of finding a set of models that describe and distinguish data classes
and concepts, for the purpose of being able to use the model to predict the class whose label is
unknown. Classification is a two step process, first, it build classification model using training
data. Every object of the dataset must be pre-classified i.e. its class label must be known; second
the model generated in the preceding step is tested by assigning class labels to data objects in a test
data set. Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute. The model is represented as classification rules, decision trees, or mathematical
formulae. Second step is model usage. It is for classifying future or unknown objects. It estimates
accuracy of the model. The known label of test sample is compared with the classified result from
the model. Model construction describe a set of predetermines classes. Accuracy rate is the
percentage of test set samples that are correctly classified by the model. Test set is independent of
training set, otherwise over-fitting will occur. If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not known.

Steps involve in WEKA

Basically there are four steps involved in WEKA for classification.
 Preparing the data
 Choose classify and apply algorithm
 Generate trees
 Analysis the result or output
Firstly, Prepare the data, load the data and the data should be in .arff format. After loaded the data
choose classify then choose classification algorithm and generate the trees.

Figure 4.4: Classification by using naïve bayes

4.3) Performance Measurement

After the training process is completed, the system will be tested for its performance. The testing
will be done on the Test Dataset. Test Dataset will be part of text Dataset that hasn’t been used for
the training purpose. The ratios for Training Dataset vs Testing Dataset can be 75:25, 65:35, 60:40
and so on based on the size of the available dataset. The ratio is decided on the basis of the size of
the dataset so that enough of a dataset is available for both training the system and then testing it
as well.

The performance of the proposed method will be then measured using a confusion matrix. As both
the input data required and the output by the system is discrete, therefore confusion matrix makes
the best choice for evaluating the final performance of our system. The final performance of the
system will be measured by comparing the total number of True Positives and True Negatives with
the total number of False Positives and False Negatives as predicted by the system and thus giving
a clear idea of the performance of the system.
 True Positive (TP) : Observation is positive, and is predicted to be positive.
 False Negative (FN) : Observation is positive, but is predicted negative.
 True Negative (TN) : Observation is negative, and is predicted to be negative.
 False Positive (FP) : Observation is negative, but is predicted positive.

Classification Rate or Accuracy is given by the relation:

Accuracy: TP +TN


5.1) Plan of Work:

Table 5.1: Plan of Work

Academic Calendar Activity Status

Week 1-2 Literature Searching Done

Week 3-12 Literature Survey and Review Done

Week 12-17 Start work on the first draft. Aim to In Progress

complete chapter 1.

Week 18 Submit draft of chapter 1 to the Pending


Week 18-28 Work on the first draft of the remaining


Week 29 Submit the first draft to the supervisor. Pending

Receive feedback on previous work.

Week 30 Receive feedback on the first draft of the Pending

main chapters.


6.1) Tools

WEKA Tool is used to implementing K-Means Clustering and Naïve Bayes will be:

 K-means for clustering

 Naïve Bayes for classification

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It is
also well-suited for developing new machine learning schemes.

WEKA contains “clusters” for finding groups of similar instances in a dataset. The clustering
schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Clusters can be
visualized and compared to “true” clusters (if given). Evaluation is based on log-likelihood if
clustering scheme produces a probability distribution.
In the ‘Clusterer’ box click on the ‘Choose' button. In pull-down menu select WEKA Æ Clusterers,
and select the cluster scheme ‘SimpleKMeans’. Some implementations of K-means only allow
numerical values for attributes; therefore, we do not need to use a filter.

Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and Bayes' nets. “Meta”- classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.

Figure 6.1: WEKA Tools


7.1) Tentative Outcomes

The proposed system aims to use the hybrid algorithm for lung disease using K-Means and NB
that’s more efficient and can be trained easily compared to the existing simple algorithms in a
lesser time and gives better output results. The results from the proposed system are expected to
be more precise and more accurate than those that are produced from the existing simple algorithms
for lung disease.

This research aims to extend the capabilities of K-Means Clustering with the help of NB, by taking
some pre-trained K-means Clustering and classify them for a lung disease problem.

The primary outcomes that are expected from the proposed system are as follows:

 Lung disease prediction system will be developed by combining Naïve Bayes and K-Means

 Weka Tools would be used to reduce the execution time of algorithms.

 The prediction system may be faster, less computationally expensive, time efficient and
produce more accurate results.

 The proposed system will help doctors to efficiently predict lung diseases in the initial
stages for better treatment.


World Health Organization (2011) The top ten causes of death. World Health Organization (2013)
Deaths from coronary heart disease.

 [1] P. V. Maral, “Heart Disease Prediction Using Naive Bayes and K-Means Techniques,”
Novat. Publ. Int. J. Res. Publ. Eng. Technol., vol. 3, no. 6, pp. 2454–7875, 2017.
 [2] R. Shinde, S. Arjun, P. Patil, and P. J. Waghmare, “An Intelligent Heart Disease
Prediction System Using K-Means Clustering and Naïve Bayes Algorithm,” Int. J. Comput.
Sci. Inf. Technol., vol. 6, no. 1, pp. 637–639, 2015.
 [3] D. Priyanka and M. S. S. Banu, “Prediction on Lung Disease Using K means
Algorithm,” vol. 1, no. 11, pp. 239–242, 2015.
 [4] G. Singh, K. Bagwe, S. Shanbhag, S. Singh, and S. Devi, “Heart disease prediction
using Naïve Bayes,” Int. Res. J. Eng. Technol., vol. 4, no. 3, pp. 1–3, 2017.
 [5] S. Jain, M. Aalam, and M. Doja, “K-means clustering using weka interface,” Proc.
4th Natl. Conf., 2010.
 [6] W. Zhang and F. Gao, “An improvement to naive bayes for text
classification,” Procedia Eng., vol. 15, pp. 2160–2164, 2011.
 [7] K. Vanitha and G. R. L. Rani, “Analysis of Classification and Clustering
Algorithms using Weka For Banking Data,” no. 0976, pp. 104–107.
 [8] S. Singhal and M. Jena, “W-06. Study on WEKA Tool for Data Preprocessing
, Classification and Clustering,” India - WEKA, vol. 2, no. 6, pp. 250–253, 2013.
 [9] P. Ramachandran, N. Girija, T. Bhuvaneswari, and A. Professor, “Early
Detection and Prevention of Cancer using Data Mining Techniques,” Int. J. Comput.
Appl., vol. 97, no. 13, pp. 975–8887, 2014.
 [10] S. Vijiyarani, S. Sudha, and M. P. Research Scholar, “Disease Prediction in
Data Mining Technique – A Survey,” Int. J. Comput. Appl. Inf. Technol., vol. II, no. I,
pp. 2278–7720, 2013.
 [11] T. Karthikeyan and P. Thangaraju, “PCA-NB Algorithm to Enhance the
Predictive Accuracy,” vol. 6, no. 1, pp. 381–387, 2014.
 [12] D. Kavinya, “Lung Disease Classification Using Support Vector Machine,”
vol. 3, no. 3, pp. 84–86, 2015.
 [13] A. Trivedi, “International Journal of Advanced Research in Computer Science
and Software Engineering Evaluation of Student Classification Based On Decision
Tree,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 4, no. 2, pp. 111–112, 2014.

 [14] U. Sharma, “Suitability of neural network for disease prediction : a
comprehensive literature review,” vol. 5, no. 6, pp. 12–20, 2017.
 [15] C. H. Chen, W. T. Huang, T. H. Tan, C. C. Chang, and Y. J. Chang, “Using K-
nearest neighbor classification to diagnose abnormal lung sounds,” Sensors
(Switzerland), vol. 15, no. 6, pp. 13132–13158, 2015.
 [16] M. Makinaci, “Support vector machine approach for classification of
cancerous prostate regions,” Int. Enformatika Conf., vol. 1, no. 7, pp. 166–169, 2005.
 [17] A. Kumar, M. Kamaleshwar, S. K. K, S. K. R. S, and J. Arunnehru, “An
Improved Disease Prediction System Using Machine Learning,” no. 4, 2018.
 [18] P. Mirajkar and A. Pradesh, “An Integrated Cancer Prediction System Using
Data Mining Techniques,” vol. 3, no. 1, pp. 1497–1501, 2018.
 [19] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary, “A
Lung Cancer Outcome Calculator Using Ensemble Data Mining on SEER Data
Categories and Subject Descriptors,” Kdd, 2011.
 [20] R. Ada, & Kaur, “A Study of Detection of Lung Cancer Using Data Mining
Classification Techniques,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3,
pp. 131–134, 2013.
 [21] B. Sciences, B. G. Krishna, and A. Pradesh, “a Predictive Model for Heart
Disease Using clustering techniques,” vol. 8, no. 3, pp. 529–534, 2017.
 [22] V. Krishnaiah, G. Narsimha, N. Subhash, and C. #3, “Diagnosis of Lung
Cancer Prediction System Using Data Mining Classification Techniques,” Int. J.
Comput. Sci. Inf. Technol., vol. 4, no. 1, pp. 39–45, 2013.
 [23] Nur Hafieza Ismail, Fadhilah Ahmad, Azwa Abdul Aziz, “Implementing WEKA as a
data mining tool to analyze students academic performance using naïve Bayes classifier”,