Академический Документы
Профессиональный Документы
Культура Документы
Martin Backstrom
LiTH-IMT/BIT30-A-EX--16/535--SE
Lung cancer is one of the most serious and common cancer types
of today, with very uncomfortable and potentially cumbersome di-
agnostic techniques in x-ray, CT, CT-PET scans, bronchoscopies and
biopsies. Completing all these steps can also take a long time and be
time consuming for hospital staff. So finding a new safer and faster
technique to diagnose cancer would be of great benefit.
III
ACKNOWLEDGEMENTS
I would like to thank my girlfriend Ellen for all the support she have
given me during this project. To everybody at VilleValla Pub for the
great times we have had the last couple of years. Also, to everybody
at the lung clinic that have taken their time to provide me with the
necessary data, and I would lastly like to thank my supervisor Linda
for the help she have provided throughout the project.
IV
CONTENTS
List of Figures VI
1 introduction 1
1.1 Aims and objectives 1
1.2 Limitations 2
2 theoretical background 3
2.1 Lung cancer 3
2.1.1 Symptoms 3
2.1.2 Diagnosis 3
2.1.3 Types and stages 4
2.1.4 Treatment 4
2.1.4.1 Radiotherapy 5
2.1.4.2 Chemotherapy 5
2.1.4.3 Hypothesis 5
2.2 Machine learning 5
2.2.1 PCA 6
2.2.2 KNN 7
2.2.3 Cross validation 10
2.3 Electronic nose 11
2.3.1 How it works 11
2.3.2 Sample cycle 12
3 materials 14
3.1 Sample gathering 14
3.2 Data analysis 14
4 methods 16
4.1 Sample gathering and preparation 16
4.2 Electronic nose 17
4.3 Data analysis 17
4.3.1 Patient selection 20
5 results 21
5.1 Data storage 21
5.2 Healthy vs. Sick lung 21
5.3 Lung vs. Exhaled samples 24
6 discussion 30
6.1 Data storage 30
6.2 Data preparation 30
6.3 Healthy vs. Sick lung 32
6.4 Lung vs. Exhaled sample 33
6.5 Future directions 34
7 conclusion 35
Bibliography 36
V
LIST OF FIGURES
VI
LIST OF FIGURES VII
VIII
1
INTRODUCTION
1
1.2 limitations 2
1.2 limitations
Since the objectives in this thesis aside from creating effective data
storage is to study the possibility to distinguish healthy contra cancer
diseased lungs. Only patients with one-sided lung cancer are candi-
dates. The patients should also have no other conditions that could
affect the results. However tumour position and cancer stage are not
factors that are taken into account during this study. A large limita-
tion that have to be considered when analysing the results from this
thesis work was the small amount of acquired data, this came from a
lack of suitable patients in and that some suitable patients declined to
be apart of the study, as well as a late start for the sample gathering
at the lung clinic.
2
THEORETICAL BACKGROUND
This part of the report is covering the basic theory used throughout
the thesis. Covering Lung cancer in general, the machine learning
methods used in the analytical work to enable prediction and the
electronic nose used in this study.
2.1.1 Symptoms
2.1.2 Diagnosis
3
2.1 lung cancer 4
cells since they use more energy, due to the fact that they are splitting
them self at a higher rate than other cells. If the CT or CT/PET in-
dicates tumours, a bronchoscopy and biopsy is done, to visually see
the tumour and taking a piece of the tumour to examine more closely
in a laboratory. [7]
2.1.4 Treatment
2.1.4.1 Radiotherapy
Radiotherapy is a cancer removal technique where the cancerous cells
are irradiated to destroy the cancer cells. For lung cancer patients
with a chance of getting cured the use of an external beam radiother-
apy is normally used [8].
In radiotherapy a radioactive beam is aimed directly at the cancer
cells. If the chance for survival is low and the cancer cells block the
airways internal radiotherapy is used. This technique uses a small
tube containing radioactive material in one end. This tube is then in-
serted into the airways and aims the material directly onto the cancer
cells, destroying them and freeing up the airways to make breathing
easier.
Another method that can be used is stereotactic radiotherapy where
many small beams are aimed at a small point, this method is more
accurate and spares the surrounding tissue more, but requires more
sophisticated and more expensive machines that might not exist in
all hospitals.
Side effect for radiotherapy include chest pain, redness and/or hair
loss at the radiation site among others [8].
2.1.4.2 Chemotherapy
Chemotherapy is the use of very powerful medication either taken
intravenously or as a pill. The medication kills the cancer cells and
is primarily used to shrink the cells before surgery, or after surgery
removing remaining cells to avoid that the cancer will return. It can
also be used to reduce the symptoms and decrease the spreading.
However chemotherapy can often weaken the immune system which
makes the patient very susceptible for infections. Chemotherapy can
also fatigue the patients, other side effects include nausea, vomiting
and hair loss. [8]
2.1.4.3 Hypothesis
The difference created in the organs with cancerous tissue gives the
hypothesis that lung tissue with cancer cells give a different smell
than healthy lungs. The assumption from this is that an E-nose can
sense this different smell.
both tiring and time consuming if a person were to do it. Instead you
can write a program that takes in the known data and gives out pre-
diction to which class the unknown sample should belong to while
freeing up time for the user, and removing any potential human er-
rors that are likely to occur after looking at raw data for along time.
There are several different techniques which can do this today. For
instance artificial neural networks (ANN) or support vector machines
(SVM) can be used. But for this thesis a combination of a principal
component analysis (PCA) and K-nearest neighbour (KNN) is used.
These two methods will be described below together with cross vali-
dation which is used to evaluate how well the system performs.
2.2.1 PCA
P
( Xi X ) (Yi Y )
cov( X, Y ) = P1
(1)
i
d
1j xj
0
1 x = 11 x1 + 12 x2 + ... + 1p xd = (2)
j =1
2.2.2 KNN
KNN works best for smaller sample sets since all data have to be
stored to check against, meaning that for larger sets other techniques
might be a better choice or some data preparation like dimension
reduction might be needed to reduce the amount of data that have to
be stored.
When dealing with only 2 classes an odd k will always give one
class more votes whilst if there are more than 2 classes there is a
possibility that two classes gets the same amount of votes and then
some kind of separator between the two is needed. One way of doing
this is to decrease the k for this data classification neglecting the vote
from the data point furthest away for this case and repeat the voting.
2.2 machine learning 9
There are several ways to determine how good a classification is. One
way to do this is by using n-fold cross validation. This technique
divides the tests into n parts where n 1 parts are used to train the
system while the last part is used to evaluate the system. This is
repeated, switching parts used for evaluation and sum up the error
in each part and dividing this by how many parts there are.
The extreme case of n-fold cross validation that is going to be used
in this study is called leave one out. In this case only one test is used
for validation of the system while the rest are used for training. The
error is calculated according to Equation 5 [11].
n
1
Error =
n Errorn (5)
1
Figure 8: An example of what the sensory responses look like for one
sensor.
3
M AT E R I A L S
In this chapter all computer programs and materials are listed and
their basic functionality
14
3.2 data analysis 15
Figure 9: An example of what the sensor response file will look like.
4
METHODS
This chapter will cover the methods of work in all the steps in the
thesis work, from the sample gathering at Lungkliniken via sampling
with the E-nose, and lastly how the data analysis is performed.
16
4.2 electronic nose 17
the last quarter of each sensor by the mean of the last half of the
baseline purge, this is then stored into either one or both training
data matrices, one used for all samples, and one used for samples
only drawn from the lungs.
In Figure 11 the two thick black lines describe where the baseline
and measure points means are taken from.
Figure 11: A sensor response showing from where the baseline and
measure point means are taken from
When data have been read and written onto one or both of the
training matrices it is time to train the program. This is done by
finding the principal components containing the largest variance. The
principal components are set to explain 99.5 % of the variance, this
was chosen to get at least two principal components since early tests
found that the first principal component explained over 99 % of the
variance.
The eigenvectors with eigenvalues explaining over 99.5 % for the co-
variance matrix, (without the class column) are then multiplied with
the original data to create a set with fewer dimensions, and the class
is added again as a last column. This creates a N m + 1 matrix
with N being the total number of samples in the set and m being the
eigenvalues required to explain 99.5+ % of the variance, and the last
column for which class each sample belong to.
4.3 data analysis 20
Cii
Sensitivityi = m (7)
Ci j
j
The precision, which also have to be repeated for each class indi-
vidually like for sensitivity. In Equation ?? the precision calculation
is shown, also here the i represent the confusion matrix row, and the
j represent the column.
Cii
Precisioni = m (8)
Cji
j
21
5.2 healthy vs. sick lung 22
A mean was also calculated for both sample groups to give a sense
of how close the centre of both classes are to each other. This mean cal-
culation gave that the healthy sample mean point was [3.8434,-2.1639]
while the cancerous mean point was calculated to be [3.8454,-2.1655].
Figure 13: Plot for healthy and sick lung samples in PC1 and PC2
with a unknown sample to classify, and the points used
for voting.
30
As can be seen in Table 1 a k = 7 gives Error = 1
30 Errorn =
1
12
= 0.40. In Figure 14 the confusion matrix is shown, and from this
30
it can be seen that the accuracy using k = 7 can be calculated to be
7+11
30 = 0.60, as it should since the accuracy should be 1 Error. From
the confusion matrix the sensitivity and precision are calculated and
the results for these are presented in Table 2.
Figure 14: Confusion matrix between healthy and sick lung samples.
(a) PC1
(b) PC2
Figure 15: The two principal components used to describe 99.5+ % for
healthy, cancerous and exhaled samples showing patient
number on the x-axis.
5.3 lung vs. exhaled samples 26
In Figure 16 the samples are plotted with PC1 and PC2 values, with
the same notations as in Figure 15. Also here the unknown sample
is marked with a green pentagram and the seven closest data points
are marked with a magenta coloured ring.
The mean calculated for all three classes where calculated, for the
exhaled samples the mean was calculated to be [3.9945,-0.8155], for
healthy samples the mean was [3.9423,-0.8137], and for the cancerous
samples the mean was [3.9443,-0.8134].
Figure 16: Plot for exhaled, healthy and sick lung samples in both PC
showing an unknown sample and marking the 7 closest
points to used for voting.
Figure 18: PC1 and PC2 with lung and exhaled data points.
30
6.2 data preparation 31
results from this thesis work. Meaning that an other technique might
be of better use, or possibly accepting the larger more time consum-
ing calculations required when using more variables. However test
using classifications with all 30 dimensions show no significant differ-
ences, the classification error between healthy and sick lung actually
get worse from 40 % to 43.33 % using all dimensions, while compar-
ing healthy, sick and exhaled samples gives a slight decrease in error,
from 44.44 % to 40 % the confusion matrices are shown in Figure
20-21.
Figure 20: Confusion matrix when classifying between the lungs us-
ing all 30 sensors.
Figure 21: Confusion matrix when classifying exhaled and both lung
samples using all 30 sensors.
6.3 healthy vs. sick lung 32
This shows that the PCA method might not be as reliable as you
would like, when around 0.2 % variance give an change of error by
3.33 % for the lung comparison and 0.35 % variance gives a change
of error by 4.44 % for the other comparison but these changes are
so small using as small of a data aset as in this study that no real
conclusions on whether the PCA method works well or not can be
made from these results.
One thing that will possibly affect the results more is the method
used to prepare data. Where means and normalizations have been
made creating an approximation of the sensory response.
If this method was changed to using the responses appearance
instead, by creating a polynomial explaining the curvature of the
response followed by a comparison of these response polynomials
could possibly decrease the errors even more. The decision of using
the normalizing technique was made early since early tests showed
that discrimination between Coke
R and Pepsi
R fumes was possible
using this technique, and it was therefore seen as a simpler to imple-
ment method that could give positive results.
mostly due to scaling, but the pattern is also the opposite compared
to the first PC. Here the value for healthy samples in a patient is
normally larger than the cancerous sample, some exceptions occur
here as well but not as many.
These patterns gives the statement that healthy and cancerous sam-
ples can be differentiated some validity, however the difference be-
tween patients are to large to give any definitive distinction between
the samples. As covered earlier more data have to be gathered to give
any definitive answers to whether the two classes can be separated.
This makes it less likely that it can be used to get good classifica-
tions since a larger between class separation is needed. However in
this case the exhaled samples are so far away from the lung samples
making it easy to differentiate lung samples from exhaled samples, as
seen by the large accuracy, sensitivity and precision calculated from
confusion matrix 19.
The large spread between exhaled samples can be interesting to
look further on in a later study, where exhaled air samples from
healthy subjects are compared to exhaled air from cancer patients.
This is since a large in class variation requires a much larger between
class difference to make good classifications. If there is a large enough
between class variance the possibility to create a diagnostic method
using an electronic nose appear.
This field is very interesting and have really good potential to get
used for diagnostic purposes in the future if the classification tech-
niques are improved and using larger patient data.
The test results from this study must be seen as inconclusive, since
some patterns can be seen for every patient individually, but showing
a low classification rate using all samples. Both looking between the
lungs and looking between the two lung classes and the exhaled class.
Looking at the aims and objectives we can say that the data stor-
age works well as the data structures look for the moment, and is
somewhat effective, specially using the functionality of reading in
the training matrix. This method should not suffer from a larger
scale study either, just adding up all samples at the end of a day will
require much less than 5 minutes (if not more than 45 samples are
taken every day).
Discrimination between a healthy and sick lung is not possible at
this stage using this method, with an accuracy of 60 % this method
cannot be seen as a reliable classifier, at least without more data.
Discrimination between lung and exhaled samples are however
possible with a high accuracy, sensitivity and precision using this
technique, meaning that airways and oral cavity probably introduce
some contamination to the samples.
35
BIBLIOGRAPHY
36