Exploratory Data PDF

Exploratory Data Analysis : ¶
In [1]: import pandas as pd

import seaborn as sn
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
In [2]: haber = pd.read_csv('haberman.csv')

print(haber.columns)
haber['surv_status'].value_counts()
Index(['age', 'op_year', 'axil_nodes', 'surv_status'], dtype='object')
Out[2]: 1 225
2 81
Name: surv_status, dtype: int64
Observation 1: It is an imbalanced dataset containing 225 Survivors & 81 Non Survivors.
The Below piece of code was taken from the below link:
--[https://stackoverflow.com/questions/17679089/pandas-dataframe-groupby-two-columns-and-get-counts]
I wanted to analyse how the data appears when it is grouped under multiple columns. Hence, I tried with many columns combination & Finally settled with the
surv_status & axil_nodes:
Observation 2:
From the result set we obtain after running the below code, we can conclude that we can predict the survival status based on the Axil Nodes.
For surv_status = 1 (Assuming that it corresponds to Survival Records) : We can see that Out of total 225 records, 190 records have axil_nodes <= 5 .
Close to 84% of Survival Records.
For surv_status = 2 (Assuming that it corresponds to Non Survival Records) : We can see that Out of total 81 records, 35 records have axil_nodes > 5.
Close to 43% of Non Survival Records
In [3]: haber1 = haber;

haber1['counter'] = 1
group_data = haber1.groupby(['surv_status','axil_nodes'])['counter'].sum()
#print(group_data)
Observation 3: The people affected belong to age range from 30 to 83. People Having axil nodes less than 5, Have a greater opportunity of survival. Out of
281 people with Axil Nodes less than 5, 190 Survived and 92 didnt. (Survived % : 67 & Non Survived % : 32)
2D Scatter Plot :
In [4]: sn.set_style('whitegrid')
sn.FacetGrid(haber, hue = 'surv_status', height = 5) \
.map(plt.scatter, 'age', 'axil_nodes')
plt.legend()
plt.title('2-D Scatter Plot')
plt.show()
Observation 4:
There are a good number of people who were diagnosed with cancer with No Axil Nodes. Most of them survived, but few didnt.
People who underwent treatment at the age of 30 & 31 , All of them survived.
Pair Plot :
In [5]: plt.close()
sn.set_style('whitegrid')
sn.pairplot(haber,vars = ['age', 'op_year','axil_nodes'] ,height = 3)
plt.suptitle('Pair-Plots')
plt.show()
Obervation 5:
We can understand that Axil Nodes would be a good attribute to perform analysis when compared to age & op_age.
1D Scatter Plot for Survival Records :

In [6]: surv_status_1 = haber.loc[haber['surv_status'] == 1 ];
surv_status_2 = haber.loc[haber['surv_status'] == 2 ];
plt.plot(surv_status_1['axil_nodes'], np.zeros_like(surv_status_1['axil_nodes']),'o')
#plt.plot(surv_status_2['axil_nodes'], np.zeros_like(surv_status_2['axil_nodes']),'o')
plt.title('1-D Scatter Plot Survival Records')
plt.xlabel('Axil Nodes')
plt.ylabel('Units')
plt.show()
Observation 6:
We can understand that there are a lot of survival records with Axial Nodes less than 18 (approx.)
1D Scatter Plot for Non Survival Records :

In [8]: plt.plot(surv_status_2['axil_nodes'], np.zeros_like(surv_status_2['axil_nodes']),'o')
plt.title('1-D Scatter Plot for Non Survival Records')
plt.ylabel('Units')
plt.show()
Observation 7:
We can understand that there are a lot of non survival records with Axial Nodes less than 15 (approx.)
Histogram based on Axil Nodes :

In [8]: sn.FacetGrid(haber, hue = 'surv_status', size = 5) \
.map(sn.distplot,'axil_nodes')
plt.title('Histogram based on Axil Nodes')
plt.legend()
plt.show()
Observation 8:
We can claearly see that there are a lot of survivors who had Axil Nodes = 0.
Histogram based on Age :

.map(sn.distplot,'age')
plt.legend()
plt.title('Histogram based on Age')
plt.show()
Observation 9:
Too much overlapping here, We can say that there were too many Non Survival Records between Ages 40 & 50.
Histogram based on Op Year :

.map(sn.distplot,'op_year') \
.add_legend()
plt.title('Histogram based on Op Year')
plt.show()
Observation 10:
We can say that bulk of the surgeries were performed between 1958 & 1968.
PDF, CDF Values & Curves :

In [11]: counts, bin_edges = np.histogram(surv_status_1['axil_nodes'],bins = 5,density = True)
pdf = counts / sum(counts)
print('*******************SURVIVAL DATA*********************')
print('Counts :', counts)
print('Sum :', sum(counts))
print('PDF :' , pdf)
print('Bin Edges :', bin_edges)
cdf = np.cumsum(pdf)
print('CDF :',cdf)
plt.plot(bin_edges[1:],pdf,label = 'PDF (Survival)')
plt.plot(bin_edges[1:],cdf,label = 'CDF (Survival)')
#plt.legend()
print('*******************NON SURVIVAL DATA*********************')

counts, bin_edges = np.histogram(surv_status_2['axil_nodes'],bins = 5,density = True)
pdf = counts / sum(counts)
print('Counts :', counts)
print('Sum :', sum(counts))
print('PDF :' , pdf)
print('Bin Edges :', bin_edges)
cdf = np.cumsum(pdf)
print('CDF :',cdf)
plt.plot(bin_edges[1:],pdf,label = 'PDF (Non Survival)')
plt.plot(bin_edges[1:],cdf,label = 'CDF (Non Survival)')
plt.legend()
plt.title('PDF & CDF Curves for Survival Records & Non Survival Records')
plt.ylabel('% Units')
plt.show()
*******************SURVIVAL DATA*********************
Counts : [0.09951691 0.00531401 0.00241546 0.00096618 0.00048309]
Sum : 0.10869565217391305
PDF : [0.91555556 0.04888889 0.02222222 0.00888889 0.00444444]
Bin Edges : [ 0. 9.2 18.4 27.6 36.8 46. ]
CDF : [0.91555556 0.96444444 0.98666667 0.99555556 1. ]
*******************NON SURVIVAL DATA*********************
Counts : [0.0688509 0.01780627 0.00712251 0.00118708 0.00118708]
Sum : 0.09615384615384613
PDF : [0.71604938 0.18518519 0.07407407 0.01234568 0.01234568]
Bin Edges : [ 0. 10.4 20.8 31.2 41.6 52. ]
CDF : [0.71604938 0.90123457 0.97530864 0.98765432 1. ]
Mean, SD, Median, Quantile, Percentile, MAD Values :

In [12]: from statsmodels import robust
print('*******SURVIVAL DATA BASED ON AXIL NODES**************')
print('Mean :', np.mean(surv_status_1['axil_nodes']))
print('Standard Deviation :', np.std(surv_status_1['axil_nodes']))
print('Median :', np.median(surv_status_1['axil_nodes']))
print('Quantiles :', np.percentile(surv_status_1['axil_nodes'],np.arange(0,100,25)))
print('90th Percentile :', np.percentile(surv_status_1['axil_nodes'],90))
print('Median Abs Deviation :', robust.mad(surv_status_1['axil_nodes']))
*******SURVIVAL DATA BASED ON AXIL NODES**************

Mean : 2.7911111111111113
Standard Deviation : 5.857258449412131
Median : 0.0
Quantiles : [0. 0. 0. 3.]
90th Percentile : 8.0
Median Abs Deviation : 0.0
In [13]: print('*******NON SURVIVAL DATA BASED ON AXIL NODES**************')

print('Mean :', np.mean(surv_status_2['axil_nodes']))
print('Standard Deviation :', np.std(surv_status_2['axil_nodes']))
print('Median :', np.median(surv_status_2['axil_nodes']))
print('Quantiles :', np.percentile(surv_status_2['axil_nodes'],np.arange(0,100,25)))
print('90th Percentile :', np.percentile(surv_status_2['axil_nodes'],90))
print('Median Abs Deviation :', robust.mad(surv_status_2['axil_nodes']))
*******NON SURVIVAL DATA BASED ON AXIL NODES**************

Mean : 7.45679012345679
Standard Deviation : 9.128776076761632
Median : 4.0
Quantiles : [ 0. 1. 4. 11.]
90th Percentile : 20.0
Median Abs Deviation : 5.930408874022408
Box Plots :
In [16]: sn.boxplot(x = 'surv_status', y = 'axil_nodes',data = haber)
plt.title('Box Plots : Axil_Nodes VS Surv_Status')
plt.show()
Observation 11:
Patients who had no axil nodes (Axil Nodes = 0) had a greater chance of survival.
In [17]: sn.boxplot(x = 'surv_status', y = 'age',data = haber)

plt.title('Box Plots : Age VS Surv_Status')
plt.show()
Observation 12:
The plot shows that people who were treated before the age of 34, all survived.
Violin Plots :
In [19]: sn.violinplot(x = 'surv_status', y = 'op_year', data = haber, size = 5)
plt.title('Violin Plots : Surv_Status VS Op_Year ')
plt.show()
Obervation 13:
The plot shows that most of the surgeries took place between the years 1960 to 1965
Conclusion :
We can say that people with less Axil Nodes have a good chance of survival.
We can say that people who were treated before the age of 34, all survived.
We can say that most number of surgeries were between 1960 to 1965.

Exploratory Data PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Exploratory Data PDF

Загружено:

Авторское право:

Доступные форматы

Exploratory Data Analysis : ¶

In [1]: import pandas as pd

In [2]: haber = pd.read_csv('haberman.csv')

Index(['age', 'op_year', 'axil_nodes', 'surv_status'], dtype='object')

Observation 1: It is an imbalanced dataset containing 225 Survivors & 81 Non Survivors.

In [3]: haber1 = haber;

1D Scatter Plot for Survival Records :

1D Scatter Plot for Non Survival Records :

Histogram based on Axil Nodes :

Histogram based on Age :

Histogram based on Op Year :

PDF, CDF Values & Curves :

print('*NON SURVIVAL DATA***')

Mean, SD, Median, Quantile, Percentile, MAD Values :

*SURVIVAL DATA BASED ON AXIL NODES********

In [13]: print('*NON SURVIVAL DATA BASED ON AXIL NODES********')

*NON SURVIVAL DATA BASED ON AXIL NODES********

In [17]: sn.boxplot(x = 'surv_status', y = 'age',data = haber)

Вам также может понравиться