Вы находитесь на странице: 1из 6

Experiment No.

1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.

Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling

Code and Output

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('50_Startups.csv')

In [3]:
data.head(5)

Out[3]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [4]:

data.shape

Out[4]:
(50, 5)

In [5]:

data.columns #features

Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

Checking missing values

In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values

Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

Handling missing values

1. Drop rows having null values

2. Fill missing values with mean/median/mode or any relevant value

In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now

Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

In [8]:
print(data.shape)

(50, 5)

Handling categorical variables

In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()

Out[17]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [18]:

data2['Profit'].unique()

Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])

In [160]:

from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()

In [19]:

data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])

In [20]:

data_LE.head()

Out[20]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

In [21]:
data_LE_df = pd.DataFrame(data_LE)

In [22]:
data_LE_df.dropna(inplace=True)

In [23]:
data_LE_df

Out[23]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

5 131876.90 99814.71 362861.36 2 156991.12

6 134615.46 147198.87 127716.82 0 156122.51

7 130298.13 145530.06 323876.68 1 155752.60

8 120542.52 148718.95 311613.29 2 152211.77

9 123334.88 108679.17 304981.62 0 149759.96

10 101913.08 110594.11 229160.95 1 146121.95

11 100671.96 91790.61 249744.55 0 144259.40

12 93863.75 127320.38 249839.44 1 141585.52


13 R&D Spend Administration
91992.39 135495.07 Marketing
252664.93 Spend State
0 Profit
134307.35

14 119943.24 156547.42 256512.92 1 132602.65

15 114523.61 122616.84 261776.23 2 129917.04

16 78013.11 121597.55 264346.06 0 126992.93

17 94657.16 145077.58 282574.31 2 125370.37

18 91749.16 114175.79 294919.57 1 124266.90

19 86419.70 153514.11 0.00 2 122776.86

20 76253.86 113867.30 298664.47 0 118474.03

21 78389.47 153773.43 299737.29 2 111313.02

22 73994.56 122782.75 303319.26 1 110352.25

23 67532.53 105751.03 304768.73 1 108733.99

24 77044.01 99281.34 140574.81 2 108552.04

25 64664.71 139553.16 137962.62 0 107404.34

26 75328.87 144135.98 134050.07 1 105733.54

27 72107.60 127864.55 353183.81 2 105008.31

28 66051.52 182645.56 118148.20 1 103282.38

29 65605.48 153032.06 107138.38 2 101004.64

30 61994.48 115641.28 91131.24 1 99937.59

31 61136.38 152701.92 88218.23 2 97483.56

32 63408.86 129219.61 46085.25 0 97427.84

33 55493.95 103057.49 214634.81 1 96778.92

34 46426.07 157693.92 210797.67 0 96712.80

35 46014.02 85047.44 205517.64 2 96479.51

36 28663.76 127056.21 201126.82 1 90708.19

37 44069.95 51283.14 197029.42 0 89949.14

38 20229.59 65947.93 185265.10 2 81229.06

39 38558.51 82982.09 174999.30 0 81005.76

40 28754.33 118546.05 172795.67 0 78239.91

41 27892.92 84710.77 164470.71 1 77798.83

42 23640.93 96189.63 148001.11 0 71498.49

43 15505.73 127382.30 35534.17 2 69758.98

44 22177.74 154806.14 28334.72 0 65200.33

45 1000.23 124153.04 1903.93 2 64926.08

46 1315.46 115816.21 297114.46 1 49490.75

47 0.00 135426.92 0.00 0 42559.73

48 542.05 51743.15 0.00 2 35673.41

49 0.00 116983.80 45173.06 0 14681.40

Splitting into training and testing sets

In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)

/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This mo


dule was deprecated in version 0.18 in favor of the model_selection module into which all the refa
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ifferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

In [27]:
X_train.head()

Out[27]:

R&D Spend Administration Marketing Spend State Profit

25 64664.71 139553.16 137962.62 0 107404.34

0 165349.20 136897.80 471784.10 2 192261.83

10 101913.08 110594.11 229160.95 1 146121.95

14 119943.24 156547.42 256512.92 1 132602.65

35 46014.02 85047.44 205517.64 2 96479.51

In [28]:
y_train.head()

Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64

Feature Scaling

In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()

In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)

In [31]:
pd.DataFrame(X_train) #SCALED

Out[31]:

0 1 2 3 4

0 -0.147778 0.768777 -0.732925 -1.248168 -0.078585

1 2.099133 0.672035 2.246595 1.187282 2.114855

2 0.683470 -0.286287 0.081064 -0.030443 0.922208

3 1.085838 1.387929 0.325194 -0.030443 0.572754

4 -0.563993 -1.217028 -0.129964 1.187282 -0.360975

5 -0.949166 0.003426 -0.422023 -1.248168 -0.832442

6 -1.590858 -0.053492 -1.561117 -1.248168 -2.475335

7 0.158509 1.286864 0.710993 1.187282 0.022449

8 -0.730373 -1.292275 -0.402355 -1.248168 -0.760949


9 0.5215450 0.9700481 0.5578052 1.1872823 0.3858104

10 1.316921 0.986533 0.926449 -0.030443 1.171146

11 -0.607378 -2.447162 -0.205725 -1.248168 -0.529776

12 -0.352435 -0.560869 -0.048589 -0.030443 -0.353236

13 0.964891 0.151737 0.372172 1.187282 0.503335

14 -1.244827 0.325357 -1.647149 1.187282 -1.051661

15 -1.578762 -2.430403 -1.964309 1.187282 -1.932723

16 -1.139408 -1.912880 -0.310728 1.187282 -0.755177

17 -1.561502 -0.096031 0.687583 -0.030443 -1.575565

18 -0.968390 -1.229294 -0.496328 -0.030443 -0.843843

19 1.631008 0.008009 1.455935 1.187282 1.872917

20 0.018321 0.342926 1.188029 1.187282 -0.140519

21 -1.095932 1.324489 -1.711408 -1.248168 -1.169496

22 -0.083778 -0.462735 0.755901 -0.030443 -0.044215

23 0.090208 0.935743 -0.767847 -0.030443 -0.121772

24 0.150110 0.114601 0.395109 -1.248168 0.427751

25 0.462077 0.620929 0.290849 -1.248168 0.616818

26 1.099212 1.102714 0.816992 1.187282 1.079621

27 1.413268 1.047333 -0.824374 -1.248168 1.180707

28 1.580460 -0.985886 1.303923 -0.030443 1.440884

29 1.352154 -0.679013 1.274406 1.187282 1.203160

30 -0.951188 0.313476 -0.169154 -0.030443 -0.510155

31 0.655773 -0.971355 0.264783 -1.248168 0.874064

32 -0.554798 1.429699 -0.082837 -1.248168 -0.354945

33 -1.063279 -0.811085 -0.643327 -1.248168 -1.006698

34 -1.568537 0.207705 -1.947316 1.187282 -1.176585

35 0.503839 0.323101 0.265630 -0.030443 0.804948

36 0.060431 0.157781 0.742964 -0.030443 -0.002386

37 0.110850 -0.167035 0.701417 -1.248168 0.207550

38 0.456649 -0.155796 0.667992 -0.030443 0.357287

39 0.337715 1.277416 -1.964309 1.187282 0.318772

Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.

Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.

Вам также может понравиться