Вы находитесь на странице: 1из 9

9/21/2018 Random forest

RANDOM FOREST

ensemble of multiple trees

Ensemble learning technique ( collection of many individual components to create a big tree). used as boosting
technique ( sum of the whole is more than sum of individual parts ). Forest -->collection of trees. Many decision
trees (each tree = one model with different subsets of features(different combinations of feature) and subsets of
data) -> combine output of all these trees -> predict sample based on the maximum occurence of output from the
different DT models. Hence it makes up for the difficiencies in the individual models.

The term ensemble is used when more than one machine learning model/algorithm is bundled to give out the
average

Outcome of all models are taken and majority voting is made the decision

Random forest has classifier for classification and regressor for regression

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import RandomForestRegressor

In [6]: datafile = "D:\komal\SIMPLILEARN\MY COURSES\IN PROGRESS\MACHINE LEARNING RECOR


DINGS\Jul 28 Sat - Aug 25 Sat\Drive downloads\Machine Learning _ Jul 28 - Aug
25 _ Sayan\Decision Trees/titanicdata.htm"

In [7]: #BeautifulSoup is the library used for web scrapping

from bs4 import BeautifulSoup


with open(datafile,"r",encoding="Latin-1") as f:
soup = BeautifulSoup(f,"html.parser")

In [8]: table = soup.find('table')

In [9]: import pandas as pd


data = data = pd.read_html(str(table).encode('ascii', errors='replace'), flavo
r='bs4')[0]

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 1/9
9/21/2018 Random forest

In [10]: data.head()

Out[10]:
Boat Unnamed:
Name Age Class/Dept Ticket Joined Job
[Body] 7

AB??-AL-
MUN??, Mr 3rd Class 2699?18
0 27 Cherbourg ? 15? NaN
N??s??f Passenger 15s 9d
Q??sim

ABBING, Mr 3rd Class 5547?7 Blacksmith


1 42 Southampton ?? NaN
Anthony Passenger 11s ?

ABBOTT,
3rd Class CA2673?
2 Mrs Rhoda 39 Southampton ? A? NaN
Passenger 20 5s
Mary 'Rosa'

ABBOTT, Mr
3rd Class CA2673?
3 Rossmore 16 Southampton Jeweller ? ?[190] NaN
Passenger 20 5s
Edward

ABBOTT, Mr
3rd Class CA2673?
4 Eugene 13 Southampton Scholar ? ?? NaN
Passenger 20 5s
Joseph

In [11]: def cleanup(value):


return value.replace("?", " ")

data['Name']= data['Name'].apply(cleanup)
data['Boat [Body]']= data['Boat [Body]'].apply(cleanup)
data['Age'] = data['Age'].apply(pd.to_numeric, errors='coerce')
data = data[["Name","Age","Class/Dept","Boat [Body]"]]

data.head()

Out[11]:
Name Age Class/Dept Boat [Body]

0 AB -AL-MUN , Mr N s f Q sim 27.0 3rd Class Passenger 15

1 ABBING, Mr Anthony 42.0 3rd Class Passenger

2 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A

3 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190]

4 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 2/9
9/21/2018 Random forest

In [12]: def checkPass(class_type):


if "Passenger" in class_type:
return "Passenger"
else:
return "Crew"

data["Crew/Pass"]=data["Class/Dept"].apply(checkPass)
data.head()

Out[12]:
Name Age Class/Dept Boat [Body] Crew/Pass

0 AB -AL-MUN , Mr N s f Q sim 27.0 3rd Class Passenger 15 Passenger

1 ABBING, Mr Anthony 42.0 3rd Class Passenger Passenger

2 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger

3 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger

4 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger Passenger

In [13]: def checkClass(class_type):


if "Passenger" in class_type:
return class_type.split(" ")[0]
else:
return "Crew"

data["Class"]=data["Class/Dept"].apply(checkClass)
data.head()

Out[13]:
Boat
Name Age Class/Dept Crew/Pass Class
[Body]

3rd Class
0 AB -AL-MUN , Mr N s f Q sim 27.0 15 Passenger 3rd
Passenger

3rd Class
1 ABBING, Mr Anthony 42.0 Passenger 3rd
Passenger

ABBOTT, Mrs Rhoda Mary 3rd Class


2 39.0 A Passenger 3rd
'Rosa' Passenger

ABBOTT, Mr Rossmore 3rd Class


3 16.0 [190] Passenger 3rd
Edward Passenger

3rd Class
4 ABBOTT, Mr Eugene Joseph 13.0 Passenger 3rd
Passenger

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 3/9
9/21/2018 Random forest

In [14]: def checkAdult(age):


if age>=18:
return "Adult"
else:
return "Child"

data["Adult/Child"]=data["Age"].apply(checkAdult)
data.head()

Out[14]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child
[Body]

AB -AL-MUN , Mr N s f 3rd Class


0 27.0 15 Passenger 3rd Adult
Q sim Passenger

3rd Class
1 ABBING, Mr Anthony 42.0 Passenger 3rd Adult
Passenger

ABBOTT, Mrs Rhoda 3rd Class


2 39.0 A Passenger 3rd Adult
Mary 'Rosa' Passenger

ABBOTT, Mr Rossmore 3rd Class


3 16.0 [190] Passenger 3rd Child
Edward Passenger

ABBOTT, Mr Eugene 3rd Class


4 13.0 Passenger 3rd Child
Joseph Passenger

In [15]: def checkGender(name):


firstname = name[name.index(",")+2:]
salutation = firstname.split(" ")[0]
if salutation in ["Mr","Master"]:
return "Male"
else:
return "Female"

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 4/9
9/21/2018 Random forest

In [16]: data["Gender"]=data["Name"].apply(checkGender)
data.head()

Out[16]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child Gender
[Body]

AB -AL-MUN , Mr 3rd Class


0 27.0 15 Passenger 3rd Adult Male
N s f Q sim Passenger

ABBING, Mr 3rd Class


1 42.0 Passenger 3rd Adult Male
Anthony Passenger

ABBOTT, Mrs
3rd Class
2 Rhoda Mary 39.0 A Passenger 3rd Adult Female
Passenger
'Rosa'

ABBOTT, Mr
3rd Class
3 Rossmore 16.0 [190] Passenger 3rd Child Male
Passenger
Edward

ABBOTT, Mr 3rd Class


4 13.0 Passenger 3rd Child Male
Eugene Joseph Passenger

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 5/9
9/21/2018 Random forest

In [17]: def checkSurvival(boat):


if boat.strip()==" " or "[" in boat:
return 0
else:
return 1
data["Survival"]=data["Boat [Body]"].apply(checkSurvival)
data.head()

Out[17]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child Gender Survival
[Body]

AB -AL-
MUN , Mr 3rd Class
0 27.0 15 Passenger 3rd Adult Male 1
NsfQ Passenger
sim

ABBING,
3rd Class
1 Mr 42.0 Passenger 3rd Adult Male 1
Passenger
Anthony

ABBOTT,
Mrs
3rd Class
2 Rhoda 39.0 A Passenger 3rd Adult Female 1
Passenger
Mary
'Rosa'

ABBOTT,
Mr 3rd Class
3 16.0 [190] Passenger 3rd Child Male 0
Rossmore Passenger
Edward

ABBOTT,
Mr 3rd Class
4 13.0 Passenger 3rd Child Male 1
Eugene Passenger
Joseph

In [18]: data.groupby(['Crew/Pass'])['Survival'].sum()*100/data.groupby(['Crew/Pass'])[
'Survival'].count()

Out[18]: Crew/Pass
Crew 90.217391
Passenger 90.310651
Name: Survival, dtype: float64

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 6/9
9/21/2018 Random forest

In [19]: def compare(group,data):


return data.groupby([group])['Survival'].sum()*100/data.groupby([group])[
'Survival'].count()

compare("Class",data)

Out[19]: Class
1st 89.714286
2nd 88.395904
3rd 91.396333
Crew 90.217391
Name: Survival, dtype: float64

In [20]: compare("Gender",data)

Out[20]: Gender
Female 95.840555
Male 88.557743
Name: Survival, dtype: float64

In [21]: compare("Adult/Child",data)

Out[21]: Adult/Child
Adult 89.699955
Child 95.964126
Name: Survival, dtype: float64

In [22]: trainingData=data[["Age","Crew/Pass","Class","Adult/Child","Gender","Survival"
]]
trainingData.head()

Out[22]:
Age Crew/Pass Class Adult/Child Gender Survival

0 27.0 Passenger 3rd Adult Male 1

1 42.0 Passenger 3rd Adult Male 1

2 39.0 Passenger 3rd Adult Female 1

3 16.0 Passenger 3rd Child Male 0

4 13.0 Passenger 3rd Child Male 1

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 7/9
9/21/2018 Random forest

In [23]: def catToNum(series):


series = series.astype('category')
return series.cat.codes

catData=trainingData[["Crew/Pass","Class","Adult/Child","Gender"]].apply(catTo
Num)
trainingData[["Crew/Pass","Class","Adult/Child","Gender"]]=catData
trainingData.head()

C:\Users\hariz\Anaconda3\lib\site-packages\pandas\core\frame.py:3137: Setting
WithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/st


able/indexing.html#indexing-view-versus-copy
self[k1] = value[k2]
Out[23]:
Age Crew/Pass Class Adult/Child Gender Survival

0 27.0 1 2 0 1 1

1 42.0 1 2 0 1 1

2 39.0 1 2 0 0 1

3 16.0 1 2 1 1 0

4 13.0 1 2 1 1 1

In [24]: trainingData = trainingData.dropna()


len(trainingData)

Out[24]: 2426

In [25]: from sklearn.model_selection import train_test_split


train, test = train_test_split(trainingData, test_size = 0.2)

In [26]: test.head()

Out[26]:
Age Crew/Pass Class Adult/Child Gender Survival

1990 30.0 0 3 0 1 1

485 32.0 0 3 0 1 1

1591 17.0 1 1 1 1 1

1704 31.0 0 3 0 1 1

2318 34.0 0 3 0 1 1

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 8/9
9/21/2018 Random forest

In [27]: #n_estimators specifies the number of trees to have

from sklearn.ensemble import RandomForestClassifier


clf = RandomForestClassifier(n_estimators=1000,max_leaf_nodes=15)

In [28]: clf

Out[28]: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',


max_depth=None, max_features='auto', max_leaf_nodes=15,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

In [31]: from sklearn.metrics import accuracy_score

In [32]: def checkAccuracy(clf):


clf=clf.fit(train[["Age","Crew/Pass","Class","Adult/Child","Gender"]],trai
n["Survival"])
predictions = clf.predict(test[["Age","Crew/Pass","Class","Adult/Child","G
ender"]])
return accuracy_score(test["Survival"], predictions)

In [33]: checkAccuracy(clf)

Out[33]: 0.8930041152263375

In [34]: #There are known issues while installing xgboost on windows. Hence, commented
the below code

In [35]: #from xgboost.sklearn import XGBClassifier

In [36]: #clf = XGBClassifier(n_estimators=1000)

In [37]: #checkAccuracy(clf)

In [38]: #clf

file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 9/9

Вам также может понравиться