Вы находитесь на странице: 1из 3

Actividad 1

1. Lee la entrada del blog (https://machinelearningmastery.com/feature-selection-machine-learning-python/


(https://machinelearningmastery.com/feature-selection-machine-learning-python/))
2. Elige un set de datos de sklearn.datasets (https://scikit-learn.org/stable/datasets/index.html (https://scikit-learn.org/stable/datasets/index.html))
3. Aplica al set de datos al menos 2 métodos para filtrar variables de los vistos en la entrada del blog. Comenta los resultados de filtrado.
4. Ajuste de parámetros: Lee la presentación (https://orbi.uliege.be/bitstream/2268/163521/1/slides.pdf
(https://orbi.uliege.be/bitstream/2268/163521/1/slides.pdf))
5. Elige un set de datos de sklearn.datasets (https://scikit-learn.org/stable/datasets/index.html (https://scikit-learn.org/stable/datasets/index.html))
6. Crea un modelo GBRT (sklearn.ensemble.GradientBoostingClassifier o sklearn.ensemble.GradientBoostingRegressor según el problema
particular del set de datos elegido)
7. Usa GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html (https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) para optimizar los parámetros (página 27 en la
presentación): learning_rate max_depth min_samples_leaf max_features
8. Comenta los resultados de optimización

Feature Selection for Machine Learning

This section lists 4 feature selection recipes for machine learning in Python

This post contains recipes for feature selection methods.

Each recipe was designed to be complete and standalone so that you can copy-and-paste it directly into you project and use it immediately.

Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method (update: download from here). This is a
binary classification problem where all of the attributes are numeric.

1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of
features.

In [1]: 1 # Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
2 import pandas
3 import numpy as np
4 from sklearn.feature_selection import SelectKBest
5 from sklearn.feature_selection import chi2
6
7 from sklearn.datasets import load_iris
8
9 # Load data
10 iris_dataset = load_iris()
11 X, y = iris_dataset.data, iris_dataset.target
12 print(X.shape)
13
14 X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
15 print(X_new.shape)
16
17 # summarize selected features
18 print(X[0:5, :])
19
20 print(X_new[0:5, :])

(150, 4)
(150, 2)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]]

2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.
The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much
as long as it is skillful and consistent.

In [2]: 1 # Feature Extraction with RFE


2 from sklearn.feature_selection import RFE
3 from sklearn.linear_model import LogisticRegression
4
5 from sklearn.datasets import load_iris
6
7 # Load data
8 iris_dataset = load_iris()
9 X, y = iris_dataset.data, iris_dataset.target
10 print(X.shape)
11
12 # feature extraction
13 model = LogisticRegression(max_iter=300, solver='lbfgs')
14 rfe = RFE(model, 2, verbose=1)
15 fit = rfe.fit(X, y)
16
17 print("Num Features: %d" % fit.n_features_)
18 print("Selected Features: %s" % fit.support_)
19 print("Feature Ranking: %s" % fit.ranking_)
20

(150, 4)
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Num Features: 2
Selected Features: [False False True True]
Feature Ranking: [3 2 1 1]

3. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in
the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component
Analysis Wikipedia article.

In [3]: 1 # Feature Extraction with PCA


2 from sklearn.decomposition import PCA
3
4 # load data
5 iris_dataset = load_iris()
6 X, y = iris_dataset.data, iris_dataset.target
7 print(X.shape)
8
9 # feature extraction
10 pca = PCA(n_components=2)
11 fit = pca.fit(X)
12
13 # summarize components
14 print("Explained Variance: %s" % fit.explained_variance_ratio_)
15 print(fit.components_)

(150, 4)
Explained Variance: [0.92461872 0.05306648]
[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]]

4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the
ExtraTreesClassifier class in the scikit-learn API.
In [4]: 1 # Feature Importance with Extra Trees Classifier
2 from sklearn.ensemble import ExtraTreesClassifier
3
4 # load data
5 iris_dataset = load_iris()
6 X, y = iris_dataset.data, iris_dataset.target
7 print(X.shape)
8
9 # feature extraction
10 model = ExtraTreesClassifier(n_estimators=10)
11 model.fit(X, y)
12 print(model.feature_importances_)

(150, 4)
[0.13590552 0.09380199 0.30342801 0.46686448]

Ajuste de parámetros
In [5]: 1 from sklearn.ensemble import GradientBoostingClassifier
2 from sklearn.model_selection import train_test_split
3 from sklearn.model_selection import GridSearchCV
4
5 # load data
6 iris_dataset = load_iris()
7 X, y = iris_dataset.data, iris_dataset.target
8 print(X.shape)
9
10 param_grid = {'learning_rate': [0.1, 0.05, 0.02, 0.01],
11 'max_depth': [4, 6],
12 'min_samples_leaf': [3, 5, 9, 17],
13 'max_features': [1.0, 0.3, 0.1]}
14
15 # Instantiate model and train
16 gbc = GradientBoostingClassifier(n_estimators=1000)
17 gs_cv = GridSearchCV(gbc, param_grid).fit(X, y)
18
19 # Predictions
20 # pred = gbc.predict(X_train)
21 # print(gbc.predict_proba(X)[1])
22
23 # best hyperparameter setting
24 gs_cv.best_params_
25

(150, 4)

Out[5]: {'learning_rate': 0.1,


'max_depth': 4,
'max_features': 1.0,
'min_samples_leaf': 17}

In [6]: 1 # summarize results


2 print("Best: %f using %s" % (gs_cv.best_score_, gs_cv.best_params_))
3 means = gs_cv.cv_results_['mean_test_score']
4 stds = gs_cv.cv_results_['std_test_score']
5 params = gs_cv.cv_results_['params']
6 # for mean, stdev, param in zip(means, stds, params):
7 # print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.960000 using {'learning_rate': 0.1, 'max_depth': 4, 'max_features': 1.0, 'min_samples_leaf': 17}

In [8]: 1 # Train with best parameters


2
3 # Split
4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
5
6 model = GradientBoostingClassifier(n_estimators=1000,
7 learning_rate=0.01,
8 max_depth=4,
9 max_features=1.0,
10 min_samples_leaf=17)
11 model.fit(X_train, y_train)
12 model.feature_importances_

Out[8]: array([0.00873227, 0.00673273, 0.41499729, 0.5695377 ])

Вам также может понравиться