Finding importance of features with forests of trees

In a classification problem, not all features have the same importance to predict the label of a record. Different approaches are used by classification algorithms to determine the important features for the classification. E.g. XGBoost uses one of these three parameters for measuring feature importance: weight, cover, and gain.

In the following example, I am using forests of trees to evaluate the importance of features on a classification task using synthetic data and labels.

``````import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Create a synthetic data and labels for the classification model
X, y = make_classification(n_samples=1000,
n_features=10,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=True)

# Use ExtraTreesClassifier for the model
forest = ExtraTreesClassifier(n_estimators=150,
random_state=0)

# Train the model using the synthetic data and compute feature importance
forest.fit(X, y)
importances = forest.feature_importances_
# print(importances)
indices = np.argsort(importances)[::-1]  # sort in descending order
# print(indices)

# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("{0}. feature {1}: {2}".format(f + 1, indices[f], importances[indices[f]]))``````

The above code will give the following output:

Feature ranking:

1. feature 3: 0.7275378616791859
2. feature 5: 0.04782038837422823
3. feature 1: 0.029892177779750752
4. feature 7: 0.02936523022130539
5. feature 4: 0.02884410936359628
6. feature 9: 0.028589731215027326
7. feature 0: 0.02791793639608188
8. feature 2: 0.027551788416172958
9. feature 6: 0.026578769169484397
10. feature 8: 0.025902007385166802

This site uses Akismet to reduce spam. Learn how your comment data is processed.