In a classification problem, not all features have the same importance to predict the label of a record. Different approaches are used by classification algorithms to determine the important features for the classification. E.g. XGBoost uses one of these three parameters for measuring feature importance: weight, cover, and gain.
In the following example, I am using forests of trees to evaluate the importance of features on a classification task using synthetic data and labels.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
# Create a synthetic data and labels for the classification model
X, y = make_classification(n_samples=1000,
n_features=10,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=True)
# Use ExtraTreesClassifier for the model
forest = ExtraTreesClassifier(n_estimators=150,
random_state=0)
# Train the model using the synthetic data and compute feature importance
forest.fit(X, y)
importances = forest.feature_importances_
# print(importances)
indices = np.argsort(importances)[::-1] # sort in descending order
# print(indices)
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("{0}. feature {1}: {2}".format(f + 1, indices[f], importances[indices[f]]))
The above code will give the following output:
Feature ranking:
- feature 3: 0.7275378616791859
- feature 5: 0.04782038837422823
- feature 1: 0.029892177779750752
- feature 7: 0.02936523022130539
- feature 4: 0.02884410936359628
- feature 9: 0.028589731215027326
- feature 0: 0.02791793639608188
- feature 2: 0.027551788416172958
- feature 6: 0.026578769169484397
- feature 8: 0.025902007385166802