[Python] Running a decision tree classifier on breast cancer

Decision Tree (DT) is one of the supervised learning methods used for classification and regression. It is considered a white-box model as the trained model can easily be explained by boolean logic. The goal of the DT algorithm is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Various decision tree algorithms are ID3 (Iterative Dichotomiser 3), C4.5, C5.0, and CART (Classification and Regression Trees). In the post, I will use the DT from the scikit-learn library that uses an optimized version of the CART algorithm.

When the DT algorithm learns a model from the given data, likely, it may not use all features present in the data. Some features can be very informative, while others can be useless for the model. The scikit-learn library provides an attribute to find those important features used by the model.

This post aims to know how to find the list of important features. The DecisionTreeClassifier()1 module of the scikit-learn has an attribute “feature_importances_” that returns the feature importances as a list. If there are k features in the data, the feature importances list will contain k numerical values. You can map each feature to its importance value.

Here is the complete code to get feature importance and evaluate the DT model on breast cancer data.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score
from sklearn import tree
import numpy as np


def main():
    """
    start of the code
    """
    # initialize some variables
    tree_depth = 6
    rseed = 123

    # fetch breast cancer data and labels
    idata = load_breast_cancer()
    X, y, headers = idata['data'], idata['target'], idata['feature_names']

    # train the model and find important features
    clf = tree.DecisionTreeClassifier(max_depth=tree_depth, random_state=rseed)
    clf.fit(X, y)
    feature_importance_1 = dict(zip(headers, clf.feature_importances_))
    feature_importance_2 = {k: v for k, v in feature_importance_1.items() if v != 0}
    print(feature_importance_2)

    # Evaluate the model
    clf = tree.DecisionTreeClassifier(max_depth=tree_depth, random_state=rseed)
    preds = cross_val_predict(clf, X, y, method="predict_proba")
    print("AUC-ROC: ", roc_auc_score(y, preds[:, 1]))
    print("F1 Score: ", f1_score(y, np.round(preds[:, 1])))
    print("Accuracy: ", accuracy_score(y, np.round(preds[:, 1])))


if __name__ == "__main__":
    main()

The above code prints the following feature importance. I have deleted all features from the dictionary that had importance = 0.

{'mean texture': 0.03143202180001385, 'mean smoothness': 0.007067371363261426, 'mean concavity': 0.00883421420407678, 'mean concave points': 0.005679137702620788, 'perimeter error': 0.007368625861991585, 'area error': 0.0020599271469194155, 'smoothness error': 0.001011061323324746, 'compactness error': 0.037943720930179815, 'worst radius': 0.7062764596695585, 'worst texture': 0.05057276067335216, 'worst area': 0.01116564968199479, 'worst smoothness': 0.007441128537094316, 'worst concavity': 0.008790189880800664, 'worst concave points': 0.11435773122481126}

The performance metrics for the classifier are as follows:

AUC-ROC: 0.9131190211933831
F1 Score: 0.9361702127659575
Accuracy: 0.9209138840070299

References:

  1. A decision tree classifier

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.