In real-world applications, datasets with a large number of features are very common. A high dimension not only makes computation difficult but also makes it difficult to interpret the datasets. To analyze such datasets, methods are required to reduce their dimensionality so that most of the statistical information in the data is preserved. Principal component analysis (PCA) is one of the oldest and most widely used methods to reduce the dimensionality of a dataset consisting of many interrelated variables while retaining as much as possible of the variation present in the dataset. The original dataset is transformed into a new set of variables, the principal components (PCs), to preserve as much variability as possible. The PCs maximize variance, are uncorrelated, and are linear functions of variables in the original dataset. PCA can be based on either the covariance matrix or the correlation matrix1.
In this post, I will explain how to use the PCA of the sklearn library to reduce the dimension of a dataset. It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by Halko et al. 2009, depending on the shape of the input data and the number of components to extract. I will also show the effect of dimensionality reduction on the accuracy of a machine learning model. I will use a logistic regression model on breast cancer classification data.
You can use the default values of all parameters of the PCA() function. However, setting the value of the “n_components” parameter is recommended because its default value is computed as follows:
n_components = min(n_samples, n_features) – 1, which can be a big number . “n_components” is the number of features you want to keep in the transformed dataset.
Here is the complete code to reduce the dimension of the breast cancer data to 10 using PCA():
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
def load_bc_data():
"""
Use sklearn's datasets to load breast cancer data
"""
bc = datasets.load_breast_cancer()
return bc.data, bc.target
if __name__ == "__main__":
"""
Run PCA to reduce the dimension of the data and test impact on accuracy
"""
# load breast cancer data
X, y = load_bc_data()
print("Original dimension of the data: {0}".format(X.shape))
# print(X)
# accuracy before reducing dimension
lr = LogisticRegression(max_iter=500, random_state=101, solver="liblinear")
y_pred = cross_val_predict(lr, X, y, cv=5)
print("Accuracy before PCA: ", accuracy_score(y, y_pred))
# accuracy after reducing dimension
pca = PCA(n_components=10, random_state=101, svd_solver="auto")
X = pca.fit_transform(X)
print("Dimension after PCA: {0}".format(X.shape))
# print(X)
y_pred = cross_val_predict(lr, X, y, cv=5)
print("Accuracy after PCA: ", accuracy_score(y, y_pred))
The output of the above code is as follows:
Original dimension of the data: (569, 30)
Accuracy before PCA: 0.9507908611599297
Dimension after PCA: (569, 10)
Accuracy after PCA: 0.9490333919156415
After running PCA, the feature vector of the dataset changes. We also see that there is a slight drop in the accuracy after reducing the feature count from 30 to 10. You can change the value of “n_components” to a lower value and see its effect on the classification performance.
References: