[R] How to use Cross-validated Area Under the ROC Curve (cvAUC) – an example using Breast Cancer Data

The cvAUC() function of the R library “cvAUC” is used to calculate the cross-validated area under the ROC curve (AUC) estimates. It takes predictions and true labels from each fold as arguments and returns AUC for each fold and mean AUC of k-fold. The area under the ROC curve is equal to the probability that the classifier will score a randomly drawn positive sample higher than a randomly drawn negative sample. This function is a simple wrapper for the AUC functionality inside the ROCR package.

The format of the function is as follows:

cvAUC(predictions, labels, label.ordering = NULL, folds = NULL)

In this post, I am using BreakCancerData from R datasets to show how to use the cvAUC() function. I am using the XGBoost classifier to get the 5-fold predictions. The data and labels are manually divided into five groups. One group is selected as a test set in each fold, and the remaining four groups are selected as a train set. The XGBoost model is trained on the train set and tested on the test set. The predicted probabilities and labels of the test data are stored in a list. At the end of the k-fold, the cvAUC() function is called to calculate the Cross-validated Area Under the ROC Curve.

Here is the complete R code. The code has been tested on R version 4.1.1 using Rstudio 1.4.1106, and it runs without any error.

library(xgboost)
library(cvAUC)
library(mlbench)
library(pROC)
library(MLmetrics)

cvDataIndices <- function(Y, V){  
  # Randomly divide data indices into V groups
  Y0 <- split(sample(which(Y==0)), rep(1:V, length=length(which(Y==0))))
  Y1 <- split(sample(which(Y==1)), rep(1:V, length=length(which(Y==1))))
  folds <- vector("list", length=V)
  for (v in seq(V)) {
    folds[[v]] <- c(Y0[[v]], Y1[[v]])
    }		
  return(folds)
}

trainAndTestModel <- function(X, y, Xte){
  # train XGBoost model on train data and test on test data
  # compute the ratio of negative count to positive count
  sumpos <- sum(y == 1)
  sumneg <- sum(y == 0)
  r <- sumneg/sumpos
  
  # train the model
  p <- list(objective = "binary:logistic", scale_pos_weight = sumneg/sumpos)
  dtrain <- xgb.DMatrix(X, label = y)
  model <- xgb.train(data = dtrain, verbose = 0, nrounds = 25)
  
  # test the model
  pred <- predict(model, Xte)
  return(pred)
}

# Wisconsin Breast Cancer Database
data("BreastCancer")

# generate data
X <- subset(BreastCancer, select = -c(Class,Id)) # remove Id and Class
X <- as.matrix(sapply(X, as.numeric)) # keep data as numeric

# generate data labels
y <- as.character(BreastCancer$Class)
y[y=="benign"] = 0  
y[y=="malignant"] = 1
y <- as.integer(y)

# generate random indices for k fold
k <- 5
cvfolds <- cvDataIndices(y, k)

# list to store actual labels and predictd probabilites
preds <- vector("list", length = k) 
actuals <- vector("list", length = k)

# run k-fold and get predictions
j <- 1
for (v in cvfolds){
  # test data
  Xte <- X[v,]
  yte <- y[v]
  # train data
  Xtr <- X[-v,]
  ytr <- y[-v]
  
  # run ML model and get repdicted probabilities
  pred <- trainAndTestModel(Xtr, ytr, Xte)
  print(c("auc in fold ", j, auc(yte, pred)))
  print(c("accuray in fold ", j, Accuracy(yte, round(pred))))
  
  # store probs and labels
  preds[[j]] <- pred
  actuals[[j]] <- yte
  j <- j+1
}

# compute and plot cross-validated AUC
out <- cvAUC(preds, actuals)
print(c("AUC in each fold", out$fold.AUC))
print(c("Mean AUC", out$cvAUC))
#Plot fold AUCs
plot(out$perf, col="black", lty=3, main=paste0(k,"-fold CV AUC"))
#Plot CV AUC
plot(out$perf, col="blue", lty=1, avg="vertical", add=TRUE)
text(0.8,0.2, paste("mean AUC:", format(round(out$cvAUC, 4), nsmall=4)))
text(0.8,0.1, paste("AUC:", format(round(min(out$fold.AUC), 4), nsmall=4), "-", format(round(max(out$fold.AUC), 4), nsmall=4)))

The above code returns the following output and plot:

[1] “auc in fold ” “1” “0.99622892635315”
[1] “accuray in fold ” “1” “0.964539007092199”
Setting levels: control = 0, case = 1
Setting direction: controls < cases
[1] “auc in fold ” “2” “0.982563405797101”
[1] “accuray in fold ” “2” “0.964285714285714”
Setting levels: control = 0, case = 1
Setting direction: controls < cases
[1] “auc in fold ” “3” “0.973958333333333”
[1] “accuray in fold ” “3” “0.921428571428571”
Setting levels: control = 0, case = 1
Setting direction: controls < cases
[1] “auc in fold ” “4” “0.963255494505495”
[1] “accuray in fold ” “4” “0.949640287769784”
Setting levels: control = 0, case = 1
Setting direction: controls < cases
[1] “auc in fold ” “5” “0.993818681318681”
[1] “accuray in fold ” “5” “0.956834532374101”
[1] “AUC in each fold” “0.99622892635315” “0.982563405797101” “0.973958333333333” “0.963255494505494” “0.993818681318681”
[1] “Mean AUC” “0.981964968261552”

cvFold AUC ROC plot

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.