# [R] How to use Cross-validated Area Under the ROC Curve (cvAUC) – an example using Breast Cancer Data

The cvAUC() function of the R library “cvAUC” is used to calculate the cross-validated area under the ROC curve (AUC) estimates. It takes predictions and true labels from each fold as arguments and returns AUC for each fold and mean AUC of k-fold. The area under the ROC curve is equal to the probability that the classifier will score a randomly drawn positive sample higher than a randomly drawn negative sample. This function is a simple wrapper for the AUC functionality inside the ROCR package.

The format of the function is as follows:

``cvAUC(predictions, labels, label.ordering = NULL, folds = NULL)``

In this post, I am using BreakCancerData from R datasets to show how to use the cvAUC() function. I am using the XGBoost classifier to get the 5-fold predictions. The data and labels are manually divided into five groups. One group is selected as a test set in each fold, and the remaining four groups are selected as a train set. The XGBoost model is trained on the train set and tested on the test set. The predicted probabilities and labels of the test data are stored in a list. At the end of the k-fold, the cvAUC() function is called to calculate the Cross-validated Area Under the ROC Curve.

Here is the complete R code. The code has been tested on R version 4.1.1 using Rstudio 1.4.1106, and it runs without any error.

``````library(xgboost)
library(cvAUC)
library(mlbench)
library(pROC)
library(MLmetrics)

cvDataIndices <- function(Y, V){
# Randomly divide data indices into V groups
Y0 <- split(sample(which(Y==0)), rep(1:V, length=length(which(Y==0))))
Y1 <- split(sample(which(Y==1)), rep(1:V, length=length(which(Y==1))))
folds <- vector("list", length=V)
for (v in seq(V)) {
folds[[v]] <- c(Y0[[v]], Y1[[v]])
}
return(folds)
}

trainAndTestModel <- function(X, y, Xte){
# train XGBoost model on train data and test on test data
# compute the ratio of negative count to positive count
sumpos <- sum(y == 1)
sumneg <- sum(y == 0)
r <- sumneg/sumpos

# train the model
p <- list(objective = "binary:logistic", scale_pos_weight = sumneg/sumpos)
dtrain <- xgb.DMatrix(X, label = y)
model <- xgb.train(data = dtrain, verbose = 0, nrounds = 25)

# test the model
pred <- predict(model, Xte)
return(pred)
}

# Wisconsin Breast Cancer Database
data("BreastCancer")

# generate data
X <- subset(BreastCancer, select = -c(Class,Id)) # remove Id and Class
X <- as.matrix(sapply(X, as.numeric)) # keep data as numeric

# generate data labels
y <- as.character(BreastCancer\$Class)
y[y=="benign"] = 0
y[y=="malignant"] = 1
y <- as.integer(y)

# generate random indices for k fold
k <- 5
cvfolds <- cvDataIndices(y, k)

# list to store actual labels and predictd probabilites
preds <- vector("list", length = k)
actuals <- vector("list", length = k)

# run k-fold and get predictions
j <- 1
for (v in cvfolds){
# test data
Xte <- X[v,]
yte <- y[v]
# train data
Xtr <- X[-v,]
ytr <- y[-v]

# run ML model and get repdicted probabilities
pred <- trainAndTestModel(Xtr, ytr, Xte)
print(c("auc in fold ", j, auc(yte, pred)))
print(c("accuray in fold ", j, Accuracy(yte, round(pred))))

# store probs and labels
preds[[j]] <- pred
actuals[[j]] <- yte
j <- j+1
}

# compute and plot cross-validated AUC
out <- cvAUC(preds, actuals)
print(c("AUC in each fold", out\$fold.AUC))
print(c("Mean AUC", out\$cvAUC))
#Plot fold AUCs
plot(out\$perf, col="black", lty=3, main=paste0(k,"-fold CV AUC"))
#Plot CV AUC