+3 votes

Best answer

The short answer to this question is "**it depends on the data**". Not one value will be suitable for all types of data.

According to XGBoost's documentation, in a binary classification problem,

scale_pos_weight = number of majority class records/number of the minority class records.

In your case, scale_pos_weight = number of class 0 records/number of class 1 records.

However, if your data is highly imbalanced, the above formula might not give you the best results. Sometimes, square_root (number of class 0 records/number of class 1 records) might provide better results.

In my opinion, one should run GridSearch to find the optimal value of scale_pos_weight. Without scale_pos_weight, when the number of class 0 records is very high compared to the number of class 1 records, you get poor results for recall [*tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives*]. So, in GridSearch, use recall as a scoring parameter. Thus, the GridSearch will find the optimal value of scale_pos_weight that returns the best recall.

Here is a template for the GridSearch code:

import xgboost as xgb

from sklearn.model_selection import GridSearchCV

max_spw = count(0)/count(1)

model = xgb.XGBClassifier()

xgb_grid_params = {

'scale_pos_weight': [i for i in range(1, max_spw, 5)]

}

gs = GridSearchCV(model, param_grid=xgb_grid_params, scoring="recall", cv=5, verbose=7)

gs.fit(data, label)

print(gs.best_params_)