Solutions for Modeling Imbalanced Data

No Comments

What to do when modeling really imbalanced data?

Fundraising data is usually really imbalanced–for every 20,000 constituents, less than a thousand might give, sometimes half that. Most predictive modeling strategies are designed to work with balanced, normally distributed data, not imbalance, highly skewed data like ours.

Using downsampling in randomForest models can significantly help with the false positive/false negative problems caused by how scarce donors are, compared to non-donors. Weighting the different classes helps, too, but not by much.

AF16 Model

At the beginning of FY16, I built a predictive model, using caret and randomForest. It was an ok model, but had some serious problems, in retrospect, with not predicting who would actually donate.

Note the accuracy of 97% was based on the fact that of the 20K outcomes tested, 19K were correctly predicted to be non-donors. At the same time, we got as many false positives and false negatives as we did accurate donor predictions. Clearly we’ve got a problem (see the Balanced Accuracy stat at the bottom, which is more like 77%):

Dealing with Rare Cases

The problem here is that the donor cases are so rare that the model mostly just guesses that people won’t give.

Google and the Prospect-DMM list suggested two alternatives:

  • weighted modeling, where you penalized the model for selecting that they won’t give
  • downsampling, ie. sampling a smaller number of the majority class to compare

I built a series of models for both solutions, then compared the AUC of each model.

Downsampled Models

I built a matrix of possible sample sizes, ie. how many of the minority and majority class should be sampled. The I looped through that matrix, building a model for each possible combination.

possible downsampling ratios

Weighted Models

I built a similar matrix of possible weights and a model for each.

possible-weights-1

Comparing Models

After I had built all the models based on the parameters above (which I’m not going to show because building that many models took FOREVER; suffice it to say, I used lapply() and a healthy dose of patience), I generated ROC curves and calculated the AUC, or area under the curve, for each.

plot-roc-curves-1

Plotting AUC

Plotting AUC, or area under the (ROC) curve, gives us a single number per model, so we can more easily compare multiple models. That line chart above looks cool, but is largely useless for real comparisons.

Clearly, the sampled models performed much better than the weighted models, many of which performed worse than the gray, bog standard randomForest and the orange caret::randomForest models.

plot-auc-1

The Best Models

Interestingly, the best models were those where the minority class was set to 50–the majority sample sizes for the top three models were 500, 50 and 1000 respectively:

Surprisingly, the worst of the sampled models (which under performed against the reference models) also used 50 for the minority size, but had MUCH larger majority sizes:

Plotting sample ratio against AUC reveals that sample ratio is indirectly proportional to AUC. The data is noisy and uncertain for ratios smaller than about 25. After that point, however, AUC drops off logarithmicaly.

unnamed-chunk-1-1

unnamed-chunk-1-2

Summary and General Ending Thoughts

Long story short (too late!), downsampling seemed to be the real winner here. I haven’t tried combining the two, ie. using downsampling to get a good ratio AND using class weights to penalize choosing “not gonna give”–building that sort of model seems like a good next step.

Below are some useful links that helped me figured out what was going on. I didn’t link to the prospect-dmm mailing list below because you have to be logged in to access the archives, but there was a great discussion about this problem earlier this fall that got me thinking about it.