### What to do when modeling really imbalanced data?

Fundraising data is usually really imbalanced–for every 20,000 constituents, less than a thousand might give, sometimes half that. Most predictive modeling strategies are designed to work with balanced, normally distributed data, not imbalance, highly skewed data like ours.

Using downsampling in randomForest models can significantly help with the false positive/false negative problems caused by how scarce donors are, compared to non-donors. Weighting the different classes helps, too, but not by much.

### AF16 Model

At the beginning of FY16, I built a predictive model, using caret and randomForest. It was an ok model, but had some serious problems, in retrospect, with not predicting who would actually donate.

Note the accuracy of 97% was based on the fact that of the 20K outcomes tested, 19K were correctly predicted to be non-donors. At the same time, we got as many false positives and false negatives as we did accurate donor predictions. Clearly we’ve got a problem (see the Balanced Accuracy stat at the bottom, which is more like 77%):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
confusionMatrix( # the original model predicted for leadership levels, too, which I don't care in terms of accuracy fct_collapse(predictedoutcomes$rf, donor = c('donor', 'leadership')) , fct_collapse(predictedoutcomes$actual, donor = c('donor','leadership')) ) ## Confusion Matrix and Statistics ## not the exact real-life numbers ## ## Reference ## Prediction donor no gift ## donor 300 250 ## no gift 250 19500 ## ## Accuracy : 0.9744 ## 95% CI : (0.9721, 0.9765) ## No Information Rate : 0.9691 ## P-Value [Acc > NIR] : 4.31e-06 ## ## Kappa : 0.5615 ## Mcnemar's Test P-Value : 0.1732 ## ## Sensitivity : 0.56000 ## Specificity : 0.98761 ## Pos Pred Value : 0.59022 ## Neg Pred Value : 0.98600 ## Prevalence : 0.03089 ## Detection Rate : 0.01730 ## Detection Prevalence : 0.02931 ## Balanced Accuracy : 0.77380 ## ## 'Positive' Class : donor ## |

### Dealing with Rare Cases

The problem here is that the donor cases are so rare that the model mostly just guesses that people won’t give.

Google and the Prospect-DMM list suggested two alternatives:

- weighted modeling, where you penalized the model for selecting that they won’t give
- downsampling, ie. sampling a smaller number of the majority class to compare

I built a series of models for both solutions, then compared the AUC of each model.

#### Downsampled Models

I built a matrix of possible sample sizes, ie. how many of the minority and majority class should be sampled. The I looped through that matrix, building a model for each possible combination.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
possiblesizes ## # A tibble: 18 × 2 ## n_x n_y ## ## 1 50 50 ## 2 50 500 ## 3 50 1000 ## 4 50 5000 ## 5 50 20000 ## 6 50 60000 ## 7 500 50 ## 8 500 500 ## 9 500 1000 ## 10 500 5000 ## 11 500 20000 ## 12 500 60000 ## 13 1000 50 ## 14 1000 500 ## 15 1000 1000 ## 16 1000 5000 ## 17 1000 20000 ## 18 1000 60000 # plot the possible sizes for clarity possiblesizes %>% ggplot(aes(x = n_x, y = n_y)) + geom_jitter(size = 3, width = 50) + ggtitle("Possible Sample Sizes") |

#### Weighted Models

I built a similar matrix of possible weights and a model for each.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
possibleweights ## # A tibble: 25 × 2 ## p_x p_y ## ## 1 0.1 0.1 ## 2 0.1 0.3 ## 3 0.1 0.5 ## 4 0.1 0.7 ## 5 0.1 0.9 ## 6 0.3 0.1 ## 7 0.3 0.3 ## 8 0.3 0.5 ## 9 0.3 0.7 ## 10 0.3 0.9 ## # ... with 15 more rows # plot the possible weights for clarity possibleweights %>% ggplot(aes(x = p_x, y = p_y)) + geom_point(size = 3) + ggtitle("Possible Class Weights") |

### Comparing Models

After I had built all the models based on the parameters above (which I’m not going to show because building that many models took FOREVER; suffice it to say, I used `lapply()`

and a healthy dose of patience), I generated ROC curves and calculated the AUC, or area under the curve, for each.

1 2 3 4 |
# plot all the ROCs plot(FY16sampledrocs[[1]], main = "ROC") foo |

#### Plotting AUC

Plotting AUC, or area under the (ROC) curve, gives us a single number per model, so we can more easily compare multiple models. That line chart above looks cool, but is largely useless for real comparisons.

Clearly, the sampled models performed much better than the weighted models, many of which performed worse than the gray, bog standard randomForest and the orange caret::randomForest models.

1 2 3 4 5 6 7 8 |
FY16allaucs %>% ggplot( aes(x = rownum, y = auc, color = modeltype)) + geom_point() + ylim(.75,1) + # 0.8507755 for bog standard RF model geom_hline(aes(yintercept = rfreferenceauc), color = 'gray') + # 0.8410452 for caret model's AUC which is what we actually used geom_hline(aes(yintercept = caretreference), color = 'orange') |

### The Best Models

Interestingly, the best models were those where the minority class was set to 50–the majority sample sizes for the top three models were 500, 50 and 1000 respectively:

1 2 3 4 5 |
roundedauc sampleratio n_x n_y ----------- ------------- ---- ----- 0.910 10 50 500 0.907 1 50 50 0.900 20 50 1000 |

Surprisingly, the worst of the sampled models (which under performed against the reference models) also used 50 for the minority size, but had MUCH larger majority sizes:

1 2 3 4 5 |
roundedauc sampleratio n_x n_y ----------- ------------- ---- ------ 0.836 120 500 60000 0.831 400 50 20000 0.795 1200 50 60000 |

Plotting sample ratio against AUC reveals that sample ratio is indirectly proportional to AUC. The data is noisy and uncertain for ratios smaller than about 25. After that point, however, AUC drops off logarithmicaly.

### Summary and General Ending Thoughts

Long story short (too late!), downsampling seemed to be the real winner here. I haven’t tried combining the two, ie. using downsampling to get a good ratio AND using class weights to penalize choosing “not gonna give”–building that sort of model seems like a good next step.

Below are some useful links that helped me figured out what was going on. I didn’t link to the prospect-dmm mailing list below because you have to be logged in to access the archives, but there was a great discussion about this problem earlier this fall that got me thinking about it.