Solutions for Modeling Imbalanced Data

No Comments

What to do when modeling really imbalanced data?

Fundraising data is usually really imbalanced–for every 20,000 constituents, less than a thousand might give, sometimes half that. Most predictive modeling strategies are designed to work with balanced, normally distributed data, not imbalance, highly skewed data like ours.

Using downsampling in randomForest models can significantly help with the false positive/false negative problems caused by how scarce donors are, compared to non-donors. Weighting the different classes helps, too, but not by much.

AF16 Model

At the beginning of FY16, I built a predictive model, using caret and randomForest. It was an ok model, but had some serious problems, in retrospect, with not predicting who would actually donate.

Note the accuracy of 97% was based on the fact that of the 20K outcomes tested, 19K were correctly predicted to be non-donors. At the same time, we got as many false positives and false negatives as we did accurate donor predictions. Clearly we’ve got a problem (see the Balanced Accuracy stat at the bottom, which is more like 77%):

Dealing with Rare Cases

The problem here is that the donor cases are so rare that the model mostly just guesses that people won’t give.

Google and the Prospect-DMM list suggested two alternatives:

  • weighted modeling, where you penalized the model for selecting that they won’t give
  • downsampling, ie. sampling a smaller number of the majority class to compare

I built a series of models for both solutions, then compared the AUC of each model.

Downsampled Models

I built a matrix of possible sample sizes, ie. how many of the minority and majority class should be sampled. The I looped through that matrix, building a model for each possible combination.

possible downsampling ratios

Weighted Models

I built a similar matrix of possible weights and a model for each.

possible-weights-1

Comparing Models

After I had built all the models based on the parameters above (which I’m not going to show because building that many models took FOREVER; suffice it to say, I used lapply() and a healthy dose of patience), I generated ROC curves and calculated the AUC, or area under the curve, for each.

plot-roc-curves-1

Plotting AUC

Plotting AUC, or area under the (ROC) curve, gives us a single number per model, so we can more easily compare multiple models. That line chart above looks cool, but is largely useless for real comparisons.

Clearly, the sampled models performed much better than the weighted models, many of which performed worse than the gray, bog standard randomForest and the orange caret::randomForest models.

plot-auc-1

The Best Models

Interestingly, the best models were those where the minority class was set to 50–the majority sample sizes for the top three models were 500, 50 and 1000 respectively:

Surprisingly, the worst of the sampled models (which under performed against the reference models) also used 50 for the minority size, but had MUCH larger majority sizes:

Plotting sample ratio against AUC reveals that sample ratio is indirectly proportional to AUC. The data is noisy and uncertain for ratios smaller than about 25. After that point, however, AUC drops off logarithmicaly.

unnamed-chunk-1-1

unnamed-chunk-1-2

Summary and General Ending Thoughts

Long story short (too late!), downsampling seemed to be the real winner here. I haven’t tried combining the two, ie. using downsampling to get a good ratio AND using class weights to penalize choosing “not gonna give”–building that sort of model seems like a good next step.

Below are some useful links that helped me figured out what was going on. I didn’t link to the prospect-dmm mailing list below because you have to be logged in to access the archives, but there was a great discussion about this problem earlier this fall that got me thinking about it.

Interpreting the Intercept

No Comments

One of the things that threw me for a loop when I first started doing building linear models was how to interpet all those numbers that show up when you hit summary(lm(foo~bar)).

I started off by joining the Stanford Statistical Learning class (which is starting up again in January–I’m planning on going through it again, maybe even making it through the whole class this time!) and they kept saying “well, here’s the intercept, but nobody cares about that.” Of course, that elicited more curiosity than anything, so here’s some info about how to interpret the intercept.

Don’t try to interpret the intercept

Truthfully, a lot of times, the intercept is kind of meaningless and should be ignored. For example, I’ve got a simulated data sample here with two columns:

  • lifetime giving
  • class year

Let’s run a linear model with class year as the predictor and lifetime giving as the outcome:

Should Class Year be numeric?

After typing up this post, particularly the summary at the bottom, I’ve decided that class year should probably be a categorical variable, ie. a factor, rather than numeric.

If I had to do it over again, I’d probably round it by decade and then turn it into a factor, particularly because, at least in our system, we store non-alumni with a class year of 0000 (we also have a few that are 1111 (some legacy pseudo-alumni) and a handful that are 9999 (no clue–I’m scared to ask)).

In any case, I still think it’s a good example to help clear up how to think about the intercept and how to wrap your brain around interpreting it. But it’s clearly a bad way to build a model.

Here, the intercept value is 26,469, which is to say, according to the model, someone with a class year of 0000 (Mary and Joseph, perhaps?), would have an estimated lifetime giving of $26K.

A quick plot of the data illustrates what’s going on–old alumni are better donors than younger alumni, so the slope of the line is negative. Interpolate that all the way back to 0 AD and you’ve got an estimated $26K.

class_year_plot-1

Adjust the predictor for meaningful intercepts

We can make the intercept interpretable by making our oldest alumni have a class year of zero. Millikin was founded in 1901–for the sake of simplicity, we’ll make that our zero year by subtracting 1901 from all our class years.

Now we’ve got something interpretable: at Millikin Year Zero, the average lifetime giving is $1540 and for every year going forward, that average drops by $13.

The intercept is the baseline

Obviously, the model itself is a bad one–clearly, there’s a ton of variation, especially with those middle aged donors, not to mention that this is a simulated data set, so it’s not representative of our constituent population.

With all that in mind, this model doesn’t help us draw any real conclusions about the data. Hopefully, though, you get an idea of how to think about the intercept, namely, it’s what your baseline outcome would be if you zeroed out all your predictors.

Categories: modeling Tags: Tags: ,

A great explaination about how linear regression works

No Comments

This video (by the masterful Sean Rule is quite possibly the best explanation of how linear regression works that I’ve ever seen.

Sidebar re: Fundraising

Linear regression is the technique that underlies most predictive modeling.

Most of the time we’re taking some combinations of factors on the x axis with giving on the y axis. Something like “donor code + number of events attended + age” on the x axis and lifetime giving on the y axis.

We want to see how this conglomerations of factors (once we’ve transformed them into numbers in various ways) affect that y variable, eg. should we expect a higher amount of lifetime giving for certain combinations of our variables.

Creating the mathematical model to answer that question is usually done via linear regression.

A couple questions that come up when thinking about linear regression:

Why are we only concerned with the VERTICAL distance on the y axis between each point and the line?

Sean points out: “You minimize the vertical (offset) distance because you’re checking the error between the model (the “best fit” line) and the actual “performance” of the data. By checking the vertical distance, the x – coordinate (input variable) remains consistent between y (data dependent value) and “y hat” (the predicted y-value).”

Or to put it another way: for each value on the x axis, we want to see how far off we are on the y axis. We know our x values–the point of linear regression is to tell us the value of y based on the known value of x. So all we care about it is the y distance from the line for any specific x point.

Why do we square the errors instead of just taking absolute value?

Sean covers this briefly about 2:10 in, but let me try to build it out, starting by saying: it’s complex, tricky and the main answer is a dumb one, namely: “that’s how we’ve always done it”.

There’s a better reason that involves calculus, but to be honest with you, I don’t quite get it (the curse of being an English major, I suppose).

Suffice it to say, you probably could use absolute value and get something that works. If you’re really smart, you’ll know the limitations and advantages of each method. I am not really smart, so I’m gonna do it the way most everybody does. That’s a horrible answer, but it ticks the important boxes, namely:

  1. Does it work to give us good predictions most of the time? Yes, yes it does.
  2. Does it handle negative differences as just as important as positive ones? Yes, yes it does.

Ok, that’s good enough for me!