Solutions for Modeling Imbalanced Data

What to do when modeling really imbalanced data?

Fundraising data is usually really imbalanced–for every 20,000 constituents, less than a thousand might give, sometimes half that. Most predictive modeling strategies are designed to work with balanced, normally distributed data, not imbalance, highly skewed data like ours.

Using downsampling in randomForest models can significantly help with the false positive/false negative problems caused by how scarce donors are, compared to non-donors. Weighting the different classes helps, too, but not by much.

AF16 Model

At the beginning of FY16, I built a predictive model, using caret and randomForest. It was an ok model, but had some serious problems, in retrospect, with not predicting who would actually donate.

Note the accuracy of 97% was based on the fact that of the 20K outcomes tested, 19K were correctly predicted to be non-donors. At the same time, we got as many false positives and false negatives as we did accurate donor predictions. Clearly we’ve got a problem (see the Balanced Accuracy stat at the bottom, which is more like 77%):

Dealing with Rare Cases

The problem here is that the donor cases are so rare that the model mostly just guesses that people won’t give.

Google and the Prospect-DMM list suggested two alternatives:

  • weighted modeling, where you penalized the model for selecting that they won’t give
  • downsampling, ie. sampling a smaller number of the majority class to compare

I built a series of models for both solutions, then compared the AUC of each model.

Downsampled Models

I built a matrix of possible sample sizes, ie. how many of the minority and majority class should be sampled. The I looped through that matrix, building a model for each possible combination.

possible downsampling ratios

Weighted Models

I built a similar matrix of possible weights and a model for each.


Comparing Models

After I had built all the models based on the parameters above (which I’m not going to show because building that many models took FOREVER; suffice it to say, I used lapply() and a healthy dose of patience), I generated ROC curves and calculated the AUC, or area under the curve, for each.


Plotting AUC

Plotting AUC, or area under the (ROC) curve, gives us a single number per model, so we can more easily compare multiple models. That line chart above looks cool, but is largely useless for real comparisons.

Clearly, the sampled models performed much better than the weighted models, many of which performed worse than the gray, bog standard randomForest and the orange caret::randomForest models.


The Best Models

Interestingly, the best models were those where the minority class was set to 50–the majority sample sizes for the top three models were 500, 50 and 1000 respectively:

Surprisingly, the worst of the sampled models (which under performed against the reference models) also used 50 for the minority size, but had MUCH larger majority sizes:

Plotting sample ratio against AUC reveals that sample ratio is indirectly proportional to AUC. The data is noisy and uncertain for ratios smaller than about 25. After that point, however, AUC drops off logarithmicaly.



Summary and General Ending Thoughts

Long story short (too late!), downsampling seemed to be the real winner here. I haven’t tried combining the two, ie. using downsampling to get a good ratio AND using class weights to penalize choosing “not gonna give”–building that sort of model seems like a good next step.

Below are some useful links that helped me figured out what was going on. I didn’t link to the prospect-dmm mailing list below because you have to be logged in to access the archives, but there was a great discussion about this problem earlier this fall that got me thinking about it.

Read More

Distributing leaflet maps without a server

Distributing confidential maps is hard

I’ve been trying, for a while now, to figure out how to use R to replace MapPoint–the challenge has been distributing a map of confidential data to non-R users in my office.

Making the maps isn’t THAT hard: leaflet makes it pretty easy to draw up a map and the ggmap library has the geocode() function, so that’s solved too.

The real problem is, after I build a leaflet map of the VP’s prospects with giving and wealth rating data, how do I send it the VP’s admin assistant so she can work out his travel plans? I can’t put that kind of confidential data on Rpubs. Running and administering a shiny server (not to mention building a shiny app) is too much work for one-off stuff like this–I just want to be able to save the file and email it.

Use saveWidget()

Turns out, the htmlwidgets library has a saveWidget() function. I did something like this:

That makes an html file and a directory of helper functions (css, javascript, etc) on my desktop. I zip the html and the directory up into a single zip file, email that off and call it a day.

It still requires internet access to download the map (by default from Open Street Maps)–I’d like to get that sorted out later on so development officers can throw the html file in Dropbox and open it up when they’re on the road–but for now, using saveWidget() seems to be the way to go!

Read More

Bind_rows() instead of rbind_all() is awesome

I use dplyr's rbind_all() all the time to mash multiple data frames together–most of the time, I'll pull multiple data sets, but then stack them on top of each other to do some sort of summary, something like:

Today, I happened to look at the documentation for rbind_all() and found out it's been deprecated in favor of bind_rows().

Ok whatever, I thought. Then I happened to look at the arguments for bind_rows( ... , .id = NULL).

Here's the super cool part: you pass .id a string that becomes the name of a new column in the new data frame. And that column is populated by either a sequence of numbers (in whatever order you passed the data frames you're binding together) or (and here's where the real money at), you can name your data frames in the first place. Like this:

Normally, I end up adding a source column or some such each of the smaller individual data frames before I bind them all together. It's a pain and a good way to screw something up–I often forget to add the column till it's too late and then I have to go back, make the edit and re-run the code. This solves the problem, and rather elegantly so.

Read More

You should use “perl = T”

Today, I was trying to use gsub() to replace a bunch of underscores with spaces, then capitalize the words, something like this:

The first bit, replacing the underscores with spaces was working fine, but that \\U bit kept throwing all sort of errors and when it didn’t, it just didn’t work.

As it turns out, R’s built-in regexp engine doesn’t like \\U–you need to use Perl-style regular expressions to be able to use it. Fortunately that’s easy, just add perl = TRUE as an argument to gsub().

That makes the code above look like this:

The page about regexp in R says you should always use perl =T. Seems like good advice.

Read More

Organize an R Project

If I’m working a project that’s more than about 50 lines of code, I’ll often end up with several scripts. Based on a conversation on the R community on G+, I’ve started organizing my projects a bit better:

Always use an R Studio project

First of all, I always use an R project. Always.

Ok, maybe not always. Occasionally, I just want to dink around in the R console. I won’t make a new R project for that. But if I’ve got enough code that I want to do any sort of debugging, then I’ll start writing an R script. And if I’ve got even one script, I’ll make an R project for it.

This means I’ve got a directory for just about every project I work on, which is a little clutter-y, but not as bad as having a million scripts laying around that may or may not be interconnected.

Use a numeric system to name your scripts

I’ll often write more than one script–my cut off for “I need a new script here” is usually if my “load the data” section is bigger than half my screen, or if I find myself at a good stopping place and realize I still have lots of work to do.

When I have more than script, I’ll start a numerical naming system with a base script called “00-build-something-or-other.R”. This script is usually just a bunch of source() commands. The one that’s running as I type this looks like this:

Using the echo = T argument for source() makes it so you can actually see the commands that are churning along in the background. For a long time, I didn’t add this–it drove me crazy that I could run an individual script with all the commands from RStudio, but not when I ran my base script. Adding echo = T solves that’s problem.

Also, you can use the beeper package to get notifications when this first script finishes. Doing the following will play the end-of-the-level Super Mario music when your job finally gets done (I’d give $5 for a button/keyboard shortcut to include the audible notification in R Studio):

The next scripts are 01-xxxx, 02-xxxx and so on

The next scripts are named 01-xxxxx.R, 02-xxxxx.R, and so on, with each script doing a single job. I try to break up my scripts in places where I won’t have to re-run the previous script each time I screw something up in the one I’m currently working on. Usually, that means each script starts with something like:

Occasionally, too, I’ll use an if() statement to make sure the object I really need exists. Usually, this isn’t necessary if I’m using my 00- script to call everything, but very rarely, I’ll have one script call another.

I did this the other day while I was pulling data for our spring phonathon, for example. I wanted to run a predictive model on our NonAlumni Never-Givers. Since I was training the model on last year’s giving, I didn’t expect the data to change, so I didn’t need the model to refresh when I rebuilt the spring data, so in the middle of my segmentation script I had a bit of code like this:

Use /data and /output directories

Every project I build also has two sub directories in it: /data and /output. Any files I need that are specific to that project, usually .csvs of weird data, hand-reviewed notes, stuff that doesn’t live in our main database or warehouse all get thrown into /data. Anything that I dump out, whatever the final output is (or, rather, all the drafts of the final output) get dumped into /output.

This keeps my main project directory fairly clean: the only thing that should be in there is the .Rproject file itself, any R scripts (which should sort themselves in order, because they’re named 00-xxxxx, 01-xxxxx, 02-xxxxx, etc) and the two directories.

It also means I never have to wonder about what folder something is in: unless it’s a reference file that lives somewhere else (and even then, if it’s small and I don’t plan it changing, I’ll copy it into /data), I can always type read.csv('data/' and then whack Tab and all my files come up. The same goes for writing out data–I know I can always do write.tidy('output/ and I’m good to go.

An important side note there: I use the write.tidy() and read.tidy() functions from my muadc package all the time–they’re wrappers around the standard read/write.csv functions (which in turn are wrappers around the read/write.table functions), but they make my life a LOT easier, if nothing else because I know the arguments are always going to be same.

Stay Organized

Organizing my projects this way has really helped–now, when I return to a project 6 months from now (I’m sure I’ll be back in August, pulling phonathon data together again), I’ll be immediately able to see how the scripts relate to each other, what happened where and what’s important.

Read More

Comparing Dr Who episodes by decade

A little while ago, io9 rated every Dr. Who episode from best to worst.

I immediately noticed that a bunch of their favorites were from the reboot, despite the fact that there’s a lot more content in the older series. So I decided to pull the data into R to see if I was imagining things. I know this isn’t fundraising-related, but it IS R related, and it was a fun project to work on over lunch.

Here’s a plot of all the episodes with year on the x-axis and rank on the y-axis. Remember that higher rank is worse.


It definitely LOOKS like the new stuff is better, but I’ll bet we can know more.

Here’s what I found, grouping all the episodes by rounding the year to the nearest decade (ie. 1953 becomes 1950; 1958 becomes 1960):

Let’s Look at Averages

In terms of mean rank, the reboot was WAY better than everything (and the 1990s stuff had an overall average rank that was much worse):

You can see there that, on average, the 2000 decades were the best and the 1990s were the worst.

T-test to really see

But this is just averages–seems like we can do better. I mean, there could be an outlier sitting out there yanking those averages down (or up, as it were).

With that in mind, I ran a t.test to compare each year to another:

Here I’m asking whether the p-values are less than .05. Or to put it another way, is there less than a 5% chance that the difference we’re seeing in the averages for each decade occurred by chance? TRUE values are where there is a smaller than 5% chance that the difference is statistically signficant.

Old Stuff Was A Crapshoot

The other thing you can see is that the old stuff was all over the map. You can see this from the plot above, but there’s not a statistically significant difference in ranks between the 60s, 70s, 80s and 90s. I found that surprising–everybody knows that stuff at the very end of the first series was awful, right? Not io9, apparently.

In any case, if you’re introducing your friends to Dr. Who, start with the reboot–that old stuff is like a box of chocolates….

Read More

Moving Data in and out of R via the Clipboard

There’s some interesting stuff in this blog post about reading data into R, particularly about using scan() to read data in (apparently you can use it to type data in, one entry per line).

That said, it’s a pain in the butt to depend on scan()–most of the time when I’m pushing data around, I use R’s ability to read data frames from the clipboard.

I tend to use read.csv(file='clipboard') more than readClipboard(), mostly because I always forget about the latter.

One important note: by default, R only uses a small bit of the Windows clipboard to write files out (I have no idea how/if this works at all on Linux and Mac), something like 128KB. That’s not enough for a decent sized data frame/spreadsheet, but it’s pretty easy to bump that limit up.

If you do write.table( foo, file = 'clipboard-4096'), just about anything should fit in there.

I’ve got a function named write.clip() in my muadc R pacakge that does this for me, because I’m a lazy bum and got tired of typing “sep = '\t', row.names = F“.

Read More

3D Scatterplot in R

Today, I was doing some analysis of a sub-group of our constituents and was trying to get an idea of how connected these folks felt to our school.

I thought I’d do that by running a summary of their lifetime giving, event attendance and number of development officer contacts:

That shows me the 5-number breakdown for each column. Obviously, I’ve got some serious outliers in there, somebody’s who’s attended over 200 events!

But I can’t tell if the person w/ 200+ events is the same person who’s given a ton of money or not, so I tried running some scatterplots comparing the numbers:

lifetime giving vs event attendance

That’s not awful, but I have to run a bunch of these to compare all three variables–if I could run a legitimately 3D scatterplot in R, I’d be able to see all three variables at once.

Turns out, it’s REALLY easy.

3D scatterplot
click to zoom, rotate and pan
(embedding a live htmlwidget in WordPress made me crazy)

That gives you a really cool, pannable 3D scatterplot so you can see how each person compares on all three variables at once.

To tell the truth, it’s a bit gimmicky (and what do you do when you’ve got more than three variables? you’re back in the same boat). Still, it’s just so cool and fun that I had to share!

Read More

Data frames are lists

I’ve got another more expansive post in the hopper that I need to edit and post, but here’s a quick one to reiterate a point that you probably already know, but is really important:

Data Frames are lists

This point, that data frames are lists where each column of the data frame is an item (usually a named vector) within the list, is really really useful. Here’s some things this tidbit lets you do:

  1. Grab a single column as a vector with double brackets, ala constituents[[8]]. If you use single brackets, you’ll get a data frame with just that column (because you’re asking for a list with the 8th item in this case).
  2. Do lapply and sapply functions over each column, ie. you can easily loop through a data frame, performing the same function on each column.

    You might think this isn’t a big deal. “Normally the columns of my data frames are all different types of data, first name here, lifetime giving there,” you might say.

    But trust me, it won’t be long til you’ve got a data frame and you think “oh, I’ll just write a little for loop to loop through each column and…” That’s what lapply and/or sapply are for. Knowing that the scaffolding to do that task is already built into one handy function will save your bacon.

  3. Converting a list of vectors to a data frame is as simple as saying And while lists can be a bit fiddly, with this one, you know what you’re going to get.

In short, just knowing that a data frame is a special kind of list makes it a lot easier to handle data frames when the need for crazy sorts of data manipulation comes up (and trust me, it will).

Read More

Simple Parrallelization

Recently, I’ve been working on some machine learning projects at work. A lot of these require serious computing time–my work machine isn’t super beefy, but it’s no slouch and still some of these models take a couple hours to build.


Without really thinking about it, I assumed I was using all my processing power. And then, I looked at my Task Manager. And it turns out, of the 4 cores I’ve got, R was only using one!

There's 3 more boxes there, buddy--use 'em!
There’s 3 more boxes there, buddy–use ’em!

Getting multi-core support in R

A bit of research revealed that R is really bad at supporting multiple cores. It’s baffling to me, but that’s the way it is. Apparently, there are various solutions to this, but they involve installing/using packages and then making sure your processes are parreleliziable. Sounds like a receipe for disaster if you ask me–I screw enough up on my own, I don’t need to add a layer of complexity on top of that.

An alternative, easier solution is to use Revolution Analytics distribution of R, Open R, which comes with support of multiple cores out of the box.

Just download and install and when you fire up RStudio the next time, it’ll find it, and (probably) start using it (if not, you can go into Global Options in RStudio and call out that you want to use that version of R.

Now my packages won’t update

Open R seems to run just fine, but a couple weeks in, I realized I had a problem–my packages weren’t updating (and I really wanted the newest version of dplyr!).

Turns out, Open R is set to not update packages by default. The idea is that they snapshot all packages each year so things don’t get updated and break half way in.

This doesn’t really bother me–I’m rarely spending more than a couple weeks on a single project,not do I have any massive dependencies that would break between upgrades, so I followed the instructions in the FAQ above to set my package repo back to CRAN (basically, you just need to edit your

And sure enough, I was back in business with dplyr 0.4 (and a host of newly updated packages)!

Read More