The formattable package is AWESOME

No Comments

Two things I’ve really missed from Excel, when I moved to using R, were conditional formatting and being able to format numbers as currency or percentages without losing data.

I wrote an as.currency() function a while ago, but it turns numbers to strings, which means that I can no longer math those strings without coercing them back into numbers. And coercion is always scare. Do you want to lose data? Because that’s how you lose data.

The formattable package solves both these problems.

It’s got a couple functions, percent() and currency() that only affect how numbers are PRINTED, so they come out looking proper even on the terminal, but they stay numeric so you can continue to use them as numbers.

It also has a host conditional formatting options, so you can change the text color of a column based on its values, add data bars (ie. built-in sparklines within the column itself) or format one column based on the values of another (something I’m pretty sure you can do in Excel, but I’ve never been able to figure out how).

Straight out of the vignette (because I want to get back to playing with this stuff, instead of writing this post):

id price rating market_share revenue profit
1 10 5 10.00% 55,000.00 25,300.00
2 15 4 12.00% 36,400.00 11,500.00
3 12 4 5.00% 12,000.00 (8,200.00)
4 8 3 3.00% (25,000.00) (46,000.00)
5 9 4 14.00% 98,100.00 65,000.00

.Last.value stores … the last value

No Comments

Ok, here’s a quick (but handy) one:

The value of the last thing that you printed to the console is stored in the value: .Last.value. So if you’re going along, doing something interactively, and you print some tibble and realize you need to see the whole thing (but you don’t want to do that super expensive left join into the database again), you can do:

And boom! Your tibble prints all nice and neat and filterable (if that’s what you’re in to).

Categories: Getting started

Solutions for Modeling Imbalanced Data

No Comments

What to do when modeling really imbalanced data?

Fundraising data is usually really imbalanced–for every 20,000 constituents, less than a thousand might give, sometimes half that. Most predictive modeling strategies are designed to work with balanced, normally distributed data, not imbalance, highly skewed data like ours.

Using downsampling in randomForest models can significantly help with the false positive/false negative problems caused by how scarce donors are, compared to non-donors. Weighting the different classes helps, too, but not by much.

AF16 Model

At the beginning of FY16, I built a predictive model, using caret and randomForest. It was an ok model, but had some serious problems, in retrospect, with not predicting who would actually donate.

Note the accuracy of 97% was based on the fact that of the 20K outcomes tested, 19K were correctly predicted to be non-donors. At the same time, we got as many false positives and false negatives as we did accurate donor predictions. Clearly we’ve got a problem (see the Balanced Accuracy stat at the bottom, which is more like 77%):

Dealing with Rare Cases

The problem here is that the donor cases are so rare that the model mostly just guesses that people won’t give.

Google and the Prospect-DMM list suggested two alternatives:

  • weighted modeling, where you penalized the model for selecting that they won’t give
  • downsampling, ie. sampling a smaller number of the majority class to compare

I built a series of models for both solutions, then compared the AUC of each model.

Downsampled Models

I built a matrix of possible sample sizes, ie. how many of the minority and majority class should be sampled. The I looped through that matrix, building a model for each possible combination.

possible downsampling ratios

Weighted Models

I built a similar matrix of possible weights and a model for each.


Comparing Models

After I had built all the models based on the parameters above (which I’m not going to show because building that many models took FOREVER; suffice it to say, I used lapply() and a healthy dose of patience), I generated ROC curves and calculated the AUC, or area under the curve, for each.


Plotting AUC

Plotting AUC, or area under the (ROC) curve, gives us a single number per model, so we can more easily compare multiple models. That line chart above looks cool, but is largely useless for real comparisons.

Clearly, the sampled models performed much better than the weighted models, many of which performed worse than the gray, bog standard randomForest and the orange caret::randomForest models.


The Best Models

Interestingly, the best models were those where the minority class was set to 50–the majority sample sizes for the top three models were 500, 50 and 1000 respectively:

Surprisingly, the worst of the sampled models (which under performed against the reference models) also used 50 for the minority size, but had MUCH larger majority sizes:

Plotting sample ratio against AUC reveals that sample ratio is indirectly proportional to AUC. The data is noisy and uncertain for ratios smaller than about 25. After that point, however, AUC drops off logarithmicaly.



Summary and General Ending Thoughts

Long story short (too late!), downsampling seemed to be the real winner here. I haven’t tried combining the two, ie. using downsampling to get a good ratio AND using class weights to penalize choosing “not gonna give”–building that sort of model seems like a good next step.

Below are some useful links that helped me figured out what was going on. I didn’t link to the prospect-dmm mailing list below because you have to be logged in to access the archives, but there was a great discussion about this problem earlier this fall that got me thinking about it.

Distributing leaflet maps without a server

No Comments

Distributing confidential maps is hard

I’ve been trying, for a while now, to figure out how to use R to replace MapPoint–the challenge has been distributing a map of confidential data to non-R users in my office.

Making the maps isn’t THAT hard: leaflet makes it pretty easy to draw up a map and the ggmap library has the geocode() function, so that’s solved too.

The real problem is, after I build a leaflet map of the VP’s prospects with giving and wealth rating data, how do I send it the VP’s admin assistant so she can work out his travel plans? I can’t put that kind of confidential data on Rpubs. Running and administering a shiny server (not to mention building a shiny app) is too much work for one-off stuff like this–I just want to be able to save the file and email it.

Use saveWidget()

Turns out, the htmlwidgets library has a saveWidget() function. I did something like this:

That makes an html file and a directory of helper functions (css, javascript, etc) on my desktop. I zip the html and the directory up into a single zip file, email that off and call it a day.

It still requires internet access to download the map (by default from Open Street Maps)–I’d like to get that sorted out later on so development officers can throw the html file in Dropbox and open it up when they’re on the road–but for now, using saveWidget() seems to be the way to go!

Categories: plotting Tags: Tags: , , ,

Bind_rows() instead of rbind_all() is awesome

1 Comment

I use dplyr’s rbind_all() all the time to mash multiple data frames together–most of the time, I’ll pull multiple data sets, but then stack them on top of each other to do some sort of summary, something like:

Today, I happened to look at the documentation for rbind_all() and found out it’s been deprecated in favor of bind_rows().

Ok whatever, I thought. Then I happened to look at the arguments for bind_rows( ... , .id = NULL).

Here’s the super cool part: you pass .id a string that becomes the name of a new column in the new data frame. And that column is populated by either a sequence of numbers (in whatever order you passed the data frames you’re binding together) or (and here’s where the real money at), you can name your data frames in the first place. Like this:

Normally, I end up adding a source column or some such each of the smaller individual data frames before I bind them all together. It’s a pain and a good way to screw something up–I often forget to add the column till it’s too late and then I have to go back, make the edit and re-run the code. This solves the problem, and rather elegantly so.

You should use “perl = T”

No Comments

Today, I was trying to use gsub() to replace a bunch of underscores with spaces, then capitalize the words, something like this:

The first bit, replacing the underscores with spaces was working fine, but that \\U bit kept throwing all sort of errors and when it didn’t, it just didn’t work.

As it turns out, R’s built-in regexp engine doesn’t like \\U–you need to use Perl-style regular expressions to be able to use it. Fortunately that’s easy, just add perl = TRUE as an argument to gsub().

That makes the code above look like this:

The page about regexp in R says you should always use perl =T. Seems like good advice.

Categories: data manipulation Tags: Tags:

Organize an R Project

No Comments

If I’m working a project that’s more than about 50 lines of code, I’ll often end up with several scripts. Based on a conversation on the R community on G+, I’ve started organizing my projects a bit better:

Always use an R Studio project

First of all, I always use an R project. Always.

Ok, maybe not always. Occasionally, I just want to dink around in the R console. I won’t make a new R project for that. But if I’ve got enough code that I want to do any sort of debugging, then I’ll start writing an R script. And if I’ve got even one script, I’ll make an R project for it.

This means I’ve got a directory for just about every project I work on, which is a little clutter-y, but not as bad as having a million scripts laying around that may or may not be interconnected.

Use a numeric system to name your scripts

I’ll often write more than one script–my cut off for “I need a new script here” is usually if my “load the data” section is bigger than half my screen, or if I find myself at a good stopping place and realize I still have lots of work to do.

When I have more than script, I’ll start a numerical naming system with a base script called “00-build-something-or-other.R”. This script is usually just a bunch of source() commands. The one that’s running as I type this looks like this:

Using the echo = T argument for source() makes it so you can actually see the commands that are churning along in the background. For a long time, I didn’t add this–it drove me crazy that I could run an individual script with all the commands from RStudio, but not when I ran my base script. Adding echo = T solves that’s problem.

Also, you can use the beeper package to get notifications when this first script finishes. Doing the following will play the end-of-the-level Super Mario music when your job finally gets done (I’d give $5 for a button/keyboard shortcut to include the audible notification in R Studio):

The next scripts are 01-xxxx, 02-xxxx and so on

The next scripts are named 01-xxxxx.R, 02-xxxxx.R, and so on, with each script doing a single job. I try to break up my scripts in places where I won’t have to re-run the previous script each time I screw something up in the one I’m currently working on. Usually, that means each script starts with something like:

Occasionally, too, I’ll use an if() statement to make sure the object I really need exists. Usually, this isn’t necessary if I’m using my 00- script to call everything, but very rarely, I’ll have one script call another.

I did this the other day while I was pulling data for our spring phonathon, for example. I wanted to run a predictive model on our NonAlumni Never-Givers. Since I was training the model on last year’s giving, I didn’t expect the data to change, so I didn’t need the model to refresh when I rebuilt the spring data, so in the middle of my segmentation script I had a bit of code like this:

Use /data and /output directories

Every project I build also has two sub directories in it: /data and /output. Any files I need that are specific to that project, usually .csvs of weird data, hand-reviewed notes, stuff that doesn’t live in our main database or warehouse all get thrown into /data. Anything that I dump out, whatever the final output is (or, rather, all the drafts of the final output) get dumped into /output.

This keeps my main project directory fairly clean: the only thing that should be in there is the .Rproject file itself, any R scripts (which should sort themselves in order, because they’re named 00-xxxxx, 01-xxxxx, 02-xxxxx, etc) and the two directories.

It also means I never have to wonder about what folder something is in: unless it’s a reference file that lives somewhere else (and even then, if it’s small and I don’t plan it changing, I’ll copy it into /data), I can always type read.csv('data/' and then whack Tab and all my files come up. The same goes for writing out data–I know I can always do write.tidy('output/ and I’m good to go.

An important side note there: I use the write.tidy() and read.tidy() functions from my muadc package all the time–they’re wrappers around the standard read/write.csv functions (which in turn are wrappers around the read/write.table functions), but they make my life a LOT easier, if nothing else because I know the arguments are always going to be same.

Stay Organized

Organizing my projects this way has really helped–now, when I return to a project 6 months from now (I’m sure I’ll be back in August, pulling phonathon data together again), I’ll be immediately able to see how the scripts relate to each other, what happened where and what’s important.

Categories: Getting started Tags: Tags: , ,

Comparing Dr Who episodes by decade

1 Comment

A little while ago, io9 rated every Dr. Who episode from best to worst.

I immediately noticed that a bunch of their favorites were from the reboot, despite the fact that there’s a lot more content in the older series. So I decided to pull the data into R to see if I was imagining things. I know this isn’t fundraising-related, but it IS R related, and it was a fun project to work on over lunch.

Here’s a plot of all the episodes with year on the x-axis and rank on the y-axis. Remember that higher rank is worse.


It definitely LOOKS like the new stuff is better, but I’ll bet we can know more.

Here’s what I found, grouping all the episodes by rounding the year to the nearest decade (ie. 1953 becomes 1950; 1958 becomes 1960):

Let’s Look at Averages

In terms of mean rank, the reboot was WAY better than everything (and the 1990s stuff had an overall average rank that was much worse):

You can see there that, on average, the 2000 decades were the best and the 1990s were the worst.

T-test to really see

But this is just averages–seems like we can do better. I mean, there could be an outlier sitting out there yanking those averages down (or up, as it were).

With that in mind, I ran a t.test to compare each year to another:

Here I’m asking whether the p-values are less than .05. Or to put it another way, is there less than a 5% chance that the difference we’re seeing in the averages for each decade occurred by chance? TRUE values are where there is a smaller than 5% chance that the difference is statistically signficant.

Old Stuff Was A Crapshoot

The other thing you can see is that the old stuff was all over the map. You can see this from the plot above, but there’s not a statistically significant difference in ranks between the 60s, 70s, 80s and 90s. I found that surprising–everybody knows that stuff at the very end of the first series was awful, right? Not io9, apparently.

In any case, if you’re introducing your friends to Dr. Who, start with the reboot–that old stuff is like a box of chocolates….

Categories: assessment Tags: Tags:

Moving Data in and out of R via the Clipboard

1 Comment

There’s some interesting stuff in this blog post about reading data into R, particularly about using scan() to read data in (apparently you can use it to type data in, one entry per line).

That said, it’s a pain in the butt to depend on scan()–most of the time when I’m pushing data around, I use R’s ability to read data frames from the clipboard.

I tend to use read.csv(file='clipboard') more than readClipboard(), mostly because I always forget about the latter.

One important note: by default, R only uses a small bit of the Windows clipboard to write files out (I have no idea how/if this works at all on Linux and Mac), something like 128KB. That’s not enough for a decent sized data frame/spreadsheet, but it’s pretty easy to bump that limit up.

If you do write.table( foo, file = 'clipboard-4096'), just about anything should fit in there.

I’ve got a function named write.clip() in my muadc R pacakge that does this for me, because I’m a lazy bum and got tired of typing “sep = '\t', row.names = F“.

3D Scatterplot in R


Today, I was doing some analysis of a sub-group of our constituents and was trying to get an idea of how connected these folks felt to our school.

I thought I’d do that by running a summary of their lifetime giving, event attendance and number of development officer contacts:

That shows me the 5-number breakdown for each column. Obviously, I’ve got some serious outliers in there, somebody’s who’s attended over 200 events!

But I can’t tell if the person w/ 200+ events is the same person who’s given a ton of money or not, so I tried running some scatterplots comparing the numbers:

lifetime giving vs event attendance

That’s not awful, but I have to run a bunch of these to compare all three variables–if I could run a legitimately 3D scatterplot in R, I’d be able to see all three variables at once.

Turns out, it’s REALLY easy.

3D scatterplot

click to zoom, rotate and pan
(embedding a live htmlwidget in WordPress made me crazy)

That gives you a really cool, pannable 3D scatterplot so you can see how each person compares on all three variables at once.

To tell the truth, it’s a bit gimmicky (and what do you do when you’ve got more than three variables? you’re back in the same boat). Still, it’s just so cool and fun that I had to share!