Correlation testing in R

No Comments

In his new book, Kevin MacDonnell argues that when you’re building a predictive model for something like major giving, you’re gonna want to prioritize your predictors, assuming you’ve got a bunch of possible predictors for the outcome any combination of which may or may not be any good.

Kevin recommends doing Pearson’s correlation test to see how each predictor like “has a business phone” and “number of events attended” correlates to the outcome (ie. “number of major gifts” or “lifetime giving”).

Doing a correlation test in R is pretty simple. Let’s assume you’ve got your predictors and outcome in a data frame with one row per constituent, something like this:

id hasaddr hasbuphone hascell logOfLifeG
1 TRUE TRUE FALSE 3.218876
2 TRUE TRUE FALSE 5.828946
3 TRUE TRUE FALSE 6.690842
5 TRUE TRUE FALSE 4.382027
8 TRUE TRUE TRUE 5.010635
9 TRUE FALSE FALSE 5.703782

Test Predictors Against Lifetime Giving

You’ll remember that a dataframe is just a list of vectors (and each column is a vector), so we just need to sapply over that list, comparing each vector to the outcome (in this case our outcome is df$logOfLifeG, ie. the log value of lifetime giving.

You’ll note that I’m excluding the first and last columns–that’s where my constituent ids and outcomes are).

Also, at the end of the function, I grab the 4th item in the list–cor.test() returns a list and the 4th item in the list is the actual data about how correlated the items are.:

Ok, that’s the correlation numbers for each predictors, but right now, testresults is just vector of values with a name for each row.

Make It Readable

A vector is fine if you’re just going to View() it in RStudio, but it’s a pain to read. So let’s turn it into a data frame where the first column is a list of all the column names from the original list:

predictors correlation
hasaddr.estimate.cor hasaddr 0.0249737
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318

Alrighty, now we’re getting somewhere (note that the row names are still there on the left: annoying, but not really worth dealing with). With a small data frame like this, you don’t really need to do anything else. But if you’ve got 20 or more predictors, you’re going to want to sort it.

Sort the Table

Since negative correlation (ie. if your outcome goes down when the predictor goes up)is just as important as positive, we’ll create a new column which is the absolute value of the correlation just so we’ve got something to accurately sort by. Note that I’m using dplyr here

predictors correlation
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318
hasaddr.estimate.cor hasaddr 0.0249737

And there’s your correlation table! Now you can start at the top of the table, building your model with your highly correlated predictors.

Excel vs. R (or Why You Should Use R Scripts)

No Comments

I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.

The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.

Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.

What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).

The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.

Scripts in RStudio

If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):

  1. Hit File -> New -> R Script. That’ll open a new script in the Source pane.
  2. Type any number of R commands in the new script.
  3. Save the file as foo.R
  4. Hit the “Source file” button at the top of the Source pane. This runs source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
  5. You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I'm opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I've built in the Files pane in the bottom right.

Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.

That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.

One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.

Using Cut to Categorize Wealth Capacity

No Comments

Occasionally, I want to categorize wealth capacity into different bins. For example, capacity (or giving, or whatever numeric variable) should fit into bins from 1 – 12 (we’ve always used WealthEngine’s categories here in the office, despite the fact that it’s incomprehensibly a 12 point scale instead of 10 and the lowest number is best, instead of worst).

In Excel, I’d turn a number into a categorical variable (or factor, as R calls them) by using a lookup table and a bunch of nested IF statements. You can do the same with ifelse() in R if you want, but I’d recommend using cut() instead.

“Cut. It. Into bins in order to better describe wealth capacity.”
Somehow, that just doesn’t have the same ring.

To start, you’ll want something like this:

That should get you pretty close to where you want to go. A few notes about what’s going on here.

  1. You’re going to need specify breaks, ie. the minimum amount for each bin. If we’re talking about giving or wealth capacity, this means 0 and Inf should be the top and bottom of your breaks (the minimum of the worst bin is a capacity of $0; without setting the minimum at the top at infinity, R won’t know what bin to put your very best donors in).
  2. You’ll need one fewer labels than cut points–nobody has a wealth capacity of less than zero (your bottom break) or greater than infinity (the top break).
  3. You’re also going to want to specify right=F. This tells R that you’re giving it the minimum for each bin rather than the maximum (ie. the “left” side instead of the right).

To sum up: , if you’re making a scale with 12 bins, you need 12 labels and 13 break points (with 0 and Inf on either end of that list of breaks).

Using a lookup table to specify capacity ratings

If you’ve already got your ratings in a table (I’d be willing to bet you’ve got something in a spreadsheet somewhere), you can read that into a data frame (named ratings below) and use it like this:

The only trick here is that you need to make sure you sort the table so your minimums are in ASCENDING order, so your worst category is at the top of the table and the best, wealthiest, category is at the bottom of the lookup table.

Using cut() for other categorizations

That’s all there is to using cut to categorize wealth capacity ratings into bins, but you can probably see how to apply this to other numeric columns ( to lifetime giving to come up with rough categorical outcome variables, for example).

I’d probably end up knocking together the lookup table in Excel and then using read.table(file="clipboard", sep="\t") to read it into R from the clipboard.

From there, it’s just a matter of running cut() on the data you want to transform!

R doesn’t care about leading zeros

No Comments

Constituents in our database each have an I’d number. Every number has 8 digits, regardless of how big the number is–we pad the id out with leading zeroes (an id of 12345 would go into the database as 00012345).

Ids stay numeric when I load them into R to make joins easier (R barfs when trying to do joins on factors, ie. categorical variables).

The leading zeros get removed once they hit R, but I realized today, via a bit of careless copy and paste, that 00012345 evaluates to the same thing as 12345.

Is he really trying to paste from Banner into a csv?

Or to put it another way:

> 0012345==12345
TRUE

Handy, when it comes to copying and pasting ids from who knows where into R.

Categories: data manipulation Tags: Tags: , , , ,

Re-normalizing data (in the database sense of normalization)

No Comments

I can’t always hit our database directly, so I often end up working with data that looks like this:

ID Activities
1
2 DEC~FILE~STD~TTR
3
4 AQD~BEA~FILE~HOSP~MCL~MMA~PRP~PSI~WTENN
5 DEC~FILE~PSP~SCL~STD
6 FILE~NAL
7 FILE~HOME~MMA~SACO
8 FILE~MFORT

If I want to categorize those activities, for example, to say “Const #1 was in a Greek organization” because I think maybe that will be a predictor for lifetime giving, I need a good way to normalize the data in R.

I want it to look like it does in the database, something like:

ID Activities
2 DEC
2 FILE
2 STD
2 TTR
4 BEA
4 FILE
4 HOSP

and so on.

Re-normalizing data the R way

Doing this R isn’t hard, but it took me several tries and scads of Googling to find the R wa to do it:

df <- read.csv('that-annoying-list.csv') activities <- strsplit(df$activities, "~") normalizedlist <- data.frame(user = rep(df$id, sapply(activities, length)), item = unlist(activities))

Here's what's going on:

  1. We read the data into the df object using .
  2. strsplit splits the list of activities by the delimiter "~". This returns a big nested list--the top-level entries are by person, the second level of entries are each activity.
  3. We build a new data frame that starts with id, repeated as many times as each list item is long. The next column is just all the values of the list dumped straight out.

I originally tried doing this job with ddply, breaking the data frame apart by id and using strsplit and then reassembling it all. It worked but was a disaster (the above version works in seconds--my original took about 20 minutes). This is really the R way to normalize a list of codes.

Now that I've got the codes broken apart, I can use dplyr's join functions to categorize each code and figure out what constituents were in each category.

PS. "Normalizing" a list, in statistics means something different than it does in the database world--here's we're just talking about normalization in the Boyce-Codd sense of the word.

Getting started – use dplyr

No Comments

If you’ve already installed R and RStudio, there’s one more thing you’re going to need before you really get started using R for predictive modeling for fundraising: dplyr.

dplyr is an R package (which is to say “add-on code”) that makes using R for basic data manipulation substantially less painful.

It’s dangerous to go alone. Take this. [offers dplyr].

To get dplyr, fire up R and then type

install.packages('dplyr')

When that finishes, the package will be installed but not loaded, so do

library(dplyr)

Write like you think

dplyr provides two major advantages. The first is that it allows you to write code the way you think, namely starting from a set of data and working towards the final result.

There’s a host of articles online about how great this is, so I’m not going to spell it out for you–suffice it to say, it makes thinking through and solving problems a lot easier.

A grammar of data manipulation

The other thing that dplyr does very well is providing a sort of grammar of data manipulation, specifically a set of verbs that you typically use to solve most common data problems, stuff like sorting, rearranging and renaming columns, adding new calculated columns, etc.

The author of dplyr, the inestimable Hadley Wickham, has a great tutorial on these verbs.

Learning these 5 functions (plus just a couple more like ifelse and grepl, which I’ll cover in later posts) will solve the overwhelming majority of the tedious sorts of data manipulation tasks you’ll find yourself doing every time you start in a data job.

Categories: Getting started Tags: Tags: , ,

Getting started – Use RStudio

No Comments

There’s scads of articles online about getting R installed on your machine, so I’m not going to spend a lot time talking about that.

What I will say is that you MUST use RStudio.

a look at Rstudio.

It’s an IDE for R, which is to say, it’s a UI that wraps around R and makes everything you do in R LOTS easier. Literally everything.

Some advantages include:

  • See a list of the objects you’re currently working with.
  • See a list of packages you’ve got installed and load one with a single click of the mouse
  • Edit scripts and see the results at the same time.
  • Have plots show up in the same window that you’re working in.
  • See files in the current directory
  • Read help files while typing the commands at the same time
  • The View() function, which gives you,a well formatted, legible view of your data (this feature alone is worth the price of admission, which come to think of it is the next point)
  • Free (as in beer AND as in freedom)

Disadvantages compared to using R by itself include:

  • Doesn’t make you feel like it’s 1986

I’d list the lack of remote access here, but setting up rStudio to be accessible via the web is pretty easy, too.

in short, unless you’re running a really old computer and are short on resources, there is approximately zero reason┬áto use R any other way.

Go, install it now. Trust me , it’s worth it.

Categories: Getting started Tags: Tags: ,

Kicking this off

No Comments

Alrighty, here’s the typical first post for a blog:

I’m working in learning predictive modeling in R for work – I work for the development office at Millikin University. So it seems like it might be worthwhile to start documenting my progress and sharing them in my blog.

So so we’ll see how this goes…

Categories: Getting started