Creating Yes/No Predictors in R

No Comments

When you’re getting ready to create a predictive model, you spend a LOT of your time trying to whip the data into shape to be useful.

One of the main things I find myself doing is transforming text strings into Yes/No variables. Things like addresses and phone numbers aren’t particularly useful data for building a model in and of themselves–there’s too much variation in the data between everyone’s street addresses for it to mean much (setting aside, for now, the interesting idea of using Forbes Wealthiest Zip Codes data or some such).

On the other hand, transforming a mail or email address into a predictor that says “Yes, this person has an address” or “No, we don’t have an email for this person” can be really useful data, minimizing the variation.

Using ifelse to create binary predictors

To do that, we can use ifelse. Here’s what we’ll do. First, let’s mockup a little sample data frame:

Next, we’ll transform the data (using dplyr, of course), adding a new column which is our tranformed variable:

There you have it: your new binary predictor variable.

Binary transformation as a function

If you’re doing a lot of these sorts of transformations, you’re going to want to use a function so you don’t have to type the same thing over and over.

Here’s the function I wrote to do this (note that the comments above the function are in roxygen style because I’m planning to eventually turn this into a package):

You’ll notice a few changes from what we did above:

  1. The function accepts a data frame and asks for a column index. This is so you only have to get your data into one big flat file and can reuse this function on columns 2, 5, 23, 75 or whatever.
  2. There’s an extra bit of criteria in the ifelse: grepl("\\(\\)-",x[,i]). My data has a bunch of missing phone numbers that look like this: “()-“. It’s just as much empty data as an NA, so I wrote a regular expression to find those and consider them empty.
  3. I ordered the levels of the factor, because I want to assume that “no address” is the default.
  4. The output of the function is only id and the newly transformed column. I’m assuming you want to build all your predictors and then join all/some/any of them up into one big data file to feed to the linear regression model.

Correlation testing in R

No Comments

In his new book, Kevin MacDonnell argues that when you’re building a predictive model for something like major giving, you’re gonna want to prioritize your predictors, assuming you’ve got a bunch of possible predictors for the outcome any combination of which may or may not be any good.

Kevin recommends doing Pearson’s correlation test to see how each predictor like “has a business phone” and “number of events attended” correlates to the outcome (ie. “number of major gifts” or “lifetime giving”).

Doing a correlation test in R is pretty simple. Let’s assume you’ve got your predictors and outcome in a data frame with one row per constituent, something like this:

id hasaddr hasbuphone hascell logOfLifeG
1 TRUE TRUE FALSE 3.218876
2 TRUE TRUE FALSE 5.828946
3 TRUE TRUE FALSE 6.690842
5 TRUE TRUE FALSE 4.382027
8 TRUE TRUE TRUE 5.010635
9 TRUE FALSE FALSE 5.703782

Test Predictors Against Lifetime Giving

You’ll remember that a dataframe is just a list of vectors (and each column is a vector), so we just need to sapply over that list, comparing each vector to the outcome (in this case our outcome is df$logOfLifeG, ie. the log value of lifetime giving.

You’ll note that I’m excluding the first and last columns–that’s where my constituent ids and outcomes are).

Also, at the end of the function, I grab the 4th item in the list–cor.test() returns a list and the 4th item in the list is the actual data about how correlated the items are.:

Ok, that’s the correlation numbers for each predictors, but right now, testresults is just vector of values with a name for each row.

Make It Readable

A vector is fine if you’re just going to View() it in RStudio, but it’s a pain to read. So let’s turn it into a data frame where the first column is a list of all the column names from the original list:

predictors correlation
hasaddr.estimate.cor hasaddr 0.0249737
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318

Alrighty, now we’re getting somewhere (note that the row names are still there on the left: annoying, but not really worth dealing with). With a small data frame like this, you don’t really need to do anything else. But if you’ve got 20 or more predictors, you’re going to want to sort it.

Sort the Table

Since negative correlation (ie. if your outcome goes down when the predictor goes up)is just as important as positive, we’ll create a new column which is the absolute value of the correlation just so we’ve got something to accurately sort by. Note that I’m using dplyr here

predictors correlation
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318
hasaddr.estimate.cor hasaddr 0.0249737

And there’s your correlation table! Now you can start at the top of the table, building your model with your highly correlated predictors.

Using Cut to Categorize Wealth Capacity

No Comments

Occasionally, I want to categorize wealth capacity into different bins. For example, capacity (or giving, or whatever numeric variable) should fit into bins from 1 – 12 (we’ve always used WealthEngine’s categories here in the office, despite the fact that it’s incomprehensibly a 12 point scale instead of 10 and the lowest number is best, instead of worst).

In Excel, I’d turn a number into a categorical variable (or factor, as R calls them) by using a lookup table and a bunch of nested IF statements. You can do the same with ifelse() in R if you want, but I’d recommend using cut() instead.

“Cut. It. Into bins in order to better describe wealth capacity.”
Somehow, that just doesn’t have the same ring.

To start, you’ll want something like this:

That should get you pretty close to where you want to go. A few notes about what’s going on here.

  1. You’re going to need specify breaks, ie. the minimum amount for each bin. If we’re talking about giving or wealth capacity, this means 0 and Inf should be the top and bottom of your breaks (the minimum of the worst bin is a capacity of $0; without setting the minimum at the top at infinity, R won’t know what bin to put your very best donors in).
  2. You’ll need one fewer labels than cut points–nobody has a wealth capacity of less than zero (your bottom break) or greater than infinity (the top break).
  3. You’re also going to want to specify right=F. This tells R that you’re giving it the minimum for each bin rather than the maximum (ie. the “left” side instead of the right).

To sum up: , if you’re making a scale with 12 bins, you need 12 labels and 13 break points (with 0 and Inf on either end of that list of breaks).

Using a lookup table to specify capacity ratings

If you’ve already got your ratings in a table (I’d be willing to bet you’ve got something in a spreadsheet somewhere), you can read that into a data frame (named ratings below) and use it like this:

The only trick here is that you need to make sure you sort the table so your minimums are in ASCENDING order, so your worst category is at the top of the table and the best, wealthiest, category is at the bottom of the lookup table.

Using cut() for other categorizations

That’s all there is to using cut to categorize wealth capacity ratings into bins, but you can probably see how to apply this to other numeric columns ( to lifetime giving to come up with rough categorical outcome variables, for example).

I’d probably end up knocking together the lookup table in Excel and then using read.table(file="clipboard", sep="\t") to read it into R from the clipboard.

From there, it’s just a matter of running cut() on the data you want to transform!

Re-normalizing data (in the database sense of normalization)

No Comments

I can’t always hit our database directly, so I often end up working with data that looks like this:

ID Activities
1
2 DEC~FILE~STD~TTR
3
4 AQD~BEA~FILE~HOSP~MCL~MMA~PRP~PSI~WTENN
5 DEC~FILE~PSP~SCL~STD
6 FILE~NAL
7 FILE~HOME~MMA~SACO
8 FILE~MFORT

If I want to categorize those activities, for example, to say “Const #1 was in a Greek organization” because I think maybe that will be a predictor for lifetime giving, I need a good way to normalize the data in R.

I want it to look like it does in the database, something like:

ID Activities
2 DEC
2 FILE
2 STD
2 TTR
4 BEA
4 FILE
4 HOSP

and so on.

Re-normalizing data the R way

Doing this R isn’t hard, but it took me several tries and scads of Googling to find the R wa to do it:

df <- read.csv('that-annoying-list.csv') activities <- strsplit(df$activities, "~") normalizedlist <- data.frame(user = rep(df$id, sapply(activities, length)), item = unlist(activities))

Here's what's going on:

  1. We read the data into the df object using .
  2. strsplit splits the list of activities by the delimiter "~". This returns a big nested list--the top-level entries are by person, the second level of entries are each activity.
  3. We build a new data frame that starts with id, repeated as many times as each list item is long. The next column is just all the values of the list dumped straight out.

I originally tried doing this job with ddply, breaking the data frame apart by id and using strsplit and then reassembling it all. It worked but was a disaster (the above version works in seconds--my original took about 20 minutes). This is really the R way to normalize a list of codes.

Now that I've got the codes broken apart, I can use dplyr's join functions to categorize each code and figure out what constituents were in each category.

PS. "Normalizing" a list, in statistics means something different than it does in the database world--here's we're just talking about normalization in the Boyce-Codd sense of the word.