Using Cut to Categorize Wealth Capacity

No Comments

Occasionally, I want to categorize wealth capacity into different bins. For example, capacity (or giving, or whatever numeric variable) should fit into bins from 1 – 12 (we’ve always used WealthEngine’s categories here in the office, despite the fact that it’s incomprehensibly a 12 point scale instead of 10 and the lowest number is best, instead of worst).

In Excel, I’d turn a number into a categorical variable (or factor, as R calls them) by using a lookup table and a bunch of nested IF statements. You can do the same with ifelse() in R if you want, but I’d recommend using cut() instead.

“Cut. It. Into bins in order to better describe wealth capacity.”
Somehow, that just doesn’t have the same ring.

To start, you’ll want something like this:

That should get you pretty close to where you want to go. A few notes about what’s going on here.

  1. You’re going to need specify breaks, ie. the minimum amount for each bin. If we’re talking about giving or wealth capacity, this means 0 and Inf should be the top and bottom of your breaks (the minimum of the worst bin is a capacity of $0; without setting the minimum at the top at infinity, R won’t know what bin to put your very best donors in).
  2. You’ll need one fewer labels than cut points–nobody has a wealth capacity of less than zero (your bottom break) or greater than infinity (the top break).
  3. You’re also going to want to specify right=F. This tells R that you’re giving it the minimum for each bin rather than the maximum (ie. the “left” side instead of the right).

To sum up: , if you’re making a scale with 12 bins, you need 12 labels and 13 break points (with 0 and Inf on either end of that list of breaks).

Using a lookup table to specify capacity ratings

If you’ve already got your ratings in a table (I’d be willing to bet you’ve got something in a spreadsheet somewhere), you can read that into a data frame (named ratings below) and use it like this:

The only trick here is that you need to make sure you sort the table so your minimums are in ASCENDING order, so your worst category is at the top of the table and the best, wealthiest, category is at the bottom of the lookup table.

Using cut() for other categorizations

That’s all there is to using cut to categorize wealth capacity ratings into bins, but you can probably see how to apply this to other numeric columns ( to lifetime giving to come up with rough categorical outcome variables, for example).

I’d probably end up knocking together the lookup table in Excel and then using read.table(file="clipboard", sep="\t") to read it into R from the clipboard.

From there, it’s just a matter of running cut() on the data you want to transform!

R doesn’t care about leading zeros

No Comments

Constituents in our database each have an I’d number. Every number has 8 digits, regardless of how big the number is–we pad the id out with leading zeroes (an id of 12345 would go into the database as 00012345).

Ids stay numeric when I load them into R to make joins easier (R barfs when trying to do joins on factors, ie. categorical variables).

The leading zeros get removed once they hit R, but I realized today, via a bit of careless copy and paste, that 00012345 evaluates to the same thing as 12345.

Is he really trying to paste from Banner into a csv?

Or to put it another way:

> 0012345==12345
TRUE

Handy, when it comes to copying and pasting ids from who knows where into R.

Categories: data manipulation Tags: Tags: , , , ,

Re-normalizing data (in the database sense of normalization)

No Comments

I can’t always hit our database directly, so I often end up working with data that looks like this:

ID Activities
1
2 DEC~FILE~STD~TTR
3
4 AQD~BEA~FILE~HOSP~MCL~MMA~PRP~PSI~WTENN
5 DEC~FILE~PSP~SCL~STD
6 FILE~NAL
7 FILE~HOME~MMA~SACO
8 FILE~MFORT

If I want to categorize those activities, for example, to say “Const #1 was in a Greek organization” because I think maybe that will be a predictor for lifetime giving, I need a good way to normalize the data in R.

I want it to look like it does in the database, something like:

ID Activities
2 DEC
2 FILE
2 STD
2 TTR
4 BEA
4 FILE
4 HOSP

and so on.

Re-normalizing data the R way

Doing this R isn’t hard, but it took me several tries and scads of Googling to find the R wa to do it:

df <- read.csv('that-annoying-list.csv') activities <- strsplit(df$activities, "~") normalizedlist <- data.frame(user = rep(df$id, sapply(activities, length)), item = unlist(activities))

Here's what's going on:

  1. We read the data into the df object using .
  2. strsplit splits the list of activities by the delimiter "~". This returns a big nested list--the top-level entries are by person, the second level of entries are each activity.
  3. We build a new data frame that starts with id, repeated as many times as each list item is long. The next column is just all the values of the list dumped straight out.

I originally tried doing this job with ddply, breaking the data frame apart by id and using strsplit and then reassembling it all. It worked but was a disaster (the above version works in seconds--my original took about 20 minutes). This is really the R way to normalize a list of codes.

Now that I've got the codes broken apart, I can use dplyr's join functions to categorize each code and figure out what constituents were in each category.

PS. "Normalizing" a list, in statistics means something different than it does in the database world--here's we're just talking about normalization in the Boyce-Codd sense of the word.

Getting started – use dplyr

No Comments

If you’ve already installed R and RStudio, there’s one more thing you’re going to need before you really get started using R for predictive modeling for fundraising: dplyr.

dplyr is an R package (which is to say “add-on code”) that makes using R for basic data manipulation substantially less painful.

It’s dangerous to go alone. Take this. [offers dplyr].

To get dplyr, fire up R and then type

install.packages('dplyr')

When that finishes, the package will be installed but not loaded, so do

library(dplyr)

Write like you think

dplyr provides two major advantages. The first is that it allows you to write code the way you think, namely starting from a set of data and working towards the final result.

There’s a host of articles online about how great this is, so I’m not going to spell it out for you–suffice it to say, it makes thinking through and solving problems a lot easier.

A grammar of data manipulation

The other thing that dplyr does very well is providing a sort of grammar of data manipulation, specifically a set of verbs that you typically use to solve most common data problems, stuff like sorting, rearranging and renaming columns, adding new calculated columns, etc.

The author of dplyr, the inestimable Hadley Wickham, has a great tutorial on these verbs.

Learning these 5 functions (plus just a couple more like ifelse and grepl, which I’ll cover in later posts) will solve the overwhelming majority of the tedious sorts of data manipulation tasks you’ll find yourself doing every time you start in a data job.

Categories: Getting started Tags: Tags: , ,

Getting started – Use RStudio

No Comments

There’s scads of articles online about getting R installed on your machine, so I’m not going to spend a lot time talking about that.

What I will say is that you MUST use RStudio.

a look at Rstudio.

It’s an IDE for R, which is to say, it’s a UI that wraps around R and makes everything you do in R LOTS easier. Literally everything.

Some advantages include:

  • See a list of the objects you’re currently working with.
  • See a list of packages you’ve got installed and load one with a single click of the mouse
  • Edit scripts and see the results at the same time.
  • Have plots show up in the same window that you’re working in.
  • See files in the current directory
  • Read help files while typing the commands at the same time
  • The View() function, which gives you,a well formatted, legible view of your data (this feature alone is worth the price of admission, which come to think of it is the next point)
  • Free (as in beer AND as in freedom)

Disadvantages compared to using R by itself include:

  • Doesn’t make you feel like it’s 1986

I’d list the lack of remote access here, but setting up rStudio to be accessible via the web is pretty easy, too.

in short, unless you’re running a really old computer and are short on resources, there is approximately zero reason┬áto use R any other way.

Go, install it now. Trust me , it’s worth it.

Categories: Getting started Tags: Tags: ,

Kicking this off

No Comments

Alrighty, here’s the typical first post for a blog:

I’m working in learning predictive modeling in R for work – I work for the development office at Millikin University. So it seems like it might be worthwhile to start documenting my progress and sharing them in my blog.

So so we’ll see how this goes…

Categories: Getting started