Data frames are lists

No Comments

I’ve got another more expansive post in the hopper that I need to edit and post, but here’s a quick one to reiterate a point that you probably already know, but is really important:

Data Frames are lists

This point, that data frames are lists where each column of the data frame is an item (usually a named vector) within the list, is really really useful. Here’s some things this tidbit lets you do:

  1. Grab a single column as a vector with double brackets, ala constituents[[8]]. If you use single brackets, you’ll get a data frame with just that column (because you’re asking for a list with the 8th item in this case).
  2. Do lapply and sapply functions over each column, ie. you can easily loop through a data frame, performing the same function on each column.

    You might think this isn’t a big deal. “Normally the columns of my data frames are all different types of data, first name here, lifetime giving there,” you might say.

    But trust me, it won’t be long til you’ve got a data frame and you think “oh, I’ll just write a little for loop to loop through each column and…” That’s what lapply and/or sapply are for. Knowing that the scaffolding to do that task is already built into one handy function will save your bacon.

  3. Converting a list of vectors to a data frame is as simple as saying as.data.frame(myOriginalList). And while lists can be a bit fiddly, with this one, you know what you’re going to get.

In short, just knowing that a data frame is a special kind of list makes it a lot easier to handle data frames when the need for crazy sorts of data manipulation comes up (and trust me, it will).

Categories: data manipulation Tags: Tags:

Simple Parrallelization

No Comments

Recently, I’ve been working on some machine learning projects at work. A lot of these require serious computing time–my work machine isn’t super beefy, but it’s no slouch and still some of these models take a couple hours to build.

aintnobodygottime

Without really thinking about it, I assumed I was using all my processing power. And then, I looked at my Task Manager. And it turns out, of the 4 cores I’ve got, R was only using one!

There's 3 more boxes there, buddy--use 'em!

There’s 3 more boxes there, buddy–use ’em!

Getting multi-core support in R

A bit of research revealed that R is really bad at supporting multiple cores. It’s baffling to me, but that’s the way it is. Apparently, there are various solutions to this, but they involve installing/using packages and then making sure your processes are parreleliziable. Sounds like a receipe for disaster if you ask me–I screw enough up on my own, I don’t need to add a layer of complexity on top of that.

An alternative, easier solution is to use Revolution Analytics distribution of R, Open R, which comes with support of multiple cores out of the box.

Just download and install and when you fire up RStudio the next time, it’ll find it, and (probably) start using it (if not, you can go into Global Options in RStudio and call out that you want to use that version of R.

Now my packages won’t update

Open R seems to run just fine, but a couple weeks in, I realized I had a problem–my packages weren’t updating (and I really wanted the newest version of dplyr!).

Turns out, Open R is set to not update packages by default. The idea is that they snapshot all packages each year so things don’t get updated and break half way in.

This doesn’t really bother me–I’m rarely spending more than a couple weeks on a single project,not do I have any massive dependencies that would break between upgrades, so I followed the instructions in the FAQ above to set my package repo back to CRAN (basically, you just need to edit your Rprofile.site.)

And sure enough, I was back in business with dplyr 0.4 (and a host of newly updated packages)!

Categories: Getting started Tags: Tags: ,

New library for view dataframes: javascript datatables

No Comments

Here’s something that might turn out to be really cool:

RStudio has released a package to integrate R data with the javascript datatables library.

Datatables, not to be confused with the R package by the same name, is a great way to easily make really usable tables online–you get sorting, filtering, pagination all for free without having to write a bunch of nested table/tr/td tags.

Installing this library lets you quickly turn your dataframe into a sortable, filterable table you can really play with. Truth be told, I often dump my data into Excel just before I report on it, for this sort of thing–often I can spot errors more quickly when I can click to resort/filter/etc.

Using datatables to do that sort of filtering might provide a handy alternative to printing data to the console or RStudio’s View() function.
filter-df

Getting and using DT

Installing the DT library is pretty easy:

From there, you datatable-erize any dataframe with a simple:

Assuming you’re using RStudio, the new data should open in the Viewer pane, giving you something like this:

datatables-in-Rstudio

Caveats

The one caveat to be aware of is that using this on big data frames is a bad idea–I tried it on our constituent data (80K rows, 200 columns) and it effectively locked up R and RStudio.

So don’t do that.

For smaller data frames, it’s just fine. I have vague plans to wrap a function around this to get some different defaults (mostly to dump the overly large padding and the serif font).

In any case, enjoy the Christmas present–fun libraries to play with!

Categories: Functions Tags: Tags: ,

Assessing phonathon effectiveness

1 Comment

I was looking at our phonathon data from this semester last week. We made a major change in the way we did ask amounts about half way through–I was curious if that made a difference. So I thought I’d take R for a spin to see if I could assess our phonathon effectiveness.

To be more specific, were people more likely to give when we asked them off the scripts, or when we calculated an ask amount based on their previous gift?

To answer the question, I busted out some t tests, which compare two vectors, ie. groups of numbers, to see if there’s any real change.

Preparing the data

I’m going to use some sample data here, so you get the idea. The data looks like this:

First, I set up an indicator to show which ask method we were using. We made this change around Oct 3, so I just looked at dates–if the ask date was >= Oct 3, the ask method was “script”. If it was after, it’s “calculated”.

I also added a boolean variable that indicates whether or not the donor made a commitment, ie. a pledge or a gift. At this point, I don’t care about amount–I just want to know if people were more or less likely to make any commitment at all. I’ll use TRUE for “made a commitment” and FALSE for “turned us down.”

Running a test on each segment

At this point, I could run a t.test to compare the number of madecommits in each ask amount and see what kind of results I get, but that doesn’t seem fair–what if we called never-givers for the first 3 weeks and then, maybe sybunt friends for a couple weeks and only called alumni lybunts for the last week before October 3? We should at least compare similar segments to each other.

The bad news about this is that t.test() will bomb out if you don’t have at least 2 observation in both sides of your group, which means that if nobody in our never-givers pool said yes, or if we didn’t call anybody in the pool, things get wonky, so we’ll need to account for in our script.

I’m going to use dlply from the plyr package–it works by taking a data frame (that’s the “d” in dlply), splitting it up by whatever variable I tell it, applying the same function to each chunk and then returning a list (that’s the “l”). I’m returning a list b/c the results of t.test() are a list. Since I’m just planning on looking at the output, I don’t want to futz with it.

The variable I’m splitting on is segment, ie. alumni lybunt, never-giver, or whatever.

Ok, that’s a lot of output, but as you can see, we’ve got a lot of empty segments, which makes sense–we called most of our segments in one chunk, so we didn’t see before and after results.

Interpreting the results we do have

Now, to interpret the results we do have, let’s take a closer look at one chunk:

An alternate interpretation

Despite the statistical proof that our first ask method was better, I’m not ready to chuck calculated ask amounts out the window, and here’s why:

We didn’t design a good experiment here–there could be lots of other factors at play that we haven’t accounted for.
What springs to mind first and foremost is that the folks we got in touch with first were most likely to make commitments.

Since we just split our groups based on date, all we can REALLY say is that folks in the earlier group were more likely to give. Is that because they liked the amounts we asked for? Or because they weren’t dodging our calls in the first place?

Next semester, I’m hoping we’ll be able to do better, more random testing.

Let’s look at those last two numbers first: these compare the mean of our two group. Remember we’re averaging the “made or didn’t make a commitment” variable, which was a TRUE/FALSE category–R evaluates TRUE as 1 and FALSE as 0, so a low mean means we didn’t have many commits and a high mean indicates lots of commits. Our x group was the scripted ask, so 44% made a commitment when we were doing scripted asks–27% made a commitment when we made a calculated ask.

However, we can’t quite take this at face value. If we only called 2 sybunts with the scripted ask and they both happened to say yes, and then we switched to the calculated ask and called another 1000 and only 50% of those folks said yes, we’d have means of 1.0 and .500–it’d look like Plan B was a royal failure. But it could be that we got lucky and our 2 scripted asks just happened to be good.

That’s, in essence, what the p-value tells us: how likely is it that the difference we’re seeing between the mean of x and the mean of y occured by chance. Truthfully, p value interpretation is really complicated–suffice it to say that if your p-value is bigger than 0.05, you probably shouldn’t rely on the difference you’re seeing in the means.

In this example, the p value is 1.01e-08, or .0000000101 (I think I got enough zeros in there). To put it another way: very small. So we can reliably say that there was a difference between the two methods and that our original method was more effective.

Categories: assessment Tags: Tags: ,

Interpreting the Intercept

No Comments

One of the things that threw me for a loop when I first started doing building linear models was how to interpet all those numbers that show up when you hit summary(lm(foo~bar)).

I started off by joining the Stanford Statistical Learning class (which is starting up again in January–I’m planning on going through it again, maybe even making it through the whole class this time!) and they kept saying “well, here’s the intercept, but nobody cares about that.” Of course, that elicited more curiosity than anything, so here’s some info about how to interpret the intercept.

Don’t try to interpret the intercept

Truthfully, a lot of times, the intercept is kind of meaningless and should be ignored. For example, I’ve got a simulated data sample here with two columns:

  • lifetime giving
  • class year

Let’s run a linear model with class year as the predictor and lifetime giving as the outcome:

Should Class Year be numeric?

After typing up this post, particularly the summary at the bottom, I’ve decided that class year should probably be a categorical variable, ie. a factor, rather than numeric.

If I had to do it over again, I’d probably round it by decade and then turn it into a factor, particularly because, at least in our system, we store non-alumni with a class year of 0000 (we also have a few that are 1111 (some legacy pseudo-alumni) and a handful that are 9999 (no clue–I’m scared to ask)).

In any case, I still think it’s a good example to help clear up how to think about the intercept and how to wrap your brain around interpreting it. But it’s clearly a bad way to build a model.

Here, the intercept value is 26,469, which is to say, according to the model, someone with a class year of 0000 (Mary and Joseph, perhaps?), would have an estimated lifetime giving of $26K.

A quick plot of the data illustrates what’s going on–old alumni are better donors than younger alumni, so the slope of the line is negative. Interpolate that all the way back to 0 AD and you’ve got an estimated $26K.

class_year_plot-1

Adjust the predictor for meaningful intercepts

We can make the intercept interpretable by making our oldest alumni have a class year of zero. Millikin was founded in 1901–for the sake of simplicity, we’ll make that our zero year by subtracting 1901 from all our class years.

Now we’ve got something interpretable: at Millikin Year Zero, the average lifetime giving is $1540 and for every year going forward, that average drops by $13.

The intercept is the baseline

Obviously, the model itself is a bad one–clearly, there’s a ton of variation, especially with those middle aged donors, not to mention that this is a simulated data set, so it’s not representative of our constituent population.

With all that in mind, this model doesn’t help us draw any real conclusions about the data. Hopefully, though, you get an idea of how to think about the intercept, namely, it’s what your baseline outcome would be if you zeroed out all your predictors.

Categories: modeling Tags: Tags: ,

A great explaination about how linear regression works

No Comments

This video (by the masterful Sean Rule is quite possibly the best explanation of how linear regression works that I’ve ever seen.

Sidebar re: Fundraising

Linear regression is the technique that underlies most predictive modeling.

Most of the time we’re taking some combinations of factors on the x axis with giving on the y axis. Something like “donor code + number of events attended + age” on the x axis and lifetime giving on the y axis.

We want to see how this conglomerations of factors (once we’ve transformed them into numbers in various ways) affect that y variable, eg. should we expect a higher amount of lifetime giving for certain combinations of our variables.

Creating the mathematical model to answer that question is usually done via linear regression.

A couple questions that come up when thinking about linear regression:

Why are we only concerned with the VERTICAL distance on the y axis between each point and the line?

Sean points out: “You minimize the vertical (offset) distance because you’re checking the error between the model (the “best fit” line) and the actual “performance” of the data. By checking the vertical distance, the x – coordinate (input variable) remains consistent between y (data dependent value) and “y hat” (the predicted y-value).”

Or to put it another way: for each value on the x axis, we want to see how far off we are on the y axis. We know our x values–the point of linear regression is to tell us the value of y based on the known value of x. So all we care about it is the y distance from the line for any specific x point.

Why do we square the errors instead of just taking absolute value?

Sean covers this briefly about 2:10 in, but let me try to build it out, starting by saying: it’s complex, tricky and the main answer is a dumb one, namely: “that’s how we’ve always done it”.

There’s a better reason that involves calculus, but to be honest with you, I don’t quite get it (the curse of being an English major, I suppose).

Suffice it to say, you probably could use absolute value and get something that works. If you’re really smart, you’ll know the limitations and advantages of each method. I am not really smart, so I’m gonna do it the way most everybody does. That’s a horrible answer, but it ticks the important boxes, namely:

  1. Does it work to give us good predictions most of the time? Yes, yes it does.
  2. Does it handle negative differences as just as important as positive ones? Yes, yes it does.

Ok, that’s good enough for me!

Find column names in R with grep

No Comments

About half the time, when I’m working in R, I’m querying against a denormalized dump of data from our system of record. (If I was a real rockstar, I’d be querying against the database itself, but I’m not because of reasons.)

The worst part about this is that the column names are generally a wreck, a mix of ugly SQL names and overly pretty readable names. And since we’ve flattened the data, there’s a host of calculated columns with names like “AMT_MAX_GIFT_MAX_DT”. Which is hard to get exactly right for all 200 variables.

I want names!

Tl;dr I can never remember what half the names of these columns are. And because R abbreviates the output of str(), I can’t see them in the RStudio sidebar, either. Even if I could, looking through 200 variables would be a colossal pain, so I devised a way to solve that problem.

My grepnames() function makes it easy to find column names in R.

##Grepnames() Function

Using grepnames()

You use grepnames() like you would grep: you pass it a regular expression and a dataframe, and it returns a dataframe with column names that match the regular expression and their respective column indexes. Something like this:

This isn’t much different than doing grep("foo" names(df)), but it’s less typing and if you mistype, you won’t end up locking up R. Also, the output is slightly more informative.

By default, it’s not case sensitive – I’m working on the assumption that there’s no telling what a column is named, so trying to get the case right would just be a pain. Plus, you’re rarely doing complicated regular expressions – most often I end up passing it “donor” because I can’t remember how the donor code column is titled.

An R package with grepnames()

This function is part of my muadc package for R, which is on github. It’s mostly an assortment of convenience functions, stuff I find myself doing over and over and so wrote functions for. If you have the developer tools package installed, you can install it by doing install_github("crazybilly/muadc").

There’s a couple function which will be useless for you (they’re specific to our office), but a few of them, like grepnames() are pretty handy.

My eventual plan is to build out a full package for higher education fundraising (with a sample data set and some legit documentation) and submit it to CRAN, but I’ll need a bit more time to make that happen.

Until then, happy grepping!

Creating Yes/No Predictors in R

No Comments

When you’re getting ready to create a predictive model, you spend a LOT of your time trying to whip the data into shape to be useful.

One of the main things I find myself doing is transforming text strings into Yes/No variables. Things like addresses and phone numbers aren’t particularly useful data for building a model in and of themselves–there’s too much variation in the data between everyone’s street addresses for it to mean much (setting aside, for now, the interesting idea of using Forbes Wealthiest Zip Codes data or some such).

On the other hand, transforming a mail or email address into a predictor that says “Yes, this person has an address” or “No, we don’t have an email for this person” can be really useful data, minimizing the variation.

Using ifelse to create binary predictors

To do that, we can use ifelse. Here’s what we’ll do. First, let’s mockup a little sample data frame:

Next, we’ll transform the data (using dplyr, of course), adding a new column which is our tranformed variable:

There you have it: your new binary predictor variable.

Binary transformation as a function

If you’re doing a lot of these sorts of transformations, you’re going to want to use a function so you don’t have to type the same thing over and over.

Here’s the function I wrote to do this (note that the comments above the function are in roxygen style because I’m planning to eventually turn this into a package):

You’ll notice a few changes from what we did above:

  1. The function accepts a data frame and asks for a column index. This is so you only have to get your data into one big flat file and can reuse this function on columns 2, 5, 23, 75 or whatever.
  2. There’s an extra bit of criteria in the ifelse: grepl("\\(\\)-",x[,i]). My data has a bunch of missing phone numbers that look like this: “()-“. It’s just as much empty data as an NA, so I wrote a regular expression to find those and consider them empty.
  3. I ordered the levels of the factor, because I want to assume that “no address” is the default.
  4. The output of the function is only id and the newly transformed column. I’m assuming you want to build all your predictors and then join all/some/any of them up into one big data file to feed to the linear regression model.

Correlation testing in R

No Comments

In his new book, Kevin MacDonnell argues that when you’re building a predictive model for something like major giving, you’re gonna want to prioritize your predictors, assuming you’ve got a bunch of possible predictors for the outcome any combination of which may or may not be any good.

Kevin recommends doing Pearson’s correlation test to see how each predictor like “has a business phone” and “number of events attended” correlates to the outcome (ie. “number of major gifts” or “lifetime giving”).

Doing a correlation test in R is pretty simple. Let’s assume you’ve got your predictors and outcome in a data frame with one row per constituent, something like this:

id hasaddr hasbuphone hascell logOfLifeG
1 TRUE TRUE FALSE 3.218876
2 TRUE TRUE FALSE 5.828946
3 TRUE TRUE FALSE 6.690842
5 TRUE TRUE FALSE 4.382027
8 TRUE TRUE TRUE 5.010635
9 TRUE FALSE FALSE 5.703782

Test Predictors Against Lifetime Giving

You’ll remember that a dataframe is just a list of vectors (and each column is a vector), so we just need to sapply over that list, comparing each vector to the outcome (in this case our outcome is df$logOfLifeG, ie. the log value of lifetime giving.

You’ll note that I’m excluding the first and last columns–that’s where my constituent ids and outcomes are).

Also, at the end of the function, I grab the 4th item in the list–cor.test() returns a list and the 4th item in the list is the actual data about how correlated the items are.:

Ok, that’s the correlation numbers for each predictors, but right now, testresults is just vector of values with a name for each row.

Make It Readable

A vector is fine if you’re just going to View() it in RStudio, but it’s a pain to read. So let’s turn it into a data frame where the first column is a list of all the column names from the original list:

predictors correlation
hasaddr.estimate.cor hasaddr 0.0249737
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318

Alrighty, now we’re getting somewhere (note that the row names are still there on the left: annoying, but not really worth dealing with). With a small data frame like this, you don’t really need to do anything else. But if you’ve got 20 or more predictors, you’re going to want to sort it.

Sort the Table

Since negative correlation (ie. if your outcome goes down when the predictor goes up)is just as important as positive, we’ll create a new column which is the absolute value of the correlation just so we’ve got something to accurately sort by. Note that I’m using dplyr here

predictors correlation
hasbuphone.estimate.cor hasbuphone 0.2008512
hascell.estimate.cor hascell 0.0312318
hasaddr.estimate.cor hasaddr 0.0249737

And there’s your correlation table! Now you can start at the top of the table, building your model with your highly correlated predictors.

Excel vs. R (or Why You Should Use R Scripts)

No Comments

I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.

The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.

Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.

What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).

The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.

Scripts in RStudio

If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):

  1. Hit File -> New -> R Script. That’ll open a new script in the Source pane.
  2. Type any number of R commands in the new script.
  3. Save the file as foo.R
  4. Hit the “Source file” button at the top of the Source pane. This runs source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
  5. You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I'm opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I've built in the Files pane in the bottom right.

Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.

That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.

One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.