Assessing phonathon effectiveness

1 Comment

I was looking at our phonathon data from this semester last week. We made a major change in the way we did ask amounts about half way through–I was curious if that made a difference. So I thought I’d take R for a spin to see if I could assess our phonathon effectiveness.

To be more specific, were people more likely to give when we asked them off the scripts, or when we calculated an ask amount based on their previous gift?

To answer the question, I busted out some t tests, which compare two vectors, ie. groups of numbers, to see if there’s any real change.

Preparing the data

I’m going to use some sample data here, so you get the idea. The data looks like this:

First, I set up an indicator to show which ask method we were using. We made this change around Oct 3, so I just looked at dates–if the ask date was >= Oct 3, the ask method was “script”. If it was after, it’s “calculated”.

I also added a boolean variable that indicates whether or not the donor made a commitment, ie. a pledge or a gift. At this point, I don’t care about amount–I just want to know if people were more or less likely to make any commitment at all. I’ll use TRUE for “made a commitment” and FALSE for “turned us down.”

Running a test on each segment

At this point, I could run a t.test to compare the number of madecommits in each ask amount and see what kind of results I get, but that doesn’t seem fair–what if we called never-givers for the first 3 weeks and then, maybe sybunt friends for a couple weeks and only called alumni lybunts for the last week before October 3? We should at least compare similar segments to each other.

The bad news about this is that t.test() will bomb out if you don’t have at least 2 observation in both sides of your group, which means that if nobody in our never-givers pool said yes, or if we didn’t call anybody in the pool, things get wonky, so we’ll need to account for in our script.

I’m going to use dlply from the plyr package–it works by taking a data frame (that’s the “d” in dlply), splitting it up by whatever variable I tell it, applying the same function to each chunk and then returning a list (that’s the “l”). I’m returning a list b/c the results of t.test() are a list. Since I’m just planning on looking at the output, I don’t want to futz with it.

The variable I’m splitting on is segment, ie. alumni lybunt, never-giver, or whatever.

Ok, that’s a lot of output, but as you can see, we’ve got a lot of empty segments, which makes sense–we called most of our segments in one chunk, so we didn’t see before and after results.

Interpreting the results we do have

Now, to interpret the results we do have, let’s take a closer look at one chunk:

An alternate interpretation

Despite the statistical proof that our first ask method was better, I’m not ready to chuck calculated ask amounts out the window, and here’s why:

We didn’t design a good experiment here–there could be lots of other factors at play that we haven’t accounted for.
What springs to mind first and foremost is that the folks we got in touch with first were most likely to make commitments.

Since we just split our groups based on date, all we can REALLY say is that folks in the earlier group were more likely to give. Is that because they liked the amounts we asked for? Or because they weren’t dodging our calls in the first place?

Next semester, I’m hoping we’ll be able to do better, more random testing.

Let’s look at those last two numbers first: these compare the mean of our two group. Remember we’re averaging the “made or didn’t make a commitment” variable, which was a TRUE/FALSE category–R evaluates TRUE as 1 and FALSE as 0, so a low mean means we didn’t have many commits and a high mean indicates lots of commits. Our x group was the scripted ask, so 44% made a commitment when we were doing scripted asks–27% made a commitment when we made a calculated ask.

However, we can’t quite take this at face value. If we only called 2 sybunts with the scripted ask and they both happened to say yes, and then we switched to the calculated ask and called another 1000 and only 50% of those folks said yes, we’d have means of 1.0 and .500–it’d look like Plan B was a royal failure. But it could be that we got lucky and our 2 scripted asks just happened to be good.

That’s, in essence, what the p-value tells us: how likely is it that the difference we’re seeing between the mean of x and the mean of y occured by chance. Truthfully, p value interpretation is really complicated–suffice it to say that if your p-value is bigger than 0.05, you probably shouldn’t rely on the difference you’re seeing in the means.

In this example, the p value is 1.01e-08, or .0000000101 (I think I got enough zeros in there). To put it another way: very small. So we can reliably say that there was a difference between the two methods and that our original method was more effective.

Categories: assessment Tags: Tags: ,

One Reply to “Assessing phonathon effectiveness”

  1. I can’t pretend I understand anything in the foreign language boxes you’ve created here. But I’ve done phone-a-thons at my Alma Mater, and have received calls from them ever since. The way you’re processing data, and looking outside the box (or perhaps examining the same box from another angle) here is impressive. I can’t imagine being able to choose factors to study (time of day, female/male voices, weather, etc.) and having to select one. Here we go, Big Data…

    On the other hand…you have politicians like Mr.Bush in Texas pre-emptively allowing access to hundreds of thousands of e-mails under the false pretense of total transparency as he moves into larger political aspirations. Tons of information doesn’t equal “right/necessary” information.

    All that to say, I’d read your posts about phone-a-thoning before I’d read his e-mails….for what that’s worth.

Leave a Reply

Your email address will not be published. Required fields are marked *