I was looking at our phonathon data from this semester last week. We made a major change in the way we did ask amounts about half way through–I was curious if that made a difference. So I thought I’d take R for a spin to see if I could assess our phonathon effectiveness.
To be more specific, were people more likely to give when we asked them off the scripts, or when we calculated an ask amount based on their previous gift?
To answer the question, I busted out some t tests, which compare two vectors, ie. groups of numbers, to see if there’s any real change.
Preparing the data
I’m going to use some sample data here, so you get the idea. The data looks like this:

## Source: local data frame [16,084 x 5] ## ## segment calldate resultcode commitamt id ## 1 othernever 20140914 VM NA 378329 ## 2 othernever 20140914 NR NA 148566 ## 3 othernever 20140914 NR NA 100418 ## 4 othernever 20140914 W# NA 759778 ## 5 othernever 20140914 VM NA 582095 ## 6 othernever 20140914 LM NA 887121 ## 7 othernever 20140914 NR NA 104517 ## 8 othernever 20140914 LM NA 209933 ## 9 othernever 20140914 W# NA 320840 ## 10 othernever 20140914 VM NA 336854 ## .. ... ... ... ... ... 
First, I set up an indicator to show which ask method we were using. We made this change around Oct 3, so I just looked at dates–if the ask date was >= Oct 3, the ask method was “script”. If it was after, it’s “calculated”.
I also added a boolean variable that indicates whether or not the donor made a commitment, ie. a pledge or a gift. At this point, I don’t care about amount–I just want to know if people were more or less likely to make any commitment at all. I’ll use TRUE
for “made a commitment” and FALSE
for “turned us down.”

phonathon < phonathon %>% mutate( askmethod = factor( ifelse(calldate >= as.POSIXct('20141003'),"calculated","script")) , madecommit = ifelse( grepl("PLCC",resultcode) , TRUE, FALSE ) ) 
Running a test on each segment
At this point, I could run a t.test to compare the number of madecommits in each ask amount and see what kind of results I get, but that doesn’t seem fair–what if we called nevergivers for the first 3 weeks and then, maybe sybunt friends for a couple weeks and only called alumni lybunts for the last week before October 3? We should at least compare similar segments to each other.
The bad news about this is that t.test()
will bomb out if you don’t have at least 2 observation in both sides of your group, which means that if nobody in our nevergivers pool said yes, or if we didn’t call anybody in the pool, things get wonky, so we’ll need to account for in our script.
I’m going to use dlply
from the plyr package–it works by taking a data frame (that’s the “d” in dlply), splitting it up by whatever variable I tell it, applying the same function to each chunk and then returning a list (that’s the “l”). I’m returning a list b/c the results of t.test()
are a list. Since I’m just planning on looking at the output, I don’t want to futz with it.
The variable I’m splitting on is segment, ie. alumni lybunt, nevergiver, or whatever.

dlply(phonathon,.(segment),function(x) { # segment the data into our two sections askscript < filter(x,askmethod=='script') askcalc < filter(x,askmethod=='calculated') # check to make sure you've actually got results to compare if ( length(askscript$madecommit)  sum(is.na(askscript$madecommit)) >= 2 & length(askcalc$madecommit)  sum(is.na(askcalc$madecommit)) >= 2 ){ # do the test for each segment t.test(askscript$madecommit,askcalc$madecommit) } }) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108

## $AF15donor ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = NaN, df = NaN, pvalue = NA ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## NaN NaN ## sample estimates: ## mean of x mean of y ## 0 0 ## ## ## $alumLybunt ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = 5.7791, df = 973.908, pvalue = 1.01e08 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.1082967 0.2196614 ## sample estimates: ## mean of x mean of y ## 0.4371134 0.2731343 ## ## ## $alumNever ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = 0.013, df = 1083.668, pvalue = 0.9896 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.006639543 0.006728072 ## sample estimates: ## mean of x mean of y ## 0.006931317 0.006887052 ## ## ## $alumSybunt ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = 4.1869, df = 1866.558, pvalue = 2.96e05 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.01533117 0.04235085 ## sample estimates: ## mean of x mean of y ## 0.05432288 0.02548187 ## ## ## $buphone ## NULL ## ## $currentparent ## NULL ## ## $faculty ## NULL ## ## $leadership ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = 0.3125, df = 47.848, pvalue = 0.756 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.2081296 0.2847254 ## sample estimates: ## mean of x mean of y ## 0.4000000 0.3617021 ## ## ## $otherlybunt ## NULL ## ## $othernever ## NULL ## ## $yaNever ## NULL ## ## $yaSybunt ## NULL ## ## attr(,"split_type") ## [1] "data.frame" ## attr(,"split_labels") ## segment ## 1 AF15donor ## 2 alumLybunt ## 3 alumNever ## 4 alumSybunt ## 5 buphone ## 6 currentparent ## 7 faculty ## 8 leadership ## 9 otherlybunt ## 10 othernever ## 11 yaNever ## 12 yaSybunt 
Ok, that’s a lot of output, but as you can see, we’ve got a lot of empty segments, which makes sense–we called most of our segments in one chunk, so we didn’t see before and after results.
Interpreting the results we do have
Now, to interpret the results we do have, let’s take a closer look at one chunk:

## $alumLybunt ## ## Welch Two Sample ttest ## ## data: askscript$madecommit and askcalc$madecommit ## t = 5.7791, df = 973.908, pvalue = 1.01e08 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.1082967 0.2196614 ## sample estimates: ## mean of x mean of y ## 0.4371134 0.2731343 
An alternate interpretation
Despite the statistical proof that our first ask method was better, I’m not ready to chuck calculated ask amounts out the window, and here’s why:
We didn’t design a good experiment here–there could be lots of other factors at play that we haven’t accounted for.
What springs to mind first and foremost is that the folks we got in touch with first were most likely to make commitments.
Since we just split our groups based on date, all we can REALLY say is that folks in the earlier group were more likely to give. Is that because they liked the amounts we asked for? Or because they weren’t dodging our calls in the first place?
Next semester, I’m hoping we’ll be able to do better, more random testing.
Let’s look at those last two numbers first: these compare the mean of our two group. Remember we’re averaging the “made or didn’t make a commitment” variable, which was a TRUE/FALSE category–R evaluates TRUE as 1 and FALSE as 0, so a low mean means we didn’t have many commits and a high mean indicates lots of commits. Our x group was the scripted ask, so 44% made a commitment when we were doing scripted asks–27% made a commitment when we made a calculated ask.
However, we can’t quite take this at face value. If we only called 2 sybunts with the scripted ask and they both happened to say yes, and then we switched to the calculated ask and called another 1000 and only 50% of those folks said yes, we’d have means of 1.0 and .500–it’d look like Plan B was a royal failure. But it could be that we got lucky and our 2 scripted asks just happened to be good.
That’s, in essence, what the pvalue tells us: how likely is it that the difference we’re seeing between the mean of x and the mean of y occured by chance. Truthfully, p value interpretation is really complicated–suffice it to say that if your pvalue is bigger than 0.05, you probably shouldn’t rely on the difference you’re seeing in the means.
In this example, the p value is 1.01e08, or .0000000101 (I think I got enough zeros in there). To put it another way: very small. So we can reliably say that there was a difference between the two methods and that our original method was more effective.