Comparing Dr Who episodes by decade

1 Comment

A little while ago, io9 rated every Dr. Who episode from best to worst.

I immediately noticed that a bunch of their favorites were from the reboot, despite the fact that there’s a lot more content in the older series. So I decided to pull the data into R to see if I was imagining things. I know this isn’t fundraising-related, but it IS R related, and it was a fun project to work on over lunch.

Here’s a plot of all the episodes with year on the x-axis and rank on the y-axis. Remember that higher rank is worse.

dr-who-episode-rankings

It definitely LOOKS like the new stuff is better, but I’ll bet we can know more.

Here’s what I found, grouping all the episodes by rounding the year to the nearest decade (ie. 1953 becomes 1950; 1958 becomes 1960):

Let’s Look at Averages

In terms of mean rank, the reboot was WAY better than everything (and the 1990s stuff had an overall average rank that was much worse):

You can see there that, on average, the 2000 decades were the best and the 1990s were the worst.

T-test to really see

But this is just averages–seems like we can do better. I mean, there could be an outlier sitting out there yanking those averages down (or up, as it were).

With that in mind, I ran a t.test to compare each year to another:

Here I’m asking whether the p-values are less than .05. Or to put it another way, is there less than a 5% chance that the difference we’re seeing in the averages for each decade occurred by chance? TRUE values are where there is a smaller than 5% chance that the difference is statistically signficant.

Old Stuff Was A Crapshoot

The other thing you can see is that the old stuff was all over the map. You can see this from the plot above, but there’s not a statistically significant difference in ranks between the 60s, 70s, 80s and 90s. I found that surprising–everybody knows that stuff at the very end of the first series was awful, right? Not io9, apparently.

In any case, if you’re introducing your friends to Dr. Who, start with the reboot–that old stuff is like a box of chocolates….

Categories: assessment Tags: Tags:

Assessing phonathon effectiveness

1 Comment

I was looking at our phonathon data from this semester last week. We made a major change in the way we did ask amounts about half way through–I was curious if that made a difference. So I thought I’d take R for a spin to see if I could assess our phonathon effectiveness.

To be more specific, were people more likely to give when we asked them off the scripts, or when we calculated an ask amount based on their previous gift?

To answer the question, I busted out some t tests, which compare two vectors, ie. groups of numbers, to see if there’s any real change.

Preparing the data

I’m going to use some sample data here, so you get the idea. The data looks like this:

First, I set up an indicator to show which ask method we were using. We made this change around Oct 3, so I just looked at dates–if the ask date was >= Oct 3, the ask method was “script”. If it was after, it’s “calculated”.

I also added a boolean variable that indicates whether or not the donor made a commitment, ie. a pledge or a gift. At this point, I don’t care about amount–I just want to know if people were more or less likely to make any commitment at all. I’ll use TRUE for “made a commitment” and FALSE for “turned us down.”

Running a test on each segment

At this point, I could run a t.test to compare the number of madecommits in each ask amount and see what kind of results I get, but that doesn’t seem fair–what if we called never-givers for the first 3 weeks and then, maybe sybunt friends for a couple weeks and only called alumni lybunts for the last week before October 3? We should at least compare similar segments to each other.

The bad news about this is that t.test() will bomb out if you don’t have at least 2 observation in both sides of your group, which means that if nobody in our never-givers pool said yes, or if we didn’t call anybody in the pool, things get wonky, so we’ll need to account for in our script.

I’m going to use dlply from the plyr package–it works by taking a data frame (that’s the “d” in dlply), splitting it up by whatever variable I tell it, applying the same function to each chunk and then returning a list (that’s the “l”). I’m returning a list b/c the results of t.test() are a list. Since I’m just planning on looking at the output, I don’t want to futz with it.

The variable I’m splitting on is segment, ie. alumni lybunt, never-giver, or whatever.

Ok, that’s a lot of output, but as you can see, we’ve got a lot of empty segments, which makes sense–we called most of our segments in one chunk, so we didn’t see before and after results.

Interpreting the results we do have

Now, to interpret the results we do have, let’s take a closer look at one chunk:

An alternate interpretation

Despite the statistical proof that our first ask method was better, I’m not ready to chuck calculated ask amounts out the window, and here’s why:

We didn’t design a good experiment here–there could be lots of other factors at play that we haven’t accounted for.
What springs to mind first and foremost is that the folks we got in touch with first were most likely to make commitments.

Since we just split our groups based on date, all we can REALLY say is that folks in the earlier group were more likely to give. Is that because they liked the amounts we asked for? Or because they weren’t dodging our calls in the first place?

Next semester, I’m hoping we’ll be able to do better, more random testing.

Let’s look at those last two numbers first: these compare the mean of our two group. Remember we’re averaging the “made or didn’t make a commitment” variable, which was a TRUE/FALSE category–R evaluates TRUE as 1 and FALSE as 0, so a low mean means we didn’t have many commits and a high mean indicates lots of commits. Our x group was the scripted ask, so 44% made a commitment when we were doing scripted asks–27% made a commitment when we made a calculated ask.

However, we can’t quite take this at face value. If we only called 2 sybunts with the scripted ask and they both happened to say yes, and then we switched to the calculated ask and called another 1000 and only 50% of those folks said yes, we’d have means of 1.0 and .500–it’d look like Plan B was a royal failure. But it could be that we got lucky and our 2 scripted asks just happened to be good.

That’s, in essence, what the p-value tells us: how likely is it that the difference we’re seeing between the mean of x and the mean of y occured by chance. Truthfully, p value interpretation is really complicated–suffice it to say that if your p-value is bigger than 0.05, you probably shouldn’t rely on the difference you’re seeing in the means.

In this example, the p value is 1.01e-08, or .0000000101 (I think I got enough zeros in there). To put it another way: very small. So we can reliably say that there was a difference between the two methods and that our original method was more effective.

Categories: assessment Tags: Tags: ,