New library for view dataframes: javascript datatables

No Comments

Here’s something that might turn out to be really cool:

RStudio has released a package to integrate R data with the javascript datatables library.

Datatables, not to be confused with the R package by the same name, is a great way to easily make really usable tables online–you get sorting, filtering, pagination all for free without having to write a bunch of nested table/tr/td tags.

Installing this library lets you quickly turn your dataframe into a sortable, filterable table you can really play with. Truth be told, I often dump my data into Excel just before I report on it, for this sort of thing–often I can spot errors more quickly when I can click to resort/filter/etc.

Using datatables to do that sort of filtering might provide a handy alternative to printing data to the console or RStudio’s View() function.
filter-df

Getting and using DT

Installing the DT library is pretty easy:

From there, you datatable-erize any dataframe with a simple:

Assuming you’re using RStudio, the new data should open in the Viewer pane, giving you something like this:

datatables-in-Rstudio

Caveats

The one caveat to be aware of is that using this on big data frames is a bad idea–I tried it on our constituent data (80K rows, 200 columns) and it effectively locked up R and RStudio.

So don’t do that.

For smaller data frames, it’s just fine. I have vague plans to wrap a function around this to get some different defaults (mostly to dump the overly large padding and the serif font).

In any case, enjoy the Christmas present–fun libraries to play with!

Categories: Functions Tags: Tags: ,

Assessing phonathon effectiveness

1 Comment

I was looking at our phonathon data from this semester last week. We made a major change in the way we did ask amounts about half way through–I was curious if that made a difference. So I thought I’d take R for a spin to see if I could assess our phonathon effectiveness.

To be more specific, were people more likely to give when we asked them off the scripts, or when we calculated an ask amount based on their previous gift?

To answer the question, I busted out some t tests, which compare two vectors, ie. groups of numbers, to see if there’s any real change.

Preparing the data

I’m going to use some sample data here, so you get the idea. The data looks like this:

First, I set up an indicator to show which ask method we were using. We made this change around Oct 3, so I just looked at dates–if the ask date was >= Oct 3, the ask method was “script”. If it was after, it’s “calculated”.

I also added a boolean variable that indicates whether or not the donor made a commitment, ie. a pledge or a gift. At this point, I don’t care about amount–I just want to know if people were more or less likely to make any commitment at all. I’ll use TRUE for “made a commitment” and FALSE for “turned us down.”

Running a test on each segment

At this point, I could run a t.test to compare the number of madecommits in each ask amount and see what kind of results I get, but that doesn’t seem fair–what if we called never-givers for the first 3 weeks and then, maybe sybunt friends for a couple weeks and only called alumni lybunts for the last week before October 3? We should at least compare similar segments to each other.

The bad news about this is that t.test() will bomb out if you don’t have at least 2 observation in both sides of your group, which means that if nobody in our never-givers pool said yes, or if we didn’t call anybody in the pool, things get wonky, so we’ll need to account for in our script.

I’m going to use dlply from the plyr package–it works by taking a data frame (that’s the “d” in dlply), splitting it up by whatever variable I tell it, applying the same function to each chunk and then returning a list (that’s the “l”). I’m returning a list b/c the results of t.test() are a list. Since I’m just planning on looking at the output, I don’t want to futz with it.

The variable I’m splitting on is segment, ie. alumni lybunt, never-giver, or whatever.

Ok, that’s a lot of output, but as you can see, we’ve got a lot of empty segments, which makes sense–we called most of our segments in one chunk, so we didn’t see before and after results.

Interpreting the results we do have

Now, to interpret the results we do have, let’s take a closer look at one chunk:

An alternate interpretation

Despite the statistical proof that our first ask method was better, I’m not ready to chuck calculated ask amounts out the window, and here’s why:

We didn’t design a good experiment here–there could be lots of other factors at play that we haven’t accounted for.
What springs to mind first and foremost is that the folks we got in touch with first were most likely to make commitments.

Since we just split our groups based on date, all we can REALLY say is that folks in the earlier group were more likely to give. Is that because they liked the amounts we asked for? Or because they weren’t dodging our calls in the first place?

Next semester, I’m hoping we’ll be able to do better, more random testing.

Let’s look at those last two numbers first: these compare the mean of our two group. Remember we’re averaging the “made or didn’t make a commitment” variable, which was a TRUE/FALSE category–R evaluates TRUE as 1 and FALSE as 0, so a low mean means we didn’t have many commits and a high mean indicates lots of commits. Our x group was the scripted ask, so 44% made a commitment when we were doing scripted asks–27% made a commitment when we made a calculated ask.

However, we can’t quite take this at face value. If we only called 2 sybunts with the scripted ask and they both happened to say yes, and then we switched to the calculated ask and called another 1000 and only 50% of those folks said yes, we’d have means of 1.0 and .500–it’d look like Plan B was a royal failure. But it could be that we got lucky and our 2 scripted asks just happened to be good.

That’s, in essence, what the p-value tells us: how likely is it that the difference we’re seeing between the mean of x and the mean of y occured by chance. Truthfully, p value interpretation is really complicated–suffice it to say that if your p-value is bigger than 0.05, you probably shouldn’t rely on the difference you’re seeing in the means.

In this example, the p value is 1.01e-08, or .0000000101 (I think I got enough zeros in there). To put it another way: very small. So we can reliably say that there was a difference between the two methods and that our original method was more effective.

Categories: assessment Tags: Tags: ,

Interpreting the Intercept

No Comments

One of the things that threw me for a loop when I first started doing building linear models was how to interpet all those numbers that show up when you hit summary(lm(foo~bar)).

I started off by joining the Stanford Statistical Learning class (which is starting up again in January–I’m planning on going through it again, maybe even making it through the whole class this time!) and they kept saying “well, here’s the intercept, but nobody cares about that.” Of course, that elicited more curiosity than anything, so here’s some info about how to interpret the intercept.

Don’t try to interpret the intercept

Truthfully, a lot of times, the intercept is kind of meaningless and should be ignored. For example, I’ve got a simulated data sample here with two columns:

  • lifetime giving
  • class year

Let’s run a linear model with class year as the predictor and lifetime giving as the outcome:

Should Class Year be numeric?

After typing up this post, particularly the summary at the bottom, I’ve decided that class year should probably be a categorical variable, ie. a factor, rather than numeric.

If I had to do it over again, I’d probably round it by decade and then turn it into a factor, particularly because, at least in our system, we store non-alumni with a class year of 0000 (we also have a few that are 1111 (some legacy pseudo-alumni) and a handful that are 9999 (no clue–I’m scared to ask)).

In any case, I still think it’s a good example to help clear up how to think about the intercept and how to wrap your brain around interpreting it. But it’s clearly a bad way to build a model.

Here, the intercept value is 26,469, which is to say, according to the model, someone with a class year of 0000 (Mary and Joseph, perhaps?), would have an estimated lifetime giving of $26K.

A quick plot of the data illustrates what’s going on–old alumni are better donors than younger alumni, so the slope of the line is negative. Interpolate that all the way back to 0 AD and you’ve got an estimated $26K.

class_year_plot-1

Adjust the predictor for meaningful intercepts

We can make the intercept interpretable by making our oldest alumni have a class year of zero. Millikin was founded in 1901–for the sake of simplicity, we’ll make that our zero year by subtracting 1901 from all our class years.

Now we’ve got something interpretable: at Millikin Year Zero, the average lifetime giving is $1540 and for every year going forward, that average drops by $13.

The intercept is the baseline

Obviously, the model itself is a bad one–clearly, there’s a ton of variation, especially with those middle aged donors, not to mention that this is a simulated data set, so it’s not representative of our constituent population.

With all that in mind, this model doesn’t help us draw any real conclusions about the data. Hopefully, though, you get an idea of how to think about the intercept, namely, it’s what your baseline outcome would be if you zeroed out all your predictors.

Categories: modeling Tags: Tags: ,