A great explaination about how linear regression works

No Comments

This video (by the masterful Sean Rule is quite possibly the best explanation of how linear regression works that I’ve ever seen.

Sidebar re: Fundraising

Linear regression is the technique that underlies most predictive modeling.

Most of the time we’re taking some combinations of factors on the x axis with giving on the y axis. Something like “donor code + number of events attended + age” on the x axis and lifetime giving on the y axis.

We want to see how this conglomerations of factors (once we’ve transformed them into numbers in various ways) affect that y variable, eg. should we expect a higher amount of lifetime giving for certain combinations of our variables.

Creating the mathematical model to answer that question is usually done via linear regression.

A couple questions that come up when thinking about linear regression:

Why are we only concerned with the VERTICAL distance on the y axis between each point and the line?

Sean points out: “You minimize the vertical (offset) distance because you’re checking the error between the model (the “best fit” line) and the actual “performance” of the data. By checking the vertical distance, the x – coordinate (input variable) remains consistent between y (data dependent value) and “y hat” (the predicted y-value).”

Or to put it another way: for each value on the x axis, we want to see how far off we are on the y axis. We know our x values–the point of linear regression is to tell us the value of y based on the known value of x. So all we care about it is the y distance from the line for any specific x point.

Why do we square the errors instead of just taking absolute value?

Sean covers this briefly about 2:10 in, but let me try to build it out, starting by saying: it’s complex, tricky and the main answer is a dumb one, namely: “that’s how we’ve always done it”.

There’s a better reason that involves calculus, but to be honest with you, I don’t quite get it (the curse of being an English major, I suppose).

Suffice it to say, you probably could use absolute value and get something that works. If you’re really smart, you’ll know the limitations and advantages of each method. I am not really smart, so I’m gonna do it the way most everybody does. That’s a horrible answer, but it ticks the important boxes, namely:

  1. Does it work to give us good predictions most of the time? Yes, yes it does.
  2. Does it handle negative differences as just as important as positive ones? Yes, yes it does.

Ok, that’s good enough for me!

Find column names in R with grep

No Comments

About half the time, when I’m working in R, I’m querying against a denormalized dump of data from our system of record. (If I was a real rockstar, I’d be querying against the database itself, but I’m not because of reasons.)

The worst part about this is that the column names are generally a wreck, a mix of ugly SQL names and overly pretty readable names. And since we’ve flattened the data, there’s a host of calculated columns with names like “AMT_MAX_GIFT_MAX_DT”. Which is hard to get exactly right for all 200 variables.

I want names!

Tl;dr I can never remember what half the names of these columns are. And because R abbreviates the output of str(), I can’t see them in the RStudio sidebar, either. Even if I could, looking through 200 variables would be a colossal pain, so I devised a way to solve that problem.

My grepnames() function makes it easy to find column names in R.

##Grepnames() Function

Using grepnames()

You use grepnames() like you would grep: you pass it a regular expression and a dataframe, and it returns a dataframe with column names that match the regular expression and their respective column indexes. Something like this:

This isn’t much different than doing grep("foo" names(df)), but it’s less typing and if you mistype, you won’t end up locking up R. Also, the output is slightly more informative.

By default, it’s not case sensitive – I’m working on the assumption that there’s no telling what a column is named, so trying to get the case right would just be a pain. Plus, you’re rarely doing complicated regular expressions – most often I end up passing it “donor” because I can’t remember how the donor code column is titled.

An R package with grepnames()

This function is part of my muadc package for R, which is on github. It’s mostly an assortment of convenience functions, stuff I find myself doing over and over and so wrote functions for. If you have the developer tools package installed, you can install it by doing install_github("crazybilly/muadc").

There’s a couple function which will be useless for you (they’re specific to our office), but a few of them, like grepnames() are pretty handy.

My eventual plan is to build out a full package for higher education fundraising (with a sample data set and some legit documentation) and submit it to CRAN, but I’ll need a bit more time to make that happen.

Until then, happy grepping!

Creating Yes/No Predictors in R

No Comments

When you’re getting ready to create a predictive model, you spend a LOT of your time trying to whip the data into shape to be useful.

One of the main things I find myself doing is transforming text strings into Yes/No variables. Things like addresses and phone numbers aren’t particularly useful data for building a model in and of themselves–there’s too much variation in the data between everyone’s street addresses for it to mean much (setting aside, for now, the interesting idea of using Forbes Wealthiest Zip Codes data or some such).

On the other hand, transforming a mail or email address into a predictor that says “Yes, this person has an address” or “No, we don’t have an email for this person” can be really useful data, minimizing the variation.

Using ifelse to create binary predictors

To do that, we can use ifelse. Here’s what we’ll do. First, let’s mockup a little sample data frame:

Next, we’ll transform the data (using dplyr, of course), adding a new column which is our tranformed variable:

There you have it: your new binary predictor variable.

Binary transformation as a function

If you’re doing a lot of these sorts of transformations, you’re going to want to use a function so you don’t have to type the same thing over and over.

Here’s the function I wrote to do this (note that the comments above the function are in roxygen style because I’m planning to eventually turn this into a package):

You’ll notice a few changes from what we did above:

  1. The function accepts a data frame and asks for a column index. This is so you only have to get your data into one big flat file and can reuse this function on columns 2, 5, 23, 75 or whatever.
  2. There’s an extra bit of criteria in the ifelse: grepl("\\(\\)-",x[,i]). My data has a bunch of missing phone numbers that look like this: “()-“. It’s just as much empty data as an NA, so I wrote a regular expression to find those and consider them empty.
  3. I ordered the levels of the factor, because I want to assume that “no address” is the default.
  4. The output of the function is only id and the newly transformed column. I’m assuming you want to build all your predictors and then join all/some/any of them up into one big data file to feed to the linear regression model.