# Correlation testing in R

In his new book, Kevin MacDonnell argues that when you’re building a predictive model for something like major giving, you’re gonna want to prioritize your predictors, assuming you’ve got a bunch of possible predictors for the outcome any combination of which may or may not be any good.

Kevin recommends doing Pearson’s correlation test to see how each predictor like “has a business phone” and “number of events attended” correlates to the outcome (ie. “number of major gifts” or “lifetime giving”).

Doing a correlation test in R is pretty simple. Let’s assume you’ve got your predictors and outcome in a data frame with one row per constituent, something like this:

id | hasaddr | hasbuphone | hascell | logOfLifeG |
---|---|---|---|---|

1 | TRUE | TRUE | FALSE | 3.218876 |

2 | TRUE | TRUE | FALSE | 5.828946 |

3 | TRUE | TRUE | FALSE | 6.690842 |

5 | TRUE | TRUE | FALSE | 4.382027 |

8 | TRUE | TRUE | TRUE | 5.010635 |

9 | TRUE | FALSE | FALSE | 5.703782 |

### Test Predictors Against Lifetime Giving

You’ll remember that a dataframe is just a list of vectors (and each column is a vector), so we just need to sapply over that list, comparing each vector to the outcome (in this case our outcome is `df$logOfLifeG`

, ie. the log value of lifetime giving.

You’ll note that I’m excluding the first and last columns–that’s where my constituent ids and outcomes are).

Also, at the end of the function, I grab the 4th item in the list–`cor.test()`

returns a list and the 4th item in the list is the actual data about how correlated the items are.:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
testresults <- sapply(df[,2:4],function(x) { unlist( cor.test( # the predictor as.numeric(x) # the outcome , df$logOfLifeG # further arguments to cor.test # and the index of where the results are stored , method='pearson')[4]) }) |

1 2 3 |
## hasaddr.estimate.cor hasbuphone.estimate.cor hascell.estimate.cor ## 0.02497375 0.20085118 0.03123176 |

Ok, that’s the correlation numbers for each predictors, but right now, `testresults`

is just vector of values with a name for each row.

### Make It Readable

A vector is fine if you’re just going to `View()`

it in RStudio, but it’s a pain to read. So let’s turn it into a data frame where the first column is a list of all the column names from the original list:

1 2 3 4 5 6 7 8 9 |
testresults <- data.frame( # column 1 is the predictor predictors = names(df[,2:4]) # column 2 is the correlation data , correlation=testresults ) |

predictors | correlation | |
---|---|---|

hasaddr.estimate.cor | hasaddr | 0.0249737 |

hasbuphone.estimate.cor | hasbuphone | 0.2008512 |

hascell.estimate.cor | hascell | 0.0312318 |

Alrighty, now we’re getting somewhere (note that the row names are still there on the left: annoying, but not really worth dealing with). With a small data frame like this, you don’t really need to do anything else. But if you’ve got 20 or more predictors, you’re going to want to sort it.

### Sort the Table

Since negative correlation (ie. if your outcome goes down when the predictor goes up)is just as important as positive, we’ll create a new column which is the absolute value of the correlation just so we’ve got something to accurately sort by. Note that I’m using `dplyr`

here

1 2 3 4 5 6 7 8 |
testresults %>% # get the absolute value of the correlation mutate(abscor = abs(correlation)) %>% # sort the data by the new absolute value column arrange(-abscor) |

predictors | correlation | |
---|---|---|

hasbuphone.estimate.cor | hasbuphone | 0.2008512 |

hascell.estimate.cor | hascell | 0.0312318 |

hasaddr.estimate.cor | hasaddr | 0.0249737 |

And there’s your correlation table! Now you can start at the top of the table, building your model with your highly correlated predictors.