Bind_rows() instead of rbind_all() is awesome

1 Comment

I use dplyr’s rbind_all() all the time to mash multiple data frames together–most of the time, I’ll pull multiple data sets, but then stack them on top of each other to do some sort of summary, something like:

Today, I happened to look at the documentation for rbind_all() and found out it’s been deprecated in favor of bind_rows().

Ok whatever, I thought. Then I happened to look at the arguments for bind_rows( ... , .id = NULL).

Here’s the super cool part: you pass .id a string that becomes the name of a new column in the new data frame. And that column is populated by either a sequence of numbers (in whatever order you passed the data frames you’re binding together) or (and here’s where the real money at), you can name your data frames in the first place. Like this:

Normally, I end up adding a source column or some such each of the smaller individual data frames before I bind them all together. It’s a pain and a good way to screw something up–I often forget to add the column till it’s too late and then I have to go back, make the edit and re-run the code. This solves the problem, and rather elegantly so.

Getting started – use dplyr

No Comments

If you’ve already installed R and RStudio, there’s one more thing you’re going to need before you really get started using R for predictive modeling for fundraising: dplyr.

dplyr is an R package (which is to say “add-on code”) that makes using R for basic data manipulation substantially less painful.

It’s dangerous to go alone. Take this. [offers dplyr].

To get dplyr, fire up R and then type

install.packages('dplyr')

When that finishes, the package will be installed but not loaded, so do

library(dplyr)

Write like you think

dplyr provides two major advantages. The first is that it allows you to write code the way you think, namely starting from a set of data and working towards the final result.

There’s a host of articles online about how great this is, so I’m not going to spell it out for you–suffice it to say, it makes thinking through and solving problems a lot easier.

A grammar of data manipulation

The other thing that dplyr does very well is providing a sort of grammar of data manipulation, specifically a set of verbs that you typically use to solve most common data problems, stuff like sorting, rearranging and renaming columns, adding new calculated columns, etc.

The author of dplyr, the inestimable Hadley Wickham, has a great tutorial on these verbs.

Learning these 5 functions (plus just a couple more like ifelse and grepl, which I’ll cover in later posts) will solve the overwhelming majority of the tedious sorts of data manipulation tasks you’ll find yourself doing every time you start in a data job.

Categories: Getting started Tags: Tags: , ,