Bind_rows() instead of rbind_all() is awesome

1 Comment

I use dplyr’s rbind_all() all the time to mash multiple data frames together–most of the time, I’ll pull multiple data sets, but then stack them on top of each other to do some sort of summary, something like:

Today, I happened to look at the documentation for rbind_all() and found out it’s been deprecated in favor of bind_rows().

Ok whatever, I thought. Then I happened to look at the arguments for bind_rows( ... , .id = NULL).

Here’s the super cool part: you pass .id a string that becomes the name of a new column in the new data frame. And that column is populated by either a sequence of numbers (in whatever order you passed the data frames you’re binding together) or (and here’s where the real money at), you can name your data frames in the first place. Like this:

Normally, I end up adding a source column or some such each of the smaller individual data frames before I bind them all together. It’s a pain and a good way to screw something up–I often forget to add the column till it’s too late and then I have to go back, make the edit and re-run the code. This solves the problem, and rather elegantly so.

Data frames are lists

No Comments

I’ve got another more expansive post in the hopper that I need to edit and post, but here’s a quick one to reiterate a point that you probably already know, but is really important:

Data Frames are lists

This point, that data frames are lists where each column of the data frame is an item (usually a named vector) within the list, is really really useful. Here’s some things this tidbit lets you do:

  1. Grab a single column as a vector with double brackets, ala constituents[[8]]. If you use single brackets, you’ll get a data frame with just that column (because you’re asking for a list with the 8th item in this case).
  2. Do lapply and sapply functions over each column, ie. you can easily loop through a data frame, performing the same function on each column.

    You might think this isn’t a big deal. “Normally the columns of my data frames are all different types of data, first name here, lifetime giving there,” you might say.

    But trust me, it won’t be long til you’ve got a data frame and you think “oh, I’ll just write a little for loop to loop through each column and…” That’s what lapply and/or sapply are for. Knowing that the scaffolding to do that task is already built into one handy function will save your bacon.

  3. Converting a list of vectors to a data frame is as simple as saying as.data.frame(myOriginalList). And while lists can be a bit fiddly, with this one, you know what you’re going to get.

In short, just knowing that a data frame is a special kind of list makes it a lot easier to handle data frames when the need for crazy sorts of data manipulation comes up (and trust me, it will).

Categories: data manipulation Tags: Tags: