Organize an R Project

No Comments

If I’m working a project that’s more than about 50 lines of code, I’ll often end up with several scripts. Based on a conversation on the R community on G+, I’ve started organizing my projects a bit better:

Always use an R Studio project

First of all, I always use an R project. Always.

Ok, maybe not always. Occasionally, I just want to dink around in the R console. I won’t make a new R project for that. But if I’ve got enough code that I want to do any sort of debugging, then I’ll start writing an R script. And if I’ve got even one script, I’ll make an R project for it.

This means I’ve got a directory for just about every project I work on, which is a little clutter-y, but not as bad as having a million scripts laying around that may or may not be interconnected.

Use a numeric system to name your scripts

I’ll often write more than one script–my cut off for “I need a new script here” is usually if my “load the data” section is bigger than half my screen, or if I find myself at a good stopping place and realize I still have lots of work to do.

When I have more than script, I’ll start a numerical naming system with a base script called “00-build-something-or-other.R”. This script is usually just a bunch of source() commands. The one that’s running as I type this looks like this:

Using the echo = T argument for source() makes it so you can actually see the commands that are churning along in the background. For a long time, I didn’t add this–it drove me crazy that I could run an individual script with all the commands from RStudio, but not when I ran my base script. Adding echo = T solves that’s problem.

Also, you can use the beeper package to get notifications when this first script finishes. Doing the following will play the end-of-the-level Super Mario music when your job finally gets done (I’d give $5 for a button/keyboard shortcut to include the audible notification in R Studio):

The next scripts are 01-xxxx, 02-xxxx and so on

The next scripts are named 01-xxxxx.R, 02-xxxxx.R, and so on, with each script doing a single job. I try to break up my scripts in places where I won’t have to re-run the previous script each time I screw something up in the one I’m currently working on. Usually, that means each script starts with something like:

Occasionally, too, I’ll use an if() statement to make sure the object I really need exists. Usually, this isn’t necessary if I’m using my 00- script to call everything, but very rarely, I’ll have one script call another.

I did this the other day while I was pulling data for our spring phonathon, for example. I wanted to run a predictive model on our NonAlumni Never-Givers. Since I was training the model on last year’s giving, I didn’t expect the data to change, so I didn’t need the model to refresh when I rebuilt the spring data, so in the middle of my segmentation script I had a bit of code like this:

Use /data and /output directories

Every project I build also has two sub directories in it: /data and /output. Any files I need that are specific to that project, usually .csvs of weird data, hand-reviewed notes, stuff that doesn’t live in our main database or warehouse all get thrown into /data. Anything that I dump out, whatever the final output is (or, rather, all the drafts of the final output) get dumped into /output.

This keeps my main project directory fairly clean: the only thing that should be in there is the .Rproject file itself, any R scripts (which should sort themselves in order, because they’re named 00-xxxxx, 01-xxxxx, 02-xxxxx, etc) and the two directories.

It also means I never have to wonder about what folder something is in: unless it’s a reference file that lives somewhere else (and even then, if it’s small and I don’t plan it changing, I’ll copy it into /data), I can always type read.csv('data/' and then whack Tab and all my files come up. The same goes for writing out data–I know I can always do write.tidy('output/ and I’m good to go.

An important side note there: I use the write.tidy() and read.tidy() functions from my muadc package all the time–they’re wrappers around the standard read/write.csv functions (which in turn are wrappers around the read/write.table functions), but they make my life a LOT easier, if nothing else because I know the arguments are always going to be same.

Stay Organized

Organizing my projects this way has really helped–now, when I return to a project 6 months from now (I’m sure I’ll be back in August, pulling phonathon data together again), I’ll be immediately able to see how the scripts relate to each other, what happened where and what’s important.

Categories: Getting started Tags: Tags: , ,