I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.
The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.
Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.
Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?
Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:
I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.
What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).
The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.
Scripts in RStudio
If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):
- Hit File -> New -> R Script. That’ll open a new script in the Source pane.
- Type any number of R commands in the new script.
- Save the file as foo.R
- Hit the “Source file” button at the top of the Source pane. This runs
source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
- You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.
That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put
require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.
One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.