Moving Data in and out of R via the Clipboard

1 Comment

There’s some interesting stuff in this blog post about reading data into R, particularly about using scan() to read data in (apparently you can use it to type data in, one entry per line).

That said, it’s a pain in the butt to depend on scan()–most of the time when I’m pushing data around, I use R’s ability to read data frames from the clipboard.

I tend to use read.csv(file='clipboard') more than readClipboard(), mostly because I always forget about the latter.

One important note: by default, R only uses a small bit of the Windows clipboard to write files out (I have no idea how/if this works at all on Linux and Mac), something like 128KB. That’s not enough for a decent sized data frame/spreadsheet, but it’s pretty easy to bump that limit up.

If you do write.table( foo, file = 'clipboard-4096'), just about anything should fit in there.

I’ve got a function named write.clip() in my muadc R pacakge that does this for me, because I’m a lazy bum and got tired of typing “sep = '\t', row.names = F“.

Excel vs. R (or Why You Should Use R Scripts)

No Comments

I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.

The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.

Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.

What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).

The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.

Scripts in RStudio

If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):

  1. Hit File -> New -> R Script. That’ll open a new script in the Source pane.
  2. Type any number of R commands in the new script.
  3. Save the file as foo.R
  4. Hit the “Source file” button at the top of the Source pane. This runs source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
  5. You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I'm opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I've built in the Files pane in the bottom right.

Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.

That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.

One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.

R doesn’t care about leading zeros

No Comments

Constituents in our database each have an I’d number. Every number has 8 digits, regardless of how big the number is–we pad the id out with leading zeroes (an id of 12345 would go into the database as 00012345).

Ids stay numeric when I load them into R to make joins easier (R barfs when trying to do joins on factors, ie. categorical variables).

The leading zeros get removed once they hit R, but I realized today, via a bit of careless copy and paste, that 00012345 evaluates to the same thing as 12345.

Is he really trying to paste from Banner into a csv?

Or to put it another way:

> 0012345==12345
TRUE

Handy, when it comes to copying and pasting ids from who knows where into R.

Categories: data manipulation Tags: Tags: , , , ,