.Last.value stores … the last value

No Comments

Ok, here’s a quick (but handy) one:

The value of the last thing that you printed to the console is stored in the value: .Last.value. So if you’re going along, doing something interactively, and you print some tibble and realize you need to see the whole thing (but you don’t want to do that super expensive left join into the database again), you can do:

And boom! Your tibble prints all nice and neat and filterable (if that’s what you’re in to).

Categories: Getting started

Bind_rows() instead of rbind_all() is awesome

1 Comment

I use dplyr’s rbind_all() all the time to mash multiple data frames together–most of the time, I’ll pull multiple data sets, but then stack them on top of each other to do some sort of summary, something like:

Today, I happened to look at the documentation for rbind_all() and found out it’s been deprecated in favor of bind_rows().

Ok whatever, I thought. Then I happened to look at the arguments for bind_rows( ... , .id = NULL).

Here’s the super cool part: you pass .id a string that becomes the name of a new column in the new data frame. And that column is populated by either a sequence of numbers (in whatever order you passed the data frames you’re binding together) or (and here’s where the real money at), you can name your data frames in the first place. Like this:

Normally, I end up adding a source column or some such each of the smaller individual data frames before I bind them all together. It’s a pain and a good way to screw something up–I often forget to add the column till it’s too late and then I have to go back, make the edit and re-run the code. This solves the problem, and rather elegantly so.

Organize an R Project

No Comments

If I’m working a project that’s more than about 50 lines of code, I’ll often end up with several scripts. Based on a conversation on the R community on G+, I’ve started organizing my projects a bit better:

Always use an R Studio project

First of all, I always use an R project. Always.

Ok, maybe not always. Occasionally, I just want to dink around in the R console. I won’t make a new R project for that. But if I’ve got enough code that I want to do any sort of debugging, then I’ll start writing an R script. And if I’ve got even one script, I’ll make an R project for it.

This means I’ve got a directory for just about every project I work on, which is a little clutter-y, but not as bad as having a million scripts laying around that may or may not be interconnected.

Use a numeric system to name your scripts

I’ll often write more than one script–my cut off for “I need a new script here” is usually if my “load the data” section is bigger than half my screen, or if I find myself at a good stopping place and realize I still have lots of work to do.

When I have more than script, I’ll start a numerical naming system with a base script called “00-build-something-or-other.R”. This script is usually just a bunch of source() commands. The one that’s running as I type this looks like this:

Using the echo = T argument for source() makes it so you can actually see the commands that are churning along in the background. For a long time, I didn’t add this–it drove me crazy that I could run an individual script with all the commands from RStudio, but not when I ran my base script. Adding echo = T solves that’s problem.

Also, you can use the beeper package to get notifications when this first script finishes. Doing the following will play the end-of-the-level Super Mario music when your job finally gets done (I’d give $5 for a button/keyboard shortcut to include the audible notification in R Studio):

The next scripts are 01-xxxx, 02-xxxx and so on

The next scripts are named 01-xxxxx.R, 02-xxxxx.R, and so on, with each script doing a single job. I try to break up my scripts in places where I won’t have to re-run the previous script each time I screw something up in the one I’m currently working on. Usually, that means each script starts with something like:

Occasionally, too, I’ll use an if() statement to make sure the object I really need exists. Usually, this isn’t necessary if I’m using my 00- script to call everything, but very rarely, I’ll have one script call another.

I did this the other day while I was pulling data for our spring phonathon, for example. I wanted to run a predictive model on our NonAlumni Never-Givers. Since I was training the model on last year’s giving, I didn’t expect the data to change, so I didn’t need the model to refresh when I rebuilt the spring data, so in the middle of my segmentation script I had a bit of code like this:

Use /data and /output directories

Every project I build also has two sub directories in it: /data and /output. Any files I need that are specific to that project, usually .csvs of weird data, hand-reviewed notes, stuff that doesn’t live in our main database or warehouse all get thrown into /data. Anything that I dump out, whatever the final output is (or, rather, all the drafts of the final output) get dumped into /output.

This keeps my main project directory fairly clean: the only thing that should be in there is the .Rproject file itself, any R scripts (which should sort themselves in order, because they’re named 00-xxxxx, 01-xxxxx, 02-xxxxx, etc) and the two directories.

It also means I never have to wonder about what folder something is in: unless it’s a reference file that lives somewhere else (and even then, if it’s small and I don’t plan it changing, I’ll copy it into /data), I can always type read.csv('data/' and then whack Tab and all my files come up. The same goes for writing out data–I know I can always do write.tidy('output/ and I’m good to go.

An important side note there: I use the write.tidy() and read.tidy() functions from my muadc package all the time–they’re wrappers around the standard read/write.csv functions (which in turn are wrappers around the read/write.table functions), but they make my life a LOT easier, if nothing else because I know the arguments are always going to be same.

Stay Organized

Organizing my projects this way has really helped–now, when I return to a project 6 months from now (I’m sure I’ll be back in August, pulling phonathon data together again), I’ll be immediately able to see how the scripts relate to each other, what happened where and what’s important.

Categories: Getting started Tags: Tags: , ,

Moving Data in and out of R via the Clipboard

1 Comment

There’s some interesting stuff in this blog post about reading data into R, particularly about using scan() to read data in (apparently you can use it to type data in, one entry per line).

That said, it’s a pain in the butt to depend on scan()–most of the time when I’m pushing data around, I use R’s ability to read data frames from the clipboard.

I tend to use read.csv(file='clipboard') more than readClipboard(), mostly because I always forget about the latter.

One important note: by default, R only uses a small bit of the Windows clipboard to write files out (I have no idea how/if this works at all on Linux and Mac), something like 128KB. That’s not enough for a decent sized data frame/spreadsheet, but it’s pretty easy to bump that limit up.

If you do write.table( foo, file = 'clipboard-4096'), just about anything should fit in there.

I’ve got a function named write.clip() in my muadc R pacakge that does this for me, because I’m a lazy bum and got tired of typing “sep = '\t', row.names = F“.

Simple Parrallelization

No Comments

Recently, I’ve been working on some machine learning projects at work. A lot of these require serious computing time–my work machine isn’t super beefy, but it’s no slouch and still some of these models take a couple hours to build.

aintnobodygottime

Without really thinking about it, I assumed I was using all my processing power. And then, I looked at my Task Manager. And it turns out, of the 4 cores I’ve got, R was only using one!

There's 3 more boxes there, buddy--use 'em!

There’s 3 more boxes there, buddy–use ’em!

Getting multi-core support in R

A bit of research revealed that R is really bad at supporting multiple cores. It’s baffling to me, but that’s the way it is. Apparently, there are various solutions to this, but they involve installing/using packages and then making sure your processes are parreleliziable. Sounds like a receipe for disaster if you ask me–I screw enough up on my own, I don’t need to add a layer of complexity on top of that.

An alternative, easier solution is to use Revolution Analytics distribution of R, Open R, which comes with support of multiple cores out of the box.

Just download and install and when you fire up RStudio the next time, it’ll find it, and (probably) start using it (if not, you can go into Global Options in RStudio and call out that you want to use that version of R.

Now my packages won’t update

Open R seems to run just fine, but a couple weeks in, I realized I had a problem–my packages weren’t updating (and I really wanted the newest version of dplyr!).

Turns out, Open R is set to not update packages by default. The idea is that they snapshot all packages each year so things don’t get updated and break half way in.

This doesn’t really bother me–I’m rarely spending more than a couple weeks on a single project,not do I have any massive dependencies that would break between upgrades, so I followed the instructions in the FAQ above to set my package repo back to CRAN (basically, you just need to edit your Rprofile.site.)

And sure enough, I was back in business with dplyr 0.4 (and a host of newly updated packages)!

Categories: Getting started Tags: Tags: ,

Excel vs. R (or Why You Should Use R Scripts)

No Comments

I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.

The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.

Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.

What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).

The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.

Scripts in RStudio

If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):

  1. Hit File -> New -> R Script. That’ll open a new script in the Source pane.
  2. Type any number of R commands in the new script.
  3. Save the file as foo.R
  4. Hit the “Source file” button at the top of the Source pane. This runs source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
  5. You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I'm opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I've built in the Files pane in the bottom right.

Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.

That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.

One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.

Getting started – use dplyr

No Comments

If you’ve already installed R and RStudio, there’s one more thing you’re going to need before you really get started using R for predictive modeling for fundraising: dplyr.

dplyr is an R package (which is to say “add-on code”) that makes using R for basic data manipulation substantially less painful.

It’s dangerous to go alone. Take this. [offers dplyr].

To get dplyr, fire up R and then type

install.packages('dplyr')

When that finishes, the package will be installed but not loaded, so do

library(dplyr)

Write like you think

dplyr provides two major advantages. The first is that it allows you to write code the way you think, namely starting from a set of data and working towards the final result.

There’s a host of articles online about how great this is, so I’m not going to spell it out for you–suffice it to say, it makes thinking through and solving problems a lot easier.

A grammar of data manipulation

The other thing that dplyr does very well is providing a sort of grammar of data manipulation, specifically a set of verbs that you typically use to solve most common data problems, stuff like sorting, rearranging and renaming columns, adding new calculated columns, etc.

The author of dplyr, the inestimable Hadley Wickham, has a great tutorial on these verbs.

Learning these 5 functions (plus just a couple more like ifelse and grepl, which I’ll cover in later posts) will solve the overwhelming majority of the tedious sorts of data manipulation tasks you’ll find yourself doing every time you start in a data job.

Categories: Getting started Tags: Tags: , ,

Getting started – Use RStudio

No Comments

There’s scads of articles online about getting R installed on your machine, so I’m not going to spend a lot time talking about that.

What I will say is that you MUST use RStudio.

a look at Rstudio.

It’s an IDE for R, which is to say, it’s a UI that wraps around R and makes everything you do in R LOTS easier. Literally everything.

Some advantages include:

  • See a list of the objects you’re currently working with.
  • See a list of packages you’ve got installed and load one with a single click of the mouse
  • Edit scripts and see the results at the same time.
  • Have plots show up in the same window that you’re working in.
  • See files in the current directory
  • Read help files while typing the commands at the same time
  • The View() function, which gives you,a well formatted, legible view of your data (this feature alone is worth the price of admission, which come to think of it is the next point)
  • Free (as in beer AND as in freedom)

Disadvantages compared to using R by itself include:

  • Doesn’t make you feel like it’s 1986

I’d list the lack of remote access here, but setting up rStudio to be accessible via the web is pretty easy, too.

in short, unless you’re running a really old computer and are short on resources, there is approximately zero reason to use R any other way.

Go, install it now. Trust me , it’s worth it.

Categories: Getting started Tags: Tags: ,

Kicking this off

No Comments

Alrighty, here’s the typical first post for a blog:

I’m working in learning predictive modeling in R for work – I work for the development office at Millikin University. So it seems like it might be worthwhile to start documenting my progress and sharing them in my blog.

So so we’ll see how this goes…

Categories: Getting started