The formattable package is AWESOME

No Comments

Two things I’ve really missed from Excel, when I moved to using R, were conditional formatting and being able to format numbers as currency or percentages without losing data.

I wrote an as.currency() function a while ago, but it turns numbers to strings, which means that I can no longer math those strings without coercing them back into numbers. And coercion is always scare. Do you want to lose data? Because that’s how you lose data.

The formattable package solves both these problems.

It’s got a couple functions, percent() and currency() that only affect how numbers are PRINTED, so they come out looking proper even on the terminal, but they stay numeric so you can continue to use them as numbers.

It also has a host conditional formatting options, so you can change the text color of a column based on its values, add data bars (ie. built-in sparklines within the column itself) or format one column based on the values of another (something I’m pretty sure you can do in Excel, but I’ve never been able to figure out how).

Straight out of the vignette (because I want to get back to playing with this stuff, instead of writing this post):

id price rating market_share revenue profit
1 10 5 10.00% 55,000.00 25,300.00
2 15 4 12.00% 36,400.00 11,500.00
3 12 4 5.00% 12,000.00 (8,200.00)
4 8 3 3.00% (25,000.00) (46,000.00)
5 9 4 14.00% 98,100.00 65,000.00

You should use “perl = T”

No Comments

Today, I was trying to use gsub() to replace a bunch of underscores with spaces, then capitalize the words, something like this:

The first bit, replacing the underscores with spaces was working fine, but that \\U bit kept throwing all sort of errors and when it didn’t, it just didn’t work.

As it turns out, R’s built-in regexp engine doesn’t like \\U–you need to use Perl-style regular expressions to be able to use it. Fortunately that’s easy, just add perl = TRUE as an argument to gsub().

That makes the code above look like this:

The regular-expresions.info page about regexp in R says you should always use perl =T. Seems like good advice.

Categories: data manipulation Tags: Tags:

Moving Data in and out of R via the Clipboard

1 Comment

There’s some interesting stuff in this blog post about reading data into R, particularly about using scan() to read data in (apparently you can use it to type data in, one entry per line).

That said, it’s a pain in the butt to depend on scan()–most of the time when I’m pushing data around, I use R’s ability to read data frames from the clipboard.

I tend to use read.csv(file='clipboard') more than readClipboard(), mostly because I always forget about the latter.

One important note: by default, R only uses a small bit of the Windows clipboard to write files out (I have no idea how/if this works at all on Linux and Mac), something like 128KB. That’s not enough for a decent sized data frame/spreadsheet, but it’s pretty easy to bump that limit up.

If you do write.table( foo, file = 'clipboard-4096'), just about anything should fit in there.

I’ve got a function named write.clip() in my muadc R pacakge that does this for me, because I’m a lazy bum and got tired of typing “sep = '\t', row.names = F“.

3D Scatterplot in R

2 Comments

Today, I was doing some analysis of a sub-group of our constituents and was trying to get an idea of how connected these folks felt to our school.

I thought I’d do that by running a summary of their lifetime giving, event attendance and number of development officer contacts:

That shows me the 5-number breakdown for each column. Obviously, I’ve got some serious outliers in there, somebody’s who’s attended over 200 events!

But I can’t tell if the person w/ 200+ events is the same person who’s given a ton of money or not, so I tried running some scatterplots comparing the numbers:

lifetime giving vs event attendance

That’s not awful, but I have to run a bunch of these to compare all three variables–if I could run a legitimately 3D scatterplot in R, I’d be able to see all three variables at once.

Turns out, it’s REALLY easy.

3D scatterplot

click to zoom, rotate and pan
(embedding a live htmlwidget in WordPress made me crazy)

That gives you a really cool, pannable 3D scatterplot so you can see how each person compares on all three variables at once.

To tell the truth, it’s a bit gimmicky (and what do you do when you’ve got more than three variables? you’re back in the same boat). Still, it’s just so cool and fun that I had to share!

Data frames are lists

No Comments

I’ve got another more expansive post in the hopper that I need to edit and post, but here’s a quick one to reiterate a point that you probably already know, but is really important:

Data Frames are lists

This point, that data frames are lists where each column of the data frame is an item (usually a named vector) within the list, is really really useful. Here’s some things this tidbit lets you do:

  1. Grab a single column as a vector with double brackets, ala constituents[[8]]. If you use single brackets, you’ll get a data frame with just that column (because you’re asking for a list with the 8th item in this case).
  2. Do lapply and sapply functions over each column, ie. you can easily loop through a data frame, performing the same function on each column.

    You might think this isn’t a big deal. “Normally the columns of my data frames are all different types of data, first name here, lifetime giving there,” you might say.

    But trust me, it won’t be long til you’ve got a data frame and you think “oh, I’ll just write a little for loop to loop through each column and…” That’s what lapply and/or sapply are for. Knowing that the scaffolding to do that task is already built into one handy function will save your bacon.

  3. Converting a list of vectors to a data frame is as simple as saying as.data.frame(myOriginalList). And while lists can be a bit fiddly, with this one, you know what you’re going to get.

In short, just knowing that a data frame is a special kind of list makes it a lot easier to handle data frames when the need for crazy sorts of data manipulation comes up (and trust me, it will).

Categories: data manipulation Tags: Tags:

Find column names in R with grep

No Comments

About half the time, when I’m working in R, I’m querying against a denormalized dump of data from our system of record. (If I was a real rockstar, I’d be querying against the database itself, but I’m not because of reasons.)

The worst part about this is that the column names are generally a wreck, a mix of ugly SQL names and overly pretty readable names. And since we’ve flattened the data, there’s a host of calculated columns with names like “AMT_MAX_GIFT_MAX_DT”. Which is hard to get exactly right for all 200 variables.

I want names!

Tl;dr I can never remember what half the names of these columns are. And because R abbreviates the output of str(), I can’t see them in the RStudio sidebar, either. Even if I could, looking through 200 variables would be a colossal pain, so I devised a way to solve that problem.

My grepnames() function makes it easy to find column names in R.

##Grepnames() Function

Using grepnames()

You use grepnames() like you would grep: you pass it a regular expression and a dataframe, and it returns a dataframe with column names that match the regular expression and their respective column indexes. Something like this:

This isn’t much different than doing grep("foo" names(df)), but it’s less typing and if you mistype, you won’t end up locking up R. Also, the output is slightly more informative.

By default, it’s not case sensitive – I’m working on the assumption that there’s no telling what a column is named, so trying to get the case right would just be a pain. Plus, you’re rarely doing complicated regular expressions – most often I end up passing it “donor” because I can’t remember how the donor code column is titled.

An R package with grepnames()

This function is part of my muadc package for R, which is on github. It’s mostly an assortment of convenience functions, stuff I find myself doing over and over and so wrote functions for. If you have the developer tools package installed, you can install it by doing install_github("crazybilly/muadc").

There’s a couple function which will be useless for you (they’re specific to our office), but a few of them, like grepnames() are pretty handy.

My eventual plan is to build out a full package for higher education fundraising (with a sample data set and some legit documentation) and submit it to CRAN, but I’ll need a bit more time to make that happen.

Until then, happy grepping!

Excel vs. R (or Why You Should Use R Scripts)

No Comments

I ran across this great article today about when to use Excel vs. R. It’s a good article overall, but the real money is in the section labeled “When you need to be really clear about how you change your data”.

The basic argument is that most of us who use Excel to manipulate data start with a spreadsheet, then we make a TON of extra formula columns, pivot tables, various and asundry edits until we final get to the product we need, which is often substantially different than the original product.

Worse, it takes at least an hour of work to reproduce all those changes if/when the original data changes, assuming you can remember what changes you made.

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Remember when you rearranged the columns on this spreadsheet? Wait, which column did you put where?

Using an R script to take an data set from its original form to the new form solves most of those problems. As the author puts it:

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom.

What you end up doing is reading in an original file, doing your manipulation in carefully recorded, exact (and easily changeable!) steps, then eventually writing your data back out (usually as a .csv).

The biggest advantage is that you can look at your script, line by line, and see what changes you’re making and in what order. So when you have a problem, you can find it a lot easier and fix it without having to redo your whole spreadsheet.

Scripts in RStudio

If you’re using RStudio, it’s really easy to use scripts (if you’re not using RStudio…well, best of luck):

  1. Hit File -> New -> R Script. That’ll open a new script in the Source pane.
  2. Type any number of R commands in the new script.
  3. Save the file as foo.R
  4. Hit the “Source file” button at the top of the Source pane. This runs source('foo.R') in the Console pane, executing all the commands you’ve written line by line.
  5. You can also highlight a line or two and hit Ctrl+Enter to run that line in the Console–super handy for testing out commands, debugging on the fly, so to speak.
Here, I'm opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I've built in the Files pane in the bottom right.

Here, I’m opening a new script. You can see the scripts I have open in the Source pane at the bottom left and a list of R scripts I’ve built in the Files pane in the bottom right.

That’s all there is to it! The only really tricky thing to note about using scripts is that you’ll want to make sure that you put require('dplyr') at the top of the script (or whatever packages you’re using)–that way when you run the script next Tuesday right after firing up R, dplyr (or whatever packages you’re using in the script) get loaded.

One more quick trick: I don’t recommend using the “source on save” button in RStudio. This runs the script every time you save it. While it seems to be handy, more than once I’ve ended up turning a split-second save on a minor edit into a 2 minute, CPU-churning waste of time–my script had some serious analysis in it. If you’re smart enough to know not to use it on big, CPU/RAM/hard drive-intensive tasks, then go for it, but don’t tell me I didn’t warn you.

Using Cut to Categorize Wealth Capacity

No Comments

Occasionally, I want to categorize wealth capacity into different bins. For example, capacity (or giving, or whatever numeric variable) should fit into bins from 1 – 12 (we’ve always used WealthEngine’s categories here in the office, despite the fact that it’s incomprehensibly a 12 point scale instead of 10 and the lowest number is best, instead of worst).

In Excel, I’d turn a number into a categorical variable (or factor, as R calls them) by using a lookup table and a bunch of nested IF statements. You can do the same with ifelse() in R if you want, but I’d recommend using cut() instead.

“Cut. It. Into bins in order to better describe wealth capacity.”
Somehow, that just doesn’t have the same ring.

To start, you’ll want something like this:

That should get you pretty close to where you want to go. A few notes about what’s going on here.

  1. You’re going to need specify breaks, ie. the minimum amount for each bin. If we’re talking about giving or wealth capacity, this means 0 and Inf should be the top and bottom of your breaks (the minimum of the worst bin is a capacity of $0; without setting the minimum at the top at infinity, R won’t know what bin to put your very best donors in).
  2. You’ll need one fewer labels than cut points–nobody has a wealth capacity of less than zero (your bottom break) or greater than infinity (the top break).
  3. You’re also going to want to specify right=F. This tells R that you’re giving it the minimum for each bin rather than the maximum (ie. the “left” side instead of the right).

To sum up: , if you’re making a scale with 12 bins, you need 12 labels and 13 break points (with 0 and Inf on either end of that list of breaks).

Using a lookup table to specify capacity ratings

If you’ve already got your ratings in a table (I’d be willing to bet you’ve got something in a spreadsheet somewhere), you can read that into a data frame (named ratings below) and use it like this:

The only trick here is that you need to make sure you sort the table so your minimums are in ASCENDING order, so your worst category is at the top of the table and the best, wealthiest, category is at the bottom of the lookup table.

Using cut() for other categorizations

That’s all there is to using cut to categorize wealth capacity ratings into bins, but you can probably see how to apply this to other numeric columns ( to lifetime giving to come up with rough categorical outcome variables, for example).

I’d probably end up knocking together the lookup table in Excel and then using read.table(file="clipboard", sep="\t") to read it into R from the clipboard.

From there, it’s just a matter of running cut() on the data you want to transform!

R doesn’t care about leading zeros

No Comments

Constituents in our database each have an I’d number. Every number has 8 digits, regardless of how big the number is–we pad the id out with leading zeroes (an id of 12345 would go into the database as 00012345).

Ids stay numeric when I load them into R to make joins easier (R barfs when trying to do joins on factors, ie. categorical variables).

The leading zeros get removed once they hit R, but I realized today, via a bit of careless copy and paste, that 00012345 evaluates to the same thing as 12345.

Is he really trying to paste from Banner into a csv?

Or to put it another way:

> 0012345==12345
TRUE

Handy, when it comes to copying and pasting ids from who knows where into R.

Categories: data manipulation Tags: Tags: , , , ,

Re-normalizing data (in the database sense of normalization)

No Comments

I can’t always hit our database directly, so I often end up working with data that looks like this:

ID Activities
1
2 DEC~FILE~STD~TTR
3
4 AQD~BEA~FILE~HOSP~MCL~MMA~PRP~PSI~WTENN
5 DEC~FILE~PSP~SCL~STD
6 FILE~NAL
7 FILE~HOME~MMA~SACO
8 FILE~MFORT

If I want to categorize those activities, for example, to say “Const #1 was in a Greek organization” because I think maybe that will be a predictor for lifetime giving, I need a good way to normalize the data in R.

I want it to look like it does in the database, something like:

ID Activities
2 DEC
2 FILE
2 STD
2 TTR
4 BEA
4 FILE
4 HOSP

and so on.

Re-normalizing data the R way

Doing this R isn’t hard, but it took me several tries and scads of Googling to find the R wa to do it:

df <- read.csv('that-annoying-list.csv') activities <- strsplit(df$activities, "~") normalizedlist <- data.frame(user = rep(df$id, sapply(activities, length)), item = unlist(activities))

Here's what's going on:

  1. We read the data into the df object using .
  2. strsplit splits the list of activities by the delimiter "~". This returns a big nested list--the top-level entries are by person, the second level of entries are each activity.
  3. We build a new data frame that starts with id, repeated as many times as each list item is long. The next column is just all the values of the list dumped straight out.

I originally tried doing this job with ddply, breaking the data frame apart by id and using strsplit and then reassembling it all. It worked but was a disaster (the above version works in seconds--my original took about 20 minutes). This is really the R way to normalize a list of codes.

Now that I've got the codes broken apart, I can use dplyr's join functions to categorize each code and figure out what constituents were in each category.

PS. "Normalizing" a list, in statistics means something different than it does in the database world--here's we're just talking about normalization in the Boyce-Codd sense of the word.