You should use “perl = T”

No Comments

Today, I was trying to use gsub() to replace a bunch of underscores with spaces, then capitalize the words, something like this:

The first bit, replacing the underscores with spaces was working fine, but that \\U bit kept throwing all sort of errors and when it didn’t, it just didn’t work.

As it turns out, R’s built-in regexp engine doesn’t like \\U–you need to use Perl-style regular expressions to be able to use it. Fortunately that’s easy, just add perl = TRUE as an argument to gsub().

That makes the code above look like this:

The page about regexp in R says you should always use perl =T. Seems like good advice.

Categories: data manipulation Tags: Tags:

Find column names in R with grep

No Comments

About half the time, when I’m working in R, I’m querying against a denormalized dump of data from our system of record. (If I was a real rockstar, I’d be querying against the database itself, but I’m not because of reasons.)

The worst part about this is that the column names are generally a wreck, a mix of ugly SQL names and overly pretty readable names. And since we’ve flattened the data, there’s a host of calculated columns with names like “AMT_MAX_GIFT_MAX_DT”. Which is hard to get exactly right for all 200 variables.

I want names!

Tl;dr I can never remember what half the names of these columns are. And because R abbreviates the output of str(), I can’t see them in the RStudio sidebar, either. Even if I could, looking through 200 variables would be a colossal pain, so I devised a way to solve that problem.

My grepnames() function makes it easy to find column names in R.

##Grepnames() Function

Using grepnames()

You use grepnames() like you would grep: you pass it a regular expression and a dataframe, and it returns a dataframe with column names that match the regular expression and their respective column indexes. Something like this:

This isn’t much different than doing grep("foo" names(df)), but it’s less typing and if you mistype, you won’t end up locking up R. Also, the output is slightly more informative.

By default, it’s not case sensitive – I’m working on the assumption that there’s no telling what a column is named, so trying to get the case right would just be a pain. Plus, you’re rarely doing complicated regular expressions – most often I end up passing it “donor” because I can’t remember how the donor code column is titled.

An R package with grepnames()

This function is part of my muadc package for R, which is on github. It’s mostly an assortment of convenience functions, stuff I find myself doing over and over and so wrote functions for. If you have the developer tools package installed, you can install it by doing install_github("crazybilly/muadc").

There’s a couple function which will be useless for you (they’re specific to our office), but a few of them, like grepnames() are pretty handy.

My eventual plan is to build out a full package for higher education fundraising (with a sample data set and some legit documentation) and submit it to CRAN, but I’ll need a bit more time to make that happen.

Until then, happy grepping!