Programming with dplyr   R dplyr programming

This post is published on Medium and available as an Rmd notebook.

With `dplyr` version 1.0, there are new ways that you can write functions. The programming with dplyr vignette with the docs is the best reference.

If you're familiar with using `sym` and converting from standard to nonstandard form, the following progression may be useful. If you’re familiar with using `sym` and converting from standard to nonstandard form, the following progression should show you how to replace (and extend) your code. It should be mostly find-replace! If you have a function that takes a character vector and uses an `at` verb, see the difference between the corresponding old option (here, option 1 with `at` ) and see how this is changed for the "Super version" at the bottom.

Old option 1: use `*at` verbs

This is the old way of writing a dplyr function:

max_by_at <- function(data, var, by="") {
  data %>%
    group_by_at(by) %>%
    summarise_at(var, max, na.rm = TRUE)
}

Let's try it out:

starwars %>% max_by_at("height", by="gender")
starwars %>% max_by_at(c("height", "mass"), by="gender")
starwars %>% max_by_at(c("height", "mass"), by=c("sex", "gender"))

That worked great, but it won't work for env variables:

testthat::expect_error(starwars %>% max_by_at(height, by=gender))
testthat::expect_error(starwars %>% max_by_at("height", by=gender))
testthat::expect_error(starwars %>% max_by_at(height, by="gender"))

Old(ish) option 2: use `across`

This works for characters and character vectors, but not for env variables. Using `across` is a replacement for using `*at`, and it has the same functionality:

max_by_across <- function(data, var, by="") {
  data %>%
    group_by(across(by)) %>%
    summarise(across(var, max, na.rm = TRUE), .groups='keep')
}
starwars %>% max_by_across("height", by="gender")
starwars %>% max_by_across(c("height", "mass"), by="gender")
starwars %>% max_by_across(c("height", "mass"), by=c("sex", "gender"))
testthat::expect_error(starwars %>% max_by_across(height, by=gender))
testthat::expect_error(starwars %>% max_by_across("height", by=gender))
testthat::expect_error(starwars %>% max_by_across(height, by="gender"))

Old option 3: Convert from character to env var by `sym`

max_by_1 <- function(data, var, by="") {
  data %>%
    group_by(!!sym(by)) %>%
    summarise(maximum = max(!!sym(var), na.rm = TRUE))
}

It doesn't work for passing in env variables:

testthat::expect_error(starwars %>% max_by_1(height))
testthat::expect_error(starwars %>% max_by_1(height, by=gender))

It does work for strings:

starwars %>% max_by_1("height")
starwars %>% max_by_1("height", by="gender")

But, it doesn't work for lists (so, it's less general than `across`):

testthat::expect_error(starwars %>% max_by_1(c("height", "weight")))
testthat::expect_error(starwars %>% max_by_1("height", by=c("gender", "sex")))

Check out this improved version!

It works for env vars, so we can use it like a dplyr function with non standard eval, as well as pass in `sym` variables.

max_by_2 <- function(data, var, by) {
  data %>%
    group_by({{ by }}) %>%
    summarise(maximum = max({{ var }}, na.rm = TRUE))
}

It does work for env variables!

Which is pretty cool:

starwars %>% max_by_2(height)
starwars %>% max_by_2(height, by=gender)

It does not work for strings out of the box:

starwars %>% max_by_2("height")
starwars %>% max_by_2("height", by="gender")

We can work around this with `sym`:

starwars %>% max_by_2(!!sym("height"))

It does not work for lists of env vars:

starwars %>% max_by_2(c(height, mass))
testthat::expect_error(starwars %>% max_by_2(height, by=c(gender, sex)))

We'll use `across()` to allow strings, lists of env vars, and even lists of strings. The default for `by=()` becomes an empty list and we simple wrap the `{{}}` with `across()`:

max_by_3 <- function(data, var, by=c()) {
  data %>%
    group_by(across({{ by }})) %>%
    summarise(across({{ var }}, max, .names = "max_{.col}", na.rm = TRUE), .groups='keep')
}

It works for env variables:

starwars %>% max_by_3(height)
starwars %>% max_by_3(height, by=gender)

It works for strings:

starwars %>% max_by_3("height")
starwars %>% max_by_3("height", by="gender")

It works for lists of env variables:

starwars %>% max_by_3(c(height, mass))
starwars %>% max_by_3(height, by=c(gender, sex))
starwars %>% max_by_3(c(height, mass), by=c(gender, sex))

It works for character lists:

starwars %>% max_by_3(c("height", "mass"))
starwars %>% max_by_3("height", by=c(gender, sex))
starwars %>% max_by_3(c("height", "mass"), by=c("gender", "sex"))

Now you've seen how to write some very flexible functions using the new powers of dplyr programming. Enjoy!

previous | next | random