Choices with the group_by and summarize functions

You have an interesting choice in an R program When you use the group_by function with two or more arguments and follow that with the summarize function. Here is a practical illustration of how your choices can make a difference.

The palmer penguins library provides an interesting dataset on body measurements of several penguin species conducted by the researchers at Palmer Station, Antarctica. The group_by and summarize functions are part of the dplyr library which is one of the ones loaded when you ask for the tidyverse library.

Illustration of group_by with two arguments followed by summarize

penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarize(n=n())

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex        n
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      73
3 Chinstrap female    34
4 Chinstrap male      34
5 Gentoo    female    58
6 Gentoo    male      61

To simplify the output, I am removing some missing values. In a real analysis, you should not be so cavalier in just tossing out entire rows of data when one of the important variables is missing.

Notice the warning message produced by summarize. You can suppress this warning by adding warning=FALSE as a chunk option, or by specifying a .groups argument in the summarize function.

Two of the .groups options are “drop_last” and “drop”. The options don’t matter if you stop with summarize. If you continue onward with additional calculations, the options sometimes (but not always) become important. Here is an example where the .groups options are important.

Calculating percentages using .groups=“drop”

penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarize(n=n(), .groups="drop") |>
  mutate(total=sum(n)) |>
  mutate(pct=round(100*n/total))

# A tibble: 6 × 5
  species   sex        n total   pct
  <fct>     <fct>  <int> <int> <dbl>
1 Adelie    female    73   333    22
2 Adelie    male      73   333    22
3 Chinstrap female    34   333    10
4 Chinstrap male      34   333    10
5 Gentoo    female    58   333    17
6 Gentoo    male      61   333    18

With the .groups=“drop” option, all grouping is removed after the summarize function. This means that mutate(total=sum(n)) calculates the sum across all six of rows produced by summarize. Effectievly that produces percentages in the second mutate function will that add up to 100% across all six rows. Actually, they only add up to 99% but this is because of rounding.

Calculating percentages using .groups=“drop_last”

penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarize(n=n(), .groups="drop_last") |>
  mutate(total=sum(n)) |>
  mutate(pct=round(100*n/total))

# A tibble: 6 × 5
# Groups:   species [3]
  species   sex        n total   pct
  <fct>     <fct>  <int> <int> <dbl>
1 Adelie    female    73   146    50
2 Adelie    male      73   146    50
3 Chinstrap female    34    68    50
4 Chinstrap male      34    68    50
5 Gentoo    female    58   119    49
6 Gentoo    male      61   119    51

The default option for the .groups argument in summarize is .groups=“drop_last”.

With the .groups=“drop_last” option, grouping is removed for sex, the last argument in group_by. This means that mutate(total=sum(n)) calculates the sum across each species. Effectively, this produce percentages in the second mutate function that add up to 100% across each species.

Calculating a third set of percentages

penguins |>
  filter(!is.na(sex)) |>
  group_by(sex, species) |>
  summarize(n=n(), .groups="drop_last") |>
  mutate(total=sum(n)) |>
  mutate(pct=round(100*n/total))

# A tibble: 6 × 5
# Groups:   sex [2]
  sex    species       n total   pct
  <fct>  <fct>     <int> <int> <dbl>
1 female Adelie       73   165    44
2 female Chinstrap    34   165    21
3 female Gentoo       58   165    35
4 male   Adelie       73   168    43
5 male   Chinstrap    34   168    20
6 male   Gentoo       61   168    36

If you reverse the order of the two arguments in the group_by function, you get a third option. In this case, grouping is removed for species instead of sex. This means that mutate(total=sum(n)) calculates the sum across each species. Effectively, this produce percentages in the second mutate function that add up to 100% across each sex.

The count function works differently

The count function Notice that while the documentation on count says that

“df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())”

it is not exactly equivalent. It becomes obvious first when you see that replacing group_by and summarize with count does not produce a warning message.

penguins |>
  filter(!is.na(sex)) |>
  count(species, sex)

# A tibble: 6 × 3
  species   sex        n
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      73
3 Chinstrap female    34
4 Chinstrap male      34
5 Gentoo    female    58
6 Gentoo    male      61

When you follow up with summarize functions to compute total and pct, you get the following.

penguins |>
  filter(!is.na(sex)) |>
  count(species, sex) |>
  mutate(total=sum(n)) |>
  mutate(pct=round(100*n/total))

# A tibble: 6 × 5
  species   sex        n total   pct
  <fct>     <fct>  <int> <int> <dbl>
1 Adelie    female    73   333    22
2 Adelie    male      73   333    22
3 Chinstrap female    34   333    10
4 Chinstrap male      34   333    10
5 Gentoo    female    58   333    17
6 Gentoo    male      61   333    18

So count implicitly uses the .groups=“drop” argument, which is not the default when you combine group_by with summarize.

The count function does not have a .groups argument. If you wanted to do something effectively the same as .groups=“drop_last” you just need to insert group_by statement after count.

penguins |>
  filter(!is.na(sex)) |>
  count(species, sex) |>
  group_by(species) |>
  mutate(total=sum(n)) |>
  mutate(pct=round(100*n/total))

# A tibble: 6 × 5
# Groups:   species [3]
  species   sex        n total   pct
  <fct>     <fct>  <int> <int> <dbl>
1 Adelie    female    73   146    50
2 Adelie    male      73   146    50
3 Chinstrap female    34    68    50
4 Chinstrap male      34    68    50
5 Gentoo    female    58   119    49
6 Gentoo    male      61   119    51

Notice that the percentages here add up to 100% within each species.

More information

Like most of the tidyverse documentation, the information on summarize is very complete. Sometimes so complete as to be overwhelming. Look first at the summarise section in the vignette on grouping. The documentation page on count does not go into detail about the inconsistency between the count function and the group_by followed by summary functions, but is still worth reading.