You have an interesting choice in an R program When you use the group_by function with two or more arguments and follow that with the summarize function. Here is a practical illustration of how your choices can make a difference.
The palmer penguins library provides an interesting dataset on body measurements of several penguin species conducted by the researchers at Palmer Station, Antarctica. The group_by and summarize functions are part of the dplyr library which is one of the ones loaded when you ask for the tidyverse library.
Illustration of group_by with two arguments followed by summarize
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups: species [3]
species sex n
<fct> <fct> <int>
1 Adelie female 73
2 Adelie male 73
3 Chinstrap female 34
4 Chinstrap male 34
5 Gentoo female 58
6 Gentoo male 61
To simplify the output, I am removing some missing values. In a real analysis, you should not be so cavalier in just tossing out entire rows of data when one of the important variables is missing.
Notice the warning message produced by summarize. You can suppress this warning by adding warning=FALSE as a chunk option, or by specifying a .groups argument in the summarize function.
Two of the .groups options are “drop_last” and “drop”. The options don’t matter if you stop with summarize. If you continue onward with additional calculations, the options sometimes (but not always) become important. Here is an example where the .groups options are important.
# A tibble: 6 × 5
species sex n total pct
<fct> <fct> <int> <int> <dbl>
1 Adelie female 73 333 22
2 Adelie male 73 333 22
3 Chinstrap female 34 333 10
4 Chinstrap male 34 333 10
5 Gentoo female 58 333 17
6 Gentoo male 61 333 18
With the .groups=“drop” option, all grouping is removed after the summarize function. This means that mutate(total=sum(n)) calculates the sum across all six of rows produced by summarize. Effectievly that produces percentages in the second mutate function will that add up to 100% across all six rows. Actually, they only add up to 99% but this is because of rounding.
# A tibble: 6 × 5
# Groups: species [3]
species sex n total pct
<fct> <fct> <int> <int> <dbl>
1 Adelie female 73 146 50
2 Adelie male 73 146 50
3 Chinstrap female 34 68 50
4 Chinstrap male 34 68 50
5 Gentoo female 58 119 49
6 Gentoo male 61 119 51
The default option for the .groups argument in summarize is .groups=“drop_last”.
With the .groups=“drop_last” option, grouping is removed for sex, the last argument in group_by. This means that mutate(total=sum(n)) calculates the sum across each species. Effectively, this produce percentages in the second mutate function that add up to 100% across each species.
# A tibble: 6 × 5
# Groups: sex [2]
sex species n total pct
<fct> <fct> <int> <int> <dbl>
1 female Adelie 73 165 44
2 female Chinstrap 34 165 21
3 female Gentoo 58 165 35
4 male Adelie 73 168 43
5 male Chinstrap 34 168 20
6 male Gentoo 61 168 36
If you reverse the order of the two arguments in the group_by function, you get a third option. In this case, grouping is removed for species instead of sex. This means that mutate(total=sum(n)) calculates the sum across each species. Effectively, this produce percentages in the second mutate function that add up to 100% across each sex.
The count function works differently
The count function Notice that while the documentation on count says that
“df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())”
it is not exactly equivalent. It becomes obvious first when you see that replacing group_by and summarize with count does not produce a warning message.
# A tibble: 6 × 5
species sex n total pct
<fct> <fct> <int> <int> <dbl>
1 Adelie female 73 333 22
2 Adelie male 73 333 22
3 Chinstrap female 34 333 10
4 Chinstrap male 34 333 10
5 Gentoo female 58 333 17
6 Gentoo male 61 333 18
So count implicitly uses the .groups=“drop” argument, which is not the default when you combine group_by with summarize.
The count function does not have a .groups argument. If you wanted to do something effectively the same as .groups=“drop_last” you just need to insert group_by statement after count.
# A tibble: 6 × 5
# Groups: species [3]
species sex n total pct
<fct> <fct> <int> <int> <dbl>
1 Adelie female 73 146 50
2 Adelie male 73 146 50
3 Chinstrap female 34 68 50
4 Chinstrap male 34 68 50
5 Gentoo female 58 119 49
6 Gentoo male 61 119 51
Notice that the percentages here add up to 100% within each species.
More information
Like most of the tidyverse documentation, the information on summarize is very complete. Sometimes so complete as to be overwhelming. Look first at the summarise section in the vignette on grouping. The documentation page on count does not go into detail about the inconsistency between the count function and the group_by followed by summary functions, but is still worth reading.