<- function(p_table) {
pretty_p # p_table is the output from the tidy function in the broom package
|>
p_table mutate(p.value =
case_when(
< 0.001 ~ "p < 0.001",
p.value >= 0.001 ~ glue("p = {signif(p.value, 2)}")))
p.value }
Most of the statistical summary functions in R produce plain vanilla output. That’s actually good because it allows you to customize things the way you like. I have been finding that I use pretty close to the same customizations over and over again, so I thought I should standardize them in a few simple R functions.
pretty_p
The pretty_p function represents my efforts to avoid displaying small p-values using scientific notation. Some researchers get confused when they see something like 7.93e- 2. Even for those researchers who are used to scientific notation, it just makes things hard to read. This function replaces very small p-values with “p < 0.001” and rounds larger p-values to two significant figures.
Example
Here is an example of how you might use the pretty_p function. I am using the Palmer Penguins dataset. Here is the output from tidy without pretty_p.
lm(flipper_length_mm ~ species + island, data=penguins) |>
tidy()
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 189. 0.999 189. 0
2 speciesChinstrap 6.09 1.20 5.09 5.92e- 7
3 speciesGentoo 28.4 1.16 24.4 2.23e-76
4 islandDream 0.937 1.34 0.701 4.84e- 1
5 islandTorgersen 2.40 1.36 1.76 7.93e- 2
Here is the output from tidy with pretty_p.
lm(flipper_length_mm ~ species + island, data=penguins) |>
tidy() |>
pretty_p()
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <glue>
1 (Intercept) 189. 0.999 189. p < 0.001
2 speciesChinstrap 6.09 1.20 5.09 p < 0.001
3 speciesGentoo 28.4 1.16 24.4 p < 0.001
4 islandDream 0.937 1.34 0.701 p = 0.48
5 islandTorgersen 2.40 1.36 1.76 p = 0.079
pretty_n
I also like to replace counts with percentages, but show the numerator and denominator in parentheses afterwards. This function will take results from the count function (or from any dataset with a variable named “n”), compute a total and percentage, and then display the results in a nice format.
<- function(n_table) {
pretty_n # n_table is output from the count function
|>
n_table mutate(total=sum(n)) |>
mutate(pct=round(100*n/total)) |>
mutate(pct=glue("{pct}% ({n}/{total})")) |>
select(-n, -total)
}
Comment on the code
The mutate and glue functions are described above. The round function is described in the same link as the signif function. The select function will keep or drop columns of data.
Example
Here is an example of how you might use the pretty_n function. Here is the output from count without pretty_n.
|>
penguins count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
Here is the output from count with pretty_n.
|>
penguins count(species) |>
pretty_n()
# A tibble: 3 × 2
species pct
<fct> <glue>
1 Adelie 44% (152/344)
2 Chinstrap 20% (68/344)
3 Gentoo 36% (124/344)
pretty_mean
For continuous variables, I usually like to compute a few descriptive statistics: the mean, standard deviation, minimum, maximum, and the number of missing values. This is easy enough to do but the code is long and tedious. The pretty_mean function is just as long and tedious (if not more so), but once you enter it, you can save a lot of time starting with the second variable that you need to summarize. I actually used to like the summary function, but it does not include a standard deviation and it does not work well with the group_by function in dplyr.
<- function(d, v) {
pretty_mean # d is a data frame or tibble, v is a variable in d.
|>
d summarize(n_missing=sum(is.na({{v}}))) -> d1
|>
d filter(!is.na({{v}})) |>
summarize(
across({{v}},
list(
mean=mean,
sd=sd,
min=min,
max=max))) |>
bind_cols(d1)
}
Comment on the code
Using tidyverse functions inside a loop or function is tricky. With an argument that represents a variable name, you either enclose the argument in double curly braces (“embrace” is the term they use) or you specify the argument inside of .data[[]].
The across function, part of the dplyr/tidyverse libraries, allows you to select specific variable(s) for further operations within dplyr functions like summarize. I have had some difficulty with this function, but it seems to work well here.
The bind_cols function, part of the dplyr library, combines two data frames or tibbles side by side.
Example
Here is the code you would write without the pretty_mean function.
|>
penguins summarize(
body_mass_g_mean=mean(body_mass_g, na.rm=TRUE),
body_mass_g_sd=sd(body_mass_g, na.rm=TRUE),
body_mass_g_min=min(body_mass_g, na.rm=TRUE),
body_mass_g_max=max(body_mass_g, na.rm=TRUE),
n_missing=sum(is.na(body_mass_g)))
# A tibble: 1 × 5
body_mass_g_mean body_mass_g_sd body_mass_g_min body_mass_g_max n_missing
<dbl> <dbl> <int> <int> <int>
1 4202. 802. 2700 6300 2
Here is the code with the pretty_mean function.
|>
penguins pretty_mean(body_mass_g)
# A tibble: 1 × 5
body_mass_g_mean body_mass_g_sd body_mass_g_min body_mass_g_max n_missing
<dbl> <dbl> <int> <int> <int>
1 4202. 802. 2700 6300 2
Writing functions like these only make sense if you find yourself doing the same sort of thing three or more times in one program. There is a nice side effect of writing the function. If you decide that you want a different appearance or you want to round some of your results or you want to feed the data into a nicely formatted table, it only takes a small modification in one spot.
Comment on the code
The mutate function modifies an existing variable or creates a new variable. The case_when function uses logic statements to assign values. Both are part of the dplyr/tidyverse libraries. The signif function, part of the base R package rounds a value to a specified number of significant digits. The glue function, part of the glue package, will insert variables inside of a string by surrounding the variable by curly brackets.