Functions to make plain R output look nicer

Most of the statistical summary functions in R produce plain vanilla output. That’s actually good because it allows you to customize things the way you like. I have been finding that I use pretty close to the same customizations over and over again, so I thought I should standardize them in a few simple R functions.

pretty_p

The pretty_p function represents my efforts to avoid displaying small p-values using scientific notation. Some researchers get confused when they see something like 7.93e- 2. Even for those researchers who are used to scientific notation, it just makes things hard to read. This function replaces very small p-values with “p < 0.001” and rounds larger p-values to two significant figures.

pretty_p <- function(p_table) {
  # p_table is the output from the tidy function in the broom package
  p_table |>
    mutate(p.value =
      case_when(
        p.value <  0.001 ~ "p < 0.001",
        p.value >= 0.001 ~ glue("p = {signif(p.value, 2)}")))
}

Comment on the code

The mutate function modifies an existing variable or creates a new variable. The case_when function uses logic statements to assign values. Both are part of the dplyr/tidyverse libraries. The signif function, part of the base R package rounds a value to a specified number of significant digits. The glue function, part of the glue package, will insert variables inside of a string by surrounding the variable by curly brackets.

Example

Here is an example of how you might use the pretty_p function. I am using the Palmer Penguins dataset. Here is the output from tidy without pretty_p.

lm(flipper_length_mm ~ species + island, data=penguins) |>
  tidy()

# A tibble: 5 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       189.        0.999   189.    0       
2 speciesChinstrap    6.09      1.20      5.09  5.92e- 7
3 speciesGentoo      28.4       1.16     24.4   2.23e-76
4 islandDream         0.937     1.34      0.701 4.84e- 1
5 islandTorgersen     2.40      1.36      1.76  7.93e- 2

Here is the output from tidy with pretty_p.

lm(flipper_length_mm ~ species + island, data=penguins) |>
  tidy() |>
  pretty_p()

# A tibble: 5 × 5
  term             estimate std.error statistic p.value  
  <chr>               <dbl>     <dbl>     <dbl> <glue>   
1 (Intercept)       189.        0.999   189.    p < 0.001
2 speciesChinstrap    6.09      1.20      5.09  p < 0.001
3 speciesGentoo      28.4       1.16     24.4   p < 0.001
4 islandDream         0.937     1.34      0.701 p = 0.48 
5 islandTorgersen     2.40      1.36      1.76  p = 0.079

pretty_n

I also like to replace counts with percentages, but show the numerator and denominator in parentheses afterwards. This function will take results from the count function (or from any dataset with a variable named “n”), compute a total and percentage, and then display the results in a nice format.

pretty_n <- function(n_table) {
  # n_table is output from the count function
  n_table |>
    mutate(total=sum(n)) |>
    mutate(pct=round(100*n/total)) |>
      mutate(pct=glue("{pct}% ({n}/{total})")) |>
      select(-n, -total)
}

Comment on the code

The mutate and glue functions are described above. The round function is described in the same link as the signif function. The select function will keep or drop columns of data.

Example

Here is an example of how you might use the pretty_n function. Here is the output from count without pretty_n.

penguins |>
  count(species)

# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Here is the output from count with pretty_n.

penguins |>
  count(species) |>
  pretty_n()

# A tibble: 3 × 2
  species   pct          
  <fct>     <glue>       
1 Adelie    44% (152/344)
2 Chinstrap 20% (68/344) 
3 Gentoo    36% (124/344)

pretty_mean

For continuous variables, I usually like to compute a few descriptive statistics: the mean, standard deviation, minimum, maximum, and the number of missing values. This is easy enough to do but the code is long and tedious. The pretty_mean function is just as long and tedious (if not more so), but once you enter it, you can save a lot of time starting with the second variable that you need to summarize. I actually used to like the summary function, but it does not include a standard deviation and it does not work well with the group_by function in dplyr.

pretty_mean <- function(d, v) {
  # d is a data frame or tibble, v is a variable in d.
  d |>
    summarize(n_missing=sum(is.na({{v}}))) -> d1
  d |>
    filter(!is.na({{v}})) |>
    summarize(
      across({{v}}, 
        list(
          mean=mean, 
          sd=sd,
          min=min,
          max=max))) |>
    bind_cols(d1)
}

Comment on the code

Using tidyverse functions inside a loop or function is tricky. With an argument that represents a variable name, you either enclose the argument in double curly braces (“embrace” is the term they use) or you specify the argument inside of .data[[]].

The across function, part of the dplyr/tidyverse libraries, allows you to select specific variable(s) for further operations within dplyr functions like summarize. I have had some difficulty with this function, but it seems to work well here.

The bind_cols function, part of the dplyr library, combines two data frames or tibbles side by side.

Example

Here is the code you would write without the pretty_mean function.

penguins |>
  summarize(
    body_mass_g_mean=mean(body_mass_g, na.rm=TRUE),
    body_mass_g_sd=sd(body_mass_g, na.rm=TRUE),
    body_mass_g_min=min(body_mass_g, na.rm=TRUE),
    body_mass_g_max=max(body_mass_g, na.rm=TRUE),
    n_missing=sum(is.na(body_mass_g)))

# A tibble: 1 × 5
  body_mass_g_mean body_mass_g_sd body_mass_g_min body_mass_g_max n_missing
             <dbl>          <dbl>           <int>           <int>     <int>
1            4202.           802.            2700            6300         2

Here is the code with the pretty_mean function.

penguins |>
  pretty_mean(body_mass_g)

# A tibble: 1 × 5
  body_mass_g_mean body_mass_g_sd body_mass_g_min body_mass_g_max n_missing
             <dbl>          <dbl>           <int>           <int>     <int>
1            4202.           802.            2700            6300         2

Writing functions like these only make sense if you find yourself doing the same sort of thing three or more times in one program. There is a nice side effect of writing the function. If you decide that you want a different appearance or you want to round some of your results or you want to feed the data into a nicely formatted table, it only takes a small modification in one spot.