1

I'm a R frequent user but always try to understand why these two graphs are super different and what I can do to mimic geom_line to match the display produced by stat_summary (which is way better) Bonus question: What is the reasonable justification for keeping geom_line() working like that ?

library(tidyverse)
df = structure(list(
  year_completed_cat = structure(
    c(5L, 4L, 5L, 4L, 4L, 6L, 6L, 4L, 6L, 4L, 6L, 5L, 4L, 4L
      4L, 5L, 6L, 5L, 6L, 5L, 6L, 6L, 6L, 5L, 4L, 6L, 6L, 6L,
      6L, 5L, 4L, 6L, 6L, 5L, 5L, 6L, 6L, 4L, 4L, 6L, 6L, 6L,
      6L, 5L, 4L, 6L, 5L, 6L, 6L, 5L),
    levels = c("18", "19", "20", "21", "22", "23", "24"),
    class = "factor"),
  asqse_quest = structure(
    c(6L, 7L, 7L, 7L, 7L, 6L, 6L, 5L, 5L, 5L, 5L, 6L, 6L, 5L,
      7L, 6L, 5L, 6L, 7L, 5L, 6L, 7L, 6L, 7L, 7L, 7L, 5L, 7L,
      5L, 5L, 6L, 7L, 5L, 5L, 7L, 5L, 7L, 6L, 6L, 5L, 6L, 5L,
      6L, 6L, 6L, 5L, 6L, 6L, 5L, 5L),
    levels = c("2", "6", "12", "18", "24", "30", "36", "48", "60"),
    class = "factor"),
  asqse_total =
    c(205, 40, 80, 60, 40, 60, 120, 0, 20, 20, 70, 70, 35, 35,
      225, 140, 80, 215, 230, 110, 180, 155, 25, 165, 75, 60, 20
      85, 20, 75, 30, 35, 25, 55, 160, 70, 140, 35, 140, 30, 40,
      40, 25, 40, 75, 5, 35, 205, 5, 40)),
  row.names = c(NA, -50L), class = "data.frame")

ggplot(df, aes(x = year_completed_cat, y = asqse_total, 
               group = asqse_quest, color = asqse_quest)) +
  geom_line() + geom_point()

ggplot(df, aes(x = year_completed_cat, y = asqse_total, 
               group = asqse_quest, color = asqse_quest)) +
  stat_summary(geom = "line", fun = mean)

Created on 2024-07-07 with reprex v2.1.0

1
  • 3
    geom_line draws lines connecting the data. stat_summary summarises the data first with fun (here mean) and then draws lines connecting the "funs" of the groups.
    – Edward
    Commented Jul 8 at 6:29

3 Answers 3

1

Summarise first, then geom_line to mimic stat_summary

summarise(df, asqse_total=mean(asqse_total), .by=c(year_completed_cat, asqse_quest)) |>
  ggplot(aes(x = year_completed_cat, y = asqse_total, 
               group = asqse_quest, color = asqse_quest)) +
  geom_line() + geom_point()

enter image description here

1

You have multiple asqse_total values for each year_completed_cat, so when you draw a geom_line() it will connect those points in sequence: first all of the points for year_completed_cat 21, (a vertical line), then on to the next step on the x axis (the diagonal), then the points for year_completed_cat 22 (another vertical line), and so on.

If you want to draw a point for each data point, and then lines connecting the means, you can combine both your approaches: a geom_point() plus a stat_summary(..., geom = "line").

1

If you prefer geom_line() to stat_summary(), you could use:

ggplot(
  df %>% summarise(.by = c(year_completed_cat, asqse_quest), mean = mean(asqse_total)), 
  aes(x = year_completed_cat, y = mean, group = asqse_quest, color = asqse_quest)) +
  geom_line()

Not the answer you're looking for? Browse other questions tagged or ask your own question.