1

I am trying to create a metadata file with average pollution concentrations over the course of a year per site ID. I can easily calculate the mean, max, min, etc. but what I cannot do is extract the date on which the minimum and maximum concentration values occured from the parent dataset and make it a column in the new dataset. Example:

Parent dataset from which I am calculating the mean, min, max, etc:

ID Conc Date
1 3000 01-04-2022
1 3256 01-05-2022
1 6352 02-09-2022
1 7362 03-04-2022
2 5364 01-04-2022
2 6453 01-05-2022
2 3490 02-09-2022

and so on..

The desired output would look something like this:

ID Min Max min_date max_date
1 3000 7362 01-04-2022 03-04-2022
2 3490 6453 02-09-2022 01-05-2022
3 900 37267 01-05-2022 08-09-2022
4 3490 5666 02-09-2022 07-01-2022

I cannot seem to grab the min and max dates from the dataset. This is the code I have right now to calculate all of the other variables I need:

    annual_table <- all %>%
       group_by(NEAR_FID) %>%
       dplyr::summarize(
          avg = mean(Conc, na.rm = T),
          n_data_points = length(NEAR_FID),
          median = median(Conc),
          quant_95 = quantile(Conc,0.5),
          quant_5 = quantile(Conc,0.95),
          max = max(Conc),
          min = min(Conc))

I've tried various indexing but it won't work properly and quite frankly I'd like a dplyr solution that I can just throw into this code rather than a long workaround with filtering and joining. Any ideas?

4 Answers 4

0

Try

all %>%
       group_by(NEAR_FID) %>%
       dplyr::summarize(
          avg = mean(Conc, na.rm = T),
          n_data_points = length(NEAR_FID),
          median = median(Conc),
          quant_95 = quantile(Conc,0.5),
          quant_5 = quantile(Conc,0.95),
          max = max(Conc),
          min = min(Conc),
          date_min = Date[which.min(Conc)],
          date_max = Date[which.max(Conc)]
  )

0

You need to coerce the date column to be a date. You can use the ymd function to change it.

annual_table <- all %>%
       group_by(NEAR_FID) %>%
       
       mutate(Date = ymd(Date) %>% 

       dplyr::summarize(
          avg = mean(Conc, na.rm = T),
          n_data_points = length(NEAR_FID),
          median = median(Conc),
          quant_95 = quantile(Conc,0.5),
          quant_5 = quantile(Conc,0.95),
          max = max(Conc),
          min = min(Conc))
0

We could use summarise()

library(dplyr)
library(lubridate)

df %>% 
  group_by(group = year(dmy(Date)), ID) %>% 
  summarise(
    Min = min(Conc),
    Max = max(Conc),
    min_date = Date[which.min(Conc)],
    max_date = Date[which.max(Conc)], .groups = "drop") %>% 
  select(-group)

    ID   Min   Max min_date   max_date  
  <int> <int> <int> <chr>      <chr>     
1     1  3000  7362 01-04-2022 03-04-2022
2     2  3490  6453 02-09-2022 01-05-2022
0

in base R

subset(df, ave(Conc, ID, FUN=\(x)x %in% range(x))>0)|>
   transform(time = c("max", "min")[order(ID, Conc)%%2 + 1]) |>
   reshape(idvar = "ID", dir="wide", sep="_")
  ID Conc_min   Date_min Conc_max   Date_max
1  1     3000 01-04-2022     7362 03-04-2022
6  2     3490 02-09-2022     6453 01-05-2022

stack(with(df, tapply(Conc, ID, range)))|>
   setNames(c("Conc", "ID")) |>
   transform(time = c("min", "max")) |>
   merge(df, y = _)|>
   reshape(idvar = "ID", dir="wide", sep="_")

  ID Conc_min   Date_min Conc_max   Date_max
1  1     3000 01-04-2022     7362 03-04-2022
3  2     3490 02-09-2022     6453 01-05-2022

in Tidyverse:

df %>%
   filter(Conc %in%range(Conc), .by = ID)%>%
   cbind(name = c("min", "max")) %>%
   pivot_wider(id_cols = ID, names_from = name,
               values_from = c(Conc, Date))
# A tibble: 2 × 5
     ID Conc_min Conc_max Date_min   Date_max  
  <int>    <int>    <int> <chr>      <chr>     
1     1     3000     7362 01-04-2022 03-04-2022
2     2     6453     3490 01-05-2022 02-09-2022

Not the answer you're looking for? Browse other questions tagged or ask your own question.