1

I have a dataset where I am trying to calculate average concentrations of air pollution per ID. The issue is that some IDs require 3 separate averages per ID due to time lags. Example data (where time difference is the difference between the datetime and lag(datetime); when time_difference is greater than 600, that indicates a new average needs to be calculated beginning at that time; the start of a new ID automatically has time_difference=0):

df1:

ID DateTime Time_difference Threshold
0 2022-10-13 22:30:01 0 0
0 2022-10-13 22:30:02 1 0
0 2022-10-13 22:30:03 1 0
0 2022-10-13 22:30:04 1 0
0 2022-10-13 22:30:05 1 0
0 2022-10-13 23:05:01 2096 1
0 2022-10-13 23:05:02 1 0
0 2022-10-13 23:05:03 1 0
0 2022-10-13 23:05:04 1 0
0 2022-10-13 23:05:05 1 0
0 2022-10-13 00:10:01 3896 2
0 2022-10-13 00:10:02 1 0
0 2022-10-13 00:10:03 1 0
0 2022-10-13 00:10:04 1 0
0 2022-10-13 00:10:05 1 0
1 2022-10-13 00:10:06 0 0
1 2022-10-13 00:10:07 1 0

.... etc

All of these rows have a corresponding concentration level, but right now that is irrelevant. What I am trying to do is create a new column labeled "group" where I can have all of the values associated with the ID that are before the row where threshold==1 to be "group 1" and all the values associated with the same ID that are when threshold==1 until threshold==2 to be "group 2" and all the values with the same ID that are when threshold==2 until the next ID to be "group 3":

ID DateTime Time_difference Threshold group
0 2022-10-13 22:30:01 0 0 group 1
0 2022-10-13 22:30:02 1 0 group 1
0 2022-10-13 22:30:03 1 0 group 1
0 2022-10-13 22:30:04 1 0 group 1
0 2022-10-13 22:30:05 1 0 group 1
0 2022-10-13 23:05:01 2096 1 group 2
0 2022-10-13 23:05:02 1 0 group 2
0 2022-10-13 23:05:03 1 0 group 2
0 2022-10-13 23:05:04 1 0 group 2
0 2022-10-13 23:05:05 1 0 group 2
0 2022-10-13 00:10:01 3896 2 group 3
0 2022-10-13 00:10:02 1 0 group 3
0 2022-10-13 00:10:03 1 0 group 3
0 2022-10-13 00:10:04 1 0 group 3
0 2022-10-13 00:10:05 1 0 group 3
1 2022-10-13 00:10:06 0 0 group 1
1 2022-10-13 00:10:07 1 0 group 1

.... etc

The idea would then be the group by the ID and the group and calculate the averages concentrations.

I've been able to do this when there are just 2 groups per ID, but when there are 3 groups I cannot adapt the code. This is how I created the two groups (very inefficiently, I know):

temp <- df1 %>%
    ungroup() %>%
    filter(threshold==1) %>% 
    mutate(thresh_time = DateTime) %>%
    dplyr::select(ID, thresh_time)

df2 <- left_join(df1, temp, by="ID", all.x=TRUE)

df3 <- df2 %>%
    mutate(group = ifelse(DateTime<thresh_time | ID!=lag(ID), "group 1", "group 2"))

So for this solution I had to create yet another variable, being the thresh_time or the DateTime when threshold==1, and then assign that time to all of the values for a specific ID, and then create the groups based on whether or not the DateTime of that row is greater than or less than the thresh_time. This only works for two groups, though, and I cannot for the life of me figure out how to adapt it to 3 groups, if that is even possible.

The final dataset would ideally look something like this:

ID group Concentration
0 group 1 55890
0 group 2 67491
0 group 3 87645
1 group 1 94827
1 group 2 61527
2 group 1 45362

Any ideas on how to get here? I have been at this for weeks...

1
  • Could use something simple such as mutate(df, Group = cumsum(Threshold > 0) + 1, .by = ID)? (For .by argument using dplyr version >= 1.1.0...)
    – Ben
    Commented May 22, 2023 at 18:04

1 Answer 1

0

One way to create a group with your specification is by assigning group 1 to all rows of the group column, and then edit the group number by using for loop:

fgroup <- function(dat){
     dat$group <- "group 1"
     n <- 1
     for(k in 1:nrow(dat)) {
              if(dat$Threshold[k] > 0) {
                      n = n + 1
                      dat$group[k:nrow(dat)] <- paste("group", n)
              }
         } 
     return(dat)
 }

Then, split the data frame based on ID, and run the function to each ID, and then combine the result.

dat2 <- split(dat, dat$ID) |> map(fgroup) |> bind_rows()
dat2

   ID             DateTime Time_difference Threshold   group
1   0 2022-10-13 22:30:01                0         0 group 1
2   0 2022-10-13 22:30:02                1         0 group 1
3   0 2022-10-13 22:30:03                1         0 group 1
4   0 2022-10-13 22:30:04                1         0 group 1
5   0 2022-10-13 22:30:05                1         0 group 1
6   0 2022-10-13 23:05:01             2096         1 group 2
7   0 2022-10-13 23:05:02                1         0 group 2
8   0 2022-10-13 23:05:03                1         0 group 2
9   0 2022-10-13 23:05:04                1         0 group 2
10  0 2022-10-13 23:05:05                1         0 group 2
11  0 2022-10-13 00:10:01             3896         2 group 3
12  0 2022-10-13 00:10:02                1         0 group 3
13  0 2022-10-13 00:10:03                1         0 group 3
14  0 2022-10-13 00:10:04                1         0 group 3
15  0 2022-10-13 00:10:05                1         0 group 3
16  1 2022-10-13 00:10:06                0         0 group 1
17  1 2022-10-13 00:10:07                1         0 group 1

Then, you can calculate the concentration mean of each group for each ID:

dat2 |> group_by(ID, group) |> summarise(mean(Concentration))
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.