I have a dataset where I am trying to calculate average concentrations of air pollution per ID. The issue is that some IDs require 3 separate averages per ID due to time lags. Example data (where time difference is the difference between the datetime and lag(datetime); when time_difference is greater than 600, that indicates a new average needs to be calculated beginning at that time; the start of a new ID automatically has time_difference=0):
df1:
ID | DateTime | Time_difference | Threshold |
---|---|---|---|
0 | 2022-10-13 22:30:01 | 0 | 0 |
0 | 2022-10-13 22:30:02 | 1 | 0 |
0 | 2022-10-13 22:30:03 | 1 | 0 |
0 | 2022-10-13 22:30:04 | 1 | 0 |
0 | 2022-10-13 22:30:05 | 1 | 0 |
0 | 2022-10-13 23:05:01 | 2096 | 1 |
0 | 2022-10-13 23:05:02 | 1 | 0 |
0 | 2022-10-13 23:05:03 | 1 | 0 |
0 | 2022-10-13 23:05:04 | 1 | 0 |
0 | 2022-10-13 23:05:05 | 1 | 0 |
0 | 2022-10-13 00:10:01 | 3896 | 2 |
0 | 2022-10-13 00:10:02 | 1 | 0 |
0 | 2022-10-13 00:10:03 | 1 | 0 |
0 | 2022-10-13 00:10:04 | 1 | 0 |
0 | 2022-10-13 00:10:05 | 1 | 0 |
1 | 2022-10-13 00:10:06 | 0 | 0 |
1 | 2022-10-13 00:10:07 | 1 | 0 |
.... etc
All of these rows have a corresponding concentration level, but right now that is irrelevant. What I am trying to do is create a new column labeled "group" where I can have all of the values associated with the ID that are before the row where threshold==1 to be "group 1" and all the values associated with the same ID that are when threshold==1 until threshold==2 to be "group 2" and all the values with the same ID that are when threshold==2 until the next ID to be "group 3":
ID | DateTime | Time_difference | Threshold | group |
---|---|---|---|---|
0 | 2022-10-13 22:30:01 | 0 | 0 | group 1 |
0 | 2022-10-13 22:30:02 | 1 | 0 | group 1 |
0 | 2022-10-13 22:30:03 | 1 | 0 | group 1 |
0 | 2022-10-13 22:30:04 | 1 | 0 | group 1 |
0 | 2022-10-13 22:30:05 | 1 | 0 | group 1 |
0 | 2022-10-13 23:05:01 | 2096 | 1 | group 2 |
0 | 2022-10-13 23:05:02 | 1 | 0 | group 2 |
0 | 2022-10-13 23:05:03 | 1 | 0 | group 2 |
0 | 2022-10-13 23:05:04 | 1 | 0 | group 2 |
0 | 2022-10-13 23:05:05 | 1 | 0 | group 2 |
0 | 2022-10-13 00:10:01 | 3896 | 2 | group 3 |
0 | 2022-10-13 00:10:02 | 1 | 0 | group 3 |
0 | 2022-10-13 00:10:03 | 1 | 0 | group 3 |
0 | 2022-10-13 00:10:04 | 1 | 0 | group 3 |
0 | 2022-10-13 00:10:05 | 1 | 0 | group 3 |
1 | 2022-10-13 00:10:06 | 0 | 0 | group 1 |
1 | 2022-10-13 00:10:07 | 1 | 0 | group 1 |
.... etc
The idea would then be the group by the ID and the group and calculate the averages concentrations.
I've been able to do this when there are just 2 groups per ID, but when there are 3 groups I cannot adapt the code. This is how I created the two groups (very inefficiently, I know):
temp <- df1 %>%
ungroup() %>%
filter(threshold==1) %>%
mutate(thresh_time = DateTime) %>%
dplyr::select(ID, thresh_time)
df2 <- left_join(df1, temp, by="ID", all.x=TRUE)
df3 <- df2 %>%
mutate(group = ifelse(DateTime<thresh_time | ID!=lag(ID), "group 1", "group 2"))
So for this solution I had to create yet another variable, being the thresh_time or the DateTime when threshold==1, and then assign that time to all of the values for a specific ID, and then create the groups based on whether or not the DateTime of that row is greater than or less than the thresh_time. This only works for two groups, though, and I cannot for the life of me figure out how to adapt it to 3 groups, if that is even possible.
The final dataset would ideally look something like this:
ID | group | Concentration |
---|---|---|
0 | group 1 | 55890 |
0 | group 2 | 67491 |
0 | group 3 | 87645 |
1 | group 1 | 94827 |
1 | group 2 | 61527 |
2 | group 1 | 45362 |
Any ideas on how to get here? I have been at this for weeks...
mutate(df, Group = cumsum(Threshold > 0) + 1, .by = ID)
? (For.by
argument usingdplyr
version >= 1.1.0...)