R: How to change values in all of the rows that come after a specific row in a group?

Question

I have a dataset where I am trying to calculate average concentrations of air pollution per ID. The issue is that some IDs require 3 separate averages per ID due to time lags. Example data (where time difference is the difference between the datetime and lag(datetime); when time_difference is greater than 600, that indicates a new average needs to be calculated beginning at that time; the start of a new ID automatically has time_difference=0):

df1:

ID	DateTime	Time_difference	Threshold
0	2022-10-13 22:30:01	0	0
0	2022-10-13 22:30:02	1	0
0	2022-10-13 22:30:03	1	0
0	2022-10-13 22:30:04	1	0
0	2022-10-13 22:30:05	1	0
0	2022-10-13 23:05:01	2096	1
0	2022-10-13 23:05:02	1	0
0	2022-10-13 23:05:03	1	0
0	2022-10-13 23:05:04	1	0
0	2022-10-13 23:05:05	1	0
0	2022-10-13 00:10:01	3896	2
0	2022-10-13 00:10:02	1	0
0	2022-10-13 00:10:03	1	0
0	2022-10-13 00:10:04	1	0
0	2022-10-13 00:10:05	1	0
1	2022-10-13 00:10:06	0	0
1	2022-10-13 00:10:07	1	0

.... etc

All of these rows have a corresponding concentration level, but right now that is irrelevant. What I am trying to do is create a new column labeled "group" where I can have all of the values associated with the ID that are before the row where threshold==1 to be "group 1" and all the values associated with the same ID that are when threshold==1 until threshold==2 to be "group 2" and all the values with the same ID that are when threshold==2 until the next ID to be "group 3":

ID	DateTime	Time_difference	Threshold	group
0	2022-10-13 22:30:01	0	0	group 1
0	2022-10-13 22:30:02	1	0	group 1
0	2022-10-13 22:30:03	1	0	group 1
0	2022-10-13 22:30:04	1	0	group 1
0	2022-10-13 22:30:05	1	0	group 1
0	2022-10-13 23:05:01	2096	1	group 2
0	2022-10-13 23:05:02	1	0	group 2
0	2022-10-13 23:05:03	1	0	group 2
0	2022-10-13 23:05:04	1	0	group 2
0	2022-10-13 23:05:05	1	0	group 2
0	2022-10-13 00:10:01	3896	2	group 3
0	2022-10-13 00:10:02	1	0	group 3
0	2022-10-13 00:10:03	1	0	group 3
0	2022-10-13 00:10:04	1	0	group 3
0	2022-10-13 00:10:05	1	0	group 3
1	2022-10-13 00:10:06	0	0	group 1
1	2022-10-13 00:10:07	1	0	group 1

.... etc

The idea would then be the group by the ID and the group and calculate the averages concentrations.

I've been able to do this when there are just 2 groups per ID, but when there are 3 groups I cannot adapt the code. This is how I created the two groups (very inefficiently, I know):

temp <- df1 %>%
    ungroup() %>%
    filter(threshold==1) %>% 
    mutate(thresh_time = DateTime) %>%
    dplyr::select(ID, thresh_time)

df2 <- left_join(df1, temp, by="ID", all.x=TRUE)

df3 <- df2 %>%
    mutate(group = ifelse(DateTime<thresh_time | ID!=lag(ID), "group 1", "group 2"))

So for this solution I had to create yet another variable, being the thresh_time or the DateTime when threshold==1, and then assign that time to all of the values for a specific ID, and then create the groups based on whether or not the DateTime of that row is greater than or less than the thresh_time. This only works for two groups, though, and I cannot for the life of me figure out how to adapt it to 3 groups, if that is even possible.

The final dataset would ideally look something like this:

ID	group	Concentration
0	group 1	55890
0	group 2	67491
0	group 3	87645
1	group 1	94827
1	group 2	61527
2	group 1	45362

Any ideas on how to get here? I have been at this for weeks...

Could use something simple such as mutate(df, Group = cumsum(Threshold > 0) + 1, .by = ID)? (For .by argument using dplyr version >= 1.1.0...) — Ben, Commented May 22, 2023 at 18:04

Abdur Rohman · Accepted Answer · 2023-05-18 15:17:54Z

One way to create a group with your specification is by assigning group 1 to all rows of the group column, and then edit the group number by using for loop:

fgroup <- function(dat){
     dat$group <- "group 1"
     n <- 1
     for(k in 1:nrow(dat)) {
              if(dat$Threshold[k] > 0) {
                      n = n + 1
                      dat$group[k:nrow(dat)] <- paste("group", n)
              }
         } 
     return(dat)
 }

Then, split the data frame based on ID, and run the function to each ID, and then combine the result.

dat2 <- split(dat, dat$ID) |> map(fgroup) |> bind_rows()
dat2

   ID             DateTime Time_difference Threshold   group
1   0 2022-10-13 22:30:01                0         0 group 1
2   0 2022-10-13 22:30:02                1         0 group 1
3   0 2022-10-13 22:30:03                1         0 group 1
4   0 2022-10-13 22:30:04                1         0 group 1
5   0 2022-10-13 22:30:05                1         0 group 1
6   0 2022-10-13 23:05:01             2096         1 group 2
7   0 2022-10-13 23:05:02                1         0 group 2
8   0 2022-10-13 23:05:03                1         0 group 2
9   0 2022-10-13 23:05:04                1         0 group 2
10  0 2022-10-13 23:05:05                1         0 group 2
11  0 2022-10-13 00:10:01             3896         2 group 3
12  0 2022-10-13 00:10:02                1         0 group 3
13  0 2022-10-13 00:10:03                1         0 group 3
14  0 2022-10-13 00:10:04                1         0 group 3
15  0 2022-10-13 00:10:05                1         0 group 3
16  1 2022-10-13 00:10:06                0         0 group 1
17  1 2022-10-13 00:10:07                1         0 group 1

Then, you can calculate the concentration mean of each group for each ID:

dat2 |> group_by(ID, group) |> summarise(mean(Concentration))

Collectives™ on Stack Overflow

R: How to change values in all of the rows that come after a specific row in a group?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
r
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged r or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
or ask your own question.