Introduction

The experience sampling method (ESM, also called ecological momentary assessment or ambulatory assessment; Csikszentmihalyi et al., 1977) is an ecologically valid self-report diary technique used for capturing and quantifying data in daily life. It involves sampling individuals’ mental states, symptoms, and context by prompting them to fill out a questionnaire multiple times per day for several consecutive days. In this way, data rich in moment-to-moment information is obtained, providing a unique and detailed insight into the individual’s everyday life flow (Myin-germeys et al., 2018). While ESM is not new and has its roots in ecological psychology (Larson & Csikszentmihalyi, 2014), an increasing number of researchers are now turning to ESM to examine within-person psychological processes (Stange et al., 2019). ESM is, for instance, often used to study how people feel in daily life, including both negative (NA) and positive (PA) affective states. In addition, ESM can provide crucial insights into the dynamics of these affective processes, notably regarding how people respond emotionally to real-life events (most often studied in terms of stress reactivity; Lardinois et al., 2011) or, conversely, how their emotions have become decoupled from events (which is captured in the notion of emotional inertia; Kuppens et al., 2010, 2012).

These core characteristics of people’s emotional functioning in daily life have proven crucial for understanding important differences in the experience of affect in people who do and do not suffer from mental disorders. For instance, NA and stress reactivity are consistently found to be elevated in individuals with psychosis (Lardinois et al., 2011; Myin-Germeys & van Os, 2007; Reininghaus et al., 2016b). Similarly, emotional inertia is increased in patients suffering from depression compared to healthy individuals (Kuppens et al., 2010), and may even prospectively predict the onset of depression (Kuppens et al., 2012). The potential clinical merit of ESM and these dynamic parameters as markers of psychopathology, as well as technological advances that make it possible to gather real-world data about the psychological state of individuals, has led to an increase in the use of ESM in recent years (van Berkel et al., 2017). The growth of ESM research has, however, brought with it an increasing heterogeneity in study designs and preprocessing strategies. While the study design (i.e., included items and sampling frequency) is decided upon based on a specific rationale, preprocessing strategies are typically made without a specific a priori justification. Studies frequently differ regarding the exclusion of participants based on a predefined compliance rate (Trull & Ebner-Priemer, 2020; i.e., the minimum number of completed assessments relative to assessment occasions), the exclusion of the first day of data collection (Stone et al., 2002), the grouping of ordinal variables into specific measures, and the centering of variables. However, out of all these preprocessing choices, only the effect of centering has been well studied (e.g., Hamaker & Grasman, 2014). For example, it is well known that person-mean centering is required when studying within-person associations (Enders & Tofighi, 2007).

Multiverse analysis (MA) is a statistical technique used to investigate the effect of different preprocessing choices on the statistical results and conclusions of an empirical study (Steegen et al., 2016). Specifically, it involves creating a collection of datasets, with each dataset stemming from a different line of data preprocessing choices. The hypothesis is evaluated on each unique dataset, yielding a distribution of statistical results such as p-values. Using this distribution, one can then explore the impact of preprocessing choices on the robustness of conclusions by considering whether the same conclusion is reached under different scenarios (akin to sensitivity analyses that investigate the impact of alternative modeling choices; Steegen et al., 2016)

While MA has been used in the past to examine the robustness of findings in cross-sectional survey data (Steegen et al., 2016; Stern et al., 2019), researchers only recently started applying MA to investigate the validity of conclusions in ESM research. For example, Dejonckheere et al. (2019) recently demonstrated that, irrespective of the selected mood items, an inverse NA–PA association becomes more potent when anticipating personally relevant events (e.g., the release of exam results for students). While the study of Dejonckheere et al. (2019) highlights how MA can be used to strengthen the validity of ESM research, many frequently made preprocessing choices have not yet been empirically investigated, cautioning against the specification of a particular combination of preprocessing choices (also known as knife-edge specification; Weston et al., 2019; Young & Holsteen, 2017).

One frequent preprocessing choice for which there exists substantial heterogeneity is the use of a compliance cutoff for analysis. Researchers often exclude participants from analyses based on low compliance. However, what constitutes low compliance varies widely across studies [e.g., 8% (Gaudiano et al., 2018), 20% (Edwards et al., 2018), 33% (Lataster et al., 2010), and 50% (Li et al., 2019)] and often goes unreported (Trull & Ebner-Priemer, 2020). Similarly, some studies have excluded the data of the first day of ESM for participants, labeling it a familiarization day (e.g., Stone et al., 2002). This is based on the assumption of an initial elevation bias affecting results (e.g., Shrout et al., 2018), although emerging evidence does not support this postulation (Arslan et al., 2020). Importantly, however, it remains unclear whether this choice affects conclusions. Using MA to provide greater clarity about the potential impact of these two preprocessing choices on often studied dynamic processes in ESM research would provide meaningful information to the scientific community.

Apart from investigating preprocessing choices that often vary across studies (i.e., compliance cutoff and exclusion of day 1), it is equally important to question preprocessing choices that appear undisputedly uniform. One such example is how researchers group items to form psychological constructs. A momentary psychological construct, such as NA, is typically defined as the mean value of a set of Likert-type items. However, this approach has been criticized (Jamieson, 2004), as data collected using Likert-type items are ordinal, which can be problematic given that arithmetic operations—such as calculating a mean—cannot be conducted on ordinal data, as intervals between values cannot be considered equal (Wu & Leung, 2017). However, Likert-type items responses are often treated as if they come from an interval scale (Wu & Leung, 2017). To justify this, the ordinal data must approach the properties of interval data (Chyung et al., 2017). For example, respondents must share similar interpretations of data intervals (e.g., the interval between 1 and 2 must be considered equal to the interval between 4 and 5). Simulation research has demonstrated that the mean is biased when this is not the case (Lindstädt et al., 2020). As opposed to computing the mean, the median or mode values can be used (Jamieson, 2004; Lindstädt et al., 2020). However, the use of this alternative and its impact on statistical conclusions, is yet to be explored in ESM literature.

The current study

Using a large pooled ESM dataset of individuals with and without psychosis, the current study employs and demonstrates how MA can be used to examine the influence of preprocessing choices on the robustness of conclusions. More specifically, we look at choices related to (1) person exclusion (i.e., based on various levels of compliance), (2) data exclusion (i.e., exclusion of the first assessment day), and (3) the calculation of constructs (i.e., composite scores calculated as the mean, median, or mode). Building upon existing research, we evaluated the following three hypotheses: (1) NA is elevated in individuals with psychosis compared to healthy individuals (hypothesis 1; see also Blanchard et al., 1998); (2) momentary stress is associated with momentary NA, and this association (i.e., stress reactivity) is more potent in individuals with psychosis than healthy individuals for various types of momentary stress (i.e., social, activity, or event stress; hypotheses 2.1 to 2.3, respectively; see also Lataster et al., 2013; van Winkel et al., 2015; Reininghaus et al. 2016); and (3) emotional inertia (as a risk marker of depression) is elevated in individuals with psychosis (of which more than 50% report comorbid depression; Buckley et al., 2009) compared to healthy individuals (hypothesis 3; see also Kuppens et al., 2010). For each hypothesis, a multiverse of datasets was created based on various combinations of each of the three preprocessing choices discussed above. The analysis plan underlying these analyses was preregistered and can be consulted online (https://osf.io/vefkg/?view_only=0d70ea4e0b8d4241901516131cc38cad).

Method

Sample and assessment procedure

This study combines data from methodologically similar experience sampling studies (Bak et al., 2001; Lataster et al., 2013; Myin-Germeys et al., 2001; Thewissen et al., 2008; Vaessen et al., 2018). All of these studies include patients with psychotic disorders and healthy individuals. General inclusion criteria were as follows: (1) age between 18 and 70 years, and (2) having sufficient proficiency of the Dutch language to comprehend the content of questionnaires. For the clinical group (i.e., patients with psychosis), participants were required to have Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) diagnoses of schizophrenia, schizoaffective disorder, or schizophreniform disorder. General exclusion criteria were (1) brain disease or (2) history of head injury with loss of consciousness. Participants presenting with a family history of schizophrenia spectrum disorder were excluded from participation. In these studies, participants were recruited from across Belgium and the Netherlands. The pooled sample comprised 456 participants, including 233 individuals with and 223 without psychosis (in total 26,892 longitudinal assessments). The mean age of the patient sample was 34.85 years (SD = 11.08), and 33.47% were female. The mean age of the sample of healthy individuals was 37.48 years (SD = 12.56), and 60.99% were female.

In each study, participants were asked to complete an ESM questionnaire ten times per day, for six consecutive days. Questionnaires were filled out using paper-and-pencil booklets. A signal-contingent semi-random sampling scheme was used in which ten prompts to fill out the questionnaire were delivered via a digital wristwatch in 90-minute blocks, between 7:30 AM and 10:30 PM (minimum 15 minutes apart from prior assessments). Questionnaire items consisted of items scored on seven-point Likert scales and several open-ended questions. For the current study, only a subset of the Likert scale items was used. A detailed overview is available on this project's preregistration webpage on the Open Science Framework (OSF) (https://osf.io/vefkg/?view_only=0d70ea4e0b8d4241901516131cc38cad).

Included baseline and ESM variables

Baseline variables included were age expressed in years, biological sex at birth (i.e., male or female), and group indicating a clinically diagnosed psychotic disorder. Age was measured as the absolute number of years since birth, whereas gender was measured as binary and referred to biological sex at birth (male, coded as 0; female, coded as 1). The binary group variable (i.e., patient vs. control) was coded “1” if a participant had been diagnosed with a psychotic disorder and “0” for healthy individuals.

ESM variables selected for this study were based on previous work (e.g., Myin-Germeys et al., 2003; Reininghaus et al., 2016a) and included momentary NA and momentary stress. Momentary NA was measured using six items: “I feel uncertain,” “I am lonely,” “I am anxious,” “I feel irritated,” “I feel sad,” and “I feel guilty.” Momentary Stress was conceptualized as using indices of momentary social stress, activity stress, and event stress. Social stress was assessed with the item “I prefer being alone.” If participants were not in the presence of others, they were instructed to skip this question. Activity stress was framed as stress experienced in currently ongoing activities and was assessed using four items, of which the last item was reverse-coded (“I would prefer doing something else,” “It [the ongoing activity] costs much effort,” “It [the ongoing activity] is challenging,” and “I am skilled to do this”). Except for momentary event stress, each variable was measured using questions rated on 7-point Likert scales that ranged from 1 “not at all” to 7 “very much.” Event stress was defined as the unpleasantness of the most important event since the last beep and was measured using one item rating (un)pleasantness of that event, measured on a bipolar scale ranging from −3 “very unpleasant” to +3 “very pleasant.” Positive and neutral events (0 to +3) were coded 0, whereas negative values were reverse coded (i.e., flipped from −1, −2, −3 to 1, 2, 3, respectively). Finally, stress reactivity was operationalized as the effect of each momentary stress variable on momentary NA. Information on intraclass correlation coefficients (ICCs), means, and variances of all measured and calculated ESM variables can be found in table 1 below.

Table 1 Variable characteristics

Preprocessing

Datasets for each unique combination of preprocessing choices were created. For hypotheses 1, 2.1, 2.3, and 3, this can be broken down as follows: exclusion based on different rates of compliance ranging from 0 to 50% (5% increments, 11 options), data of day 1 included/excluded (two options), and different central tendency measures to calculate NA (mean, median, mode, three options). Combining these different choices results in 66 (11 × 2 × 3) unique datasets. For hypothesis 2.2, we have one additional choice dimension, as activity stress, unlike social (hypothesis 2.1) and event stress (hypothesis 2.3), was measured with multiple items. Therefore, the activity stress score can be computed as either the mean, mode, or median of these items (three extra options), resulting in 198 (11 × 2× 3 × 3) unique datasets.

Statistical analyses

Given that the obtained intensive longitudinal data have a hierarchical structure (i.e., repeated momentary measurements nested [level 1] within subjects [level 2]), linear mixed-effects models were used to test our hypotheses. In each model, age and gender were added as covariates. To avoid an overly small coefficient for the age covariate, we coded the age variable in line with recent work (Age-20)/10 (Rintala et al., 2019). ESM predictor variables were person mean-centered using the participant’s mean to allow interpretation of predictor variables at the within-person level in a relative fashion for each person (Enders & Tofighi, 2007; Hamaker & Grasman, 2014). Restricted maximum likelihood estimation was used to estimate variance components of our models (Lafit et al., 2021). A significant statistical result was obtained in scenarios where the p-value was below .05. For the first hypothesis, a linear mixed-effects model with a random intercept was estimated, with NA as the dependent variable and the diagnostic group (indicating psychosis) as a predictor variable. For hypotheses 2.1 to 2.3, we added momentary stress and the interaction with group to the model, including a random slope effect for the slope of momentary stress. For the third hypothesis, we also built upon the baseline model from hypothesis 1, including now the NA's autoregressive parameter (lagged within individuals and days, with the first beep of each day set to missing) as the predictor variable as well as the interaction with group. The random effect structure was again defined as having a random intercept and a random slope for lagged momentary NA. Except for this last model, all models included a first-order autoregressive [AR(1)] error structure. Mathematic formulas for the models can be found in Table 2. All analyses were performed in R, version 3.5.3 (2019), using the nlme package (Pinheiro et al., 2020). Footnote 1

Table 2 Formulas for the models used in this study

Results

Based on the predefined compliance cutoff scores and data exclusion from day 1, the number of participants varied between 401 and 456. Table 3 provides a summary of the number of included participants under each exclusion criterion. The results concerning our first hypothesis (i.e., elevated momentary NA in individuals with psychosis compared to healthy individuals) indicated, as expected, significant differences in the mean level of NA between individuals with and without psychosis across all generated datasets. The estimated unstandardized coefficient value for group ranged from .53 to .66, with all corresponding p-values < .001 (Fig. 1, row 1).

Table 3 Number of included participants under different exclusion criteria
Fig. 1
figure 1

Frequencies of p-values were obtained across generated datasets that were used for testing hypotheses 1 to 3. Each row represents one hypothesis. Hypothesis 1 = Momentary negative affect is elevated in individuals with psychosis as compared to controls; Hypotheses 2.1 to 2.3 = Momentary negative affect is predicted by either momentary social (2.1), activity (2.2), and event-related (2.3) stress with group (individuals with psychosis vs. controls) moderating this effect. Hypothesis 3 = Momentary negative affect is predicted by momentary negative affect on the preceding beep with group (individuals with psychosis vs. control) moderating this effect. The red line indicates p = .05.

Concerning the second hypothesis, we explored whether there was an association between momentary stress and momentary NA and whether this effect was stronger for individuals with than without psychosis. In line with our hypotheses, we found that social, event, and activity stress were each positively associated with NA across all generated datasets (Fig. 1, rows 2 to 4). The estimated unstandardized coefficient values ranged from .04 to .07 for social stress, .06 to .11 for event stress, and .01 to .08 for activity stress (p-values < .01). The interaction term of diagnostic group with stress reactivity was also significant across all datasets for activity and event stress (p < .05). Unstandardized coefficient values ranged from .07 to .09 for activity stress and .03 to .07 for event stress, indicating a stronger association with momentary NA among individuals with psychosis than healthy individuals. In contrast, when considering social stress, we found that the interaction term became nonsignificant in those scenarios when NA was computed as the mean (i.e., 22 out of the 66 datasets; Fig. 1, row 2). Conversely, the interaction was consistently significant across all stress variables when NA was computed as the median or mode (p < .05).

For the third hypothesis, we investigated whether NA at time point t could be predicted by NA at the previous time point t − 1, with stronger anticipated effects for individuals with psychosis. The results only partially supported this (Fig. 1, row 5). While lagged NA was, as expected, a significant predictor of NA across all datasets (p < .001, unstandardized coefficient values ranging from .19 to .21), the interaction term (group × lagged NA) was significant only in those scenarios when NA and lagged NA were computed as the mean (p < .05, unstandardized coefficient values ranging from .07 to .08). Conversely, nonsignificant results were observed when the median or mode (.31 < p < .66) were used, indicating that preprocessing choices regarding measures of central tendency impacted the conclusion of whether individuals with psychosis do (or do not) show meaningfully greater inertia of NA than healthy individuals.

Using the Lavaan R package (Rosseel, 2012), post hoc analyses were conducted to further explore this finding, in which we looked at NA's factor structure at the between- and within-person level. As shown in Table 4, NA's factor loadings were found considerably higher at the between- than the within-person level, suggesting that the negative emotions co-varied less within than across individuals. This finding was further supported by considerably lower internal consistencies at the within- (α = .67) than the between-person level (α = .93). Given that people may intensely experience a particular or specific constellation of negative emotions, we then used the leave-one-out approach (i.e., using only five out of the six NA items each time) to investigate whether the composition of NA was disproportionally affected by items considered in this study. Results revealed that conclusions differed as a function of the selected NA items (Table 5). Specifically, while excluding the feeling irritated emotion resulted in more datasets yielding a significant interaction term than the original MA (i.e., 55 instead of 44), the opposite effect was found for feeling lonely, anxious, sad, and guilty (Table 5). In contrast, no difference was observed for feeling unsure. Similarly, as the within- and between-person factor loadings seemed to differ, we additionally investigated measurement invariance of the NA construct. To do this, we tested for configural invariance, metric invariance, and invariance of latent variances and covariances at the within-person and between-person level. Additionally, and specific for the between-person level, scalar invariance as well as invariance of latent means was investigated (see also Eisele et al., 2021; Lance et al., 2000; Ryu, 2014). Results show that, while there is configural invariance, and partial metric invariance (when freeing two equality constraints on NA items 1 and 6, 2 and 6, 3 and 6, 4 and 6), there was no invariance of latent variances and covariances at the within-person level. Our measurement invariance analyses at the between-person level showed that there was configural, metric, and scalar invariance as well as invariance for (co) variances and latent means. Results of the measurement invariance analysis are summarized in table 6.

Table 4 Standardized factor loadings of the negative affect items at both the within- and between-person level for both patients with psychosis and healthy individuals
Table 5 Number of significant scenarios across the multiverse in a leave-one-out approach where each time one item was removed in the calculation of negative affect composite score
Table 6 Measurement invariance analyses of NA within-person structure

Discussion

This study illustrated the use of MA in ESM research and reanalyzed established group differences in NA, stress reactivity, and emotional inertia between individuals with and without psychosis. Using a large pooled intensive longitudinal dataset, we investigated the robustness of statistical conclusions of preprocessing choices related to data exclusion (i.e., based on various levels of compliance and exclusion of the first assessment day) and the calculation of constructs (i.e., composite scores calculated as the mean, median, or mode). The findings revealed that while different data exclusion choices have no meaningful effect on conclusions, using the median or mode instead of mean values to compute NA can lead to different conclusions.

Prior research on NA, stress reactivity, and emotional inertia has consistently provided evidence for the existence of group differences between healthy individuals and those suffering from mental illnesses (Kuppens et al., 2010, 2012; Myin-Germeys & van Os, 2007; Reininghaus et al., 2016b). While our findings support these studies, it is essential to acknowledge that this was, in some instances, dependent upon specific preprocessing choices. Specifically, when NA was computed as the mean, as opposed to the median or mode, our results deviated from those in the literature (e.g., Reininghaus et al., 2016b; ultrahigh-risk and first-episode psychosis), suggesting that there are no differences concerning the degree to which social stress predicts NA in individuals with psychosis compared to healthy individuals. Similarly, the strength of emotional inertia of NA differed between individuals with psychosis and healthy individuals only when NA was computed as the mean (e.g., comparable to previous work on patients suffering from depression; Kuppens et al., 2010) as opposed to the median or mode. Post hoc analyses suggest that this finding could be attributed to lower factor loadings of negative emotions at the within- than between-person level.

The observation of substantially different factor loadings for NA at the within- versus between-person level is not novel and has been reported previously (Eadeh et al., 2019; Möwisch et al., 2019; Vansteelandt et al., 2005; Zelenski & Larsen, 2000). Zelenski and Larsen (2000)) found a similar factor as the one we observed and attributed this to affect measurements behaving differently at different levels of analyses. Specifically, these authors argued that emotions are more strongly intercorrelated across time between (e.g., someone who feels lonelier on average also feels sadder than the average person) than within individuals when the interrelation is assessed across short periods (e.g., someone who feels lonelier than usual may not feel sadder than usual in a specific situation). Our findings support this notion by finding lower within- than between-person factor loadings, suggesting that using mean, median, or mode values to compute scores may ignore the optimal weighting of items, which not only affects scale reliability, but also impacts conclusions in subsequent analysis (see also McNeish & Wolf, 2020). Therefore, it may be necessary to re-evaluate a traditional two-factor structure at the within-person level (Brose et al., 2015). Future work could investigate whether more than one lower-order factor (e.g., an internalizing and externalizing component) exists, and also explore the predictive utility of specific negative emotions at the individual level instead of the composite factor NA. In a similar vein, it would be necessary to determine whether the finding of differential factor loadings at the within- and between-person level found in this study can be generalized across patient populations. If this holds true, the implications are far-reaching, as it would reflect the need for operationalizing NA within individuals in a more granular way than is frequently done in ESM research. Furthermore, the finding of only partial metric invariance at the within-person level adds another level of complexity to ESM research investigating between-group differences in NA. More specifically, it suggests that the pattern of factor loadings may differ within individuals, irrespective of group, for several items. Therefore, future work is advised to evaluate measurement invariance before making substantive interferences based on ESM data.

While the sample size of the dataset (N = 456 and 26,892 longitudinal assessments) was considerably larger than previous ESM studies (N in the range of 42–99; e.g., Rauschenberg et al., 2017; van der Steen et al., 2017; Vasconcelos e Sa et al., 2016), four limitations should be considered in interpreting the findings. First, given that the operationalization of the ESM constructs used in this study varies across the literature (for examples of alternative conceptualization of NA-, social-, activity-, and event-related stress see Gerritsen et al., 2019; Lüdtke et al., 2017; and Sitko et al., 2016), it is unclear to what extent the current findings will be replicated under different operationalization protocols from the one used in this study. Second, the dataset used in this study came from a series of studies that collected ESM data using a digital wristwatch and paper-and-pencil booklets. Importantly, however, prior studies consistently found that this method and full electronic data collection provide data that are equivalent psychometrically and in patterns of findings (Green et al., 2006; Gwaltney et al., 2008; Jacobs et al., 2005). Third, multiverse scenarios were constructed based on what we considered reasonable preprocessing choices. However, it is important to acknowledge that MA is an exploratory methodology (Simonsohn et al., 2015) and that opinions about which preprocessing choices need to be considered may vary among researchers and across labs. However, including many choices may make the multiverse of possible scenarios unmanageable, as the number of datasets to consider quickly becomes so large that it is challenging to disentangle each preprocessing choice (Del Giudice & Gangestad, 2020). Fourth, we only specified one model per hypothesis. In addition to the unique datasets considered for each hypothesis, a multiverse of possible statistical models exists. Future work could include alternatives such as ordinal variants of mixed-effects models (e.g., Bürkner & Vuorre, 2019) or the use of item response theory (IRT) models (e.g., Hedeker et al., 2006).

These limitations notwithstanding, the findings of this study have several implications for future research. First, we illustrated that MA is a useful and easy-to-use tool to investigate the impact of preprocessing decisions within experience sampling research. Second, our illustrative cases show that excluding data based on various compliance levels and excluding the first assessment day may be inadvisable. This reduces the power of analyses, and we found that applying different preprocessing choices related to data exclusion does not affect results. That said, we do recommend future work to explore whether this finding can be replicated across various sampling frequencies. Third, the finding that different methods for calculating (i.e., mean, median, mode) or conceptualizing NA (i.e., excluding or including items) affected conclusions in our sample illustrates the importance of exploring the effect of different but equally reasonable methods of operationalizing ESM constructs in future work. In doing so, researchers should be transparent about the choices made and the reasoning behind them, making this explicit during the preregistration process (Kirtley et al., 2019).

Conclusion

This study showed that MA could be used as a valuable data-analytical technique to investigate how different, but equally reasonable, data preprocessing choices affect ESM studies' statistical conclusions. We found that established group differences in NA, stress reactivity, and emotional inertia in individuals with psychosis compared to healthy individuals were not affected by different data exclusion choices. In contrast, calculating NA as either the mean, median, or mode of items did affect the conclusions in our study. Additional analyses revealed that this was related to different factor loading patterns at the within and between person-level. These findings imply that considering only a single set of choices risks undermining the validity of conclusions. Therefore, the most significant contribution of this study is that it empirically illustrates the necessity of conducting MA to come to terms with different preprocessing choices in ESM research.