Data Pre-Processing

Here, basic steps to transform intensive longitudinal data into the format needed for the DMBM are explained. For illustrative purposes, we simulate a data set below.

Here, we demonstrate how measurement-burst data can be prepared for the Dynamic Measurement-Burst Model (DMBM). Because the empirical data used in the manuscript cannot be shared publicly, we simulate a small example dataset.

The goal of this section is to illustrate two key preprocessing steps:

1. Transform burst data from long format into a semi-wide format
2. Construct and adjust the TINTERVAL variable required by Mplus

These steps ensure that: - within-burst dynamics can be modeled in long format - between-burst dynamics of process features can be modeled in wide format - Mplus does not waste computation time interpreting large gaps between bursts.

Simulate toy example

As the manuscript uses 3rd party data we are not permitted to share publicly, we will simulate example data below.

# ============================================================
# 1) SIMULATE "LONG FORMAT" measurement-burst data (toy example)
#    - 6 participants
#    - 3 bursts (W1/W2/W3)
#    - 5 ESM prompts per burst
#    - 2 example variables: confidence, depression
#    - Each burst spans TWO days
# ============================================================

n_id    <- 6
n_waves <- 3
k_prompts <- 5

ids <- sprintf("P%02d", 1:n_id)

wave_start <- tibble(
  W = 1:3,
  wave_gap_days = c(0, 30, 60)  # 30-day gaps between bursts
)

df_long <- expand_grid(
  UUID = ids,
  W    = 1:3,
  t    = 1:k_prompts
) %>%
  left_join(wave_start, by = "W") %>%
  group_by(UUID, W) %>%
  mutate(
    base_time = ymd_hms("2020-01-01 08:00:00", tz = "Europe/Paris") +
      days(wave_gap_days[1]),
    
    # First 3 prompts on Day 1, remaining on Day 2
    day_offset = if_else(t <= ceiling(k_prompts / 2), 0, 1),
    
    # Within-day spacing every 3 hours
    hour_offset = 3 * ((t - 1) %% ceiling(k_prompts / 2)),
    
    Date_Time = base_time +
      days(day_offset) +
      hours(hour_offset) +
      minutes(sample(0:20, 1))  # small person-specific offset
  ) %>%
  ungroup() %>%
  mutate(
    confidence = round(rnorm(n(), mean = 2.5 + 0.3 * W, sd = 0.8), 2),
    depression = round(rnorm(n(), mean = 4.0 - 0.2 * W, sd = 0.9), 2)
  ) %>%
  select(UUID, W, t, Date_Time, confidence, depression)

print(df_long, n = 20)
# A tibble: 90 × 6
   UUID      W     t Date_Time           confidence depression
   <chr> <int> <int> <dttm>                   <dbl>      <dbl>
 1 P01       1     1 2020-01-01 08:03:00       1.88       3.25
 2 P01       1     2 2020-01-01 11:03:00       2.57       3.84
 3 P01       1     3 2020-01-01 14:03:00       2.56       2.78
 4 P01       1     4 2020-01-02 08:03:00       2.47       4.32
 5 P01       1     5 2020-01-02 11:03:00       3          2.65
 6 P01       2     1 2020-01-31 08:06:00       2.39       5.06
 7 P01       2     2 2020-01-31 11:06:00       3.45       3.15
 8 P01       2     3 2020-01-31 14:06:00       2.11       5.11
 9 P01       2     4 2020-02-01 08:06:00       2.92       3.23
10 P01       2     5 2020-02-01 11:06:00       3.4        2.72
11 P01       3     1 2020-03-01 08:00:00       3.51       3.42
12 P01       3     2 2020-03-01 11:00:00       4.04       3.42
13 P01       3     3 2020-03-01 14:00:00       3.35       1.89
14 P01       3     4 2020-03-02 08:00:00       3.8        4.35
15 P01       3     5 2020-03-02 11:00:00       4.27       2.39
16 P02       1     1 2020-01-01 08:01:00       2.25       4.1 
17 P02       1     2 2020-01-01 11:01:00       1.77       4.25
18 P02       1     3 2020-01-01 14:01:00       2.84       3.92
19 P02       1     4 2020-01-02 08:01:00       2.61       3.69
20 P02       1     5 2020-01-02 11:01:00       2.37       3.98
# ℹ 70 more rows

Structure of measurement-burst data

The simulated data illustrate the typical structure of measurement-burst designs:

  • measurement occasions ( t ) nested within bursts ( W )
  • bursts nested within persons ( UUID )

Importantly, bursts are separated by large time gaps (e.g., 30 days), while observations within bursts are closely spaced (e.g., every 3 hours).

This distinction is crucial when constructing the TINTERVAL variable.

Data Transformation

To prepare the data for the DMBM, we first reshape the dataset into a semi-wide format.

In this format:

  • intensive longitudinal observations remain in long format

  • burst-level variables appear as separate columns per burst

This allows the model to simultaneously estimate:

  • within-burst dynamics

  • between-burst dynamics of process features.

Step: Convert bursts to columns

Each burst receives its own set of variables (e.g., confidence_W1, confidence_W2, confidence_W3).

#####################################################################################################################
df_long <- df_long %>%
  group_by(UUID, W) %>%
  arrange(Date_Time, .by_group = TRUE) %>%
  mutate(t = row_number()) %>%   # 1..k per person×wave
  ungroup()

# transform to wide format:

df_wide <- df_long %>%
  # choose the variables that should become wave-specific columns
  pivot_wider(
    id_cols = c(UUID, t, Date_Time),
    names_from = W,
    values_from = c(confidence, depression),
    names_glue = "{.value}_W{W}"
  )

print(df_wide, n = 20)
# A tibble: 90 × 9
   UUID      t Date_Time           confidence_W1 confidence_W2 confidence_W3
   <chr> <int> <dttm>                      <dbl>         <dbl>         <dbl>
 1 P01       1 2020-01-01 08:03:00          1.88         NA            NA   
 2 P01       2 2020-01-01 11:03:00          2.57         NA            NA   
 3 P01       3 2020-01-01 14:03:00          2.56         NA            NA   
 4 P01       4 2020-01-02 08:03:00          2.47         NA            NA   
 5 P01       5 2020-01-02 11:03:00          3            NA            NA   
 6 P01       1 2020-01-31 08:06:00         NA             2.39         NA   
 7 P01       2 2020-01-31 11:06:00         NA             3.45         NA   
 8 P01       3 2020-01-31 14:06:00         NA             2.11         NA   
 9 P01       4 2020-02-01 08:06:00         NA             2.92         NA   
10 P01       5 2020-02-01 11:06:00         NA             3.4          NA   
11 P01       1 2020-03-01 08:00:00         NA            NA             3.51
12 P01       2 2020-03-01 11:00:00         NA            NA             4.04
13 P01       3 2020-03-01 14:00:00         NA            NA             3.35
14 P01       4 2020-03-02 08:00:00         NA            NA             3.8 
15 P01       5 2020-03-02 11:00:00         NA            NA             4.27
16 P02       1 2020-01-01 08:01:00          2.25         NA            NA   
17 P02       2 2020-01-01 11:01:00          1.77         NA            NA   
18 P02       3 2020-01-01 14:01:00          2.84         NA            NA   
19 P02       4 2020-01-02 08:01:00          2.61         NA            NA   
20 P02       5 2020-01-02 11:01:00          2.37         NA            NA   
# ℹ 70 more rows
# ℹ 3 more variables: depression_W1 <dbl>, depression_W2 <dbl>,
#   depression_W3 <dbl>
#####################################################################################################################

Adding TINTERVAL to the data

Dynamic models in Mplus (DSEM) require a continuous time variable that represents elapsed time since the beginning of the study for each participant. This variable is then referenced in the TINTERVAL option in Mplus.

Instead of counting measurement occasions, this variable encodes time since the first observation for each participant.

For example:

  • first observation → 0

    second observation 3 hours later → 3

    third observation 6 hours later → 6

This allows Mplus to correctly interpret temporal precedence and to correctly account for unequal spacing between observations.


Measurement-burst complication

Measurement-burst data introduce a practical complication:

The time gap between bursts can be extremely large.
For example, a 30-day gap corresponds to 720 hours.

If left unchanged, Mplus interprets these large gaps literally and inserts many missing time points internally, which can dramatically slow estimation.


Key idea

We only model dynamics within bursts.

Therefore:

  • long gaps between bursts do not carry meaningful information -

  • compressing them does not affect estimation

We therefore:

  • keep the temporal ordering within bursts intact

  • replace large inter-burst gaps with a small fixed interval (e.g., 24 hours).

This preserves the temporal structure needed for estimation while avoiding unnecessary computational burden.

This can be done in 3 steps:

Step 1 — Create time interval variable

We first compute elapsed time since each participant’s first observation.

#####################################################################################################################

df_time <- df_long %>%
  group_by(UUID) %>%
  arrange(Date_Time, .by_group = TRUE) %>%
  mutate(
    seconds_since_start_pp = as.numeric(Date_Time - first(Date_Time), units = "secs"),
    hours_since_start_pp   = seconds_since_start_pp / 3600
  ) %>%
  ungroup()

df_time %>% select(UUID, W, t, Date_Time, hours_since_start_pp) %>% head(10)
# A tibble: 10 × 5
   UUID      W     t Date_Time           hours_since_start_pp
   <chr> <int> <int> <dttm>                             <dbl>
 1 P01       1     1 2020-01-01 08:03:00                   0 
 2 P01       1     2 2020-01-01 11:03:00                   3 
 3 P01       1     3 2020-01-01 14:03:00                   6 
 4 P01       1     4 2020-01-02 08:03:00                  24 
 5 P01       1     5 2020-01-02 11:03:00                  27 
 6 P01       2     1 2020-01-31 08:06:00                 720.
 7 P01       2     2 2020-01-31 11:06:00                 723.
 8 P01       2     3 2020-01-31 14:06:00                 726.
 9 P01       2     4 2020-02-01 08:06:00                 744.
10 P01       2     5 2020-02-01 11:06:00                 747.

Step 2 — Identify burst boundaries per participant

We determine the first and last observation of each burst.

#adjust gaps between waves:#
# 1) Get per-person wave boundaries (first/last Date_Time per wave)
pp_bounds <- df_long %>%
  group_by(UUID, W) %>%
  summarise(
    W_first = min(Date_Time),
    W_last  = max(Date_Time),
    .groups = "drop"
  ) %>%
  pivot_wider(
    id_cols = UUID,
    names_from = W,
    values_from = c(W_first, W_last),
    names_glue = "W{W}_{.value}"
  )

pp_bounds
# A tibble: 6 × 7
  UUID  W1_W_first          W2_W_first          W3_W_first         
  <chr> <dttm>              <dttm>              <dttm>             
1 P01   2020-01-01 08:03:00 2020-01-31 08:06:00 2020-03-01 08:00:00
2 P02   2020-01-01 08:01:00 2020-01-31 08:10:00 2020-03-01 08:13:00
3 P03   2020-01-01 08:17:00 2020-01-31 08:18:00 2020-03-01 08:00:00
4 P04   2020-01-01 08:20:00 2020-01-31 08:20:00 2020-03-01 08:09:00
5 P05   2020-01-01 08:13:00 2020-01-31 08:09:00 2020-03-01 08:06:00
6 P06   2020-01-01 08:08:00 2020-01-31 08:14:00 2020-03-01 08:20:00
# ℹ 3 more variables: W1_W_last <dttm>, W2_W_last <dttm>, W3_W_last <dttm>

Step 3 — Compress gaps between bursts

Large inter-burst gaps are replaced with a fixed interval (24 hours here).

# 2) Join bounds back to each row, compute adjusted seconds
df_adj <- df_time %>%
  left_join(pp_bounds, by = "UUID") %>%
  mutate(
    seconds_since_start_adjusted = case_when(
      W == 1 ~ seconds_since_start_pp,
      
      # Wave 2 starts 24h after Wave 1 ended (for that participant)
      W == 2 ~ {
        gap_W1_W2 <- as.numeric(W2_W_first - W1_W_last, units = "secs")
        seconds_since_start_pp - gap_W1_W2 + 86400
      },
      
      # Wave 3 starts 24h after Wave 2 ended
      W == 3 ~ {
        gap_W1_W2 <- as.numeric(W2_W_first - W1_W_last, units = "secs")
        gap_W2_W3 <- as.numeric(W3_W_first - W2_W_last, units = "secs")
        
        # start from pp time, compress W2->W3 gap to 24h, and keep W2 positioned 24h after W1
        seconds_since_start_pp - gap_W2_W3 + 86400 - gap_W1_W2 + 86400
      }
    ),
    tinterval_h = round(seconds_since_start_adjusted / 3600, 2) # transform to hours
  )

df_adj %>% select(UUID,Date_Time, W, t, hours_since_start_pp, tinterval_h) %>% head(12)
# A tibble: 12 × 6
   UUID  Date_Time               W     t hours_since_start_pp tinterval_h
   <chr> <dttm>              <int> <int>                <dbl>       <dbl>
 1 P01   2020-01-01 08:03:00     1     1                   0            0
 2 P01   2020-01-01 11:03:00     1     2                   3            3
 3 P01   2020-01-01 14:03:00     1     3                   6            6
 4 P01   2020-01-02 08:03:00     1     4                  24           24
 5 P01   2020-01-02 11:03:00     1     5                  27           27
 6 P01   2020-01-31 08:06:00     2     1                 720.          51
 7 P01   2020-01-31 11:06:00     2     2                 723.          54
 8 P01   2020-01-31 14:06:00     2     3                 726.          57
 9 P01   2020-02-01 08:06:00     2     4                 744.          75
10 P01   2020-02-01 11:06:00     2     5                 747.          78
11 P01   2020-03-01 08:00:00     3     1                1440.         102
12 P01   2020-03-01 11:00:00     3     2                1443.         105

Final dataset

The resulting dataset:

  • preserves within-burst timing

  • removes large inter-burst gaps

  • is ready for DSEM estimation in Mplus.

adjusted_df <- df_adj %>%
  # example: keep ID, wave, adjusted time, and wave-specific outcomes
  pivot_wider(
    id_cols = c(UUID,Date_Time, t, tinterval_h),
    names_from = W,
    values_from = c(confidence, depression),
    names_glue = "{.value}_W{W}"
  ) %>%
  # drop rows where ALL outcome columns are NA
  filter(if_any(starts_with("confidence_") | starts_with("depression_"), ~ !is.na(.x)))

print(adjusted_df,n=20)
# A tibble: 90 × 10
   UUID  Date_Time               t tinterval_h confidence_W1 confidence_W2
   <chr> <dttm>              <int>       <dbl>         <dbl>         <dbl>
 1 P01   2020-01-01 08:03:00     1           0          1.88         NA   
 2 P01   2020-01-01 11:03:00     2           3          2.57         NA   
 3 P01   2020-01-01 14:03:00     3           6          2.56         NA   
 4 P01   2020-01-02 08:03:00     4          24          2.47         NA   
 5 P01   2020-01-02 11:03:00     5          27          3            NA   
 6 P01   2020-01-31 08:06:00     1          51         NA             2.39
 7 P01   2020-01-31 11:06:00     2          54         NA             3.45
 8 P01   2020-01-31 14:06:00     3          57         NA             2.11
 9 P01   2020-02-01 08:06:00     4          75         NA             2.92
10 P01   2020-02-01 11:06:00     5          78         NA             3.4 
11 P01   2020-03-01 08:00:00     1         102         NA            NA   
12 P01   2020-03-01 11:00:00     2         105         NA            NA   
13 P01   2020-03-01 14:00:00     3         108         NA            NA   
14 P01   2020-03-02 08:00:00     4         126         NA            NA   
15 P01   2020-03-02 11:00:00     5         129         NA            NA   
16 P02   2020-01-01 08:01:00     1           0          2.25         NA   
17 P02   2020-01-01 11:01:00     2           3          1.77         NA   
18 P02   2020-01-01 14:01:00     3           6          2.84         NA   
19 P02   2020-01-02 08:01:00     4          24          2.61         NA   
20 P02   2020-01-02 11:01:00     5          27          2.37         NA   
# ℹ 70 more rows
# ℹ 4 more variables: confidence_W3 <dbl>, depression_W1 <dbl>,
#   depression_W2 <dbl>, depression_W3 <dbl>