I have a dataset of short-term behaviors displayed by 30 individuals.
#Load packages
library(TraMineR)
# Function to generate a random non-numerical sequence
generate_random_sequence <- function(length) {
alphabet <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")
return(sample(alphabet, length, replace = TRUE))
}
# Generate 15 sequences with lengths between 15 and 40
num_sequences <- 30
min_length <- 15
max_length <- 40
# Create a data frame
sequence_data <- as.data.frame(matrix(NA, ncol = max_length, nrow = num_sequences))
# Populate the data frame with random sequences
for (i in 1:num_sequences) {
seq_length <- sample(min_length:max_length, 1)
sequence_data[i, 1:seq_length] <- generate_random_sequence(seq_length)
}
# Create the sequence object using seqdef
sequences <- seqdef(sequence_data, informat = "STS")
I want to perform hierarchical cluster analysis to see if a continuous variable x
predicts which cluster each sequence falls into.
However my sequences are wildly different lengths. I have tried running dynamic time warping but my understanding is that, due to the fact that DTW uses distances, it cannot be applied to categorical data. I'm at a loss - how can align my sequences so that I can perform the HCA?