1

I have a dataset of short-term behaviors displayed by 30 individuals.

#Load packages
library(TraMineR)

# Function to generate a random non-numerical sequence
generate_random_sequence <- function(length) {
  alphabet <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")  
  return(sample(alphabet, length, replace = TRUE))
}

# Generate 15 sequences with lengths between 15 and 40
num_sequences <- 30
min_length <- 15
max_length <- 40

# Create a data frame
sequence_data <- as.data.frame(matrix(NA, ncol = max_length, nrow = num_sequences))

# Populate the data frame with random sequences
for (i in 1:num_sequences) {
  seq_length <- sample(min_length:max_length, 1)
  sequence_data[i, 1:seq_length] <- generate_random_sequence(seq_length)
}

# Create the sequence object using seqdef
sequences <- seqdef(sequence_data, informat = "STS")

I want to perform hierarchical cluster analysis to see if a continuous variable x predicts which cluster each sequence falls into. However my sequences are wildly different lengths. I have tried running dynamic time warping but my understanding is that, due to the fact that DTW uses distances, it cannot be applied to categorical data. I'm at a loss - how can align my sequences so that I can perform the HCA?

1 Answer 1

0

Clustering categorical sequences is typical sequence analysis (SA) (See the wikipedia page Sequence analysis in social sciences and the many references given there).

There exist multiple ways of measuring dissimilarities between categorical sequences, including between sequences of different length. See the review by Studer & Ritschard (2016). Many of them can be computed with the seqdist function of the TraMineR package.

I illustrate below using the optimal matching distance with INDELSLOG indel and substitution costs (costs based on the frequency of occurrences of the different tokens)

dist.om <- seqdist(sequences, method="OM", sm="INDELSLOG")
hcl <- hclust(as.dist(dist.om))
plot(hcl)

enter image description here

Not the answer you're looking for? Browse other questions tagged or ask your own question.