Sequence alignment for hierarchical cluster analysis on categorical sequence data

Question

I have a dataset of short-term behaviors displayed by 30 individuals.

#Load packages
library(TraMineR)

# Function to generate a random non-numerical sequence
generate_random_sequence <- function(length) {
  alphabet <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")  
  return(sample(alphabet, length, replace = TRUE))
}

# Generate 15 sequences with lengths between 15 and 40
num_sequences <- 30
min_length <- 15
max_length <- 40

# Create a data frame
sequence_data <- as.data.frame(matrix(NA, ncol = max_length, nrow = num_sequences))

# Populate the data frame with random sequences
for (i in 1:num_sequences) {
  seq_length <- sample(min_length:max_length, 1)
  sequence_data[i, 1:seq_length] <- generate_random_sequence(seq_length)
}

# Create the sequence object using seqdef
sequences <- seqdef(sequence_data, informat = "STS")

I want to perform hierarchical cluster analysis to see if a continuous variable x predicts which cluster each sequence falls into. However my sequences are wildly different lengths. I have tried running dynamic time warping but my understanding is that, due to the fact that DTW uses distances, it cannot be applied to categorical data. I'm at a loss - how can align my sequences so that I can perform the HCA?

Gilbert · Accepted Answer · 2024-01-18 12:39:08Z

Clustering categorical sequences is typical sequence analysis (SA) (See the wikipedia page Sequence analysis in social sciences and the many references given there).

There exist multiple ways of measuring dissimilarities between categorical sequences, including between sequences of different length. See the review by Studer & Ritschard (2016). Many of them can be computed with the seqdist function of the TraMineR package.

I illustrate below using the optimal matching distance with INDELSLOG indel and substitution costs (costs based on the frequency of occurrences of the different tokens)

dist.om <- seqdist(sequences, method="OM", sm="INDELSLOG")
hcl <- hclust(as.dist(dist.om))
plot(hcl)

Collectives™ on Stack Overflow

Sequence alignment for hierarchical cluster analysis on categorical sequence data

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
r
sequence
hierarchical-clustering
sequence-analysis
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged rsequencehierarchical-clusteringsequence-analysis or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
sequence
hierarchical-clustering
sequence-analysis
or ask your own question.