I want to cluster sequences with optimal matching with TraMineR::seqdist()
from data that contains missings, i.e. sequences containing gaps.
library(TraMineR)
data(ex1)
sum(is.na(ex1))
# [1] 38
sq <- seqdef(ex1[1:13])
sq
# Sequence
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 *-D-D-D-D-D-D-D-D-D-D
# s4 A-A-*-*-B-B-B-B-D-D
# s5 A-*-A-A-A-A-*-A-A-A
# s6 *-*-*-C-C-C-C-C-C-C
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*
sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)
# A-> B-> C-> D->
# A-> 0 2.000 2 2.000
# B-> 2 0.000 2 1.823
# C-> 2 2.000 0 2.000
# D-> 2 1.823 2 0.000
When I run seqdist()
dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
I'm receiving
Error: 'with.missing' must be TRUE when 'seqdata' or 'refseq' contains missing values
but when I set option with.missing=TRUE
I'm receiving
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
Error: [!] size of substitution cost matrix must be 5x5
So, how can we compute the dissimilarities between sequences using seqdist()
with the output of seqsubm()
the right way when the data contain missings i.e. sequences contain gaps?
Note: I'm not very sure if this makes sense at all. So far I just exclude observations with missings but due to my data I loose lots of observations by that. Therefore it would be worthwhile to know if there's such an option.