Sequence analysis clustering CHI2 EUCLID error

Question

I am quite new to sequence analysis and trying to identify clusters in an aggregated sequence matrix, focusing on the state duration. However, when using method='CHI2'/'EUCLID' combined with step=1 (not otherwise) I am getting the error:

Error in if (SCres > currentSCres) { : missing value where TRUE/FALSE needed

Any ideas why (there are some NaN in the distance matrix, could they result from sequences being of different length)?

What the sequence object and distance matrix looks like Code:

Sequence                                         
1    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a  
3    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5    b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a

Distance matrix
           1         2      3          4
2        NaN                              
3        289.92897   NaN                    
4        141.07472   NaN    263.22855          
5        10.22425    NaN    290.10919  141.44473

Code:

library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4

SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8, 
      properties=c("state", "duration", "spell.age","spell.dur",
        "transition","pattern", "AFtransition", "AFpattern","Complexity"))

I cannot reproduce the error. Please provide a minimum working example. In addition, could you specify the version of TraMineR you are using. — Gilbert, Commented Feb 6, 2020 at 12:42
Thank you for the fast response (and great package). The data is sensitive so I have made up some sequences which seem to recreate the same problem. I hope it is enough. — Rico, Commented Feb 6, 2020 at 19:56

Gilbert · Accepted Answer · 2020-02-08 16:05:50Z

The "CHI2" distance between two sequences x and y computed by TraMineR is the sum of the Chi-squared distance between the state distributions over the successive periods of length step. See Studer and Ritschard (2014, p 8).

This means that for step=1 a Chi-squared distance is computed at each position. When one of the sequence has void values at some positions (e.g. the last position in your second sequence), the distance cannot be computed for these positions and we get a NaN value for the CHI2 distance between this sequence and any other sequence.

To avoid that, you can use the following workarounds:

1) Set a step value large enough to be sure each sequence contains at least one non-void element in each period intervals. For your example, the longest sequences are of length 25. To be sure the last period contains non void elements, you have to set step=5.

DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
         "a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",  
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
         "b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000

2) Drop the columns with void elements:

SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
##          [,1]     [,2]      [,3]      [,4]     [,5]
## [1,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [2,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [3,] 10.04158 10.04158  0.000000  4.472136 10.34811
## [4,] 10.04158 10.04158  4.472136  0.000000 10.34811
## [5,]  2.50000  2.50000 10.348108 10.348108  0.00000

3) Fill the shorter sequences with missings and consider the missing value as an additional possible state. By default right='DEL' in seqdef, which creates voids. Here we set right=NA to get missing values instead.

SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
##          [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  0.000000  2.738613 10.408330 10.408330  2.500000
## [2,]  2.738613  0.000000 10.527741 10.527741  3.708099
## [3,] 10.408330 10.527741  0.000000  5.477226 10.704360
## [4,] 10.408330 10.527741  5.477226  0.000000 10.704360
## [5,]  2.500000  3.708099 10.704360 10.704360  0.000000

Now, the error reported in the question is NOT an error of seqdist, but of the seqpropclust function from the WeightedCluster library. The error is obviously caused by the NaN in the dissimilarity matrix.

Collectives™ on Stack Overflow

Sequence analysis clustering CHI2 EUCLID error

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
r
cluster-analysis
traminer
sequence-analysis
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged rcluster-analysistraminersequence-analysis or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
cluster-analysis
traminer
sequence-analysis
or ask your own question.