2

I used the ggseqplot() library to calculate the entropy from the TraMineR library. To familiarise myself with the formula, I tried unsuccessfully to calculate it by hand using simple example data.

Alphabet : 3

Probability of each event: 1/26, 22/26, 3/26

The calculation obtained by ggseqdplot(mine_mvad.seq, with.entropy = TRUE) gives me 0.265087328, which is the default normalised entropy.

The formula here is pp.77-78. I tried : - 1/log(3) / ((1/26*(log(1/26)) + 22/26*log(22/26) + 3/26*log(3/26)))

and

- (1/26*log(1/26) + 22/26*log(22/26) + 3/26*log(3/26)) But I don't have the right result. I've missed something in the formula.

1
  • I have the basic formula, with the calculation of the natural logarithm, but I don't understand the expression "entropy is normalised, ie divided by the entropy of the alphabet" in the default mathematical formula, which is supposed to give the result.
    – JC_
    Commented Dec 5, 2023 at 12:01

1 Answer 1

3

Shannon's entropy of a distribution $(p_1,...,p_n)$ is $-\sum_i p_i \log(p_i)$. In TraMineR, the entropy is computed with the natural logarithm.

TraMineR computes both the longitudinal entropy (i.e., the entropy of the distribution within each sequence) with function seqient and the cross-sectional entropy of the state distribution at each position with function seqstatd. Function ggseqdplot (from package ggseqplot) plots the cross-sectional distributions and the cross-sectional entropies. The plotted values are taken from the outcome of seqstatd.

In your example, we do not know to what the distribution c(1/26, 22/26, 3/26) corresponds. If it is the distribution inside a sequence of length 26 you should compare with the outcome of seqient. If it is the distribution, say at the first position, across 26 sequences, then you should compare with the outcome of seqstatd (the entropy is the third element of the returned list or $Entropy). Both return normalized values. The normalization is done by dividing by the maximal possible entropy for the given alphabet, which is the log of the alphabet size. In both cases, you can get non-normalized values by specifying norm=FALSE.

To illustrate, I build a set of 26 sequences of length 26 using alphabet {"a","b","c"}.

library(TraMineR)
sdat <- c("a/1-b/22-c/3","b/2-c/24","c/5-b/10-a/11")
sda <- sdat[c(1,rep(2,22),rep(3,3))]
sdsts <- seqformat(sda, from="SPS", to="STS", 
              SPS.in = list(xfix = "", sdsep = "/"), stsep="-")
myseq <- seqdef(sdsts, cnames=1:26)
seqIplot(myseq)

dplot of myseq

Now, we compute the cross-sectional entropies and display the first five values:

myseq.distr <- seqstatd(myseq)
myseq.distr$Entropy[1:5]
##         1         2         3         4         5 
## 0.4695343 0.3255263 0.1483905 0.1483905 0.1483905

The first value (i.e., the cross-sectional entropy at the first position) is obtained as:

ent <- -((1/26*(log(1/26)) + 22/26*log(22/26) + 3/26*log(3/26)))
ent/log(3)
## [1] 0.4695343

The cross-sectional distributions and entropies are plotted with

seqdHplot(myseq)

enter image description here

We get the longitudinal entropy of the first sequence with

seqient(myseq[1,])
##       Entropy
## [1] 0.4695343

and non-normalized value with

seqient(myseq[1,], norm=FALSE)
##       Entropy
## [1] 0.5158361

The value of the normalized entropy is the same as the entropy computed transversally at the first position because, by construction of our example data, the distribution within the first sequence is the same as the cross-sectional distribution at the first position.

Not the answer you're looking for? Browse other questions tagged or ask your own question.