1

I want to randomly change states in a sequence dataset for the purposes of simulation. The goal is to see how different measures of cluster quality behave with different degrees of structure in the data.

If I were to introduce missings, there is the handy seqgen.missing() function in TraMineRextras, but it only adds missing states. How would I go about randomly picking a proportion pof sequences and randomly inserting a randomly selected element of the alphabet to them with p_g, p_l, and p_r probabilities for inserting them in the middle, left, and right?

2
  • This is a question about coding without statistical content. It should be asked on StackOverflow. In any case, you should explain to what you would like to change the randomly selected elements in the sequences. To a fixed state? A randomly selected element of the alphabet?
    – Gilbert
    Commented Jul 3, 2018 at 13:18
  • Ok. I understand. Please for free to vote for migrating it to the correct site. I’ll edit to clarify the points you raised.
    – Kenji
    Commented Jul 3, 2018 at 14:46

1 Answer 1

1

Below is a seq.rand.chg function (adapted from seqgen.missing) that randomly applies state changes to a proportion p.cases of sequences. For each randomly selected sequence the function randomly changes the state either

  1. When p.gaps > 0, at a proportion between 0 and p.gaps of the positions;

  2. When p.left > 0 and/or p.right > 0, at an at most p.left (p.right) proportion left (right) positions.

As in the seqgen.missing function, the p.gaps, p.left, and p.right are the maximum proportion of cases changed in each selected sequence. These are not exactly your probabilities p_g, p_l, and p_r. But it should be easy to adapt the function for that.

Here is the function:

seq.rand.chg <- function(seqdata, p.cases=.1, p.left=.0, p.gaps=0.1, p.right=.0){
  n <- nrow(seqdata)
  alph <- alphabet(seqdata)
  lalph <- length(alph)
  lgth <- max(seqlength(seqdata))

  nm <- round(p.cases * n, 0)
  ## selecting cases
  idm <- sort(sample(1:n, nm))
  rdu.r <- runif(n,min=0,max=p.right)
  rdu.g <- runif(n,min=0,max=p.gaps)
  rdu.l <- runif(n,min=0,max=p.left)

  for (i in idm){
    # inner positions
    gaps <- sample(1:lgth, round(rdu.g[i] * lgth, 0))
    seqdata[i,gaps] <- alph[sample(1:lalph, length(gaps), replace=TRUE)]
    # left positions
    nl <- round(rdu.l[i] * lgth, 0)
    if (nl>0) seqdata[i,1:nl] <- alph[sample(1:lalph, nl, replace=TRUE)]
    # right positions
    nr <- round(rdu.r[i] * lgth, 0)
    if (nr>0) seqdata[i,(lgth-nr+1):lgth] <- alph[sample(1:lalph, nr, replace=TRUE)]
  }

  return(seqdata)
}

We illustrate the usage of the function with the first three sequences of the mvad data

library(TraMineR)
data(mvad)
mvad.lab <- c("employment", "further education", "higher education",
              "joblessness", "school", "training")
mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad[, 17:62], states = mvad.shortlab,
                   labels = mvad.lab, xtstep = 6)
mvad.ori <- mvad.seq[1:3,]

## Changing up to 50% of states in 30% of the sequences
seed=11
mvad.chg <- seq.rand.chg(mvad.ori, p.cases = .3, p.gaps=0.5)

## plotting the outcome 
par(mfrow=c(3,1))
seqiplot(mvad.ori, with.legend=FALSE, main="Original sequences")
seqiplot(mvad.chg, with.legend=FALSE, main="After random changes")
seqlegend(mvad.ori, ncol=6 )

Sequences before and after changes]

We observe that changes were applied to the randomly selected 3rd sequence.

Not the answer you're looking for? Browse other questions tagged or ask your own question.