1

I am trying to conduct event sequence analysis on longitudinal survey data. I want to create a plot which looks like this (pg. 44 of https://www.researchgate.net/publication/279560802_Exploratory_mining_of_life_event_histories), which I believe was generated using the seqpcplot() function within TraMineR: enter image description here

This would allow me to identify common occupational states which participants transition through whilst in the survey (e.g. “full-time education >> full-time work” OR “full-time work >> part-time work >> family responsibilities”).

Unfortunately, different participants stay within the survey for different amounts of time, leading to sequences of varying length. This seems to cause TraMineR to create a missing data state ‘%’ at the end of all but the longest sequences (I think to make sure they are all the same length?). This additional state ‘%’ is then inserted into the seqpcplot() graph.

Here is a randomly generated example of the problem:

## Import libraries and set seed
library(TraMineR)
set.seed(123)



## Define functions

# Function which randomly generates sequences of varying length
ranseq <- function(x,y) {
  y[round(runif( round(runif(1, 1, x)), 1, length(y)) ) ]
}

# Function which creates dataframe from randomly generated sequences
rangen <- function(x,y,z) {
  # Create list of randomly generated sequences
  data <- list()
  for (i in 1:x) {
    a <- ranseq(y,z)
    b <- c(a, rep(NA, y-length(a) ) )
    data[[i]] <- b
  }
  # Convert to dataframe
  data <- data.frame(do.call(rbind, data))
  return(data)
}



## Generate sequences

# Define possible sates of the sequence
states <- c("A","B","C","D","E","F")

# Run rangen function (no. rows, max seq length, possible states)
data <- rangen(300,25,states)



## Convert to sequence object

# Convert data to a state sequence object
# NOTE THAT ALL MISSING VALUES (NAs) BEFORE, WITHIN AND AFTER SEQUENCES ARE DELETED
data.seq <- seqdef(data = data, alphabet = states, states = states, labels = states, 
                   left = "DEL", right = "DEL", gaps = "DEL")
head(data.seq)

####################################################################################

  Sequence                         
1 E-C-E-F-A-D-E-D                  
2 F-C-D-D-B-E-B-A-C-F-E-D          
3 F-D-E-D-D-B-B-F-F-D-E-A-C-E-B-C  
4 B-C-C-C-B-B-B                    
5 B-E-A-C-E-B-D-B-B-E-E-C          
6 A-C-B-E-C-E-E-E-C-E-D-E-A-C-B-C-D

In this example, participants are assigned 1 of 6 potential states in each wave of the survey. The total length of the sequence varies between participants depending on how many times they have been interviewed (e.g. participant 4 has been interviews 7 times, whilst participant 6 has been interviewed 17).

However, once this has been converted to an event sequence object, a final state ‘%’ has been added to the end of almost every sequence:

# Convert to event sequence object
data.eseq <- seqecreate(data.seq, tevent = "state")
head(data.eseq)

####################################################################################

 [1] (E)-1-(C)-1-(E)-1-(F)-1-(A)-1-(D)-1-(E)-1-(D)-1-(%)-0                                          
[2] (F)-1-(C)-1-(D)-2-(B)-1-(E)-1-(B)-1-(A)-1-(C)-1-(F)-1-(E)-1-(D)-1-(%)-0                        
[3] (F)-1-(D)-1-(E)-1-(D)-2-(B)-2-(F)-2-(D)-1-(E)-1-(A)-1-(C)-1-(E)-1-(B)-1-(C)-1-(%)-0            
[4] (B)-1-(C)-3-(B)-3-(%)-0                                                                        
[5] (B)-1-(E)-1-(A)-1-(C)-1-(E)-1-(B)-1-(D)-1-(B)-2-(E)-2-(C)-1-(%)-0                              
[6] (A)-1-(C)-1-(B)-1-(E)-1-(C)-1-(E)-3-(C)-1-(E)-1-(D)-1-(E)-1-(A)-1-(C)-1-(B)-1-(C)-1-(D)-1-(%)-0

This results in the following ‘seqpcplot’:

## Plot seqpcplot
# NOTE THAT 'missing' HAS BEEN SET TO "hide" AND 'with.missing' TO 'FALSE'
seqpcplot(seqdata = data.eseq, filter = list(type = "function", value = "linear"),
          order.align = "first", missing = "hide", with.missing = FALSE)

enter image description here

Here, virtually every sequence ends in the state ‘%’. This isn’t useful because all it tells me is these event sequences have ‘missing data’ attached to the end of the sequence to account for the fact that they are shorter then the longest sequence in the dataset.

Question 1: Is there any way to format the data or the graph to remove this missing data state ‘%’?

Question 2: If not, why not? It seems to me it should be perfectly possible to plot event sequences of varying lengths on a graph like this without resorting to this ‘%’ category.

Thanks in advance for you time!

1 Answer 1

1

In seqecreate you can specify the event that ends observation time. So a simple solution is to specify the void attribute of the state sequence object ('%' by default) as the end.event

data.eseq <- seqecreate(data.seq, tevent = "state", 
                        end.event = attr(data.seq,'void') )

This works only when tevent = 'state' and leaves the void symbol in the alphabet of the resulting event sequence.

A better solution is to act on the state-to-event transformation matrix tevent: First, generate the matrix associated to the selected method and then empty the entries for the column associated to the void state. I illustrate below using the 'transition' tevent method.

sq.dat <- c('AAAA','AAAC','ABC','ABAA','AC')
sqm <- seqdef(seqdecomp(sq.dat, sep=''), right='DEL')
tm <- seqetm(sqm,method='transition')
tm[,which(colnames(tm)==attr(sqm,'void'))] <- ''
sqe <- seqecreate(sqm,tevent=tm)
alphabet(sqe)
##[1] "A"   "A>B" "A>C" "B>A" "B>C"
seqpcplot(sqe)

enter image description here

3
  • Hi @Gilbert, thanks so much for the feedback! I've tried it on my example above and it does exactly what you said. Namely, remove the '%' sequence object. I have two follow up questions if that's OK: (1) When I plot this new event seqence data, the seqpcplot() still displays the '%' category, even though no sequences contain it. Do you know how to remove this from the plot? (2) When I apply this to my actual data, I get the error message Error in seqpcplot_private(seqdata = seqdata, group = group, weights = weights, : [!] cannot link weight and id vector Do you know what's causing this?
    – Misc584
    Commented Aug 10, 2020 at 17:46
  • 1
    The pc-plot of the event sequences obtained with tevent='state' is the same as that of the DSS of the state sequence. So try seqpcplot(seqdss(data.seq)). As for the error, I would need the data and the code.
    – Gilbert
    Commented Aug 12, 2020 at 8:00
  • Hi @Gilbert, that change has sorted it, thanks! It also stopped the error with my actual data. Thanks for all the help!
    – Misc584
    Commented Aug 12, 2020 at 15:44

Not the answer you're looking for? Browse other questions tagged or ask your own question.