%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Visualization tools in the seqHMM package} %\VignetteKeyword{categorical time series} %\VignetteKeyword{latent Markov models} %\VignetteKeyword{latent class models} \documentclass{article} \usepackage[authoryear,round,longnamesfirst]{natbib} \usepackage{amsmath} \usepackage{array} \usepackage{hyperref} % \usepackage[backend=bibtex]{biblatex} % \newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \author{Satu Helske and Jouni Helske\\ University of Turku, Finland} \title{Visualization tools in the \texttt{seqHMM} package} \begin{document} \maketitle \tableofcontents \section{Introduction} Visualization is a powerful tool throughout the analysis process from the first glimpses into the data to exploration and finally to presentation of the results. This vignette is supplementary material to the paper \citet{Helske2016} on the \texttt{seqHMM} package; here we discuss visualization of sequence data and hidden Markov models in more detail and give a more examples on the plotting features of the package. \subsection{Visualizing sequences} There are many options for graphical description of sequence data. Most of them either represent sequences or summarize them. \textit{Sequence index plot} is the most commonly used example of the former (see an illustration on the left-hand side of Figure \ref{fig:seqplotSingle}). Such a graph was proposed by \citet{Scherer2001} to show the observations of each subject in the order they appear, illustrating different states with different colours. The horizontal axis shows the time points while individuals are represented on the vertical axis; thus, each horizontal line shows the sequence of one individual. <>= library("seqHMM") data("biofam", package = "TraMineR") biofam_seq <- seqdef(biofam[, 10:25], start = 15, labels = c( "parent", "left", "married", "left+marr", "child", "left+child", "left+marr+ch", "divorced") ) data("biofam3c") marr_seq <- seqdef(biofam3c$married, start = 15, alphabet = c( "single", "married", "divorced"), cpal = c("violetred2", "darkgoldenrod2", "darkmagenta") ) child_seq <- seqdef(biofam3c$children, start = 15, alphabet = c("childless", "children"), cpal = c("darkseagreen1", "coral3") ) left_seq <- seqdef(biofam3c$left, start = 15, alphabet = c( "with parents","left home"), cpal = c("lightblue", "red3") ) @ <>= library(patchwork) p1 <- stacked_sequence_plot(biofam_seq[1:100, ], type = "i", legend_position = "none") p2 <- stacked_sequence_plot(biofam_seq[1:100, ], type = "d", legend_position = "right") p1 + p2 & ggplot2::xlab("Age") @ When the number of subjects is moderate, sequence index plots give an accurate representation of the data, offering an overview on the timing of transitions and on the durations of different episodes. Sequence index plots get more complex to comprehend when the number of individuals and states increases. Sequence analysis with clustering eases interpretation by grouping similar trajectories together. \citet{Piccarreta2010} suggested using multidimensional scaling for ordering sequences more meaningfully (similar sequences close to each other). \citet{Piccarreta2012} proposed smoothing techniques that reduce individual noise. Similar sequences are summarized into artificial sequences that are representative to the data. \citet{Gabadinho2011} introduced \textit{representative sequence plots} where only a few of the most representative sequences (observed or artificial) are shown. A similar approach, \textit{relative frequency sequence plot}, was introduced by \citet{Fasang2014}. The idea is to find a representative sequence (the medoid) in equal-sized neighbourhoods to represent the relative frequencies in the data. \textit{State distribution plots} \citep[also called tempograms or chronograms;][]{Billari2005,Widmer2009} summarize information in the whole data. Such graphs show the change in the prevalence of states in the course of time (see the right-hand side of Figure \ref{fig:seqplotSingle}). Also here, the horizontal axis represents time but vertical axis is now a percentage scale. These plots simplify the overall patterns but do not give information on transitions between different states. Other summary plots include, e.g., \textit{transversal entropy plots} \citep{Billari2001}, which describe how evenly states are distributed at a given time point, and \textit{mean time plots} \citep{Gabadinho2011}, which show the mean time spent in each state across the time points. <>= seq_data <- list( Original = biofam_seq[1:10, ], Marriage = marr_seq[1:10, ], Parenthood = child_seq[1:10, ], Residence = left_seq[1:10, ] ) stacked_sequence_plot( seq_data, sort_by = "start", sort_channel = "Original" ) & ggplot2::xlab("Age") @ Visualizing multichannel data is not a straightforward task. The \texttt{TraMineR} package \citep{Gabadinho2011} provides nice plotting options and summaries for simple sequence data, but there are no easy options for multichannel data. Combining states into a single-channel representation often works well if the state space is small and states at each time point are either completely observed or completely missing. In other cases it can be preferable to preserve the multichannel structure. The \textit{stacked sequence plot} shows sequence data plotted separately for each channel (see Figure \ref{fig:seqplotMulti}). The order of the subjects is kept the same in each plot and the plots are stacked on top of each other. Since the time axes are horizontally aligned, comparing timing in different life domains should be relatively easy. This approach also protects the privacy of the subjects; even though all data are shown, combining information across channels for a single individual is difficult unless the number of subjects is small. State distribution plots can then be used to show information on the prevalence and timing of combined states on a more general level. \subsection{Visualizing hidden Markov models} \label{sec:visHMM} Markovian models are often visualized as directed graphs where vertices (nodes) present states and edges (arrows, arcs) show transition probabilities between states. We have extended this basic graph in the hidden Markov model framework by presenting hidden states as pie charts, with emission probabilities as slices, and by adjusting the thickness of edges according to transition probabilities. Such graph allows for presenting a complex model in a very efficient way, guiding the viewer to the most important aspects of the model. The graph shows the essence of the hidden states and the dynamics between them. Figure \ref{fig:HMMplot} illustrates a HMM with five hidden states for the multichannel version of the \texttt{biofam} data. Following the common convention, hidden states are presented as vertices and transition probabilities are shown as edges. Initial state probabilities are given below the respective vertices. Almost all individuals (99\%) start from the first hidden state of living home unmarried and without children. Vertices are drawn as pie charts where the slices represent emitted observations or -- in a multichannel case -- combinations of observed states across channels. The size of the slice is proportional to the emission probability of the observed state (or in a multichannel model, the product of the emission probabilities across channels). In this model, the hidden states are very close to observed states, the largest emission probabilities for the combined observations are 0.94 or higher in all states. For emphasizing the relevant information, observations with small emission probabilities can be combined into one category -- here we've combined states with emission probabilities less than 2\%. The width of the edge depends on the probability of the transition; the most probable transitions are thus easy to detect. Here the most probable transition between two hidden states is the transition into parenthood (typically married/childless/left home $\to$ married/children/left home) with a probability of 0.19. <>= plot(hmm_biofam, layout = matrix(c( 1, 2, 3, 4, 2, 1, 1, 1, 1, 0 ), ncol = 2), # varying curvature of edges edge.curved = c(0, -0.8, 0.6, 0, 0, -0.8, 0), # thinner edges and arrows cex.edge.width = 0.8, edge.arrow.size = 1, # fixing axes to the right scale xlim = c(0.5, 4.5), ylim = c(-0.5, 1.5), rescale = FALSE, # different legend properties with.legend = "bottom", legend.prop = 0.3, ncol.legend = 2, # distance of vertex labels to vertices vertex.label.dist = 1.1, # threshold for emission probabilities not shown as separate slices combine.slices = 0.02, combined.slice.label = "others (emission prob. < 0.02)" ) @ \subsection{Example data} As an example we use the \texttt{biofam} data available in the \texttt{TraMineR} package \citep{Gabadinho2011}. It is a sample of 2000 individuals born in 1909--1972, constructed from the Swiss Household Panel survey in 2002 \citep{Mueller2007}. The data set contains sequences of annual family life statuses from age 15 to 30. Eight observed states are defined from the combination of five basic states: living with parents, left home, married, having children, and divorced. To show a more complex example, we have split the original data into three separate channels representing different life domains: marriage, parenthood, and residence. The data for each individual now includes three parallel sequences constituting of two or three states each: single/married/divorced, childless/parent, and living with parents / having left home. This three-channel version of the data is stored as a new data object called \texttt{biofam3c}. \section{Tools for visualizing multichannel sequence data} \subsection{Stacked sequence plots} The \texttt{stacked\_sequence\_plot} function is the simplest way of plotting multichannel sequence data in \texttt{seqHMM}. It can be used to illustrate state distributions or sequence index plots. The former is the default option, since index plots can take a lot of time and memory if data are large. \\ \noindent \textbf{Data types.} The \texttt{stacked\_sequence\_plot} function accepts two types of objects, either state sequence objects of type \texttt{stslist} from the \texttt{seqdef} function or a hidden Markov model object of \texttt{seqHMM} package. As an example of the former, we start by preparing three state sequence objects from the \texttt{biofam3c} data. Here we set the start of the sequence at age 15 and define the states in each channel with the \texttt{alphabet} argument. The colour palette is stored as an attribute and may be modified accordingly. <>= library("seqHMM") data("biofam3c") marr_seq <- seqdef( biofam3c$married, start = 15, alphabet = c("single", "married", "divorced"), cpal = c("violetred2", "darkgoldenrod2", "darkmagenta") ) child_seq <- seqdef( biofam3c$children, start = 15, alphabet = c("childless", "children"), cpal = c("darkseagreen1", "coral3") ) left_seq <- seqdef( biofam3c$left, start = 15, alphabet = c("with parents", "left home"), cpal = c("lightblue", "red3") ) @ The \texttt{stacked\_sequence\_plot} function is designed for multichannel data but also works for single-channel sequences. For multichannel data, when using \texttt{stslist} objects the different channels are given as a list. Figure \ref{fig:plottingsequences} illustrates a default plot. <>= p <- stacked_sequence_plot( x = list(Marriage = marr_seq, Parenthood = child_seq, Residence = left_seq) ) p @ The object returned by \texttt{stacked\_sequence\_plot} can be further modified with the functionalities of \texttt{patchwork} and \texttt{ggplot2} packages. For example, the x-axis label may be modified with the \texttt{ggplot2::xlab} function and \texttt{patchwork}'s \texttt{\&} operator as \texttt{p \& ggplot2::xlab("Age")} (In single-channel case, \texttt{+} could be used, but here \texttt{\&} modifies all three subfigures). \noindent \textbf{Choosing plots.} In the case of hidden Markov model, the \texttt{stacked\_sequence\_plot} function may be asked to plot observed sequences, most probable hidden state sequences, or both. The type of sequences is set with the \texttt{plots} argument. Figure \ref{fig:plottingsequencesHMM} shows state distributions for the observations as well as hidden states. <>= data("hmm_biofam") stacked_sequence_plot(x = hmm_biofam, plots = "both") @ \noindent \textbf{Sequence index plots} are called with the \texttt{type = "index"}. This type of plot shows the sequences as a whole. As the index plot is often difficult to interpret as such, sequences may be ordered in a more meaningful way. \\ <>= stacked_sequence_plot( hmm_biofam, type = "index", sort_by = "start", sort_channel = 3 ) @ \noindent \textbf{Sorting.} The \texttt{sort\_by} argument may be given a sorting variable or a sorting method. A sorting method may be one of \texttt{"start"}, \texttt{"end"}, or \texttt{"mds"}. The \texttt{"start"} and \texttt{"end"} options sort the sequences by the elements of the alphabet in the channel defined by the \texttt{sort\_channel} argument, starting from the start or the end of the sequences. \texttt{sort\_channel} should be either an integer or the name of the channel (name of the list component). Hidden states are automatically given a name \texttt{"Hidden states"}. The \texttt{"mds"} option sorts the sequences according to the scores of multidimensional scaling for the channel defined by the \texttt{sort\_channel}. Figure \ref{fig:seqIplot} shows sequences sorted by the third channel, residence. \noindent \textbf{Legend.} The \texttt{legend\_position} argument defines if and where the legend for the states is plotted. The default value \texttt{"right"} creates separate legends for each channel and positions them on the right-hand side of the plot. This argument can be a single value or a vector of length matching the number of channels; these value of passed to \texttt{ggplot2::theme(legend.position)}. The legend of one or multiple channels may be supressed altogether with the value \texttt{"none"}. \section{Tools for hidden Markov models} \subsection{\texttt{plot.hmm}: plotting hidden Markov models} A basic HMM graph is easily called with the \texttt{plot} method. Figure \ref{fig:code_plottingHMMbasic} illustrates the default plot. <>= plot(hmm_biofam) @ A simple default plot is a convenient way of visualizing the models during the analysis process, but for publishing it is often better to modify the plot to get an output that best illustrates the structure of the model at hand. \\ \noindent \textbf{Layout.} The default layout positions hidden states horizontally. This may be changed with the \texttt{layout} argument. The user may choose to position hidden states vertically or use a layout function from the \texttt{igraph} package. It is also possible to position the vertices manually by giving a two-column numerical matrix of x and y coordinates. \\ \noindent \textbf{Vertices.} By default, the vertices are drawn as pie charts showing the emission probabilities as slices. The pies may be omitted with the \texttt{pie = FALSE} argument. Initial probabilities are shown as vertex labels by default. Instead of these, the user may choose to print the names of the hidden states, use own labels, or omit the labels altogether with the \texttt{vertex.label} argument. The distance and the positions of the labels are modified with the \texttt{vertex.label.dist} and \texttt{vertex.label.pos} arguments. The size of the vertices is modified with the \texttt{vertex.size} argument. The colour palette for (combinations of) observed states is set with the \texttt{cpal} argument. Observations with small emission probabilities may be combined into one state with the \texttt{combine.slices} argument which sets the threshold for printing observed states. By defafult, observed states with emission probabilities less than 0.05 are combined. The colour and label for the combined state is modified with the \texttt{combined.slice.color} and \texttt{combined.slice.label} arguments, respectively. \\ \noindent \textbf{Edges.} By default, the plotting method draws transition probabilities between different states and omits transitions to same states. Self-loops may be drawn with the \texttt{loops} argument. The \texttt{edge.curved} argument tells the curvature of edges. These may be different for different edges; setting curvature to 0 or FALSE draws straight edges. The widths of the edges are modified with the \texttt{edge.width} and \texttt{cex.edge.width} arguments. The former sets the widths (by default, proportional to transition probabilities) and the latter sets an expansion factor to all edges. The size of the arrows is modified with the \texttt{edge.arrow.size} argument. The \texttt{trim} argument omits edges with transition probabilities less than the specified value (0 by default, i.e., no trimming). \\ \noindent \textbf{Legend.} The \texttt{with.legend} argument defines if and where the legend is plotted. The \texttt{ltext} argument may be used to modify the labels shown in the legend. Similarly to ssp figures, the legend may be modified with the \texttt{legend.prop}, \texttt{ncol.legend}, and \texttt{cex.legend} arguments. \\ \noindent \textbf{Texts.} Font families for labels are modified with the \texttt{vertex.label.family} and \texttt{edge.label.family} arguments. Printing of model parameters may be modified with the \texttt{label.signif}, \texttt{label.scientific}, and \texttt{label.max.length} arguments. The first rounds labels to the specified number of significant digits, the second one defines if scientific notation should be used to describe small numbers, and the last argument sets the maximum number of digits for labels. These three arguments are omitted for user-given labels. \noindent \textbf{Other modifications.} The plotting method also accepts other arguments in the \texttt{plot.igraph} function in the \texttt{igraph} package. Figure \ref{fig:HMMplot} in Section \ref{sec:visHMM} is an example of manual positioning of vertices. Here we have modified edge curvature and width, legend properties, and combined slices. The code for creating the figure is shown here. <>= plot(hmm_biofam, layout = matrix(c( 1, 2, 3, 4, 2, 1, 1, 1, 1, 0 ), ncol = 2), xlim = c(0.5, 4.5), ylim = c(-0.5, 1.5), rescale = FALSE, edge.curved = c(0, -0.8, 0.6, 0, 0, -0.8, 0), cex.edge.width = 0.8, edge.arrow.size = 1, legend.prop = 0.3, ncol.legend = 2, vertex.label.dist = 1.1, combine.slices = 0.02, combined.slice.label = "others (emission prob. < 0.02)" ) @ Figure \ref{fig:HMMplotLayout} shows an example of using a layout function from the \texttt{igraph} package. This is based on the same model as the other figures but here we have omitted pie graphs and instead show the names of the hidden states printed within the circles. This figure also shows the self-loops for transitions. We have omitted transition probabilities smaller than 0.01 and printed labels with three significant digits. <>= require("igraph") set.seed(1234) plot(hmm_biofam, layout = layout_nicely, pie = FALSE, vertex.size = 30, vertex.label = "names", vertex.label.dist = 0, edge.curved = FALSE, edge.width = 1, loops = TRUE, edge.loop.angle = -pi / 8, trim = 0.01, label.signif = 3, xlim = c(-1, 1.3) ) @ \subsection{\texttt{plot.mhmm}: plotting mixture hidden Markov models} The \texttt{plot.mhmm} function shows the submodels of a \texttt{mhmm} object similarly to the \texttt{mssplot} function. Also here the user may choose which submodels to plot with the \texttt{which.plots} argument and ask for a menu from which to choose the next plot with the \texttt{ask} argument. Instead of an interactive mode it is also possible to plot all submodels in the same figure similarly to the \texttt{gridplot} function. Note, however, that such a plot requires opening a large window. \section{Helper tools} \subsection{\texttt{mc\_to\_sc\_data} and \texttt{mc\_to\_sc}: from multichannel to single-channel} We also provide a function \texttt{mc\_to\_sc\_data} for the easy conversion of multichannel sequence data into a single channel representation. Plotting combined data is often useful in addition to (or instead of) showing separate channels. A similar function called \texttt{mc\_to\_sc} converts multichannel HMMs into single-channel representations. \subsection{\texttt{colorpalette}: ready-made colour palettes} The \texttt{colorpalette} data is a list of ready-made colour palettes with distinct colours. By default, the \texttt{seqHMM} package uses these palettes when determining colours for new state sequence objects or for plotting. A colour palette with $n$ colours is called with \texttt{colorpalette[[n]]}. \subsection{\texttt{plot\_colors}: show colours of colour palettes} The \texttt{plot\_colors} function plots colors and their labels for easy visualization of a colorpalette. Figure \ref{fig:colorpalette} shows an example from \texttt{colorpalette} with seven colours. <>= plot_colors(colorpalette[[7]]) @ \subsection{\texttt{separate\_mhmm}: reorganize MHMM into a list of HMMs} The \texttt{separate\_mhmm} function reorganizes the parameters of an \texttt{mhmm} object into a list where each list component is an object of class \texttt{hmm} consisting of parameters of the corresponding cluster. This gives more possibilities for plotting. For example, the user may define ssp figures for each cluster defined by an MHMM for plotting with the \texttt{gridplot} function. % \pagebreak \bibliographystyle{plainnat} \bibliography{references_for_vignettes} \end{document}