%\VignetteIndexEntry{Sequential Detector} \documentclass[a4paper]{article} \usepackage{Sweave} \usepackage[utf8]{inputenc} \usepackage[english]{babel} \usepackage{amsmath,amssymb,amsfonts,amsthm} \usepackage{natbib} \usepackage{filecontents} \usepackage[a4paper,left=3cm,right=2cm,top=2.5cm,bottom=2.5cm]{geometry} \bibliographystyle{apalike} \usepackage{listings} \lstset{ basicstyle=\footnotesize\ttfamily, columns=flexible, breaklines=true } \newtheorem{lsequence}{Learning sequence}[section] \newtheorem{tsequence}{Testing sequence}[section] \title{Evolving Tokenized Transducer Sequential Detector} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents{} \newpage \section{About} This package has been partly supported under Competitiveness and Cohesion Operational Programme from the European Regional and Development Fund, as part of the Integrated Anti-Fraud System project no. KK.01.2.1.01.0041 (IAFS). This package has also been partly supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS). The package comprises Evolving Tokenized Transducer (ETT) \citep{krleza2019latent}, developed to learn and detect data sequences. Although ETT was primarily developed for the process discovery purpose, it can be used to learn and detect any data sequence, e.g., data sequences from data streams or process instance flows from event streams. Although we use terms \emph{data streams} and \emph{event streams} throughout this documentation, limited datasets and logs fall into this category as well. This package is still in development and will be significantly extended over the next years. We are also aware that things are currently not perfect, and that there are most certainly some bugs hidden around. As a book always has at least one more typo, a piece of software always has at least one more bug. \subsection{Evolving Tokenized Transducer} ETT is a transducer, which is essentially a type of Moore machine, defined as \begin{equation} \begin{split} ETT=(Q,\Sigma,\Gamma,\mathcal{T},\Omega,\eta,\pi,\Delta,S,F) \end{split} \end{equation} where: \begin{itemize} \item $Q$ is a set of all states, which helps to define the ETT structure, \item $\Sigma$ is an input alphabet set taken from the input source, \item $\Gamma$ is an output alphabet set generated by ETT (the reason why this is a transducer), \item $\mathcal{T}$ is a set of tokens, which describe a set of current data sequences handled by ETT, \item $\Omega$ is a state and output alphabet mapping relation, \item $\eta$ is a state and token mapping relation, which assigns tokens to states, \item $\pi$ is a token selection function, which is used to select relevant tokens based in the input symbol and data item, \item $\Delta$ is a state-token transition and mapping function, which defines transitions in the ETT structure, \item $S \subseteq Q$ is a set of starting states, \item $F \subseteq Q$ is a set of final states. \end{itemize} ETT uses a set of auxiliary algorithms called \textbf{actions} for its work. \paragraph{ETT actions}: \begin{itemize} \item \textbf{Extending actions} - Capable to extend the ETT structure, which gives ETT capability to learn new sequences, or to add new data items in the existing data sequences. \item \textbf{Pushing actions} - Performing token push throughout the ETT structure, which resembles as sequence detection. \end{itemize} \paragraph{ETT modes}: \begin{itemize} \item \textbf{Learning mode} - Uses both extending and pushing actions for intertwining or data sequence learning and detection. \item \textbf{Detection mode} - Uses only pushing actions. ETT in this mode is not extended in any circumstances, i.e., it can only detect previously learned data sequences. \end{itemize} \begin{figure}[ht] \centering \includegraphics[width=0.65\textwidth,keepaspectratio]{./figs/fig_synth_case2.png} \caption{An ETT example} \end{figure} The foundation of ETT is to have a single structure that comprises multiple tokens. Each one of these tokens represents one active data sequence. In the process discovery context, each token represents one process instance flow. ETT is a non-deterministic transducer, whose current context can be described by the set of tokens. By pushing tokens through the ETT structure on input data we advance a data sequence or a process instance flow. Token pushing is performed by the \emph{pushing actions} described previously. If there is no token that can be pushed and ETT is in \emph{learning mode}, \emph{extension actions} are invoked to extend ETT with new data sequence element or process activity by extending the ETT structure. After \emph{extension actions} are performed, tokens are pushed. \subsection{Data and event stream processing} This Sequence Detector is capable of processing data and event streams. For this reason ETT was implemented in C++, to boost its performance memory and processing wise. ETT is a simple mechanism that requires a limited input data structure. On the other hand, a data stream can be composed from various data sources, having different schemata and abundant amount of attributes. For that reason, we need some pre-processing and pre-classification, to compose a consolidated data stream suitable for passing into ETT. \begin{figure}[ht] \centering \includegraphics[width=0.8\textwidth,keepaspectratio]{./figs/fig_hybridarch.png} \caption{The Sequence Detector hybrid architecture} \end{figure} The Sequence Detector comprises the following two stages: \begin{itemize} \item \textbf{The pre-processing stage} - In which we take data items from multiple input data streams and rearrange them into a single output pre-processed data stream. This involves data item aggregation, reordering and pre-processing (e.g., dimension reduction). \item \textbf{The pre-classification stage} - In which we take the pre-processed data stream and classify data items for creating the input ETT alphabet $\Sigma$ and additional information needed for ETT. So reduced data items are written in the output consolidated data stream $\mathcal{D}_c$. \end{itemize} \newpage The consolidated data stream at step $k$, denoted as $\mathcal{D}_c^{[k]}$, comprises ordered $k$ data items that are the result of the pre-processing and pre-classification stages, which can be defined as \begin{equation} \begin{split} \mathcal{D}_c^{[k]} = \{d_c^{[1]},...,d_c^{[k]}\}\\ |\mathcal{D}_c^{[k]}| = k, k \in \mathbb{N}\\ \mathcal{D}_c^{[k]} \subseteq \mathcal{I}^{[k]} \times T^{[k]} \times T^{[k]} \times \mathcal{C}^{[k]}\\ \forall 1 \leq j \leq k (d_c^{[j]} = (id^{[j]},t_s^{[j]},t_e^{[j]},{class}^{[j]}) \in \mathcal{D}_c^{[k]} \wedge j \in \mathbb{N})\\ \end{split} \end{equation} where at step $k$, $\mathcal{I}^{[k]}$ is the set of all used context identifiers, $T^{[k]}$ is the set of all timestamps, and $\mathcal{C}^{[k]}$ is the set of all output classes from the pre-processing procedures, which is the same as the input alphabet of the subsequent automata. Each data item is a tuple that comprises a context identifier $id^{[j]} \in \mathcal{I}^{[k]}$, a starting timestamp $t_s^{[j]} \in T^{[k]}$, an ending timestamp $t_e^{[j]} \in T^{[k]}$, and a class of the data item $class^{[j]} \in \mathcal{C}^{[k]}$.\ \section{Overview} \subsection{Pre-processing} The pre-processing stage takes a number of input data streams and performs aggregation, consolidation and reordering on them. The result is a single output stream which is then passed into the pre-classification stage. <>= library(SeqDetect) library(xtable) library(dplyr) @ We define a set of input streams, each one to be from a specific IT system. To pass a set of input streams into the pre-processing code, we need a named list whose element names relate to the names of IT systems and element values are data frames that represents a slice of the related data stream. First we create four exemplatory slices of input data streams. \paragraph{ER registration system stream} \begin{footnotesize} <<>>= st1 <- data.frame(patient=c("C156","C156","E9383","C167"), time=c("05.12.2019. 10:30:20","05.12.2019. 11:59:07","07.12.2019. 08:34:12", "07.12.2019. 10:45:11"), age=c(12,12,26,76), fever=c(TRUE,TRUE,TRUE,FALSE), action=c("registration","release","registration","registration"), can_walk=c(TRUE,TRUE,FALSE,TRUE)) @ \end{footnotesize} <>= print(xtable(st1,caption = "ER registration system data stream slice")) @ Finish it by transforming string dates into POSIXct data types. \begin{footnotesize} <<>>= st1 <- transform.data.frame(st1,time=as.POSIXct(st1$time,format="%d.%m.%Y. %H:%M:%S")) @ \end{footnotesize} \paragraph{ER triage system} \begin{footnotesize} <<>>= st2 <- data.frame(patient=c("C156","C156","E9383","E9383","C167","C167"), time=c("05.12.2019. 10:41:00","05.12.2019. 12:12:00","07.12.2019. 09:56:00", "07.12.2019. 11:32:00","07.12.2019. 11:01:00","07.12.2019. 13:14:15"), diagnosis=c(NA,"J04.0",NA,"A41.9",NA,"N41.0"), action=c("biomarker","release","biomarker","hospital_ic","biomarker","hospital_nc"), decription=c("suspect. laryng...","course of antibiotics", "high fever,in shock state!! URGENT!","septic shock? IC..", "cannot pee,catheter","urology hospitalization")) @ \end{footnotesize} <>= print(xtable(st2,caption = "ER triage system data stream slice"),size="\\footnotesize") @ Finish it by transforming string dates into POSIXct data types. \begin{footnotesize} <<>>= st2 <- transform.data.frame(st2,time=as.POSIXct(st2$time,format="%d.%m.%Y. %H:%M:%S")) @ \end{footnotesize} \paragraph{Bio-laboratory system} \begin{footnotesize} <<>>= st3 <- data.frame(request_id=c("2019_645553","2019_654331","2019_654331","2019_654331","2019_655376", "2019_655376"), request_org=c("ER","ER","ER","ER","ER","ER"), ext_id=c("C156","E9383","E9383","E9383","C167","C167"), date=c("05.12.2019.","07.12.2019.","07.12.2019.","07.12.2019.","07.12.2019.", "07.12.2019."), biomarker=c("WBC","WBC","CRP","LAC","WBC","CRP"), final=c(14.6,13.11,345.0,4.5,11.43,67.0),stringsAsFactors=FALSE) @ \end{footnotesize} <>= print(xtable(st3,caption = "Biolab system data stream slice")) @ Finish it by transforming string dates into POSIXct data types. \begin{footnotesize} <<>>= st3 <- transform.data.frame(st3,date=as.POSIXct(st3$date,format="%d.%m.%Y.")) @ \end{footnotesize} \paragraph{Hospital registration system} \begin{footnotesize} <<>>= st4 <- data.frame(patient_id=c("I93382","N94511"), ext_id=c("E9383","C167"), time_in=c("07.12.2019. 11:35:46","07.12.2019. 12:11:49"), diagnosis=c("A41.9","N41.0"), type=c("IC","NC"), time_release=c("15.12.2019. 08:52:11","11.12.2019. 14:02:11")) @ \end{footnotesize} <>= print(xtable(st4,caption = "Hospital system data stream slice")) @ Finish it by transforming string dates into POSIXct data types. \begin{footnotesize} <<>>= st4 <- transform.data.frame(st4,time_in=as.POSIXct(st4$time_in,format="%d.%m.%Y. %H:%M:%S"), time_release=as.POSIXct(st4$time_release,format="%d.%m.%Y. %H:%M:%S")) @ \end{footnotesize}\mbox{} Previous data streams could be read from Kafka for example. To pre-process these specific data streams, we need to create a pre-processor that inhertis \textit{HSC\_PP} class and implements the \textit{process} method, as the following example: \begin{footnotesize} <<>>= HSC_PP_Hospital <- function(...) { structure(list(),class = c("HSC_PP_Hospital","HSC_PP")) } preprocess.HSC_PP_Hospital <- function(x, streams, ...) { # perform some meaningful checking on the input data streams res <- data.frame(stringsAsFactors=FALSE) reg_stream <- streams[["registration_system"]] for(j in 1:nrow(reg_stream)) { el <- reg_stream[j,] cz <- case_when(el[,"action"]=="registration"~"ER registration", el[,"action"]=="release"~"ER release") res <- rbind(res,data.frame(id=el[,"patient"],class=cz,time=el[,"time"],out=cz,WBC=NA,CRP=NA,LAC=NA)) } triage_stream <- streams[["triage"]] for(j in 1:nrow(triage_stream)) { el <- triage_stream[j,] if(nrow(res[res$id==el[,"patient"] & res$class=="ER triage",])==0) res <- rbind(res,data.frame(id=el[,"patient"],class="ER triage", time=min(triage_stream[triage_stream[,"patient"]==el[,"patient"],"time"]), out="ER triage",WBC=NA,CRP=NA,LAC=NA)) } biolab_stream <- streams[["biolab"]] for(j in 1:nrow(biolab_stream)) { el <- biolab_stream[j,] if(nrow(res[res$id==el[,"ext_id"] & res[,"class"]=="Biomarker assessment",])==0) { if(el[,"request_org"]=="ER") { t1_time <- triage_stream[triage_stream[,"patient"]==el[,"ext_id"] & triage_stream[,"action"]=="biomarker","time"]+1 res <- rbind(res,data.frame(id=el[,"ext_id"],class="Biomarker assessment", time=t1_time,out="Biomarker assessment", WBC=NA,CRP=NA,LAC=NA)) } } res[res[,"id"]==el[,"ext_id"] & res[,"class"]=="Biomarker assessment", el[,"biomarker"]] <- el[,"final"] } hospital_stream <- streams[["hospital"]] for(j in 1:nrow(hospital_stream)) { el <- hospital_stream[j,] cz1 <- paste0("Admission to ",el[,"type"]) res <- rbind(res,data.frame(id=el[,"ext_id"],class=cz1,time=el[,"time_in"], out=cz1,WBC=NA,CRP=NA,LAC=NA)) res <- rbind(res,data.frame(id=el[,"ext_id"],class="Hospital release", time=el[,"time_release"],out="Hospital release", WBC=NA,CRP=NA,LAC=NA)) } res <- res[order(res[,"time"]),] # order this event stream slice return(list(obj=x,res=res)) } pp <- HSC_PP_Hospital() input_streams <- list(registration_system=st1,triage=st2,biolab=st3,hospital=st4) event_stream <- preprocess(pp,input_streams)$res @ \end{footnotesize} A list comprising \emph{obj} and \emph{res} elements has to be returned from the \emph{process} method of a pre-processor class. The \emph{obj} element returns the same S4 object passed in the \emph{process} method, i.e., the modified variable \emph{x}. The \emph{res} element must return the composed event stream. <>= print(xtable(event_stream,caption="The resulting event stream slice"),size="\\footnotesize") @ \subsubsection{Data and event stream slicing rules} Let us define $n$ input data streams as $I=\{ is_1,is_2,...,is_n \}$. Each data stream is sliced in the same number of slices $is_i=\{ s_1(is_i),...,s_k(is_i) \} \in I$. The event stream made by the pre-processing code can be defined as $E=\bigcup I=\{ s_1(E),...,s_k(E) \}$ and comprises the same number of slices as input data streams. Each of the $k$ slices can be aggregated separately $s_i(E)=\bigcup_{j=1}^{k} s_i(is_j)$.\\ Timeframe of each event stream slice is bound by a minimal starting timestamp of the composing slices $t_s(s_i(E))=\min_{j=1}^k t_s(s_i(is_j))$ and a maximal ending timestamp of the composing slices $t_e(s_i(E))=\max_{j=1}^k t_e(s_i(is_j))$. Event stream slices cannot overlap time-wise. This means \begin{equation} \begin{split} \nexists 1 \leq i,j \leq k(j>i \wedge t_s(s_j(E)) \leq t_e(s_i(E))) \end{split} \end{equation} We need to take care about the current status of an individual input data stream, i.e., some of the input data streams could be lagging behind, time-wise. Once all input data streams reach a certain time point $tp_j$, we slice them all between $(tp_{i-1},tp_i ]$. This is the bound of the slice $i$, meaning \begin{equation} \begin{split} tp_{i-1} < t_s(s_i(E)) \leq t_e(s_i(E)) \leq tp_{i} \end{split} \end{equation} which can happen when slice $s_i(E)$ events do not occur exactly at time points $tp_{i-1}$ or $tp_i$.\\ We can determinte the time point $tp_i$ by examining last data items in all input data streams. If we take $l_i=\vert is_i \vert$, and the last data item in the input data stream $is_i$ as $d_i^{[l_i]} \in is_i$, under assumption that there is no back-logging and that each input data stream comprises data item from one unique data source, we can choose the time point as \begin{equation} \begin{split} tp_i = \min_{i=1}^k t_e(d_i^{[l_i]}) \end{split} \end{equation} If one of the systems allows back-logging, i.e., retroactive work, the whole slicing time point estimation must be adjusted for the allowed back-logging period. This can create some serious problems, for example when we try to capture ongoing fraudulent processes and react to the ongoing alerts in real-time as much as possible. Overnight batch processing systems might destroy the capability of ETT to track process instances in real-time.\\ Modernizing IT systems and adopting service-oriented architectural principles can greatly improve the ability to track ongoing processes. \subsection{Pre-classification} Once we create a slice of the event stream, each event in the stream must have sufficient data for the next step: pre-classification. We begin with writing a concrete pre-classifier for the event stream. Pre-classifier is instantiated from a class that inherits \emph{HSC\_PC} class, and implements the \emph{classify} method. A simple implementation would be: \begin{footnotesize} <<>>= HSC_PC_Hospital <- function(...) { structure(list(),class = c("HSC_PC_Hospital","HSC_PC")) } classify.HSC_PC_Hospital <- function(x, event_stream, ...) { # perform some meaningful checking on the supplied event stream res <- data.frame(stringsAsFactors=FALSE) for(i in 1:nrow(event_stream)) { event <- event_stream[i,] symbol <- case_when( event$class=="ER registration" ~ "ER_REG", event$class=="ER triage" ~ "ER_TR", event$class=="ER release" ~ "ER_REL", event$class=="Biomarker assessment" ~ "BIO_A", event$class=="Admission to IC" ~ "IC", event$class=="Admission to NC" ~ "NC", event$class=="Hospital release" ~ "H_REL" ) if(symbol=="BIO_A") { symbol <- paste0(symbol,case_when(is.na(event$WBC) ~ "#WBC=NONE", event$WBC>11 ~ "#WBC=EL",TRUE ~ "#WBC=OK")) symbol <- paste0(symbol,case_when(is.na(event$CRP) ~ "#CRP=NONE", event$CRP>50 ~ "#CRP=EL",TRUE ~ "#CRP=OK")) symbol <- paste0(symbol,case_when(is.na(event$LAC) ~ "#LAC=NONE", event$LAC>4 ~ "#LAC=EL",TRUE ~ "#LAC=OK")) } res <- rbind(res,data.frame(id=event$id,class=event$class,time=event$time,out=event$out, .clazz=symbol)) } return(res) } pc <- HSC_PC_Hospital() consolidated_stream <- classify(pc,event_stream) @ \end{footnotesize} <>= print(xtable(consolidated_stream,caption="The consolidated data stream slice"),size="\\tiny") @ \subsection{Process discovery} After we created pre-processing and pre-classification classes, we can use them in the Sequence Detector. <>= seq_detector <- HybridSequenceClassifier(c("id","class","time","out"),"time","time", "id",preclassifier=pc,preprocessor=pp, pattern_field="out") seq_detector$process(input_streams) @ \noindent Any subsequent input data streams slice are processed by re-invoking \emph{process} method over again. ETT created from the example is: <>= seq_detector$printMachines() @ \begin{lstlisting} <>= seq_detector$printMachines() @ \end{lstlisting} \noindent And the ETT structure ($\Gamma$ related) is: <>= seq_detector$plotMachines() @ \newpage \section{Detecting pre-learned sequences} \subsection{Pre-learning} There is a specific need to detect specific sequences in process instances or in time-series datasets. Once Sequence Detector detects the sequence, an alerting notification can be generated from the results. Let us define a sequence that we want to learn. \begin{lsequence}\label{seq:ps_learn1}\mbox{} <<>>= st <- data.frame(sequence=c("A","E","G"),alert=c(NA,NA,"Alert 1")) @ \end{lsequence} <>= st @ <>= print(xtable(st,caption="Learning sequence \\ref{seq:ps_learn1}")) @ <>= pp <- HSC_PP(c("sequence","alert"),"sequence_id",create_unique_key=TRUE,auto_id=TRUE) pc <- HSC_PC_Attribute("sequence") seq_detector <- HybridSequenceClassifier(c("sequence_id","sequence","alert"), "sequence_id","sequence_id", preclassifier=pc,preprocessor=pp, pattern_field="alert",reuse_states=FALSE) input_streams <- list(stream=st) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ We used predefined pre-processor \emph{HSC\_PP}, which filters out supplied fields and orders the output event stream according to one of the filtered streams. A predefined pre-classifier \emph{HSC\_PC\_Attribute} is used, which uses an attribute value as the input symbol into ETT. ETT created from the example is: <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} \subsection{Simple testing streams} \begin{tsequence}\label{seq:ps_test1}\mbox{} <>= seq_test1 <- c("E","I","A","G","E","K","F","E","A","G","G","B","W","L") res_test1 <- seq_detector$process(list(stream=data.frame(sequence=seq_test1, alert=NA)),learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(sequence=res_test1$stream[i,"sequence"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ \end{tsequence} <>= out_test1 @ <>= print(xtable(out_test1,caption="Results for testing sequence \\ref{seq:ps_test1}, $\\lambda_n=\\infty$",label="tab:ps_test1_res")) @ The result in Table \ref{tab:ps_test1_res} is a bit unusual and from the user point of view unexpected. However, if looked carefully, the learned sequence $A E G$ actually is there until the \emph{Alert 1} appears. The reason for that is the current level of ETT noise tolerance, set to $\lambda_n=\infty$. If we set the noise tolerance to $\lambda_n=0$. <>= pp <- HSC_PP(c("sequence","alert"),"sequence_id",create_unique_key=TRUE,auto_id=TRUE) pc <- HSC_PC_Attribute("sequence") dd <- list(type="count",count=0,context_related=TRUE) seq_detector <- HybridSequenceClassifier(c("sequence_id","sequence","alert"), "sequence_id","sequence_id", preclassifier=pc,preprocessor=pp, pattern_field="alert",reuse_states=FALSE, decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ we are getting a bit different result for the same testing sequence, as seen in Table \ref{tab:ps_test1_res2} <>= res_test1 <- seq_detector$process(list(stream=data.frame(sequence=seq_test1,alert=NA)), learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(sequence=res_test1$stream[i,"sequence"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ <>= out_test1 @ <>= print(xtable(out_test1,caption="Result for testing sequence \\ref{seq:ps_test1}, $\\lambda_n=0$",label="tab:ps_test1_res2")) @ \noindent We define a bit diffent testing sequence for noise tolerance $\lambda_n=0$. Part of the sequence is learned sequence. Later in the following testing sequence we have a sub-sequence $A E B G$, which is not recognized due to the noise tolerance setting, as seen in Table \ref{tab:ps_test2_res}. \begin{tsequence}\label{seq:ps_test2}\mbox{} <>= seq_test2 <- c("E","I","A","E","G","E","K","F","E","A","E","B","G","W","L") res_test2 <- seq_detector$process(list(stream=data.frame(sequence=seq_test2,alert=NA)), learn=FALSE) out_test2 <- data.frame() for(i in 1:nrow(res_test2$stream)) out_test2 <- rbind(out_test2,data.frame(sequence=res_test2$stream[i,"sequence"], alert=c_to_string(res_test2$explanation[[i]]$actual))) @ \end{tsequence} <>= out_test2 @ <>= print(xtable(out_test2,caption="Result for testing sequence \\ref{seq:ps_test2}, $\\lambda_n=0$",label="tab:ps_test2_res")) @ However, if we set the noise tolerance to $\lambda_n=1$, the sub-sequence $AEBG$ is recognized, as seen in Table \ref{tab:ps_test2_res2}. <>= dd <- list(type="count",count=1,context_related=TRUE) seq_detector <- HybridSequenceClassifier(c("sequence_id","sequence","alert"), "sequence_id","sequence_id", preclassifier=pc,preprocessor=pp, pattern_field="alert",reuse_states=FALSE, decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) # learn seq_detector$cleanKeys() # clean all context keys res_test2 <- seq_detector$process(list(stream=data.frame(sequence=seq_test2,alert=NA)), learn=FALSE) out_test2 <- data.frame() for(i in 1:nrow(res_test2$stream)) out_test2 <- rbind(out_test2,data.frame(sequence=res_test2$stream[i,"sequence"], alert=c_to_string(res_test2$explanation[[i]]$actual))) @ <>= out_test2 @ <>= print(xtable(out_test2,caption="Result for testing sequence \\ref{seq:ps_test2}, $\\lambda_n=1$",label="tab:ps_test2_res2")) @ \newpage \section{Multi-contextual detection} \subsection{Multi-contextual learning sequences} When a several sequences are intertwined in the same consolidated event stream $\mathcal{D}_c$, each sequence can be observed as a separated whole identified by a set of data, or sequence context identifiers. A context identifier could be a customer, product, partner, or just an ongoing process identifier. The following learning sequence represents two products and their sales numbers. Let us define a sequence that we want to learn. \begin{lsequence}[]\label{seq:mc_learn1}\mbox{} <<>>= st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) @ \end{lsequence} <>= st @ <>= print(xtable(st,caption="A multi-contextual learning sequence \\ref{seq:mc_learn1}",label="tab:mc_learn1")) @ Now we define a Sequence Detector having noise tolerance $\lambda_n=0$ \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") dd <- list(type="count",count=0,context_related=TRUE) seq_detector <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,pattern_field="alert",reuse_states=FALSE, decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ \end{footnotesize} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} Since we did not use option of reusing states (reuse\_states=FALSE, see \citep{krleza2019latent} for details), two disctinct ETTs were created, each for its own context. If there is any merging point in these two ETTs, such as the state \emph{18}, we can use merging option two create one consolidated ETT. We need to be careful with merging option, as it will only perform merging for $ETT_1$ and $ETT_2$ when the following conditions are met: \begin{equation} \begin{split} \exists s_1, s_2 (s_1 \in Q(ETT_1) \wedge s_2 \in Q(ETT_2) \wedge \exists q((*,q,s_1) \in R_{\delta}(ETT_1) \wedge (*,q,s_1) \in R_{\delta}(ETT_2)) \wedge\\ \omega_{s_1} = \{(s_1,\gamma_1) | \gamma_1 \in \Gamma(ETT_1)\} \subseteq \omega(ETT_1) \wedge \omega_{s_2} = \{(s_2,\gamma_2) | \gamma_2 \in \Gamma(ETT_2)\} \subseteq \omega(ETT_2) \wedge\\ (\omega_{s_1} \cap \omega_{s_2} \neq \emptyset \vee (\omega_{s_1}=\emptyset \wedge \omega_{s_2}=\emptyset))) \end{split} \end{equation} We need to have a pair of states, each in its own ETT, that share the same input and output symbols, or have no output symbols at all. Since we did not define output symbols for the state \emph{18}, merging should be successful. <>= seq_detector$mergeMachines() @ <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} \subsection{Multi-contextual testing sequences} We define the following testing sequence. \begin{tsequence}\label{seq:mc_test1}\mbox{} \begin{footnotesize} <>= tt <- data.frame(product=c("P672","P113","P983","P23872","P5","P672","P2982","P983","P672", "P991","P983","P113","P2982","P344"), sales=c(2,11,12,98,8,18,298,16,24,25,18,16,43,101),alert=NA) test_streams <- list(stream=tt) @ \end{footnotesize}\end{tsequence} <>= tt @ <>= print(xtable(tt,caption="A multi-contextual testing sequence \\ref{seq:mc_test1}",label="tab:mc_test1")) @ \begin{footnotesize} <>= res_test1 <- seq_detector$process(test_streams,learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(product=res_test1$stream[i,"product"], sales=res_test1$stream[i,"sales"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test1 @ The result can be seen in Table \ref{tab:mc_test1_res}. <>= print(xtable(out_test1,caption="Results for the testing sequence \\ref{seq:mc_test1}",label="tab:mc_test1_res")) @ \noindent The final state of ETT is <>= seq_detector$printMachines() @ \begin{lstlisting} <>= seq_detector$printMachines() @ \end{lstlisting} \noindent If we add another slice of the testing stream \ref{seq:mc_test1} <<>>= tt <- data.frame(product=c("P115","P45","P22","P983","P9","P19","P73"), sales=c(91,43,52,8,1,105,35),alert=NA) test_streams <- list(stream=tt) @ <>= tt @ <>= print(xtable(tt,caption="Additional slice of the testing sequence \\ref{seq:mc_test1}",label="tab:mc_test1_as")) @ \newpage \begin{footnotesize} <>= res_test1 <- seq_detector$process(test_streams,learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(product=res_test1$stream[i,"product"], sales=res_test1$stream[i,"sales"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test1 @ We get additional sequence recognition alert as in Table \ref{tab:mc_test1_as_res}. <>= print(xtable(out_test1,caption="Results for the additional slice of the testing sequence \\ref{seq:mc_test1}",label="tab:mc_test1_as_res")) @ \section{Token decay mechanism} \subsection{Overview} Token decay mechanism is introduced for the following reasons: \begin{itemize} \item ETT noise tolerance. Decay mechanism enabled ETT to tolerate extra symbols in the previously learned sequences, both within a context and globally. \item Has table cleaning. As the has table that keeps tokens gets bigger, ETT processing gets slower. Eventually, processing might fall behind the processed data streams velocity. We need to purge irrelevant tokens to keep ETT high performing and taking less memory as possible. \end{itemize} \noindent Token decay has three basic options: \begin{enumerate} \item \textbf{Context-related counting decay} - Works within one context. User can define how much extra symbols placed within context is acceptable. \item \textbf{Global counting decay} - Works globally. User can define how much extra symbols is allowable between two token pushes. If a token is not pushed within the defined number of input symbols it decays and gets removed. \item \textbf{Time decay} - Users can define the time needed for a token to decay. If the token does not get pushed within the designated time it decays and gets removed. \end{enumerate} Decay mechanism is controlled through passing decay descriptors at the time of Sequence Detector creation, and cannot be changed later on, i.e., a new Sequence Detector needs to be created to use different decay descriptors. \paragraph{Context-related count decay} <>= dd1 <- list(type="count",count=1,context_related=TRUE) @ ($\lambda_n=1$) 1 symbol is allowed within the same context. For example, the learning sequence $A B C$ will be successfully recognized within the testing sequence $A B F C$, but not in the testing sequence $A B F T C$. \paragraph{Global count decay} <>= dd2 <- list(type="count",count=50,context_related=FALSE) @ If a token is not pushed 50 symbols from its last push, it will decay and get removed. \paragraph{Time decay} <>= dd3 <- list(type="time",days=0,hours=1,minutes=10,context_related=FALSE) @ If a token is not pushed within 1 hour and 10 minutes from its last push, it will decay and get removed. \subsection{Example 1: Context-related count decay} Let us define a sequence that we want to learn. \begin{lsequence}\label{seq:td_learn1}\mbox{} <<>>= st <- data.frame(product=c("P45","P45"),sales=c(5,10),alert=c(NA,"Alert 1")) input_streams <- list(stream=st) @ \end{lsequence} <>= st @ <>= print(xtable(st,caption="Learning sequence \\ref{seq:td_learn1}",label="tab:td_learn1")) @ Now we define a Sequence Detector having context-related count decay $\lambda_n=1$ \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") dd <- list(type="count",count=1,context_related=TRUE) seq_detector <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,pattern_field="alert",reuse_states=FALSE, decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ \end{footnotesize} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} \newpage \noindent For the testing we define both correct and incorrect contexts. \begin{tsequence}\label{seq:td_test1}\mbox{} <<>>= tt <- data.frame(product=c("P113","P29","P113","P29","P29","P113","P29"), sales=c(5,5,7,8,9,10,10),alert=NA) test_streams <- list(stream=tt) @ \end{tsequence} <>= tt @ <>= print(xtable(tt,caption="Testing sequence \\ref{seq:td_test1}",label="tab:td_test1")) @ The testing result is: \begin{footnotesize} <>= res_test1 <- seq_detector$process(test_streams,learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(product=res_test1$stream[i,"product"], sales=res_test1$stream[i,"sales"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test1 @ <>= print(xtable(out_test1,caption="Testing sequence \\ref{seq:td_test1} results",label="tab:td_test1_res")) @ \subsection{Example 2: Global count decay} We use the same learning sequence \ref{seq:td_learn1}. However, we define a Sequence Detector having a global count decay. \begin{footnotesize} <>= dd <- list(type="count",count=5,context_related=FALSE) seq_detector <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,pattern_field="alert",reuse_states=FALSE, decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ \end{footnotesize} For the testing we define both correct and incorrect contexts. \begin{tsequence}\label{seq:td_test2}\mbox{} <<>>= tt <- data.frame(product=c("P29","P113","P114","P115","P113","P114","P115","P29"), sales=c(5,5,5,5,10,7,10,10),alert=NA) test_streams <- list(stream=tt) @ \end{tsequence} <>= tt @ <>= print(xtable(tt,caption="Testing sequence \\ref{seq:td_test2}",label="tab:td_test2")) @ The testing result is: \begin{footnotesize} <>= res_test2 <- seq_detector$process(test_streams,learn=FALSE) out_test2 <- data.frame() for(i in 1:nrow(res_test2$stream)) out_test2 <- rbind(out_test2,data.frame(product=res_test2$stream[i,"product"], sales=res_test2$stream[i,"sales"], alert=c_to_string(res_test2$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test2 @ <>= print(xtable(out_test2,caption="Testing sequence \\ref{seq:td_test2} results",label="tab:td_test2_res")) @ \subsection{Example 3: Time decay} We use the same learning sequence \ref{seq:td_learn1}. However, this time we define a timestamp field for determining an event timing. \begin{footnotesize} <<>>= st <- data.frame(product=c("P21","P21"),timestamp=c("01.12.2019. 10:00:00","01.12.2019. 10:01:00"), sales=c(5,10),alert=c(NA,"Alert 1")) @ \end{footnotesize} <>= st @ <>= print(xtable(st,caption="Learning sequence \\ref{seq:td_learn1} with timestamp",label="tab:td_learn1_tsfield")) @ Finish it by transforming timestamp fields into the POSIXct type. <<>>= st <- transform.data.frame(st,timestamp=as.POSIXct(st$timestamp, format="%d.%m.%Y. %H:%M:%S")) input_streams <- list(stream=st) @ We define a Sequence Detector having a time decay set to 1 hour. \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert","timestamp"),"timestamp") pc <- HSC_PC_Attribute("sales") dd <- list(type="time",days=0,hours=1,minutes=0,context_related=FALSE) seq_detector <- HybridSequenceClassifier(c("timestamp","product","sales","alert"), "timestamp","timestamp",context_field="product", preclassifier=pc,preprocessor=pp,pattern_field="alert", reuse_states=FALSE,decay_descriptors=list(d1=dd)) seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ \end{footnotesize} We define a new testing data stream. \begin{tsequence}\label{seq:td_test3}\mbox{}\begin{footnotesize} <>= tt <- data.frame(product=c("P12","P13","P14","P15","P13","P14","P15","P12"), sales=c(5,5,5,5,10,10,10,10), timestamp=c("05.12.2019. 10:30:20","05.12.2019. 10:31:20", "05.12.2019. 10:32:20","05.12.2019. 10:33:20", "05.12.2019. 10:34:20","05.12.2019. 10:35:20", "05.12.2019. 10:40:20","05.12.2019. 12:30:20"),alert=NA) @ \end{footnotesize}\end{tsequence} <>= tt @ <>= print(xtable(tt,caption="Testing sequence \\ref{seq:td_test3}",label="tab:td_test3")) @ We finish by transforming timestamp fields into the POSIXct type. <>= tt <- transform.data.frame(tt,timestamp=as.POSIXct(tt$timestamp, format="%d.%m.%Y. %H:%M:%S")) test_streams <- list(stream=tt) @ The testing result is: \begin{footnotesize} <>= res_test3 <- seq_detector$process(test_streams,learn=FALSE) out_test3 <- data.frame() for(i in 1:nrow(res_test3$stream)) out_test3 <- rbind(out_test3,data.frame(product=res_test3$stream[i,"product"], sales=res_test3$stream[i,"sales"], timestamp=as.character(res_test3$stream[i,"timestamp"], format="%d.%m.%Y. %H:%M:%S"), alert=c_to_string(res_test3$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test3 @ <>= print(xtable(out_test3,caption="Testing sequence \\ref{seq:td_test3} results",label="tab:td_test3_res")) @ \section{Statistical projections} \subsection{ETT statistic collection and projection} While working, ETT collects statistic about token pushes. At the moment the collected statistic comprises: \begin{itemize} \item The number of pushes through an ETT transition \item The number of inserted tokens in an ETT state \end{itemize} These counting statistics $\mathcal{S}_{\delta}$ can be used to create ETT projections that indicate regularity of a sequence as described in \citep{krleza2019latent}. Projection can reveal regular, irregular and anomalous sequences. Using threshold $\Phi_{\mathcal{S}_{\delta}}$, we can sub-select the ETT structure to \begin{equation} \begin{split} \mathcal{R}_{\delta_{\mathcal{S}^{+}}} \subseteq \{r_i | r_i \in \mathcal{R}_{\delta} \wedge \mathcal{S}_{\delta}(r_i) \geq \Phi_{\mathcal{S}_{\delta}} \}\\ \mathcal{R}_{\delta_{\mathcal{S}^{-}}} \subseteq \{r_i | r_i \in \mathcal{R}_{\delta} \wedge \mathcal{S}_{\delta}(r_i) < \Phi_{\mathcal{S}_{\delta}} \} \end{split} \end{equation} resulting in \begin{equation} \begin{split} ETT_{\mathcal{S}^{+}} \subseteq ETT[\mathcal{R}_{\delta}=\mathcal{R}_{\delta_{\mathcal{S}^{+}}}]\\ ETT_{\mathcal{S}^{-}} \subseteq ETT[\mathcal{R}_{\delta}=\mathcal{R}_{\delta_{\mathcal{S}^{-}}}] \end{split} \end{equation} where we can isolate the following cases: \begin{itemize} \item \textbf{Regular data sequences}. Data sequences that can be obtained only by traversing through $ETT_{\mathcal{S}^{+}}$. \item \textbf{Anomalous data sequences}. Data sequences that can be obtained only by traversing through $ETT_{\mathcal{S}^{-}}$. \item \textbf{Irregular data sequences}. Data sequences that can be obtained by traversing through both $ETT_{\mathcal{S}^{-}}$ and $ETT_{\mathcal{S}^{+}}$. These data sequences are made of regular and irregular subsequences. \end{itemize} \subsection{Example 1} Let us define sequences that we want to learn. These sequences must have regular and irregular transitions. \begin{lsequence}\label{seq:proj_learn1}\mbox{} <>= st <- data.frame(product=c("P1","P2"),sales=c(5,76),alert=c(NA,NA)) for(i in 1:400) { st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(10,58),alert=c(NA,NA))) st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(20,31),alert=c(NA,NA))) } st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(30,11), alert=c("Sequence 1","Sequence 2"))) input_streams <- list(stream=st) @ \end{lsequence} Now we define a Sequence Detector for learning the simple sequences we constructed earlier. \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") seq_detector <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,reuse_states=TRUE,pattern_field="alert") seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() @ \end{footnotesize} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} This results in two distinct ETTs, each ETT comprising its own sequence. By examining the population value in the ETTs, we can see the collected statistic for the learned data stream. Using this Sequence Detector, we can detect only original sequences, for example \begin{tsequence}\label{seq:proj_test1}\mbox{}\begin{footnotesize} <>= tt <- data.frame(product=c("P29","P29","P34","P29","P29","P11","P34","P34", "P34","P11","P11"), sales=c(5,10,76,20,30,10,58,31,11,20,30),alert=NA) test_streams <- list(stream=tt) @ \end{footnotesize}\end{tsequence} which results in \begin{footnotesize} <>= res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:proj_test1} results",label="tab:proj_test1_res")) @ The sequence $10 20$ for $P11$ was not recognized as a separated sequence. This is the regular sequence. The Sequence Detector instance can be cloned for later manipulations. <>= seq_detector1 <- seq_detector$clone() @ \noindent Now we can make an ETT projection by using $\Phi_{\mathcal{S}_{\delta}}=200$. <>= res_is <- seq_detector$induceSubmachine(threshold=200) @ <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} Four ETTs can be observed. These are two pairs of $ETT_{\mathcal{S}^{+}}$ and $ETT_{\mathcal{S}^{-}}$, where $ETT_{\mathcal{S}^{-}}$ references the $ETT_{\mathcal{S}^{+}}$. Now we can detect regular sequences. Let us create additional alerts for the regular sequences. \begin{footnotesize} <>= seq_detector$setOutputPattern(states=c("20"),transitions=c(),pattern="Reg.sequence 1") seq_detector$setOutputPattern(states=c("31"),transitions=c(),pattern="Reg.sequence 2") res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:proj_test1} results",label="tab:proj_test1_res2")) @ At this moment, we are able to detect both, whole and regular sequences. \subsection{Example 2} We can project the original Sequence Detector instance and to discard the irregular part of the learned sequences. The Sequence Detector instance is reverted back to the original, before cloning, and projection on the reverted instance is done. <>= seq_detector <- seq_detector1$clone() res_is <- seq_detector$induceSubmachine(threshold=200,isolate=TRUE) @ <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} Only ETTs that comprise the most regular sequences, those that satisfy the threshold $\Phi_{\mathcal{S}_{\delta}}=200$, can be observed. Isolating only the most regular sequences can have its benefits and shortfalls: \begin{itemize} \item \textbf{Benefit}: We keep only structures that are the most recurring, some would say the most common and regular behaviour, the most common process flows, etc... This helps to manage the size of the Sequence Detector instance in the Big Data environment, which is of uttermost importance. \item \textbf{Shortfall}: We lose the ability to detect irregular and anomalous sequences, which are important in detecting fraudulent behaviour and activities. \end{itemize} By processing the testing sequence \ref{seq:proj_test1} again, we can see that only regular subsequences yield alerts. \begin{footnotesize} <>= seq_detector$setOutputPattern(states=c("20"),transitions=c(),pattern="Reg.sequence 1") seq_detector$setOutputPattern(states=c("31"),transitions=c(),pattern="Reg.sequence 2") res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:proj_test1} results",label="tab:proj_test1_res3")) @ \section{Compressing Sequence Detector} When processing Big Data, learning actions \citep{krleza2019latent} can generate a lot of states and additional ETTs. This can cause a significant increase in memory consumption and processing power need, and slowing down the processing od data items by a Statistical Detector object. Statistical projection is one of the ways how to deal with Sequence Detector complexity explosion, which helps to select and project only the most common sequences, reducing overall size of the Sequence Detector. Even so, Sequence Detector can create multiple ETTs having a substructure (a sub-graph) that is isomorphic. Since we have the ability to stack and reuse ETTs, isolating isomorphic ETT structures in referenced, child ETTs is a way how to compress Sequence Detector and reduce its size. \subsection{Example 1 - Merging} We can define two input data stream for which we certainly know that will produce two partially isomorphic ETTs. \begin{lsequence}\label{seq:comp_learn1}\mbox{} <>= library(SeqDetect) ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) @ \end{lsequence} \noindent Then we define the Sequence Detector instance. \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") seq_detector <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") seq_detector$process(input_streams,learn=TRUE) seq_detector$cleanKeys() backup_detector <- seq_detector$clone() @ \end{footnotesize} \noindent We get two ETTs after the learning process. <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} \noindent Sequence detection works as supposed. We define the following testing sequence... \begin{tsequence}\label{seq:comp_test1}\mbox{} <>= tdf1 <- data.frame(product=c("P3","P3","P3","P3"),sequence_id=c(1,2,3,4), sales=c(5,76,123,1),alert=NA) test_streams <- list(stream1=tdf1) @ \end{tsequence} \begin{footnotesize} <>= res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:comp_test1} results",label="tab:comp_test1_res1")) @ Merging two ETTs is possible, but it will have a negative impact on the Sequence Detector by fusing learned sequences together. \begin{tsequence}\label{seq:comp_test2}\mbox{} <>= tdf1 <- data.frame(product=c("P4","P4","P4","P4"),sequence_id=c(1,2,3,4), sales=c(21,76,123,1),alert=NA) test_streams <- list(stream1=tdf1) @ \end{tsequence} \begin{footnotesize} <>= seq_detector$mergeMachines() res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:comp_test2} results",label="tab:comp_test2_res1")) @ Results in Table \ref{tab:comp_test2_res1} are incorrect. Notice that we recognize the sequence \emph{21,76,123,1}, which a combination of sequences for two different contexts in the learning sequence \ref{seq:comp_learn1}. \subsection{Example 2 - compression} The previous result is not correct, as it wrongly detects the sequence for the product $P1$, while in fact it was a combination of sequences for $P1$ and $P2$. However, if we isolate the isomorphic parts of the ETTs... \begin{tsequence}\label{seq:comp_test3}\mbox{} <>= tdf1 <- data.frame(product=c("P5","P5","P5","P5"),sequence_id=c(1,2,3,4), sales=c(21,76,123,1),alert=NA) test_streams <- list(stream1=tdf1) @ \end{tsequence} \begin{footnotesize} <>= seq_detector <- backup_detector$clone() seq_detector$compressMachines() res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} ...we get the correct result. No detection occurs when an intertwined sequence is tested. <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \begin{lstlisting} <>= seq_detector$printMachines(print_cache=FALSE,print_keys=FALSE) @ \end{lstlisting} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:comp_test3} results",label="tab:comp_test3_res1")) @ However, the correct sequence gets recognized successfully. \begin{tsequence}\label{seq:comp_test4}\mbox{} <>= tdf1 <- data.frame(product=c("P6","P6","P6","P6"), sequence_id=c(1,2,3,4),sales=c(5,76,123,1), alert=NA) test_streams <- list(stream1=tdf1) @ \end{tsequence} \begin{footnotesize} <>= res_test <- seq_detector$process(test_streams,learn=FALSE) out_test <- data.frame() for(i in 1:nrow(res_test$stream)) out_test <- rbind(out_test,data.frame(product=res_test$stream[i,"product"], sales=res_test$stream[i,"sales"], alert=c_to_string(res_test$explanation[[i]]$actual))) @ \end{footnotesize} <>= out_test @ <>= print(xtable(out_test,caption="Testing sequence \\ref{seq:comp_test4} results",label="tab:comp_test4_res1")) @ \section{Other options} \subsection{Saving and loading Sequence Detector} Having C++ objects, a Sequence Detector object cannot be saved properly without serialization. First, we define a simple Sequence Detector object and a learning sequence. \begin{lsequence}\label{seq:o_learn1}\mbox{} <>= st <- data.frame(product=c("P1","P1"),sales=c(5,76),alert=c(NA,"Alert")) input_streams <- list(stream=st) @ \end{lsequence} \begin{footnotesize} <>= pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") seq_detector_oo <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,reuse_states=TRUE,pattern_field="alert") seq_detector_oo$process(input_streams,learn=TRUE) @ \end{footnotesize} \noindent Before saving the Sequence Detector object instance, we need to perform the serialization procedure. <>= seq_detector_oo$serialize() @ The results of the serialization can be checked in the \emph{cache} field of the Sequence Detector object. <<>>= c_to_string(names(seq_detector_oo$cache)) c_to_string(names(seq_detector_oo$cache[[1]])) c_to_string(names(seq_detector_oo$cache[[1]][["states"]])) saveRDS(seq_detector_oo,"test.RDS") @ After loading, the deserialization method \emph{deserialize} can be invoked explicitly. However, this method is invoked implicitly as soon as some of the Sequence Detector methods is invoked. <>= new_seq_detector_oo <- readRDS("test.RDS") @ <>= file.remove("test.RDS") @ <>= new_seq_detector_oo$printMachines() @ \begin{lstlisting} <>= new_seq_detector_oo$printMachines() @ \end{lstlisting} \subsubsection{Serializing and deserializing into an external named list} A Sequence Detector object could be serialized and deserialized into an external variable. Serializing is done into a named list. <<>>= sd_list <- new_seq_detector_oo$serializeToList() c_to_string(names(sd_list)) c_to_string(names(sd_list[[1]])) c_to_string(names(sd_list[[1]][["states"]])) totally_new_sd_oo <- deserializeFromList(sd_list) @ <>= totally_new_sd_oo$printMachines() @ \begin{lstlisting} <>= totally_new_sd_oo$printMachines() @ \end{lstlisting} \subsection{Merging ETTs} In case when multiple ETTs are created, they can be merged into a single ETT. As mentioned in \citep{krleza2019latent}, if there are no common point in ETTs, this will create a single ETT having disconnected structure. Let us amend the previously generated Sequence Detector. \begin{lsequence}\label{seq:merg_learn1}\mbox{} <>= st <- data.frame(product=c("P5","P5","P5"),sales=c(9,76,10), alert=c(NA,"Alert","Alert P5")) input_streams <- list(stream=st) totally_new_sd_oo$process(input_streams,learn=TRUE) @ \end{lsequence} <>= totally_new_sd_oo$printMachines() @ \begin{lstlisting} <>= totally_new_sd_oo$printMachines() @ \end{lstlisting} The result are two ETTs that have a common state $76$ of a pattern $Alert$. Merging the Sequence Detector results in a single connected ETT. <<>>= totally_new_sd_oo$mergeMachines() @ <>= totally_new_sd_oo$printMachines() @ \begin{lstlisting} <>= totally_new_sd_oo$printMachines() @ \end{lstlisting} However, such merged ETT intertwines sequences from $P1$ and $P5$. This might be a good thing for the process discovery, but not so good for detecting sequences in time-series datasets. For example, the following testing stream mixes the learned data sequences... \begin{tsequence}\label{seq:merg_test1}\mbox{} <>= tt <- data.frame(product=c("P10","P10","P10"),sales=c(5,76,10),alert=NA) test_streams <- list(stream=tt) @ \end{tsequence} \begin{footnotesize} <>= res_test1 <- totally_new_sd_oo$process(test_streams,learn=FALSE) out_test1 <- data.frame() for(i in 1:nrow(res_test1$stream)) out_test1 <- rbind(out_test1,data.frame(product=res_test1$stream[i,"product"], sales=res_test1$stream[i,"sales"], alert=c_to_string(res_test1$explanation[[i]]$actual))) @ \end{footnotesize} <>= print(xtable(out_test1,caption="Testing sequence \\ref{seq:merg_test1} results",label="tab:merg_test1_res1")) @ \bibliography{SeqDetect} \end{document}