--- title: "Introduction to theftdlc" author: "Trent Henderson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Introduction to theftdlc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, warning = FALSE, fig.align = "center" ) ``` ```{r setup, message = FALSE, warning = FALSE} library(dplyr) library(ggplot2) library(theft) library(theftdlc) ``` ## Purpose The [`theft`](https://hendersontrent.github.io/theft/) package for R facilitates user-friendly access to a structured analytical workflow for the extraction of time-series features from six different feature sets (or a set of user-supplied features): `"catch22"`, `"feasts"`, `"Kats"`, `"tsfeatures"`, `"tsfresh"`, and `"TSFEL"` `theftdlc` extends this feature-based ecosystem by providing a suite of functions for analysing, interpreting, and visualising time-series features calculated using `theft`. ## Core calculation functions To explore package functionality, we are going to use a dataset that comes standard with `theft` called `simData`. This dataset contains a collection of randomly generated time series for six different types of processes. The dataset can be accessed via: ```{r, message = FALSE, warning = FALSE, eval = FALSE} theft::simData ``` The data follows the following structure: ```{r, message = FALSE, warning = FALSE} head(simData) ``` We will use `theft` to quickly calculate features using the `catch22` set: ```{r, message = FALSE, warning = FALSE} feature_matrix <- calculate_features(data = simData, id_var = "id", time_var = "timepoint", values_var = "values", group_var = "process", feature_set = "catch22", seed = 123) ``` ## Data quality checks The core `calculate_features` function in `theft` returns an object of class `feature_calculations`. Objects of this type are purposefully looked-for by other functions in `theftdlc`. Because it is a class, simple methods such as `plot()` can be called on the object to produce a range of statistical graphics. The first is a visualisation of the data types of the calculated feature vectors. This is useful for inspecting which features might need to be dropped due to large proportions of undesirable (e.g., `NA`, `NaN` etc.) values. We can specify the plot `type = "quality` to make this graphic: ```{r, message = FALSE, warning = FALSE} plot(feature_matrix, type = "quality") ``` ## Data visualisation and low-dimensional projections The package also comes with additional statistical and graphical functionality: * Feature by time-series matrix as a heatmap * Low dimensional projections of the feature space and plotting as a scatterplot * Pairwise feature correlation matrix as a heatmap ### Feature matrices The function calling `type = "matrix"` in `plot()` on a `feature_calculations` object takes it and produces a `ggplot` object heatmap showing the feature vectors across the `x` axis and each time series down the `y` axis. Prior to plotting, the function hierarchically clusters the data across both rows and columns to visually highlight the empirical structure. Note that you have several options for the hierarchical clustering linkage algorithm to use: * `"average"` (default) * `"ward.D"` * `"ward.D2"` * `"single"` * `"complete"` * `"mcquitty"` * `"median"` * `"centroid"` See the [`hclust` documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust) for more information. Note that the legend for this plot (and other matrix visualisations in `theftdlc`) have been discretised for visual clarity as continuous legends can be difficult to interpret meaningful value differences easily. ```{r, message = FALSE, warning = FALSE} plot(feature_matrix, type = "matrix", norm_method = "RobustSigmoid") ``` You can control the normalisation type with the `norm_method` argument, whether to rescale to the unit interval after normalisation with the `unit_int` argument. `norm_method` and all normalisation of feature vectors in `theftdlc` is handled by the [`normaliseR`](https://hendersontrent.github.io/normaliseR/) package. You can also control the hierarchical clustering method with the `clust_method` argument (the example above used defaults so manual specification was not needed). ### Individual feature distributions Plotting the entire feature matrix is useful, but sometimes we wish to understand the distributions of individual features. This is particularly useful if there are different groups in your data (such as in a time-series classification context). We can again use the `plot()` generic here to draw violin plots through setting `type = "violin"`. Note that for violin plots, we also need to tell the function which features we wish to plot (i.e., a vector of characters specifying feature names from the `names` column in your `feature_calculations` object). For simplicity, we will just plot two random features from `catch22` here: ```{r, message = FALSE, warning = FALSE} plot(feature_matrix, type = "violin", feature_names = c("CO_f1ecac", "PD_PeriodicityWang_th0_01")) ``` Note that when using these defined `plot()` generics, you can pass any additional arguments to certain geoms to control the plot look through the `...` argument in the `plot()` function. Below is a guide to where these arguments go depending on the plot type: * `type = "quality"`---`...` goes to `ggplot2::geom_bar` * `type = "matrix"`---`...` goes to `ggplot2::geom_raster` * `type = "cor"`---`...` goes to `ggplot2::geom_raster` * `type = "violin"`---`...` goes to `ggplot2::geom_point` For example, we may wish to control the point size and transparency in the above plot (not rendered here for space): ```{r, eval = FALSE} plot(feature_matrix, type = "violin", feature_names = c("CO_f1ecac", "PD_PeriodicityWang_th0_01"), size = 0.7, alpha = 0.9) ``` ### Low dimensional projections Low-dimensional projections are a useful tool for visualising the structure of high-dimensional datasets in low-dimensional spaces. In machine learning for time-series data, we are often interested in representing a time-series dataset in a two-dimensional projection of the high-dimensional feature space. This projection which can reveal structure in the dataset, including how different labeled classes are organized. The `theftdlc` function `project` takes the `feature_calculations` object and performs one of the following dimension reduction techniques on it to reduce its dimensionality to a bivariate state which can then be easily plotted: * Principal components analysis (PCA)---`"PCA"` * $t$-Stochastic Neighbor Embedding ($t$-SNE)---`"tSNE"` * Classical multidimensional scaling (MDS)---`"ClassicalMDS"` * Kruskal’s non-metric multidimensional scaling---`"KruskalMDS"` * Sammon’s non-linear mapping non-metric multidimensional scaling---`"SammonMDS"` * Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)---`"UMAP"` The result is stored in a custom object class called `feature_projection`. `project` takes the following arguments: * `data`---`feature_calculations` object containing the raw feature matrix produced by `theft::calculate_features` * `norm_method`---character denoting the rescaling/normalising method to apply. Can be one of `"zScore"`, `"Sigmoid"`, `"RobustSigmoid"`, `"MinMax"`, or `"MaxAbs"`. Defaults to `"zScore"` * `unit_int`---Boolean whether to rescale into unit interval $[0,1]$ after applying normalisation method. Defaults to `FALSE` * `low_dim_method`---character specifying the low dimensional embedding method to use. Can be one of `"PCA"` or `"tSNE"`, `"ClassicalMDS"`, `"KruskalMDS"`, `"SammonMDS"`, or `"UMAP"`. Defaults to `"PCA"` * `na_removal`---character defining the way to deal with `NAs` produced during feature calculation. Can be one of `"feature"` or `"sample"`. `"feature"` removes all features that produced any `NAs` in any sample, keeping the number of samples the same. `"sample"` omits all samples that produced at least one `NA`. Defaults to `"feature"` * `seed`---integer to fix R's random number generator to ensure reproducibility. Defaults to `123` * `...` arguments to be passed to the respective function specified by `low_dim_method` `project` returns an object of class `feature_projection` which is essentially a named list comprised of four elements: 1. `"Data"`---the `feature_calculations` object supplied to `project` 2. `"ModelData"`---the wide matrix of filtered data supplied to the model fit 3. `"ProjectedData"`---a tidy `data.frame` of the two-dimensional embedding 4. `"ModelFit"`---the raw model object from the dimensionality reduction algorithm ```{r, message = FALSE, warning = FALSE} low_dim <- project(feature_matrix, norm_method = "RobustSigmoid", unit_int = TRUE, low_dim_method = "PCA", seed = 123) ``` We can similarly call `plot()` on this object to produce a two-dimensional scatterplot of the results: ```{r, message = FALSE, warning = FALSE} plot(low_dim) ``` As another example, a *t*-SNE version can be specified in a similar fashion, with any function parameters for the method supplied to the `...` argument to `project`. Shaded covariance ellipses can also be disabled when plotting `feature_projection` objects by setting `show_covariance = FALSE`. Here is an example where we modify the perplexity of the *t*-SNE algorithm: ```{r, message = FALSE, warning = FALSE} low_dim2 <- project(feature_matrix, norm_method = "RobustSigmoid", unit_int = TRUE, low_dim_method = "tSNE", perplexity = 10, seed = 123) plot(low_dim2, show_covariance = FALSE) ``` ### Pairwise correlations You can plot correlations between feature vectors using `plot(type = "cor")` on a `feature_calculations` object: ```{r, message = FALSE, warning = FALSE} plot(feature_matrix, type = "cor") ``` Similarly, you can control the normalisation type with the `norm_method` argument and the hierarchical clustering method with the `clust_method` argument (the example above used defaults so manual specification was not needed). ## Time-series classification ### Feature-by-feature Since feature-based time-series analysis has shown particular promise for classification problems, `theftdlc` includes functionality for exploring group separation. The function `classify` enables you to fit a range of classification models to enable statistical comparisons using the resampling methodology presented in [this paper](https://arxiv.org/abs/2303.17809) for a detailed review^[T. Henderson, A. G., Bryant, and B. D. Fulcher, "Never a Dull Moment: Distributional Properties as a Baseline for Time-Series Classification", 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2023.]. This function is meant to serve as a fast answer that can be used to guide analysis and not a replacement for the development of a careful statistical pipeline. `classify` has the following arguments: * `data`---`feature_calculations` object containing the raw feature matrix produced by `theft::calculate_features` with an included `group` column as per `theft::calculate_features` * `classifier`---`function` specifying the classifier to fit. Should be a function with 2 arguments: `formula` and `data`. Please note that `classify` z-scores data prior to modelling using the train set's information so disabling default scaling if your function uses it is recommended. Defaults to `NULL` which means the following linear SVM is fit: `classifier = function(formula, data){mod <- e1071::svm(formula, data = data, kernel = "linear", scale = FALSE, probability = TRUE)}` * `train_size`---Numeric value denoting the proportion of samples to use in the training set. Defaults to `0.75` * `n_resamples`---Integer denoting the number of resamples to calculate. Defaults to `30` * `by_set`---Boolean specifying whether to compute classifiers for each feature set. Defaults to `TRUE` (see below section "Multi-feature" for more on this). If `FALSE`, the function will instead find the best individually-performing features * `use_null`---Boolean whether to fit null models where class labels are shuffled in order to generate a null distribution that can be compared to performance on correct class labels. Defaults to `FALSE`. This is known as permutation testing * `seed`---Integer to fix R's random number generator to ensure reproducibility. Defaults to `123` Since we are interested in individual features in this section, we will calculate both main and null results for each feature using just `5` resamples for efficiency (in practice, we would use more!) with the default linear SVM: ```{r, message = FALSE, warning = FALSE} feature_classifiers <- classify(feature_matrix, by_set = FALSE, n_resamples = 5, use_null = TRUE) ``` To show you how simple it is to specify a different classifier, we can instead maybe use a radial basis function SVM (though you are absolutely not limited to just `e1071` models! You can use anything that can be used with R's `predict` generic as `classify` internally constructs confusion matrices from model predictions): ```{r, message = FALSE, warning = FALSE} myclassifier <- function(formula, data){ mod <- e1071::svm(formula, data = data, kernel = "radial", scale = FALSE, probability = TRUE) } feature_classifiers_radial <- classify(feature_matrix, classifier = myclassifier, by_set = FALSE, n_resamples = 5, use_null = TRUE) ``` While have raw classification results is useful, we often also would like to statistical evaluate some facet of it. `theftdlc` includes the function `compare_features` for doing this. `compare_features` contains the following arguments: * `data`---List object containing the classification outputs produce by `classify` * `metric`---Character denoting the classification performance metric to use in statistical testing. Can be one of `"accuracy"`, `"precision"`, `"recall"`, `"f1"`. Defaults to `"accuracy"` * `by_set`---Boolean specifying whether you want to compare feature sets (if `TRUE`) or individual features (if `FALSE`). Defaults to `TRUE` but this is contingent on whether you computed by set or not in `classify` * `hypothesis`---Character denoting whether p-values should be calculated for each feature set or feature (depending on `by_set` argument) individually relative to the null if `use_null = TRUE` in `classify` through `"null"`, or whether pairwise comparisons between each set or feature should be conducted on main model fits only through `"pairwise"`. Defaults to `"null"` * `p_adj`---Character denoting the adjustment made to p-values for multiple comparisons. Should be a valid argument to [`stats::p.adjust`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust). Defaults to `"none"` for no adjustment. `"holm"` is recommended as a starting point if adjustments are sought We can use `compare_features` to evaluate how well each individual feature performs relative to its empirical null distribution (noting that we are using the defaults for the other arguments for code cleanliness): ```{r, message = FALSE, warning = FALSE} feature_vs_null <- compare_features(feature_classifiers, by_set = FALSE, hypothesis = "null") head(feature_vs_null) ``` Or to conduct pairwise comparisons between individual features: ```{r, message = FALSE, warning = FALSE} pairwise_features <- compare_features(feature_classifiers, by_set = FALSE, hypothesis = "pairwise", p_adj = "holm") head(pairwise_features) ``` We can then use `ggplot2` to summarise and visualise our results. Here is a pairwise correlation plot between the top 10 features in `catch22` for this toy problem. We are just simply filtering the original full feature data and making use of the `plot` generic defined for objects of class `feature_calculations`: ```{r, message = FALSE, warning = FALSE} top_10 <- feature_vs_null %>% dplyr::slice_min(p.value, n = 10) %>% dplyr::select(c(feature_set, original_names, p.value)) feature_matrix_filt <- feature_matrix %>% dplyr::filter(feature_set %in% top_10$feature_set & names %in% top_10$original_names) feature_matrix_filt <- structure(feature_matrix_filt, class = c("feature_calculations", "data.frame")) plot(feature_matrix_filt, type = "cor") ``` We can also easily draw a violin plot of the top 10 features to visualise the distributions by group: ```{r, message = FALSE, warning = FALSE} plot(feature_matrix_filt, type = "violin", feature_names = top_10$original_names) ``` Finally, `theftdlc` also contains a function `interval` for summarising the results of `classify`. `interval` takes the following arguments: * `data`---list object containing the classification outputs produce by `classify` * `metric`---character denoting the classification performance metric to calculate intervals for. Can be one of `"accuracy"`, `"precision"`, `"recall"`, `"f1"`. Defaults to `"accuracy"` * `by_set`---Boolean specifying whether to compute intervals for each feature set. Defaults to `TRUE`. If `FALSE`, the function will instead calculate intervals for each feature * `type`---character denoting whether to calculate a $\pm$ SD interval with `"sd"`, confidence interval based off the $t$-distribution with `"qt"`, or based on a quantile with `"quantile"`. Defaults to `"sd"` * `interval`---numeric scalar denoting the width of the interval to calculate. Defaults to `1` if `type = "sd"` to produce a $\pm 1$ SD interval. Defaults to `0.95` if `type = "qt"` or `type = "quantile"` for a $95\%$ interval * `model_type`---character denoting whether to calculate intervals for main models with `"main"` or null models with `"null"` if the `use_null` argument when using `classify` was `use_null = TRUE`. Defaults to `"main"` We can evidently use `interval` to produce a variety of different summaries for us. For example, we might wish to compute the $\pm1$ SD interval for each feature's main model classification accuracy values (note that the defaults for the function do this for us, so we only need to set `by_set = FALSE` manually): ```{r, message = FALSE, warning = FALSE} interval(feature_classifiers, by_set = FALSE) ``` ### Multi-feature Since `theft` contains entire sets of features, we can also use `classify` to compare them at the set level through the `by_set` argument. Let's try both `catch22` and a custom set of just mean and standard deviation: ```{r, message = FALSE, warning = FALSE} feature_matrix2 <- calculate_features(data = simData, group_var = "process", feature_set = "catch22", features = list("mean" = mean, "sd" = sd), seed = 123) set_classifiers <- classify(feature_matrix2, by_set = TRUE, n_resamples = 5, use_null = TRUE) head(set_classifiers) ``` Note that `classify` constructs a set of `"All features"` (i.e., all features across all computed sets) automatically when $>2$ unique feature sets are detected in the feature data. Similar to the individual feature case, we can also use `interval` combined with `ggplot2` to summarise our findings. Here is a comparison of mean accuracy $\pm 1SD$ between feature sets: ```{r, message = FALSE, warning = FALSE} interval_calcs <- interval(set_classifiers) interval_calcs %>% ggplot2::ggplot(ggplot2::aes(x = reorder(feature_set, -.mean), y = .mean, colour = feature_set)) + ggplot2::geom_errorbar(ggplot2::aes(ymin = .lower, ymax = .upper)) + ggplot2::geom_point(size = 5) + ggplot2::labs(x = "Feature set", y = "Classification accuracy") + ggplot2::scale_colour_brewer(palette = "Dark2") + ggplot2::theme_bw() + ggplot2::theme(legend.position = "none", panel.grid.minor = ggplot2::element_blank()) ``` ## Cluster analysis `theftdlc` also supports quick and simple cluster analysis using either [$k$-means](https://en.wikipedia.org/wiki/K-means_clustering), hierarchical clustering, or [Gaussian mixture models](https://en.wikipedia.org/wiki/Mixture_model) through the `cluster` function. `cluster` takes a few similar key arguments to other `theftdlc` functions (though defaults are set for all, and so only `data` is required for `cluster` to work): * `data`---`feature_calculations` object containing the raw feature matrix produced by `theft::calculate_features` * `norm_method`---character denoting the rescaling/normalising method to apply. Can be one of `"zScore"`, `"Sigmoid"`, `"RobustSigmoid"`, `"MinMax"`, or `"MaxAbs"`. Defaults to `"zScore"` * `unit_int`---Boolean whether to rescale into unit interval $[0,1]$ after applying normalisation method. Defaults to `FALSE` * `clust_method`---character specifying the clustering algorithm to use. Can be one of `"kmeans"`, `"hclust"`, or `"mclust"`. Defaults to `"kmeans"` * `k`---integer denoting the number of clusters to extract. Defaults to `2` * `features`---character vector denoting the names of time-series features to use in the clustering algorithm. Defaults to `NULL` for no feature filtering and usage of the entire feature matrix * `na_removal`---character defining the way to deal with `NAs` produced during feature calculation. Can be one of `"feature"` or `"sample"`. `"feature"` removes all features that produced any `NAs` in any sample, keeping the number of samples the same. `"sample"` omits all samples that produced at least one `NA`. Defaults to `"feature"` * `seed`---integer to fix R's random number generator to ensure reproducibility. Defaults to `123` * `...`---additional arguments to be passed to `stats::kmeans`, `stats::hclust`, or `mclust::Mclust` depending on `clust_method` `cluster` returns an object of class `feature_clusters` which is essentially a named list comprised of two elements: 1. `"Data"`---the `feature_calculations` object supplied to `cluster` with the cluster label appended 2. `"ModelFit"`---the raw model object from the clustering algorithm We can easily fit a $k$-means model with `k = 6` (since `theft::simData` contains data for six different temporal processes): ```{r, message = FALSE, warning = FALSE} feature_clusters <- cluster(feature_matrix, k = 6) ``` From here, it's easy to do any further analysis or data visualisation: ```{r, message = FALSE, warning = FALSE} feature_clusters$Data %>% dplyr::filter(names %in% c("CO_HistogramAMI_even_2_5", "DN_OutlierInclude_p_001_mdrmd")) %>% tidyr::pivot_wider(id_cols = c("id", "group", "cluster"), names_from = "names", values_from = "values") %>% ggplot2::ggplot(ggplot2::aes(x = CO_HistogramAMI_even_2_5, DN_OutlierInclude_p_001_mdrmd, colour = as.factor(cluster))) + ggplot2::stat_ellipse(ggplot2::aes(fill = as.factor(cluster)), geom = "polygon", alpha = 0.2) + ggplot2::geom_point() + ggplot2::labs(colour = "Cluster") + ggplot2::guides(fill = "none") + ggplot2::scale_fill_brewer(palette = "Dark2") + ggplot2::scale_colour_brewer(palette = "Dark2") + ggplot2::theme_bw() + ggplot2::theme(legend.position = "bottom", panel.grid.minor = ggplot2::element_blank()) ```