Clustering snow profiles

Objective

This vignette summarizes the methods for clustering snow profiles available in sarp.snowprofile.alignment from the following papers:

Sample profiles

We demonstrate clustering with the SPgroup2 snowprofile set which contains 5 snow profiles from a single date.

## Load packages
library(sarp.snowprofile.alignment)

## Sample profiles
plot(SPgroup2, SortMethod = 'hs')

Distance between profiles

Many clustering methods in R use a distance matrix of class dist as an input. The distanceSP function provides several ways to produce a distance matrix from a snowprofileSet by making pairwise comparisons of snow profile similarities. distanceSP passes arguments to dtw and then further to simSP to configure the comparisons. We recommend checking the documentation for simSP for available approaches to calculate similarities.

## Distance matrix using default settings
distmat1 <- distanceSP(SPgroup2)
print(distmat1)
#>                               VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                     0.2598810
#> VIR075907.2019-01-21 17:00:00                     0.3016338
#> VIR084723.2019-01-21 17:00:00                     0.3664091
#> VIR082519.2019-01-21 17:00:00                     0.3893727
#>                               VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                     0.3236395
#> VIR084723.2019-01-21 17:00:00                     0.3686663
#> VIR082519.2019-01-21 17:00:00                     0.3313452
#>                               VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                     0.4225680
#> VIR082519.2019-01-21 17:00:00                     0.4405521
#>                               VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                              
#> VIR082519.2019-01-21 17:00:00                     0.2002722

Faster clustering

Pairwise distance calculations can be slow for large datasets. Three options for improved performance include:

  1. Calculate the distances in parallel on multiple cores via the n_cores argument, which activates the parallel package.
  2. Setting symmetric = FALSE to only compute one of dtwSP(A, B) or dtw(B, A), potentially resulting in a loss of accuracy but cutting the number of alignments in half.
  3. Setting fast_summary = TRUE to compute approximate distances nearly instantaneously without aligning profiles with dynamic time warping by comparing summary statistics.
## Check for same result when computing distant in parallel on multiple cores
# library(parallel)
# n_cores <- detectCores() - 1
# distmat1b <- distanceSP(SPgroup2, n_cores = n_cores)
# identical(distmat1, distmat1b)

## Fast version of pairwise distances based on summary stats
distmat2 <- distanceSP(SPgroup2, fast_summary = TRUE)
print(distmat1)
#>                               VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                     0.2598810
#> VIR075907.2019-01-21 17:00:00                     0.3016338
#> VIR084723.2019-01-21 17:00:00                     0.3664091
#> VIR082519.2019-01-21 17:00:00                     0.3893727
#>                               VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                     0.3236395
#> VIR084723.2019-01-21 17:00:00                     0.3686663
#> VIR082519.2019-01-21 17:00:00                     0.3313452
#>                               VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                     0.4225680
#> VIR082519.2019-01-21 17:00:00                     0.4405521
#>                               VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                              
#> VIR082519.2019-01-21 17:00:00                     0.2002722
print(distmat2)
#>                               VIR077561.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                     0.3185841
#> VIR075907.2019-01-21 17:00:00                     0.8113144
#> VIR084723.2019-01-21 17:00:00                     0.6276566
#> VIR082519.2019-01-21 17:00:00                     0.3443081
#>                               VIR080858.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                     0.5644011
#> VIR084723.2019-01-21 17:00:00                     0.4471012
#> VIR082519.2019-01-21 17:00:00                     0.3125985
#>                               VIR075907.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                     0.3631210
#> VIR082519.2019-01-21 17:00:00                     0.7049122
#>                               VIR084723.2019-01-21 17:00:00
#> VIR080858.2019-01-21 17:00:00                              
#> VIR075907.2019-01-21 17:00:00                              
#> VIR084723.2019-01-21 17:00:00                              
#> VIR082519.2019-01-21 17:00:00                     0.4924477

For clustering applications, the clusterSP function handles the distance calculations, but it is worthwhile understanding how you can control the distance calculations with the clusterSPconfig function, which has an output args_distance for passing arguments to the distance calculations.

config <- clusterSPconfig(simType = 'layerwise', ddate = T)
str(config)
#> List of 6
#>  $ args_distance     :List of 4
#>   ..$ rescale_resample: logi FALSE
#>   ..$ simType         : chr "layerwise"
#>   ..$ dims            : chr [1:3] "gtype" "hardness" "ddate"
#>   ..$ weights         : num [1:3] 0.375 0.125 0.5
#>  $ args_centers      :List of 3
#>   ..$ simType: chr "layerwise"
#>   ..$ dims   : chr [1:3] "gtype" "hardness" "ddate"
#>   ..$ weights: num [1:3] 0.375 0.125 0.5
#>  $ args_cluster      : list()
#>  $ args_simpleweights: list()
#>  $ verbose           : logi TRUE
#>  $ args_fast         : Named num [1:10] 2 0 1 0 0 1 0 0 0 0
#>   ..- attr(*, "names")= chr [1:10] "w_hs" "w_hn24" "w_h3d" "w_slab" ...
distmat3 <- do.call('distanceSP', c(list(SPgroup2), config$args_distance))

Clustering methods

We present three distinct clustering methods:

  1. Hierarchical clustering of the distance matrix with hclust.
  2. Partition clustering of the distance matrix with pam.
  3. k-dimension barycentric averaging clustering by directly calculating average profiles with averageSP.

All methods are applied using the clusterSP function with different type arguments. We use the already computed a distance matrix distmat1, however, clusterSPcan also compute distance matrices if only provided with a a snowprofileSet. The output is a list of class clusterSP that contains a variety of information about the clustering solution, which has a plot method plot.clusterSP to show the grouping of clustered profiles.

Hierarchical clustering

Hierarchical clustering organizes data into a tree of nested clusters. Agglomerative hierarchical clustering begins with each profile as a separate cluster and then iteratively merges the closest clusters until all are combined into one. This process forms a dendrogram representing the data’s hierarchical structure. The desired number of clusters can be set by specifying the number of groups \(k\) or by setting the threshold height for cutting the tree. The method is implemented using the stats::hclust function.

cl_hclust <- clusterSP(SPx = SPgroup2, k = 2, type = 'hclust', distmat = distmat1)
plot(cl_hclust)

Partitioning around medoids

Partitioning around medoids (PAM) is a partition-based clustering method, where data points are assigned to a predetermined number of distinct clusters \(k\). Data points are iteratively assigned to clusters to minimizing the sum of squared distances between data points and the cluster centers. PAM uses actual data points as centers (medoids), as opposed to common k-means clustering that uses an average of data points as the center (centroids). The method is implemented using the cluster::pam function.

cl_pam <- clusterSP(SPx = SPgroup2, k = 2, type = 'pam', distmat = distmat1)
plot(cl_pam)

Horton et al. (submitted) use a fuzzy variant of PAM where data points are assign partial membership values to each cluster, which can be done with the cluster::fanny function. Note the example snowprofileSet in this vignette does not have enough data points for fanny clustering.

k-dimension barycentric averaging

k-dimensional barycenter averaging (kdba)) is a variant of k-means clustering that operates directly on sequences or time series data (Petitjean et al., 2011). It computes the barycenter (centroid) of each cluster based on the average dissimilarity between the objects in the cluster and assigns each object to the closest barycenter. For snow profiles, the cluster barycenters are represented by the average snow profile of each using the dynamic time warping barycenter averaging (DBA) method described by Herla et al. (2022).

An initial clustering condition (which can be random or based on a ‘sophisticated guess’) is iteratively refined by assigning individual profiles to the most similar cluster and at the end of every iteration recomputing the cluster centroids. The result is sensitive to the initial conditions. An advantage of this method is it considered the stratigraphy of the profiles in greater details, but a disadvantage is that iterating over different values of \(k\) is slow.

cl_kdba <- clusterSP(SPx = SPgroup2, k = 2, type = 'kdba')
#> Iteration 1: k = 2
#> clustering: 1, 1, 1, 2, 2
#> RMSE: 0.155 0.077
#> Iteration 2: k = 2
#> clustering: 1, 1, 1, 2, 2
#> RMSE: 0.155 0.077
#> Clusters converged!
plot(cl_kdba, centers = 'n')

Representative profile

Producing representative profiles for each cluster can be useful. You can compute these profiles as centroids using averageSP or as medoids using medoidSP. Depending on the clustering method, these may already be computed and included in the output of clusterSP as medoid indices ($id.med) or a snowprofileSet of centroid profiles ($centroids). To explicitly request specific representative profiles, use the centers argument with options ‘medoids’, ‘centroids’, ‘both’, or ‘none’. The plot method for clusterSP can plot the representative profile beneath the cluster groups.

plot(cl_pam, centers = 'medoids')

plot(cl_kdba, centers = 'centroids')

Manipulate distance matrix

Horton et al. (submitted) apply clustering to forecast regions by manipulating the distance matrix to also account for spatial and temporal criteria. Here is an example modifying the distance matrix based on an additional criteria: is the top layer in the profile surface hoar?

## A binary vector identifying which profiles have SH on the surface
sh <- sapply(SPgroup2, function(x) x$layers$gtype[nrow(x$layers)] == 'SH')

## Construct a distance matrix
distmat_sh <- dist(sh)

## Create weighted average
distmat <- 0.25 *  distmat1 + 0.75 * distmat_sh
cl_sh <- clusterSP(SPx = SPgroup2, k = 2, type = 'hclust', distmat = distmat)
plot(cl_sh)