kmed: Distance-Based K-Medoids

Abstract

The kmed vignette consists of four sequantial parts of distance-based (k-medoids) cluster analysis. The first part is defining the distance. It has numerical, binary, categorical, and mixed distances. The next part is applying a clustering algorithm in the pre-defined distance. There are five k-medoids presented, namely the simple and fast k-medoids, k-medoids, ranked k-medoids, increasing number of clusters in k-medoids, and simple k-medoids. After the clustering result is obtained, a validation step is required. The cluster validation applies internal and relative criteria. The last part is visualizing the cluster result in a biplot or marked barplot.

2. Distance Computation

2.A. Numerical variables (`distNumeric`)

The distNumeric function can be applied to calculate numerical distances. There are four distance options, namely Manhattan weighted by range (mrw), squared Euclidean weighted by range (ser), squared Euclidean weighted by squared range (ser.2), squared Euclidean weighted by variance (sev), and unweighted squared Euclidean (se). The distNumeric function provides method in which the desired distance method can be selected. The default method is mrw.

The distance computation in a numerical variable data set is performed in the iris data set. An example of manual calculation of the numerical distances is applied for the first and second objects only to introduce what the distNumeric function does.

library(kmed)

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'

iris[1:3,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

2.A.1. Manhattan weighted by range (`method = "mrw"`)

By applying the distNumeric function with method = "mrw", the distance among objects in the iris data set can be obtained.

num <- as.matrix(iris[,1:4])
rownames(num) <- rownames(iris)
#calculate the Manhattan weighted by range distance of all iris objects
mrwdist <- distNumeric(num, num)
#show the distance among objects 1 to 3
mrwdist[1:3,1:3]

##           1         2         3
## 1 0.0000000 0.2638889 0.2530603
## 2 0.2638889 0.0000000 0.1558380
## 3 0.2530603 0.1558380 0.0000000

The Manhattan weighted by range distance between objects 1 and 2 is 0.2638889. To calculate this distance, the range of each variable is computed.

#extract the range of each variable
apply(num, 2, function(x) max(x)-min(x))

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          3.6          2.4          5.9          2.4

Then, the distance between objects 1 and 2 is

#the distance between objects 1 and 2
abs(5.1-4.9)/3.6 + abs(3.5 - 3.0)/2.4 + abs(1.4-1.4)/5.9 + abs(0.2-0.2)/2.4

## [1] 0.2638889

which is based on the data

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2

kmed: Distance-Based K-Medoids

Weksi Budiaji

2022-08-29

1. Introduction

2. Distance Computation

2.A. Numerical variables (distNumeric)

2.A.1. Manhattan weighted by range (method = "mrw")

2.A.2. squared Euclidean weighted by range (method = "ser")

2.A.3. squared Euclidean weighted by squared range (method = "ser.2")

2.A.4. squared Euclidean weighted by variance (method = "sev")

2.A.5. squared Euclidean (method = "se")

2.B. Binary or Categorical variables

2.B.1. Simple matching (matching)

2.B.2. Co-occurrence distance (cooccur)

2.C. Mixed variables (distmix)

2.C.1 Gower (method = "gower")

2.C.2 Wishart (method = "wishart")

2.C.3 Podani (method = "podani")

2.C.4 Huang (method = "huang")

2.C.5 Harikumar and PV (method = "harikumar")

2.C.6 Ahmad and Dey (method = "ahmad")

3. K-medoids algorithms

3.A. Simple and fast k-medoids algorithm (fastkmed)

3.B. K-medoids algorithm

3.C. Rank k-medoids algorithm (rankkmed)

3.D. Increasing number of clusters k-medoids algorithm (inckmed)

3.E. Simple k-medoids algorithm (skm)

4. Cluster validation

4.A. Internal criteria

4.A.1. Silhouette (sil)

4.A.2. Centroid-based shadow value (csv)

4.A.3. Medoid-based shadow value (msv)

4.B. Relative criteria

Step 1 Creating a matrix of bootstrap replicates

Step 2 Transforming the bootstrap matrix into a consensus matrix

Step 3 Visualizing the consensus matrix in a heatmap

5. Cluster visualization

A. Biplot

B. Marked barplot

References

2.A. Numerical variables (`distNumeric`)

2.A.1. Manhattan weighted by range (`method = "mrw"`)

2.A.2. squared Euclidean weighted by range (`method = "ser"`)

2.A.3. squared Euclidean weighted by squared range (`method = "ser.2"`)

2.A.4. squared Euclidean weighted by variance (`method = "sev"`)

2.A.5. squared Euclidean (`method = "se"`)

2.B.1. Simple matching (`matching`)

2.B.2. Co-occurrence distance (`cooccur`)

2.C. Mixed variables (`distmix`)

2.C.1 Gower (`method = "gower"`)

2.C.2 Wishart (`method = "wishart"`)

2.C.3 Podani (`method = "podani"`)

2.C.4 Huang (`method = "huang"`)

2.C.5 Harikumar and PV (`method = "harikumar"`)

2.C.6 Ahmad and Dey (`method = "ahmad"`)

3.A. Simple and fast k-medoids algorithm (`fastkmed`)

3.C. Rank k-medoids algorithm (`rankkmed`)

3.D. Increasing number of clusters k-medoids algorithm (`inckmed`)

3.E. Simple k-medoids algorithm (`skm`)

4.A.1. Silhouette (`sil`)

4.A.2. Centroid-based shadow value (`csv`)

4.A.3. Medoid-based shadow value (`msv`)