NAME
Statistics::Gap - Perl extension for the "Gap Statistic"
SYNOPSIS
use Statistics::Gap;
&gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "unif", 90);
OR
use Statistics::Gap;
&gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "prop", 90);
Input file is can be in the "dense" format -
Sample Input file:
6 5
1 1 0 0 1
1 0 0 0 0
1 1 0 0 1
1 1 0 0 1
1 0 0 0 1
1 1 0 0 1
OR in "sparse" format -
Sample Input file:
8 10 41
1 1 4 1 8 1 10 1
1 1 2 1 3 1 5 1 7 1 9 1
2 1 3 1 4 1 5 1 7 1 9 1
2 1 4 1 5 1 9 1
1 1 4 1 6 1
1 1 2 1 3 1 4 1 5 1 6 1 10 1
2 1 4 1 5 1 7 1 8 1 10 1
2 1 4 1 7 1 9 1 10 1
DESCRIPTION
Given a dataset how does one automatically find the optimal number of
clusters that the dataset should be grouped into? - is one of the
prevailing problems. Statisticians Robert Tibshirani, Guenther Walther
and Trevor Hastie propose a solution for this problem in a Techinal
Report named - "Estimating the number of clusters in a dataset via the
Gap Statistic". This perl module implements the approach proposed in the
above paper.
If one tries to cluster a dataset (i.e. numerous observations described
in terms of a feature space) into n groups/clusters and if we plot the
graph of Within Cluster (Dis)Similarity along Y-axis and Number of
clusters along X-axis then this graph generally takes a form of a
elbow/knee depending upon the measure on the Y-axis. The Gap Statistic
seeks to locate this elbow/knee because the value on the X-axis at this
elbow is the optimal number of clusters for the data.
NOTE: Gap Statistic uses reference distribution in the process of
estimating the number of clusters. The appropriate methodology for
generation of this reference distribution is dependent on the data to be
clustered. This module was implemented for data with following
characteristics: 1. highly sparse - very few features occur in any given
observation. 2. high multivariate dimensionality (i.e. large feature
space) 3. binary feature frequency - feature either occurs or does not
occur in an observation but rarely occurs multiple times in the same
observation.
EXPORT
"gap" function by default.
INPUT
Prefix
The string that should be used to as a prefix while naming the
intermediate files and the .png files (graph files).
InputFile
The input dataset can be in "dense" or "sparse" matrix format. The input
matrix is expected in a plain text file where the first line in the file
gives the dimensions of the dataset and then the dataset in either of
the format should follow. The contexts / observations should be along
the rows and the features should be along the column.
eg: dense format -
6 5
1 1 0 0 1
1 0 0 0 0
1 1 0 0 1
1 1 0 0 1
1 0 0 0 1
1 1 0 0 1
The first line (6 5) gives the number of rows (observations) and the
number of columns (features) present in the following matrix. Each
following line/row records the presence (1) or absence (0) of the
feature at the column in the given observation. Thus features1 (1st
column) occurs once in the observation1 and infact once in all the other
observations too while the feature3 does not occur in observation1.
eg: sparse format -
8 10 41
1 1 4 1 8 1 10 1
1 1 2 1 3 1 5 1 7 1 9 1
2 1 3 1 4 1 5 1 7 1 9 1
2 1 4 1 5 1 9 1
1 1 4 1 6 1
1 1 2 1 3 1 4 1 5 1 6 1 10 1
2 1 4 1 5 1 7 1 8 1 10 1
2 1 4 1 7 1 9 1 10 1
The first line (8 10 41) gives the number of rows (observations), the
number of columns (features) and the number of non-zero elements present
in the following matrix. Each of the following line/row consists of pair
of values where the first number in the pair gives the column number
corresponding to the feature that occurs in the observation
corresponding to the row. In this format only features that are
present/observed (1) are recorded i.e. the features that are
absent/not-observed (0) are implied. Thus in the 1st row (observation1)
of the above matrix only features1, feature4, feature8 and feature10
occur while the other 7 features do not occur.
DistanceMeasure
The Distance Measure that should be used. Currrently this module
supports the following distance measure: 1. Squared Euclidean (string
that should be used as an argument: "squared") 2. Manhattan (string that
should be used as an argument: "manhattan") 3. Euclidean (string that
should be used as an argument: "euclidean")
ClusteringAlgorithm
The Clustering Measures that can be used are: 1. rb - Repeated
Bisections [Default] 2. rbr - Repeated Bisections for by k-way
refinement 3. direct - Direct k-way clustering 4. agglo - Agglomerative
clustering 5. graph - Graph partitioning-based clustering 6. bagglo -
Partitional biased Agglomerative clustering
K value
This is an approximate upper bound for the number of clusters that may
be present in the dataset. Thus for a dataset that you expect to be
seperated into 3 clusters this value should be set some integer value
greater than 3.
B value
Specifies the number of replicates that should be sampled from the
generated reference distribution using Monte Carlo Sampling.
ReferenceGenerationMethod
1. Uniform - While generating the reference distribution, all the
features in the feature set have equal probability of being selected for
the observation under consideration.
2. Proportional - Each feature is assigned a probability of being
selected depending upon its frequency of occurrence in the observed
data. Thus feature distribution is taken into consideration while
selecting the features for the reference distribution generation.
Percentage Confidence Interval
This parameter specifies the percentage confidence to be reported in the
log file. Since Statistics::Gap uses parametric bootstrap method for
reference distribution generation, it is critical to understand the
interval around the sample mean that could contain the population
("true") mean and with what certainty.
OUTPUT
1. A single integer number at STDOUT which is the Gap Statistic's
estimate of number of clusters present in the input dataset. 2. The
PREFIX.log file contains the log of various values at different K
values. The first table in the file gives values like Gap(k), log(W(k))
etc. for every K value experimented with. 3. The PREFIX.fig*.png files
are the graphical representations of different values which help locate
the knee/elbow.
PRE-REQUISITES
1. This module uses suite of C programs called CLUTO for clustering
purposes. Thus CLUTO needs to be installed for this module to be
functional. CLUTO can be downloaded from
http://www-users.cs.umn.edu/~karypis/cluto/
2. Following Perl Modules 1. GD
(http://search.cpan.org/~lds/GD-2.19/GD.pm) 2. GD::Text
(http://search.cpan.org/~mverb/GDTextUtil-0.86/Text.pm) 3.
GD::Graph::lines (http://search.cpan.org/~mverb/GDGraph-1.43/) 4.
GD::Graph::colour
(http://search.cpan.org/~mverb/GDGraph-1.43/Graph/colour.pm)
SEE ALSO
http://citeseer.ist.psu.edu/tibshirani00estimating.html
http://www-users.cs.umn.edu/~karypis/cluto/
AUTHOR
Anagha Kulkarni, University of Minnesota Duluth
kulka020 d.umn.edu
Guergana Savova, Mayo Clinic
savova.guergana mayo.edu
COPYRIGHT AND LICENSE
Copyright (C) 2006-2008, Anagha Kulkarni and Guergana Savova
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. This program is distributed in the hope that
it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.