NAME Statistics::Gap - Perl extension for the "Gap Statistic" SYNOPSIS use Statistics::Gap; &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "unif", 90); OR use Statistics::Gap; &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "prop", 90); Input file is can be in the "dense" format - Sample Input file: 6 5 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 OR in "sparse" format - Sample Input file: 8 10 41 1 1 4 1 8 1 10 1 1 1 2 1 3 1 5 1 7 1 9 1 2 1 3 1 4 1 5 1 7 1 9 1 2 1 4 1 5 1 9 1 1 1 4 1 6 1 1 1 2 1 3 1 4 1 5 1 6 1 10 1 2 1 4 1 5 1 7 1 8 1 10 1 2 1 4 1 7 1 9 1 10 1 DESCRIPTION Given a dataset how does one automatically find the optimal number of clusters that the dataset should be grouped into? - is one of the prevailing problems. Statisticians Robert Tibshirani, Guenther Walther and Trevor Hastie propose a solution for this problem in a Techinal Report named - "Estimating the number of clusters in a dataset via the Gap Statistic". This perl module implements the approach proposed in the above paper. If one tries to cluster a dataset (i.e. numerous observations described in terms of a feature space) into n groups/clusters and if we plot the graph of Within Cluster (Dis)Similarity along Y-axis and Number of clusters along X-axis then this graph generally takes a form of a elbow/knee depending upon the measure on the Y-axis. The Gap Statistic seeks to locate this elbow/knee because the value on the X-axis at this elbow is the optimal number of clusters for the data. NOTE: Gap Statistic uses reference distribution in the process of estimating the number of clusters. The appropriate methodology for generation of this reference distribution is dependent on the data to be clustered. This module was implemented for data with following characteristics: 1. highly sparse - very few features occur in any given observation. 2. high multivariate dimensionality (i.e. large feature space) 3. binary feature frequency - feature either occurs or does not occur in an observation but rarely occurs multiple times in the same observation. EXPORT "gap" function by default. INPUT Prefix The string that should be used to as a prefix while naming the intermediate files and the .png files (graph files). InputFile The input dataset can be in "dense" or "sparse" matrix format. The input matrix is expected in a plain text file where the first line in the file gives the dimensions of the dataset and then the dataset in either of the format should follow. The contexts / observations should be along the rows and the features should be along the column. eg: dense format - 6 5 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 The first line (6 5) gives the number of rows (observations) and the number of columns (features) present in the following matrix. Each following line/row records the presence (1) or absence (0) of the feature at the column in the given observation. Thus features1 (1st column) occurs once in the observation1 and infact once in all the other observations too while the feature3 does not occur in observation1. eg: sparse format - 8 10 41 1 1 4 1 8 1 10 1 1 1 2 1 3 1 5 1 7 1 9 1 2 1 3 1 4 1 5 1 7 1 9 1 2 1 4 1 5 1 9 1 1 1 4 1 6 1 1 1 2 1 3 1 4 1 5 1 6 1 10 1 2 1 4 1 5 1 7 1 8 1 10 1 2 1 4 1 7 1 9 1 10 1 The first line (8 10 41) gives the number of rows (observations), the number of columns (features) and the number of non-zero elements present in the following matrix. Each of the following line/row consists of pair of values where the first number in the pair gives the column number corresponding to the feature that occurs in the observation corresponding to the row. In this format only features that are present/observed (1) are recorded i.e. the features that are absent/not-observed (0) are implied. Thus in the 1st row (observation1) of the above matrix only features1, feature4, feature8 and feature10 occur while the other 7 features do not occur. DistanceMeasure The Distance Measure that should be used. Currrently this module supports the following distance measure: 1. Squared Euclidean (string that should be used as an argument: "squared") 2. Manhattan (string that should be used as an argument: "manhattan") 3. Euclidean (string that should be used as an argument: "euclidean") ClusteringAlgorithm The Clustering Measures that can be used are: 1. rb - Repeated Bisections [Default] 2. rbr - Repeated Bisections for by k-way refinement 3. direct - Direct k-way clustering 4. agglo - Agglomerative clustering 5. graph - Graph partitioning-based clustering 6. bagglo - Partitional biased Agglomerative clustering K value This is an approximate upper bound for the number of clusters that may be present in the dataset. Thus for a dataset that you expect to be seperated into 3 clusters this value should be set some integer value greater than 3. B value Specifies the number of replicates that should be sampled from the generated reference distribution using Monte Carlo Sampling. ReferenceGenerationMethod 1. Uniform - While generating the reference distribution, all the features in the feature set have equal probability of being selected for the observation under consideration. 2. Proportional - Each feature is assigned a probability of being selected depending upon its frequency of occurrence in the observed data. Thus feature distribution is taken into consideration while selecting the features for the reference distribution generation. Percentage Confidence Interval This parameter specifies the percentage confidence to be reported in the log file. Since Statistics::Gap uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty. OUTPUT 1. A single integer number at STDOUT which is the Gap Statistic's estimate of number of clusters present in the input dataset. 2. The PREFIX.log file contains the log of various values at different K values. The first table in the file gives values like Gap(k), log(W(k)) etc. for every K value experimented with. 3. The PREFIX.fig*.png files are the graphical representations of different values which help locate the knee/elbow. PRE-REQUISITES 1. This module uses suite of C programs called CLUTO for clustering purposes. Thus CLUTO needs to be installed for this module to be functional. CLUTO can be downloaded from http://www-users.cs.umn.edu/~karypis/cluto/ 2. Following Perl Modules 1. GD (http://search.cpan.org/~lds/GD-2.19/GD.pm) 2. GD::Text (http://search.cpan.org/~mverb/GDTextUtil-0.86/Text.pm) 3. GD::Graph::lines (http://search.cpan.org/~mverb/GDGraph-1.43/) 4. GD::Graph::colour (http://search.cpan.org/~mverb/GDGraph-1.43/Graph/colour.pm) SEE ALSO http://citeseer.ist.psu.edu/tibshirani00estimating.html http://www-users.cs.umn.edu/~karypis/cluto/ AUTHOR Anagha Kulkarni, University of Minnesota Duluth kulka020 d.umn.edu Guergana Savova, Mayo Clinic savova.guergana mayo.edu COPYRIGHT AND LICENSE Copyright (C) 2006-2008, Anagha Kulkarni and Guergana Savova This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.