gov.sandia.cognition.learning.algorithm.clustering
Class PartitionalClusterer<DataType,ClusterType extends Cluster<DataType>>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
          extended by gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm<ResultType>
              extended by gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>
                  extended by gov.sandia.cognition.learning.algorithm.clustering.PartitionalClusterer<DataType,ClusterType>
Type Parameters:
DataType - The type of the data to cluster. This is typically defined by the divergence function used.
ClusterType - The type of Cluster created by the algorithm. This is typically defined by the cluster creator function used.
All Implemented Interfaces:
AnytimeAlgorithm<Collection<ClusterType>>, IterativeAlgorithm, StoppableAlgorithm, AnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>, BatchLearner<Collection<? extends DataType>,Collection<ClusterType>>, BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, DivergenceFunctionContainer<ClusterType,DataType>, CloneableSerializable, Randomized, Serializable, Cloneable

@CodeReview(reviewer="Justin Basilico",
            date="2011-03-09",
            comments={"Should make do a greedy splitting prioritization.","Should make an interface for incremental cluster creation for this to use."},
            changesNeeded=true)
@PublicationReference(author={"Ying Zhao","George Karypis"},
                      title="Hierarchical Clustering Algorithms for Document Datasets",
                      type=Journal,
                      year=2005,
                      publication="Data Mining and Knowledge Discovery",
                      pages={141,168},
                      url="http://www.springerlink.com/index/jx3825j42x4333m5.pdf")
public class PartitionalClusterer<DataType,ClusterType extends Cluster<DataType>>
extends AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>
implements BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, Randomized, DivergenceFunctionContainer<ClusterType,DataType>

The PartitionClusterer implements a partitional clustering algorithm, which is a type of hierarchical clustering algorithm. Such a clustering algorithm works by initially creating one cluster for all elements in the collection and then repeatedly partitioning each cluster into two clusters until the stopping condition is met or each leaf cluster has only one element. This implementation supports multiple methods for determining the distance between two clusters by supplying an ClusterToClusterDivergenceFunction object. There are two stopping conditions for the algorithm, which are parameters that can be set. The first is that the clustering will stop when some minimum number of elements per leaf cluster is reached, which defaults to 1. The second criteria is that the partitioning will stop when the improvement in the criterion function is below some threshold greater than one.

Since:
3.1.1
Author:
Natasha Singh-Miller
See Also:
Serialized Form

Field Summary
protected  ArrayList<ClusterType> clusters
          The current set of clusters.
protected  ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> clustersHierarchy
          The current set of hierarchical clusters.
protected  IncrementalClusterCreator<ClusterType,DataType> creator
          The merger used to merge two clusters into one element.
static double DEFAULT_MAX_CRITERION_DECREASE
          The default maximum decrease in the training criterion is 1.0.
static int DEFAULT_MAX_ITERATIONS
          The default maximum number of iterations 2147483647
static int DEFAULT_MIN_CLUSTER_SIZE
          The default minimum number of elements per cluster is 1.
protected  ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction
          The divergence function used to find the distance between two clusters.
protected  double maxCriterionDecrease
          The maximum decrease in training criterion allowed.
protected  int minClusterSize
          The minimum number of elements per cluster allowed.
protected  Random random
          The source of randomness used during initial partitioning.
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
data, keepGoing
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
maxIterations
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
DEFAULT_ITERATION, iteration
 
Constructor Summary
PartitionalClusterer()
          Creates a new instance of PartitionalClusterer.
PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, IncrementalClusterCreator<ClusterType,DataType> creator, int minClusterSize, double maxCriterionDecrease, Random random)
          Initializes the clustering to use the given metric between clusters, the given cluster creator, and the minimum number of elements per cluster to allow.
PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, IncrementalClusterCreator<ClusterType,DataType> creator, int maxIterations, int minClusterSize, double maxCriterionDecrease, Random random)
          Initializes the clustering to use the given metric between clusters, the given cluster creator, the minimum number of elements per cluster to allow, and the maximum decrease in the training criterion during partition to allow.
PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, IncrementalClusterCreator<ClusterType,DataType> creator, Random random)
          Initializes the clustering to use the given metric between clusters, and the given cluster creator.
 
Method Summary
protected  void cleanupAlgorithm()
          Called to clean up the learning algorithm's state after learning has finished.
 PartitionalClusterer<DataType,ClusterType> clone()
          This makes public the clone method on the Object class and removes the exception that it throws.
 ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(Collection<? extends DataType> data)
          Performs hierarchical clustering on the given elements.
 int getClusterCount()
          Gets the number of clusters.
 ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> getClustersHierarchy()
          Gets the hierarchy of clusters.
 IncrementalClusterCreator<ClusterType,DataType> getCreator()
          Gets the cluster creator.
 ClusterDivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
          Gets the divergence function used in clustering.
 double getMaxCriterionDecrease()
          Returns the maximum decrease in the training criterion allowed following a bisection.
 int getMinClusterSize()
          Returns the minimum number of elements per cluster to allow.
 Random getRandom()
          Gets the random number generator used by this object.
 ArrayList<ClusterType> getResult()
          Gets the current result of the algorithm.
protected  boolean initializeAlgorithm()
          Called to initialize the learning algorithm's state based on the data that is stored in the data field.
protected  void setClusters(ArrayList<ClusterType> clusters)
          Sets the clusters.
protected  void setClustersHierarchy(ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> clustersHierarchy)
          Sets the hierarchy of clusters.
 void setCreator(IncrementalClusterCreator<ClusterType,DataType> creator)
          Sets the cluster creator.
 void setDivergenceFunction(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
          Sets the divergence function.
 void setMaxCriterionDecrease(double maxCriterionDecrease)
          Sets the maximum decrease in the training criterion allowed following a bisection.
 void setMinClusterSize(int minClusterSize)
          Sets the minimum number of elements per cluster to allow.
 void setRandom(Random random)
          Sets the random number generator used by this object.
protected  boolean step()
          Called to take a single step of the learning algorithm.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
getData, getKeepGoing, learn, setData, setKeepGoing, stop
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
getMaxIterations, isResultValid, setMaxIterations
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.learning.algorithm.BatchLearner
learn
 
Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm
getMaxIterations, setMaxIterations
 
Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
 
Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm
isResultValid
 

Field Detail

DEFAULT_MIN_CLUSTER_SIZE

public static final int DEFAULT_MIN_CLUSTER_SIZE
The default minimum number of elements per cluster is 1.

See Also:
Constant Field Values

DEFAULT_MAX_CRITERION_DECREASE

public static final double DEFAULT_MAX_CRITERION_DECREASE
The default maximum decrease in the training criterion is 1.0.

See Also:
Constant Field Values

DEFAULT_MAX_ITERATIONS

public static final int DEFAULT_MAX_ITERATIONS
The default maximum number of iterations 2147483647

See Also:
Constant Field Values

divergenceFunction

protected ClusterDivergenceFunction<? super ClusterType extends Cluster<DataType>,? super DataType> divergenceFunction
The divergence function used to find the distance between two clusters.


creator

protected IncrementalClusterCreator<ClusterType extends Cluster<DataType>,DataType> creator
The merger used to merge two clusters into one element.


minClusterSize

protected int minClusterSize
The minimum number of elements per cluster allowed.


maxCriterionDecrease

protected double maxCriterionDecrease
The maximum decrease in training criterion allowed.


random

protected Random random
The source of randomness used during initial partitioning.


clusters

protected transient ArrayList<ClusterType extends Cluster<DataType>> clusters
The current set of clusters.


clustersHierarchy

protected transient ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType extends Cluster<DataType>>> clustersHierarchy
The current set of hierarchical clusters.

Constructor Detail

PartitionalClusterer

public PartitionalClusterer()
Creates a new instance of PartitionalClusterer.


PartitionalClusterer

public PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                            IncrementalClusterCreator<ClusterType,DataType> creator,
                            Random random)
Initializes the clustering to use the given metric between clusters, and the given cluster creator.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
random - The random number generator to use.

PartitionalClusterer

public PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                            IncrementalClusterCreator<ClusterType,DataType> creator,
                            int minClusterSize,
                            double maxCriterionDecrease,
                            Random random)
Initializes the clustering to use the given metric between clusters, the given cluster creator, and the minimum number of elements per cluster to allow.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
minClusterSize - The minimum number of elements per cluster to allow. Must be greater than zero.
maxCriterionDecrease - The maximum decrease in the training criterion to allow.
random - The random number generator to use.

PartitionalClusterer

public PartitionalClusterer(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                            IncrementalClusterCreator<ClusterType,DataType> creator,
                            int maxIterations,
                            int minClusterSize,
                            double maxCriterionDecrease,
                            Random random)
Initializes the clustering to use the given metric between clusters, the given cluster creator, the minimum number of elements per cluster to allow, and the maximum decrease in the training criterion during partition to allow.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
maxIterations - The maximum number of iterations.
minClusterSize - The minimum number of clusters to allow. Must be greater than zero.
maxCriterionDecrease - The maximum decrease in the training criterion to allow.
random - The random number generator to use.
Method Detail

clone

public PartitionalClusterer<DataType,ClusterType> clone()
Description copied from class: AbstractCloneableSerializable
This makes public the clone method on the Object class and removes the exception that it throws. Its default behavior is to automatically create a clone of the exact type of object that the clone is called on and to copy all primitives but to keep all references, which means it is a shallow copy. Extensions of this class may want to override this method (but call super.clone() to implement a "smart copy". That is, to target the most common use case for creating a copy of the object. Because of the default behavior being a shallow copy, extending classes only need to handle fields that need to have a deeper copy (or those that need to be reset). Some of the methods in ObjectUtil may be helpful in implementing a custom clone method. Note: The contract of this method is that you must use super.clone() as the basis for your implementation.

Specified by:
clone in interface CloneableSerializable
Overrides:
clone in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
A clone of this object.

clusterHierarchically

public ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(Collection<? extends DataType> data)
Description copied from interface: BatchHierarchicalClusterer
Performs hierarchical clustering on the given elements. It returns the root node of the hierarchy of clusters for the given elements.

Specified by:
clusterHierarchically in interface BatchHierarchicalClusterer<DataType,ClusterType extends Cluster<DataType>>
Parameters:
data - The elements to cluster.
Returns:
The root node of the hierarchical cluster for the given elements.

initializeAlgorithm

protected boolean initializeAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to initialize the learning algorithm's state based on the data that is stored in the data field. The return value indicates if the algorithm can be run or not based on the initialization.

Specified by:
initializeAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
True if the learning algorithm can be run and false if it cannot.

step

protected boolean step()
Description copied from class: AbstractAnytimeBatchLearner
Called to take a single step of the learning algorithm.

Specified by:
step in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
True if another step can be taken and false it the algorithm should halt.

cleanupAlgorithm

protected void cleanupAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to clean up the learning algorithm's state after learning has finished.

Specified by:
cleanupAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>

getResult

public ArrayList<ClusterType> getResult()
Description copied from interface: AnytimeAlgorithm
Gets the current result of the algorithm.

Specified by:
getResult in interface AnytimeAlgorithm<Collection<ClusterType extends Cluster<DataType>>>
Returns:
Current result of the algorithm.

getClusterCount

public int getClusterCount()
Gets the number of clusters.

Returns:
The number of clusters.

getDivergenceFunction

public ClusterDivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
Gets the divergence function used in clustering.

Specified by:
getDivergenceFunction in interface DivergenceFunctionContainer<ClusterType extends Cluster<DataType>,DataType>
Returns:
The divergence function.

setDivergenceFunction

public void setDivergenceFunction(ClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
Sets the divergence function.

Parameters:
divergenceFunction - The divergence function.

getCreator

public IncrementalClusterCreator<ClusterType,DataType> getCreator()
Gets the cluster creator.

Returns:
The cluster creator.

setCreator

public void setCreator(IncrementalClusterCreator<ClusterType,DataType> creator)
Sets the cluster creator.

Parameters:
creator - The creator for clusters.

getRandom

public Random getRandom()
Description copied from interface: Randomized
Gets the random number generator used by this object.

Specified by:
getRandom in interface Randomized
Returns:
The random number generator used by this object.

setRandom

public void setRandom(Random random)
Description copied from interface: Randomized
Sets the random number generator used by this object.

Specified by:
setRandom in interface Randomized
Parameters:
random - The random number generator for this object to use.

getMinClusterSize

public int getMinClusterSize()
Returns the minimum number of elements per cluster to allow. If the number of elements in a cluster is less than or equal to this number, it will not be bisected. Must be greater than zero.

Returns:
The minimum number of elements per cluster allowed.

setMinClusterSize

public void setMinClusterSize(int minClusterSize)
Sets the minimum number of elements per cluster to allow. If the number of elements in a cluster is less than or equal to this number, it will not be bisected.

Parameters:
minClusterSize - The new minimum number of elements per cluster allowed. Must be greater than zero.

getMaxCriterionDecrease

public double getMaxCriterionDecrease()
Returns the maximum decrease in the training criterion allowed following a bisection. If the decrease in criterion of a bisection step is greater than this value, partitioning will stop. To turn off this stopping criterion set this to 1.0 or greater.

Returns:
The maximum decrease in the training criterion following a bisection.

setMaxCriterionDecrease

public void setMaxCriterionDecrease(double maxCriterionDecrease)
Sets the maximum decrease in the training criterion allowed following a bisection. If the decrease in criterion of a bisection step is greater than this value, partitioning will stop. To turn off this stopping criterion set this to 1.0 or greater.

Parameters:
maxCriterionDecrease - The new maximum minimum distance.

setClusters

protected void setClusters(ArrayList<ClusterType> clusters)
Sets the clusters.

Parameters:
clusters - The clusters.

getClustersHierarchy

public ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> getClustersHierarchy()
Gets the hierarchy of clusters.

Returns:
The hierarchy of clusters.

setClustersHierarchy

protected void setClustersHierarchy(ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> clustersHierarchy)
Sets the hierarchy of clusters.

Parameters:
clustersHierarchy - The hierarchy of clusters.