gov.sandia.cognition.learning.algorithm.clustering
Class AgglomerativeClusterer<DataType,ClusterType extends Cluster<DataType>>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
          extended by gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm<ResultType>
              extended by gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>
                  extended by gov.sandia.cognition.learning.algorithm.clustering.AgglomerativeClusterer<DataType,ClusterType>
Type Parameters:
DataType - The type of the data to cluster. This is typically defined by the divergence function used.
ClusterType - The type of Cluster created by the algorithm. This is typically defined by the cluster creator function used.
All Implemented Interfaces:
AnytimeAlgorithm<Collection<ClusterType>>, IterativeAlgorithm, StoppableAlgorithm, AnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>, BatchLearner<Collection<? extends DataType>,Collection<ClusterType>>, BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, DivergenceFunctionContainer<ClusterType,ClusterType>, CloneableSerializable, Serializable, Cloneable

@CodeReview(reviewer="Kevin R. Dixon",
            date="2008-07-22",
            changesNeeded=true,
            comments={"I *really* don\'t like the use of \'continue\', but I will defer.","Please implement the sections previously marked as \'to do\'"},
            response=@CodeReviewResponse(respondent="Justin Basilico",date="2008-10-07",moreChangesNeeded=false,comments="The clusterer now supports hierarchical clustering."))
public class AgglomerativeClusterer<DataType,ClusterType extends Cluster<DataType>>
extends AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType>>
implements BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, DivergenceFunctionContainer<ClusterType,ClusterType>

The AgglomerativeClusterer implements an agglomerative clustering algorithm, which is a type of hierarchical clustering algorithm. Such a clustering algorithm works by initially creating one cluster for each element in the collection to cluster and then repeatedly merging the two closest clusters until the stopping condition is met or there is only one cluster remaining. This implementation supports multiple methods for determining the distance between two clusters by supplying an ClusterToClusterDivergenceFunction object. There are two stopping conditions for the algorithm, which are parameters that can be set. The first is that the clustering will stop when some minimum number of clusters is reached, which defaults to 1. The second criteria is that the clustering will stop when the distance between the two closest clusters is larger than a given value. This threshold can be used to create clusters when the number of clusters is not known ahead of time.

Since:
1.0
Author:
Justin Basilico
See Also:
Serialized Form

Nested Class Summary
static class AgglomerativeClusterer.HierarchyNode<DataType,ClusterType extends Cluster<DataType>>
          Holds the hierarchy information for the agglomerative clusterer.
 
Field Summary
protected  ArrayList<ClusterType> clusters
          The current set of clusters.
protected  ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> clustersHierarchy
          The current set of hierarchical clusters.
protected  ClusterCreator<ClusterType,DataType> creator
          The merger used to merge two clusters into one element.
static int DEFAULT_MAX_ITERATIONS
          The default maximum number of iterations 2147483647
static double DEFAULT_MAX_MIN_DISTANCE
          The default maximum minimum distance is 1.7976931348623157E308.
static int DEFAULT_MIN_NUM_CLUSTERS
          The default minimum number of clusters is 1.
protected  ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction
          The divergence function used to find the distance between two clusters.
protected  double maxMinDistance
          The maximum minimum distance between clusters allowed.
protected  ArrayList<Integer> minClusters
          The array of indexes that maps the cluster index to the closest cluster.
protected  ArrayList<Double> minDistances
          An array list mapping the cached minimum distance from the cluster with the given index to any other clusters.
protected  int minNumClusters
          The minimum number of clusters allowed.
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
data, keepGoing
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
maxIterations
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
DEFAULT_ITERATION, iteration
 
Constructor Summary
AgglomerativeClusterer()
          Creates a new instance of AgglomerativeClusterer.
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator)
          Initializes the clustering to use the given metric between clusters, and the given cluster creator.
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, double maxMinDistance)
          Initializes the clustering to use the given metric between clusters, the given cluster merger, and the maximum minimum distance between clusters to allow.
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, int minNumClusters)
          Initializes the clustering to use the given metric between clusters, the given cluster creator, and the minimum number of clusters to allow.
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, int minNumClusters, double maxMinDistance)
          Initializes the clustering to use the given metric between clusters, the given cluster merger, the minimum number of clusters to allow, and the maximum minimum distance between clusters to allow.
 
Method Summary
protected  void cleanupAlgorithm()
          Called to clean up the learning algorithm's state after learning has finished.
 AgglomerativeClusterer<DataType,ClusterType> clone()
          This makes public the clone method on the Object class and removes the exception that it throws.
 ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(Collection<? extends DataType> data)
          Performs hierarchical clustering on the given elements.
 ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> getClustersHierarchy()
          Gets the hierarchy of clusters.
 ClusterCreator<ClusterType,DataType> getCreator()
          Gets the cluster creator.
 ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
          Gets the divergence function used in clustering.
 double getMaxMinDistance()
          The maximum minimum distance between clusters that is allowed for the two clusters to be merged.
 int getMinNumClusters()
          The minimum number of clusters to allow.
 int getNumClusters()
          Gets the number of clusters.
 ArrayList<ClusterType> getResult()
          Gets the current result of the algorithm.
protected  boolean initializeAlgorithm()
          Called to initialize the learning algorithm's state based on the data that is stored in the data field.
protected  int mergeClusters(int firstIndex, int secondIndex, double distance)
          Merges two clusters together, creating a new BinaryTreeCluster and updating the internal state.
protected  void setClusters(ArrayList<ClusterType> clusters)
          Sets the clusters.
protected  void setClustersHierarchy(ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> clustersHierarchy)
          Sets the hierarchy of clusters.
 void setCreator(ClusterCreator<ClusterType,DataType> creator)
          Sets the cluster creator.
 void setDivergenceFunction(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
          Sets the divergence function.
 void setMaxMinDistance(double maxMinDistance)
          The maximum minimum distance between clusters that is allowed for the two clusters to be merged.
protected  void setMinClusters(ArrayList<Integer> minClusters)
          Sets the index of the closest cluster.
protected  void setMinDistances(ArrayList<Double> minDistances)
          Sets the minimum distances for each clusters.
 void setMinNumClusters(int minNumClusters)
          The minimum number of clusters to allow.
protected  boolean step()
          Called to take a single step of the learning algorithm.
protected  void updateMinDistance(int index)
          Updates the cached minimum distance for this cluster by comparing it to all the other clusters.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
getData, getKeepGoing, learn, setData, setKeepGoing, stop
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
getMaxIterations, isResultValid, setMaxIterations
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.learning.algorithm.BatchLearner
learn
 
Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm
getMaxIterations, setMaxIterations
 
Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
 
Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm
isResultValid
 

Field Detail

DEFAULT_MIN_NUM_CLUSTERS

public static final int DEFAULT_MIN_NUM_CLUSTERS
The default minimum number of clusters is 1.

See Also:
Constant Field Values

DEFAULT_MAX_MIN_DISTANCE

public static final double DEFAULT_MAX_MIN_DISTANCE
The default maximum minimum distance is 1.7976931348623157E308.

See Also:
Constant Field Values

DEFAULT_MAX_ITERATIONS

public static final int DEFAULT_MAX_ITERATIONS
The default maximum number of iterations 2147483647

See Also:
Constant Field Values

divergenceFunction

protected ClusterToClusterDivergenceFunction<? super ClusterType extends Cluster<DataType>,? super DataType> divergenceFunction
The divergence function used to find the distance between two clusters.


creator

protected ClusterCreator<ClusterType extends Cluster<DataType>,DataType> creator
The merger used to merge two clusters into one element.


minNumClusters

protected int minNumClusters
The minimum number of clusters allowed.


maxMinDistance

protected double maxMinDistance
The maximum minimum distance between clusters allowed.


clusters

protected ArrayList<ClusterType extends Cluster<DataType>> clusters
The current set of clusters.


clustersHierarchy

protected ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType extends Cluster<DataType>>> clustersHierarchy
The current set of hierarchical clusters.


minDistances

protected transient ArrayList<Double> minDistances
An array list mapping the cached minimum distance from the cluster with the given index to any other clusters.


minClusters

protected transient ArrayList<Integer> minClusters
The array of indexes that maps the cluster index to the closest cluster.

Constructor Detail

AgglomerativeClusterer

public AgglomerativeClusterer()
Creates a new instance of AgglomerativeClusterer.


AgglomerativeClusterer

public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                              ClusterCreator<ClusterType,DataType> creator)
Initializes the clustering to use the given metric between clusters, and the given cluster creator. The minimum number of clusters will be set to 1.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.

AgglomerativeClusterer

public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                              ClusterCreator<ClusterType,DataType> creator,
                              int minNumClusters)
Initializes the clustering to use the given metric between clusters, the given cluster creator, and the minimum number of clusters to allow.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
minNumClusters - The minimum number of clusters to allow. Must be greater than zero.

AgglomerativeClusterer

public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                              ClusterCreator<ClusterType,DataType> creator,
                              double maxMinDistance)
Initializes the clustering to use the given metric between clusters, the given cluster merger, and the maximum minimum distance between clusters to allow.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
maxMinDistance - The maximum minimum distance between clusters to allow.

AgglomerativeClusterer

public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
                              ClusterCreator<ClusterType,DataType> creator,
                              int minNumClusters,
                              double maxMinDistance)
Initializes the clustering to use the given metric between clusters, the given cluster merger, the minimum number of clusters to allow, and the maximum minimum distance between clusters to allow.

Parameters:
divergenceFunction - The distance metric between clusters.
creator - The method for creating clusters.
minNumClusters - The minimum number of clusters to allow. Must be greater than zero.
maxMinDistance - The maximum minimum distance between clusters to allow.
Method Detail

clone

public AgglomerativeClusterer<DataType,ClusterType> clone()
Description copied from class: AbstractCloneableSerializable
This makes public the clone method on the Object class and removes the exception that it throws. Its default behavior is to automatically create a clone of the exact type of object that the clone is called on and to copy all primitives but to keep all references, which means it is a shallow copy. Extensions of this class may want to override this method (but call super.clone() to implement a "smart copy". That is, to target the most common use case for creating a copy of the object. Because of the default behavior being a shallow copy, extending classes only need to handle fields that need to have a deeper copy (or those that need to be reset). Some of the methods in ObjectUtil may be helpful in implementing a custom clone method. Note: The contract of this method is that you must use super.clone() as the basis for your implementation.

Specified by:
clone in interface CloneableSerializable
Overrides:
clone in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
A clone of this object.

clusterHierarchically

public ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(Collection<? extends DataType> data)
Description copied from interface: BatchHierarchicalClusterer
Performs hierarchical clustering on the given elements. It returns the root node of the hierarchy of clusters for the given elements.

Specified by:
clusterHierarchically in interface BatchHierarchicalClusterer<DataType,ClusterType extends Cluster<DataType>>
Parameters:
data - The elements to cluster.
Returns:
The root node of the hierarchical cluster for the given elements.

initializeAlgorithm

protected boolean initializeAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to initialize the learning algorithm's state based on the data that is stored in the data field. The return value indicates if the algorithm can be run or not based on the initialization.

Specified by:
initializeAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
True if the learning algorithm can be run and false if it cannot.

step

protected boolean step()
Description copied from class: AbstractAnytimeBatchLearner
Called to take a single step of the learning algorithm.

Specified by:
step in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>
Returns:
True if another step can be taken and false it the algorithm should halt.

cleanupAlgorithm

protected void cleanupAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to clean up the learning algorithm's state after learning has finished.

Specified by:
cleanupAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends DataType>,Collection<ClusterType extends Cluster<DataType>>>

updateMinDistance

protected void updateMinDistance(int index)
Updates the cached minimum distance for this cluster by comparing it to all the other clusters.

Parameters:
index - The cluster to update.

mergeClusters

protected int mergeClusters(int firstIndex,
                            int secondIndex,
                            double distance)
Merges two clusters together, creating a new BinaryTreeCluster and updating the internal state.

Parameters:
firstIndex - The first cluster.
secondIndex - The second cluster.
distance - The distance between the clusters.
Returns:
The new, merged cluster.

getResult

public ArrayList<ClusterType> getResult()
Description copied from interface: AnytimeAlgorithm
Gets the current result of the algorithm.

Specified by:
getResult in interface AnytimeAlgorithm<Collection<ClusterType extends Cluster<DataType>>>
Returns:
Current result of the algorithm.

getNumClusters

public int getNumClusters()
Gets the number of clusters.

Returns:
The number of clusters.

getDivergenceFunction

public ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
Gets the divergence function used in clustering.

Specified by:
getDivergenceFunction in interface DivergenceFunctionContainer<ClusterType extends Cluster<DataType>,ClusterType extends Cluster<DataType>>
Returns:
The divergence function.

setDivergenceFunction

public void setDivergenceFunction(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
Sets the divergence function.

Parameters:
divergenceFunction - The divergence function.

getCreator

public ClusterCreator<ClusterType,DataType> getCreator()
Gets the cluster creator.

Returns:
The cluster creator.

setCreator

public void setCreator(ClusterCreator<ClusterType,DataType> creator)
Sets the cluster creator.

Parameters:
creator - The creator for clusters.

getMinNumClusters

public int getMinNumClusters()
The minimum number of clusters to allow. To create a cluster tree, set this value to 1. If the number of clusters drops to this number (or below) then the clustering will stop. Must be greater than zero.

Returns:
The minimum number of clusters allowed.

setMinNumClusters

public void setMinNumClusters(int minNumClusters)
The minimum number of clusters to allow. To create a cluster tree, set this value to 1. If the number of clusters drops to this number (or below) then the clustering will stop. Must be greater than zero.

Parameters:
minNumClusters - The new minimum number of clusters.

getMaxMinDistance

public double getMaxMinDistance()
The maximum minimum distance between clusters that is allowed for the two clusters to be merged. If there are no clusters that remain that have a distance between them less than or equal to this value, then the clustering will halt. To not have this value factored into the clustering, set it to something such as Double.MAX_VALUE.

Returns:
The maximum minimum distance between clusters.

setMaxMinDistance

public void setMaxMinDistance(double maxMinDistance)
The maximum minimum distance between clusters that is allowed for the two clusters to be merged. If there are no clusters that remain that have a distance between them less than or equal to this value, then the clustering will halt. To not have this value factored into the clustering, set it to something such as Double.MAX_VALUE.

Parameters:
maxMinDistance - The new maximum minimum distance.

setClusters

protected void setClusters(ArrayList<ClusterType> clusters)
Sets the clusters.

Parameters:
clusters - The clusters.

getClustersHierarchy

public ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> getClustersHierarchy()
Gets the hierarchy of clusters.

Returns:
The hierarchy of clusters.

setClustersHierarchy

protected void setClustersHierarchy(ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> clustersHierarchy)
Sets the hierarchy of clusters.

Parameters:
clustersHierarchy - The hierarchy of clusters.

setMinDistances

protected void setMinDistances(ArrayList<Double> minDistances)
Sets the minimum distances for each clusters.

Parameters:
minDistances - The array of minimum distances.

setMinClusters

protected void setMinClusters(ArrayList<Integer> minClusters)
Sets the index of the closest cluster.

Parameters:
minClusters - The array of closest cluster indices.