gov.sandia.cognition.learning.algorithm.tree
Class AbstractVectorThresholdMaximumGainLearner<OutputType>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
Type Parameters:
OutputType - The output category type for the training data.
All Implemented Interfaces:
BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable
Direct Known Subclasses:
VectorThresholdGiniImpurityLearner, VectorThresholdHellingerDistanceLearner, VectorThresholdInformationGainLearner

public abstract class AbstractVectorThresholdMaximumGainLearner<OutputType>
extends AbstractCloneableSerializable
implements VectorThresholdMaximumGainLearner<OutputType>

An abstract class for decider learners that produce a threshold function on a vector element based on maximizing some gain value. It handles the looping over the elements of the vector and then for each element looping over the possible split points. Subclasses only need to define a method to compute the gain of a given split.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Field Summary
protected  int[] dimensionsToConsider
          The array of dimensions for the learner to consider.
 
Constructor Summary
AbstractVectorThresholdMaximumGainLearner()
          Creates a new AbstractVectorThresholdMaximumGainLearner.
 
Method Summary
 DefaultPair<Double,Double> computeBestGainAndThreshold(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data, int dimension, DefaultDataDistribution<OutputType> baseCounts)
          Computes the best gain and threshold for a given dimension using the computeSplitGain method for each potential split point of values for the given dimension.
protected  DefaultPair<Double,Double> computeBestGainAndThreshold(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data, int dimension, DefaultDataDistribution<OutputType> baseCounts, ArrayList<DefaultWeightedValue<OutputType>> values)
          Computes the best gain and threshold for a given dimension using the computeSplitGain method for each potential split point of values for the given dimension.
abstract  double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)
          Computes the gain of a given split.
protected static int getDimensionality(Collection<? extends InputOutputPair<? extends Vectorizable,?>> data)
          Figures out the dimensionality of the Vector data.
 int[] getDimensionsToConsider()
          Gets the dimensions that the learner is to consider.
 VectorElementThresholdCategorizer learn(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data)
          The learn method creates an object of ResultType using data of type DataType, using some form of "learning" algorithm.
 void setDimensionsToConsider(int[] dimensionsToConsider)
          Gets the dimensions that the learner is to consider.
 
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable
clone
 

Field Detail

dimensionsToConsider

protected int[] dimensionsToConsider
The array of dimensions for the learner to consider. If this is null, then all dimensions are considered.

Constructor Detail

AbstractVectorThresholdMaximumGainLearner

public AbstractVectorThresholdMaximumGainLearner()
Creates a new AbstractVectorThresholdMaximumGainLearner.

Method Detail

learn

public VectorElementThresholdCategorizer learn(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data)
Description copied from interface: BatchLearner
The learn method creates an object of ResultType using data of type DataType, using some form of "learning" algorithm.

Specified by:
learn in interface BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>
Parameters:
data - The data that the learning algorithm will use to create an object of ResultType.
Returns:
The object that is created based on the given data using the learning algorithm.

computeBestGainAndThreshold

public DefaultPair<Double,Double> computeBestGainAndThreshold(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data,
                                                              int dimension,
                                                              DefaultDataDistribution<OutputType> baseCounts)
Computes the best gain and threshold for a given dimension using the computeSplitGain method for each potential split point of values for the given dimension.

Parameters:
data - The data to use to compute the threshold.
dimension - The dimension to compute the threshold for.
baseCounts - Information about the base category counts.
Returns:
A pair containing the best gain computed and its associated threshold. If there is no good split point, null is returned. This can happen if there is no data or every value is the same.

computeBestGainAndThreshold

protected DefaultPair<Double,Double> computeBestGainAndThreshold(Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>> data,
                                                                 int dimension,
                                                                 DefaultDataDistribution<OutputType> baseCounts,
                                                                 ArrayList<DefaultWeightedValue<OutputType>> values)
Computes the best gain and threshold for a given dimension using the computeSplitGain method for each potential split point of values for the given dimension.

Parameters:
data - The data to use to compute the threshold.
dimension - The dimension to compute the threshold for.
baseCounts - Information about the base category counts.
values - A workspace to store the values of the data in. Recycled to avoid recreating a large array each time.
Returns:
A pair containing the best gain computed and its associated threshold. If there is no good split point, null is returned. This can happen if there is no data or every value is the same.

computeSplitGain

public abstract double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
                                        DefaultDataDistribution<OutputType> positiveCounts,
                                        DefaultDataDistribution<OutputType> negativeCounts)
Computes the gain of a given split. The base counts contains the category information before the split.

Parameters:
baseCounts - The base category information before splitting. Contains the sum of the positive and negative counts.
positiveCounts - The category information on the positive side of the split.
negativeCounts - The category information on the negative side of the split.
Returns:
The gain of the given split computed by comparing the positive and negative counts to the base counts.

getDimensionsToConsider

public int[] getDimensionsToConsider()
Description copied from interface: VectorThresholdMaximumGainLearner
Gets the dimensions that the learner is to consider. Null means that all of them are included.

Specified by:
getDimensionsToConsider in interface VectorThresholdMaximumGainLearner<OutputType>
Returns:
The array of vector dimensions to consider. Null means all of them are considered.

setDimensionsToConsider

public void setDimensionsToConsider(int[] dimensionsToConsider)
Description copied from interface: VectorThresholdMaximumGainLearner
Gets the dimensions that the learner is to consider. Null means that all of them are included.

Specified by:
setDimensionsToConsider in interface VectorThresholdMaximumGainLearner<OutputType>
Parameters:
dimensionsToConsider - The array of vector dimensions to consider. Null means all of them are considered.

getDimensionality

protected static int getDimensionality(Collection<? extends InputOutputPair<? extends Vectorizable,?>> data)
Figures out the dimensionality of the Vector data.

Parameters:
data - The data.
Returns:
The dimensionality of the data in the vector.