Class VectorThresholdInformationGainLearner<OutputType>

  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
          extended by gov.sandia.cognition.learning.algorithm.tree.VectorThresholdInformationGainLearner<OutputType>
Type Parameters:
OutputType - The output type of the data.
All Implemented Interfaces:
BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, PriorWeightedNodeLearner<OutputType>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable

public class VectorThresholdInformationGainLearner<OutputType>
extends AbstractVectorThresholdMaximumGainLearner<OutputType>
implements PriorWeightedNodeLearner<OutputType>

The VectorThresholdInformationGainLearner computes the best threshold over a dataset of vectors using information gain to determine the optimal index and threshold. This is an implementation of what is used in the C4.5 decision tree algorithm.

Information gain for a given split (sets X and Y) for two categories (a and b):
ig(X, Y) = entropy(X + Y)
– (|X| / (|X| + |Y|)) entropy(X)
– (|Y| / (|X| + |Y|)) entropy(Y)

entropy(Z) = - (Za / |Z|) log2(Za / |Z|) – (Zb / |Z|) log2(Zb / |Z|)

Za = number of a's in Z, and
Zb = number of b's in Z.
In the multi-class case, the entropy is defined as the sum over all of the categories (c) of -Zc / |Z| log2(Zc / |Z|).

Justin Basilico
See Also:
Serialized Form

Field Summary
protected  ArrayList<OutputType> categories
protected  int[] categoryCounts
protected  double[] categoryPriors
protected  double[] categoryProbabilities
          Following is scratch space used when computing weighted entropy.
Fields inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
Constructor Summary
          Creates a new instance of VectorDeciderLearner.
Method Summary
 double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)
          Computes the gain of a given split.
 void configure(Map<OutputType,Double> priors, Map<OutputType,Integer> trainCounts)
          Configure the node learner with prior weights and training counts.
Methods inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
computeBestGainAndThreshold, computeBestGainAndThreshold, getDimensionality, getDimensionsToConsider, learn, setDimensionsToConsider
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable

Field Detail


protected ArrayList<OutputType> categories


protected double[] categoryPriors


protected int[] categoryCounts


protected double[] categoryProbabilities
Following is scratch space used when computing weighted entropy. It is declared here so it can be allocated once, instead of during every entropy evaluation.

Constructor Detail


public VectorThresholdInformationGainLearner()
Creates a new instance of VectorDeciderLearner.

Method Detail


public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
                               DefaultDataDistribution<OutputType> positiveCounts,
                               DefaultDataDistribution<OutputType> negativeCounts)
Description copied from class: AbstractVectorThresholdMaximumGainLearner
Computes the gain of a given split. The base counts contains the category information before the split.

Specified by:
computeSplitGain in class AbstractVectorThresholdMaximumGainLearner<OutputType>
baseCounts - The base category information before splitting. Contains the sum of the positive and negative counts.
positiveCounts - The category information on the positive side of the split.
negativeCounts - The category information on the negative side of the split.
The gain of the given split computed by comparing the positive and negative counts to the base counts.


public void configure(Map<OutputType,Double> priors,
                      Map<OutputType,Integer> trainCounts)
Description copied from interface: PriorWeightedNodeLearner
Configure the node learner with prior weights and training counts.

If the prior weights are not specified, this method will configure default priors that match the relative frequencies of the different categories in the training data. The frequencies are based on the given category counts from the training data.

Specified by:
configure in interface PriorWeightedNodeLearner<OutputType>
priors - Prior weights for each of the possible output values (i.e., the categories for the prediction task). If null, the method will estimate default priors from the training counts.
trainCounts - Frequency counts of the possible output values (i.e., categories) in the training data. This parameter should always be specified.