gov.sandia.cognition.learning.algorithm.tree
Class VectorThresholdInformationGainLearner<OutputType>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
          extended by gov.sandia.cognition.learning.algorithm.tree.VectorThresholdInformationGainLearner<OutputType>
Type Parameters:
OutputType - The output type of the data.
All Implemented Interfaces:
BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, PriorWeightedNodeLearner<OutputType>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable

public class VectorThresholdInformationGainLearner<OutputType>
extends AbstractVectorThresholdMaximumGainLearner<OutputType>
implements PriorWeightedNodeLearner<OutputType>

The VectorThresholdInformationGainLearner computes the best threshold over a dataset of vectors using information gain to determine the optimal index and threshold. This is an implementation of what is used in the C4.5 decision tree algorithm.

Information gain for a given split (sets X and Y) for two categories (a and b):
ig(X, Y) = entropy(X + Y)
– (|X| / (|X| + |Y|)) entropy(X)
– (|Y| / (|X| + |Y|)) entropy(Y)
with


entropy(Z) = - (Za / |Z|) log2(Za / |Z|) – (Zb / |Z|) log2(Zb / |Z|)


where
Za = number of a's in Z, and
Zb = number of b's in Z.
In the multi-class case, the entropy is defined as the sum over all of the categories (c) of -Zc / |Z| log2(Zc / |Z|).

Since:
2.0
Author:
Justin Basilico
See Also:
Serialized Form

Field Summary
protected  ArrayList<OutputType> categories
           
protected  int[] categoryCounts
           
protected  double[] categoryPriors
           
protected  double[] categoryProbabilities
          Following is scratch space used when computing weighted entropy.
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
dimensionsToConsider
 
Constructor Summary
VectorThresholdInformationGainLearner()
          Creates a new instance of VectorDeciderLearner.
 
Method Summary
 double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)
          Computes the gain of a given split.
 void configure(Map<OutputType,Double> priors, Map<OutputType,Integer> trainCounts)
          Configure the node learner with prior weights and training counts.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
computeBestGainAndThreshold, computeBestGainAndThreshold, getDimensionality, getDimensionsToConsider, learn, setDimensionsToConsider
 
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable
clone
 

Field Detail

categories

protected ArrayList<OutputType> categories

categoryPriors

protected double[] categoryPriors

categoryCounts

protected int[] categoryCounts

categoryProbabilities

protected double[] categoryProbabilities
Following is scratch space used when computing weighted entropy. It is declared here so it can be allocated once, instead of during every entropy evaluation.

Constructor Detail

VectorThresholdInformationGainLearner

public VectorThresholdInformationGainLearner()
Creates a new instance of VectorDeciderLearner.

Method Detail

computeSplitGain

public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
                               DefaultDataDistribution<OutputType> positiveCounts,
                               DefaultDataDistribution<OutputType> negativeCounts)
Description copied from class: AbstractVectorThresholdMaximumGainLearner
Computes the gain of a given split. The base counts contains the category information before the split.

Specified by:
computeSplitGain in class AbstractVectorThresholdMaximumGainLearner<OutputType>
Parameters:
baseCounts - The base category information before splitting. Contains the sum of the positive and negative counts.
positiveCounts - The category information on the positive side of the split.
negativeCounts - The category information on the negative side of the split.
Returns:
The gain of the given split computed by comparing the positive and negative counts to the base counts.

configure

public void configure(Map<OutputType,Double> priors,
                      Map<OutputType,Integer> trainCounts)
Description copied from interface: PriorWeightedNodeLearner
Configure the node learner with prior weights and training counts.

If the prior weights are not specified, this method will configure default priors that match the relative frequencies of the different categories in the training data. The frequencies are based on the given category counts from the training data.

Specified by:
configure in interface PriorWeightedNodeLearner<OutputType>
Parameters:
priors - Prior weights for each of the possible output values (i.e., the categories for the prediction task). If null, the method will estimate default priors from the training counts.
trainCounts - Frequency counts of the possible output values (i.e., categories) in the training data. This parameter should always be specified.