gov.sandia.cognition.learning.algorithm.tree
Class VectorThresholdInformationGainLearner<OutputType>
java.lang.Object
gov.sandia.cognition.util.AbstractCloneableSerializable
gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
gov.sandia.cognition.learning.algorithm.tree.VectorThresholdInformationGainLearner<OutputType>
- Type Parameters:
OutputType
- The output type of the data.
- All Implemented Interfaces:
- BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, PriorWeightedNodeLearner<OutputType>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable
public class VectorThresholdInformationGainLearner<OutputType>
- extends AbstractVectorThresholdMaximumGainLearner<OutputType>
- implements PriorWeightedNodeLearner<OutputType>
The VectorThresholdInformationGainLearner
computes the best
threshold over a dataset of vectors using information gain to determine the
optimal index and threshold. This is an implementation of what is used in
the C4.5 decision tree algorithm.
Information gain for a given split (sets X and Y) for two categories (a and b):
ig(X, Y) = entropy(X + Y)
– (|X| / (|X| + |Y|)) entropy(X)
– (|Y| / (|X| + |Y|)) entropy(Y)
with
entropy(Z) = - (Za / |Z|) log2(Za / |Z|) – (Zb / |Z|) log2(Zb / |Z|)
where
Za = number of a's in Z, and
Zb = number of b's in Z.
In the multi-class case, the entropy is defined as the sum over all of the
categories (c) of -Zc / |Z| log2(Zc / |Z|).
- Since:
- 2.0
- Author:
- Justin Basilico
- See Also:
- Serialized Form
categories
protected ArrayList<OutputType> categories
categoryPriors
protected double[] categoryPriors
categoryCounts
protected int[] categoryCounts
categoryProbabilities
protected double[] categoryProbabilities
- Following is scratch space used when computing weighted
entropy. It is declared here so it can be allocated once,
instead of during every entropy evaluation.
VectorThresholdInformationGainLearner
public VectorThresholdInformationGainLearner()
- Creates a new instance of VectorDeciderLearner.
computeSplitGain
public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
DefaultDataDistribution<OutputType> positiveCounts,
DefaultDataDistribution<OutputType> negativeCounts)
- Description copied from class:
AbstractVectorThresholdMaximumGainLearner
- Computes the gain of a given split. The base counts contains the
category information before the split.
- Specified by:
computeSplitGain
in class AbstractVectorThresholdMaximumGainLearner<OutputType>
- Parameters:
baseCounts
- The base category information before splitting. Contains the sum of
the positive and negative counts.positiveCounts
- The category information on the positive side of the split.negativeCounts
- The category information on the negative side of the split.
- Returns:
- The gain of the given split computed by comparing the positive and
negative counts to the base counts.
configure
public void configure(Map<OutputType,Double> priors,
Map<OutputType,Integer> trainCounts)
- Description copied from interface:
PriorWeightedNodeLearner
- Configure the node learner with prior weights and training counts.
If the prior weights are not specified, this method will
configure default priors that match the relative frequencies of
the different categories in the training data. The frequencies
are based on the given category counts from the training data.
- Specified by:
configure
in interface PriorWeightedNodeLearner<OutputType>
- Parameters:
priors
- Prior weights for each of the possible output values (i.e.,
the categories for the prediction task). If null, the
method will estimate default priors from the training
counts.trainCounts
- Frequency counts of the possible output values (i.e.,
categories) in the training data. This parameter should
always be specified.