gov.sandia.cognition.learning.algorithm.tree
Class VectorThresholdGiniImpurityLearner<OutputType>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
          extended by gov.sandia.cognition.learning.algorithm.tree.VectorThresholdGiniImpurityLearner<OutputType>
Type Parameters:
OutputType - The type of the output categories to learn over.
All Implemented Interfaces:
BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable

@PublicationReference(author="Wikipedia",
                      title="Decision tree learning",
                      year=2010,
                      type=WebPage,
                      url="http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity")
public class VectorThresholdGiniImpurityLearner<OutputType>
extends AbstractVectorThresholdMaximumGainLearner<OutputType>

Learns vector thresholds based on the Gini impurity measure. It attempts to minimize the Gini impurity in splits. If f_i is the fraction of examples belonging to category i in split f, then the Gini impurity measure is defined as:
sum_i f_i * (1 - f_i)
Notice that sum_i f_i = 1, so the value will range between 0 and 1.

This measure is the one used in the Classification and Regression Tree (CART) algorithm.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Field Summary
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
dimensionsToConsider
 
Constructor Summary
VectorThresholdGiniImpurityLearner()
          Creates a new instance of VectorThresholdGiniImpurityLearner.
 
Method Summary
 double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)
          Computes the split gain by computing the Gini impurity for the given split.
static
<DataType> double
giniImpurity(DefaultDataDistribution<DataType> counts)
          Computes the Gini impurity of a histogram.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
computeBestGainAndThreshold, computeBestGainAndThreshold, getDimensionality, getDimensionsToConsider, learn, setDimensionsToConsider
 
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable
clone
 

Constructor Detail

VectorThresholdGiniImpurityLearner

public VectorThresholdGiniImpurityLearner()
Creates a new instance of VectorThresholdGiniImpurityLearner.

Method Detail

computeSplitGain

public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
                               DefaultDataDistribution<OutputType> positiveCounts,
                               DefaultDataDistribution<OutputType> negativeCounts)
Computes the split gain by computing the Gini impurity for the given split.

Specified by:
computeSplitGain in class AbstractVectorThresholdMaximumGainLearner<OutputType>
Parameters:
baseCounts - The histogram of counts before the split.
positiveCounts - The counts on the positive side of the threshold.
negativeCounts - The counts on the negative side of the threshold.
Returns:
The split gain by computing the gain in Gini impurity for the given split. Will be between 0.0 and 1.0.

giniImpurity

public static <DataType> double giniImpurity(DefaultDataDistribution<DataType> counts)
Computes the Gini impurity of a histogram. For each item in the histogram, it is the probability that it is randomly assigned to the wrong category, given the frequency of the different categories. This is computed by looping over all the categories and multiplying the fraction of elements in that category (f_i) times the probability of choosing a different category (1 - f_i). That is: sum_i f_i * (1 - f_i)

Type Parameters:
DataType - The type of data the counts are over.
Parameters:
counts - The distribution to compute the impurity over.
Returns:
The Gini impurity of the given distribution.