Class VectorThresholdHellingerDistanceLearner<OutputType>

  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner<OutputType>
          extended by gov.sandia.cognition.learning.algorithm.tree.VectorThresholdHellingerDistanceLearner<OutputType>
Type Parameters:
OutputType - The output category type for the training data.
All Implemented Interfaces:
BatchLearner<Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DeciderLearner<Vectorizable,OutputType,Boolean,VectorElementThresholdCategorizer>, VectorThresholdMaximumGainLearner<OutputType>, CloneableSerializable, Serializable, Cloneable

@PublicationReference(author={"David A. Cieslak","Nitesh V. Chawla"},
                      title="Increasing Skew Insensitivity of Decision Trees with Hellinger Distance",
                      publication="Notre Dame University Computer Science and Engineering Technical Reports",
public class VectorThresholdHellingerDistanceLearner<OutputType>
extends AbstractVectorThresholdMaximumGainLearner<OutputType>

A categorization tree decision function learner on vector data that learns a vector value threshold function using the Hellinger distance. The Hellinger distance is supposed to be less sensitive to skewed data than the more well known information gain method. It also behaves about the same as information gain on balanced data. Thus, it is thought that the Hellinger method may be superior to information gain.

For a given split (sets X and Y) for two categories (a and b)
d(X, Y) = sqrt( (sqrt(Xa / Na) - sqrt(Xb / Nb))^2
+ (sqrt(Ya / Na) - sqrt(Yb / Nb))^2)
Xa = number of a's in X,
Xb = number of b's in X,
Ya = number of a's in Y,
Yb = number of b's in Y,
Na = total number of a's (= Xa + Ya), and
Nb = total number of b's (= Xb + Yb).

The Hellinger distance ranges between 0 and sqrt(2), inclusive.

In a problem where there are more than two categories, the Hellinger distance is computed for each unique pair of categories and averaged to compute the Hellinger distance for that split.

Justin Basilico
See Also:
Serialized Form

Field Summary
Fields inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
Constructor Summary
          Creates a new VectorThresholdHellingerDistanceLearner.
Method Summary
 double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)
          Computes the split gain by computing the mean Hellinger distance for the given split.
Methods inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner
computeBestGainAndThreshold, computeBestGainAndThreshold, getDimensionality, getDimensionsToConsider, learn, setDimensionsToConsider
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable

Constructor Detail


public VectorThresholdHellingerDistanceLearner()
Creates a new VectorThresholdHellingerDistanceLearner.

Method Detail


public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
                               DefaultDataDistribution<OutputType> positiveCounts,
                               DefaultDataDistribution<OutputType> negativeCounts)
Computes the split gain by computing the mean Hellinger distance for the given split. The gain is equal to the distance since the base has a distance of 0.0 with itself.

Specified by:
computeSplitGain in class AbstractVectorThresholdMaximumGainLearner<OutputType>
baseCounts - The histogram of counts before the split.
positiveCounts - The counts on the positive side of the threshold.
negativeCounts - The counts on the negative side of the threshold.
The split gain by computing the mean Hellinger distance for the given split.