gov.sandia.cognition.learning.algorithm.ensemble
Class IVotingCategorizerLearner<InputType,CategoryType>

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
          extended by gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm<ResultType>
              extended by gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner<Collection<? extends InputOutputPair<? extends InputType,OutputType>>,ResultType>
                  extended by gov.sandia.cognition.learning.algorithm.AbstractAnytimeSupervisedBatchLearner<InputType,CategoryType,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>
                      extended by gov.sandia.cognition.learning.algorithm.ensemble.IVotingCategorizerLearner<InputType,CategoryType>
Type Parameters:
InputType - The type of the input for the categorizer to learn. This is the type passed to the internal batch learner to learn each ensemble member.
CategoryType - The type of the category that is the output for the categorizer to learn. It is also passed to the internal batch learner to learn each ensemble member. It must have a valid equals and hashCode method.
All Implemented Interfaces:
AnytimeAlgorithm<WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>, IterativeAlgorithm, StoppableAlgorithm, AnytimeBatchLearner<Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>, BatchLearner<Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>, BatchLearnerContainer<BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>>>, SupervisedBatchLearner<InputType,CategoryType,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>, CloneableSerializable, Randomized, Serializable, Cloneable
Direct Known Subclasses:
CategoryBalancedIVotingLearner

@PublicationReference(author="Leo Breiman",
                      title="Pasting small votes for classification in large databases and on-line",
                      year=1999,
                      type=Journal,
                      publication="Machine Learning",
                      pages={85,103},
                      url="http://www.springerlink.com/content/mnu2r28218651707/fulltext.pdf")
public class IVotingCategorizerLearner<InputType,CategoryType>
extends AbstractAnytimeSupervisedBatchLearner<InputType,CategoryType,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>
implements Randomized, BatchLearnerContainer<BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>>>

Learns an ensemble in a method similar to bagging except that on each iteration the bag is built from two parts, each sampled from elements from disjoint sets. The two sets are the set of examples that the ensemble currently gets correct and incorrect. In effect, ivoting has similar properties to boosting except that it does not require that the learner for each ensemble member be able to use weights on the examples.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Nested Class Summary
static class IVotingCategorizerLearner.OutOfBagErrorStoppingCriteria<InputType,CategoryType>
          Implements a stopping criteria for IVoting that uses the out-of-bag error to determine when to stop learning the ensemble.
 
Field Summary
protected  Factory<? extends DataDistribution<CategoryType>> counterFactory
          Factory for counting votes.
protected  ArrayList<InputOutputPair<? extends InputType,CategoryType>> currentBag
          The current bag used to train the current ensemble member.
protected  ArrayList<Integer> currentCorrectIndices
          The indices of examples that the ensemble currently gets correct.
protected  boolean[] currentEnsembleCorrect
          A boolean for each example indicating if the ensemble currently gets the example correct or incorrect.
protected  ArrayList<Integer> currentIncorrectIndices
          The indices of examples that the ensemble currently gets incorrect.
protected  Evaluator<? super InputType,? extends CategoryType> currentMember
          The currently learned member of the ensemble.
protected  ArrayList<CategoryType> currentMemberEstimates
          The estimates of the current member for each example.
protected  ArrayList<DataDistribution<CategoryType>> dataFullEstimates
          The running estimate of the ensemble for each example.
protected  int[] dataInBag
          A counter for each example indicating how many times it exists in the current bag.
protected  ArrayList<? extends InputOutputPair<? extends InputType,CategoryType>> dataList
          The data represented as an array list.
protected  ArrayList<DataDistribution<CategoryType>> dataOutOfBagEstimates
          The running estimate of the ensemble for each example where an ensemble member can only vote on elements that were not in the bag used to train it.
static int DEFAULT_MAX_ITERATIONS
          The default maximum number of iterations is 100.
static double DEFAULT_PERCENT_TO_SAMPLE
          The default percent to sample 0.1.
static double DEFAULT_PROPORTION_INCORRECT_IN_SAMPLE
          By default use 50% incorrect (and 50%) correct in the percent to sample.
static boolean DEFAULT_VOTE_OUT_OF_BAG_ONLY
          The default value to vote out-of-bag.
protected  WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>> ensemble
          The current ensemble.
protected  BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner
          The learner used to produce each ensemble member.
protected  int numCorrectToSample
          The number of correct examples to sample on each iteration.
protected  int numIncorrectToSample
          The number of incorrect examples to sample on each iteration.
protected  double percentToSample
          The percent to sample on each iteration.
protected  double proportionIncorrectInSample
          The proportion of incorrect examples in each sample.
protected  Random random
          The random number generator to use.
protected  int sampleSize
          The size of sample to create on each iteration.
protected  boolean voteOutOfBagOnly
          Controls whether or not an ensemble member can vote on items it was trained on during learning.
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
data, keepGoing
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
maxIterations
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
DEFAULT_ITERATION, iteration
 
Constructor Summary
IVotingCategorizerLearner()
          Creates a new IVotingCategorizerLearner.
IVotingCategorizerLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner, int maxIterations, double percentToSample, double proportionIncorrectInSample, boolean voteOutOfBagOnly, Factory<? extends DataDistribution<CategoryType>> counterFactory, Random random)
          Creates a new IVotingCategorizerLearner.
IVotingCategorizerLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner, int maxIterations, double percentToSample, Random random)
          Creates a new IVotingCategorizerLearner.
 
Method Summary
protected  void cleanupAlgorithm()
          Called to clean up the learning algorithm's state after learning has finished.
protected  void createBag(ArrayList<Integer> correctIndices, ArrayList<Integer> incorrectIndices)
          Create the next sample (bag) of examples to learn the next ensemble member from.
 Factory<? extends DataDistribution<CategoryType>> getCounterFactory()
          Gets the factory used for creating the object for counting the votes of the learned ensemble members.
 boolean[] getCurrentEnsembleCorrect()
          Gets whether or not the current ensemble gets each example correct.
 List<DataDistribution<CategoryType>> getDataFullEstimates()
          Gets the current estimates for each data point.
 List<DataDistribution<CategoryType>> getDataOutOfBagEstimates()
          Gets the current out-of-bag estimates for each data point.
 BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> getLearner()
          Gets the learner used to learn each ensemble member.
 double getPercentToSample()
          Gets the percentage of the total data to sample on each iteration.
 double getProportionIncorrectInSample()
          Gets the proportion of incorrect examples to place in each sample.
 Random getRandom()
          Gets the random number generator used by this object.
 WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>> getResult()
          Gets the current result of the algorithm.
protected  boolean initializeAlgorithm()
          Called to initialize the learning algorithm's state based on the data that is stored in the data field.
 boolean isVoteOutOfBagOnly()
          Gets whether during learning ensemble members can only vote on items that they are not in their bag (training set).
protected static
<DataType> void
sampleIndicesWithReplacementInto(ArrayList<Integer> fromIndices, ArrayList<? extends DataType> baseData, int numToSample, Random random, ArrayList<DataType> output, int[] dataInBag)
          Takes the given number of samples from the given list and places them in the given output list.
 void setCounterFactory(Factory<? extends DataDistribution<CategoryType>> counterFactory)
          Sets the factory used for creating the object for counting the votes of the learned ensemble members.
 void setLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner)
          Sets the learner used to learn each ensemble member.
 void setPercentToSample(double percentToSample)
          Sets the percentage of the data to sample (with replacement) on each iteration.
 void setProportionIncorrectInSample(double proportionIncorrectInSample)
          Sets the proportion of incorrect examples to place in each sample.
 void setRandom(Random random)
          Sets the random number generator used by this object.
 void setVoteOutOfBagOnly(boolean voteOutOfBagOnly)
          Sets whether during learning ensemble members can only vote on items that they are not in their bag (training set).
protected  boolean step()
          Called to take a single step of the learning algorithm.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
clone, getData, getKeepGoing, learn, setData, setKeepGoing, stop
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
getMaxIterations, isResultValid, setMaxIterations
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.learning.algorithm.BatchLearner
learn
 
Methods inherited from interface gov.sandia.cognition.util.CloneableSerializable
clone
 
Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm
getMaxIterations, setMaxIterations
 
Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
 
Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm
isResultValid
 

Field Detail

DEFAULT_MAX_ITERATIONS

public static final int DEFAULT_MAX_ITERATIONS
The default maximum number of iterations is 100.

See Also:
Constant Field Values

DEFAULT_PERCENT_TO_SAMPLE

public static final double DEFAULT_PERCENT_TO_SAMPLE
The default percent to sample 0.1.

See Also:
Constant Field Values

DEFAULT_PROPORTION_INCORRECT_IN_SAMPLE

public static final double DEFAULT_PROPORTION_INCORRECT_IN_SAMPLE
By default use 50% incorrect (and 50%) correct in the percent to sample.

See Also:
Constant Field Values

DEFAULT_VOTE_OUT_OF_BAG_ONLY

public static final boolean DEFAULT_VOTE_OUT_OF_BAG_ONLY
The default value to vote out-of-bag.

See Also:
Constant Field Values

learner

protected BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner
The learner used to produce each ensemble member.


percentToSample

protected double percentToSample
The percent to sample on each iteration.


proportionIncorrectInSample

protected double proportionIncorrectInSample
The proportion of incorrect examples in each sample. Must be between 0.0 and 1.0 (inclusive).


voteOutOfBagOnly

protected boolean voteOutOfBagOnly
Controls whether or not an ensemble member can vote on items it was trained on during learning. By default, the ensemble member can only vote on out-of-bag values.


counterFactory

protected Factory<? extends DataDistribution<CategoryType>> counterFactory
Factory for counting votes.


random

protected Random random
The random number generator to use.


ensemble

protected transient WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>> ensemble
The current ensemble.


dataList

protected transient ArrayList<? extends InputOutputPair<? extends InputType,CategoryType>> dataList
The data represented as an array list.


dataFullEstimates

protected transient ArrayList<DataDistribution<CategoryType>> dataFullEstimates
The running estimate of the ensemble for each example. Updated in each iteration with the ensemble member created for that iteration. This is used instead of evaluating the ensemble in each iteration to make it so that each ensemble member is only evaluated once on each training example.


dataOutOfBagEstimates

protected transient ArrayList<DataDistribution<CategoryType>> dataOutOfBagEstimates
The running estimate of the ensemble for each example where an ensemble member can only vote on elements that were not in the bag used to train it.


currentEnsembleCorrect

protected transient boolean[] currentEnsembleCorrect
A boolean for each example indicating if the ensemble currently gets the example correct or incorrect.


currentCorrectIndices

protected transient ArrayList<Integer> currentCorrectIndices
The indices of examples that the ensemble currently gets correct.


currentIncorrectIndices

protected transient ArrayList<Integer> currentIncorrectIndices
The indices of examples that the ensemble currently gets incorrect.


sampleSize

protected transient int sampleSize
The size of sample to create on each iteration.


numCorrectToSample

protected transient int numCorrectToSample
The number of correct examples to sample on each iteration.


numIncorrectToSample

protected transient int numIncorrectToSample
The number of incorrect examples to sample on each iteration.


currentBag

protected transient ArrayList<InputOutputPair<? extends InputType,CategoryType>> currentBag
The current bag used to train the current ensemble member.


dataInBag

protected transient int[] dataInBag
A counter for each example indicating how many times it exists in the current bag.


currentMember

protected transient Evaluator<? super InputType,? extends CategoryType> currentMember
The currently learned member of the ensemble.


currentMemberEstimates

protected transient ArrayList<CategoryType> currentMemberEstimates
The estimates of the current member for each example.

Constructor Detail

IVotingCategorizerLearner

public IVotingCategorizerLearner()
Creates a new IVotingCategorizerLearner.


IVotingCategorizerLearner

public IVotingCategorizerLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner,
                                 int maxIterations,
                                 double percentToSample,
                                 Random random)
Creates a new IVotingCategorizerLearner.

Parameters:
learner - The learner to use to create the categorizer on each iteration.
maxIterations - The maximum number of iterations to run for, which is also the number of learners to create.
percentToSample - The percentage of the total size of the data to sample on each iteration. Must be positive.
random - The random number generator to use.

IVotingCategorizerLearner

public IVotingCategorizerLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner,
                                 int maxIterations,
                                 double percentToSample,
                                 double proportionIncorrectInSample,
                                 boolean voteOutOfBagOnly,
                                 Factory<? extends DataDistribution<CategoryType>> counterFactory,
                                 Random random)
Creates a new IVotingCategorizerLearner.

Parameters:
learner - The learner to use to create the categorizer on each iteration.
maxIterations - The maximum number of iterations to run for, which is also the number of learners to create.
percentToSample - The percentage of the total size of the data to sample on each iteration. Must be positive.
proportionIncorrectInSample - The percentage of incorrect examples to put in each sample. Must be between 0.0 and 1.0 (inclusive).
voteOutOfBagOnly - Controls whether or not in-bag or out-of-bag votes are used to determine accuracy.
counterFactory - The factory for counting votes.
random - The random number generator to use.
Method Detail

initializeAlgorithm

protected boolean initializeAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to initialize the learning algorithm's state based on the data that is stored in the data field. The return value indicates if the algorithm can be run or not based on the initialization.

Specified by:
initializeAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>
Returns:
True if the learning algorithm can be run and false if it cannot.

step

protected boolean step()
Description copied from class: AbstractAnytimeBatchLearner
Called to take a single step of the learning algorithm.

Specified by:
step in class AbstractAnytimeBatchLearner<Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>
Returns:
True if another step can be taken and false it the algorithm should halt.

createBag

protected void createBag(ArrayList<Integer> correctIndices,
                         ArrayList<Integer> incorrectIndices)
Create the next sample (bag) of examples to learn the next ensemble member from.

Parameters:
correctIndices - The list of indices the ensemble is currently getting correct.
incorrectIndices - The list of indices the ensemble is currently getting incorrect.

sampleIndicesWithReplacementInto

protected static <DataType> void sampleIndicesWithReplacementInto(ArrayList<Integer> fromIndices,
                                                                  ArrayList<? extends DataType> baseData,
                                                                  int numToSample,
                                                                  Random random,
                                                                  ArrayList<DataType> output,
                                                                  int[] dataInBag)
Takes the given number of samples from the given list and places them in the given output list. It samples with replacement, which means that a given item may appear multiple times in the bag. It also keeps track of how many times each item was sampled.

Type Parameters:
DataType - The data type to sample.
Parameters:
fromIndices - The indices into the given base data to sample from.
baseData - The list to sample from using the given list of indices.
numToSample - The number to sample. Must be non-negative.
random - The random number generator to use.
output - The list to add the samples to.
dataInBag - The array of counters for the number of times each example is sampled.

cleanupAlgorithm

protected void cleanupAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to clean up the learning algorithm's state after learning has finished.

Specified by:
cleanupAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>

getResult

public WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>> getResult()
Description copied from interface: AnytimeAlgorithm
Gets the current result of the algorithm.

Specified by:
getResult in interface AnytimeAlgorithm<WeightedVotingCategorizerEnsemble<InputType,CategoryType,Evaluator<? super InputType,? extends CategoryType>>>
Returns:
Current result of the algorithm.

getLearner

public BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> getLearner()
Gets the learner used to learn each ensemble member.

Specified by:
getLearner in interface BatchLearnerContainer<BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>>>
Returns:
The learner used for each ensemble member.

setLearner

public void setLearner(BatchLearner<? super Collection<? extends InputOutputPair<? extends InputType,CategoryType>>,? extends Evaluator<? super InputType,? extends CategoryType>> learner)
Sets the learner used to learn each ensemble member. Must be a supervised learning algorithm that takes in a collection of input-output pairs of the given data types and produces an evaluator for those data types.

Parameters:
learner - The learner used for each ensemble member.

getPercentToSample

public double getPercentToSample()
Gets the percentage of the total data to sample on each iteration.

Returns:
The percentage of the total data to sample on each iteration.

setPercentToSample

public void setPercentToSample(double percentToSample)
Sets the percentage of the data to sample (with replacement) on each iteration. Must be greater than zero. The percent is represented as a floating point number with 1.0 representing 100%.

Parameters:
percentToSample - The percent of the data to sample on each iteration. Must be greater than zero. Defaults to 100%.

getProportionIncorrectInSample

public double getProportionIncorrectInSample()
Gets the proportion of incorrect examples to place in each sample.

Returns:
The proportion of incorrect examples in each sample.

setProportionIncorrectInSample

public void setProportionIncorrectInSample(double proportionIncorrectInSample)
Sets the proportion of incorrect examples to place in each sample. Must be between 0.0 and 1.0 (inclusive). The rest of the examples in the sample will be filled from the correct examples.

Parameters:
proportionIncorrectInSample - The proportion of incorrect examples in each sample. Must be between 0.0 and 1.0 (inclusive).

isVoteOutOfBagOnly

public boolean isVoteOutOfBagOnly()
Gets whether during learning ensemble members can only vote on items that they are not in their bag (training set).

Returns:
If out-of-bag-only voting is enabled.

setVoteOutOfBagOnly

public void setVoteOutOfBagOnly(boolean voteOutOfBagOnly)
Sets whether during learning ensemble members can only vote on items that they are not in their bag (training set). In the vast majority of cases, this should be enabled. It is enabled by default.

Parameters:
voteOutOfBagOnly - If out-of-bag-only voting should be enabled.

getCounterFactory

public Factory<? extends DataDistribution<CategoryType>> getCounterFactory()
Gets the factory used for creating the object for counting the votes of the learned ensemble members.

Returns:
The factory used to create the vote counting objects.

setCounterFactory

public void setCounterFactory(Factory<? extends DataDistribution<CategoryType>> counterFactory)
Sets the factory used for creating the object for counting the votes of the learned ensemble members.

Parameters:
counterFactory - The factory used to create the vote counting objects.

getRandom

public Random getRandom()
Description copied from interface: Randomized
Gets the random number generator used by this object.

Specified by:
getRandom in interface Randomized
Returns:
The random number generator used by this object.

setRandom

public void setRandom(Random random)
Description copied from interface: Randomized
Sets the random number generator used by this object.

Specified by:
setRandom in interface Randomized
Parameters:
random - The random number generator for this object to use.

getDataFullEstimates

public List<DataDistribution<CategoryType>> getDataFullEstimates()
Gets the current estimates for each data point. Do not modify these counts as they will change the algorithm.

Returns:
The current estimates for each data point.

getDataOutOfBagEstimates

public List<DataDistribution<CategoryType>> getDataOutOfBagEstimates()
Gets the current out-of-bag estimates for each data point. Do not modify these counts as they will change the algorithm.

Returns:
The current out-of-bag estimates for each data point.

getCurrentEnsembleCorrect

public boolean[] getCurrentEnsembleCorrect()
Gets whether or not the current ensemble gets each example correct. Do not modify these values, as they will change the algorithm.

Returns:
The array of booleans regarding whether or not the ensemble gets an example correct.