gov.sandia.cognition.text.topic
Class LatentDirichletAllocationVectorGibbsSampler

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
          extended by gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm<ResultType>
              extended by gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
                  extended by gov.sandia.cognition.text.topic.LatentDirichletAllocationVectorGibbsSampler
All Implemented Interfaces:
AnytimeAlgorithm<LatentDirichletAllocationVectorGibbsSampler.Result>, IterativeAlgorithm, StoppableAlgorithm, AnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>, BatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>, CloneableSerializable, Randomized, Serializable, Cloneable
Direct Known Subclasses:
ParallelLatentDirichletAllocationVectorGibbsSampler

@PublicationReferences(references={@PublicationReference(author={"David M. Blei","Andrew Y. Ng","Michael I. Jordan"},title="Latent Dirichlet Allocation",year=2003,type=Journal,publication="Journal of Machine Learning Research",pages={993,1022},url="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf"),@PublicationReference(author="Gregor Heinrich",title="Parameter estimation for text analysis",year=2009,type=TechnicalReport,url="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.1327&rep=rep1&type=pdf")})
public class LatentDirichletAllocationVectorGibbsSampler
extends AbstractAnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
implements Randomized

A Gibbs sampler for performing Latent Dirichlet Allocation (LDA). It operates on input vectors that are expected to have positive integer counts. The LDA model uses a fixed set of latent topics as a generative model for term occurrences in documents. Thus, each document is a mixture of different topics. This implementation uses a Gibbs sampling version of Markov Chain Monte Carlo algorithm to estimate the parameters of the model.

Since:
3.1
Author:
Justin Basilico
See Also:
Serialized Form

Nested Class Summary
static class LatentDirichletAllocationVectorGibbsSampler.Result
          Represents the result of performing Latent Dirichlet Allocation.
 
Field Summary
protected  double alpha
          The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
protected  double beta
          The beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
protected  int burnInIterations
          The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
static double DEFAULT_ALPHA
          The default value of alpha is 5.0.
static double DEFAULT_BETA
          The default value of beta is 0.5.
static int DEFAULT_BURN_IN_ITERATIONS
          The default number of burn-in iterations is 2000.
static int DEFAULT_ITERATIONS_PER_SAMPLE
          The default number of iterations per sample is 100.
static int DEFAULT_MAX_ITERATIONS
          The default maximum number is iterations is 10000.
static int DEFAULT_TOPIC_COUNT
          The default topic count is 10.
protected  int documentCount
          The number of documents in the dataset.
protected  int[][] documentTopicCount
          For each document, the number of terms assigned to each topic.
protected  int[] documentTopicSum
          The number of term occurrences in each document.
protected  int iterationsPerSample
          The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
protected  int[] occurrenceTopicAssignments
          The assignments of term occurrences to topics.
protected  Random random
          The random number generator to use.
protected  LatentDirichletAllocationVectorGibbsSampler.Result result
          The result probabilities.
protected  int sampleCount
          The number of model parameter samples that have been made.
protected  int termCount
          The number of terms in the dataset.
protected  int topicCount
          The number of topics for the algorithm to create.
protected  int[][] topicTermCount
          For each topic, the number of occurrences assigned to each term.
protected  int[] topicTermSum
          The number of term occurrences assigned to each term.
 
Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
data, keepGoing
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
maxIterations
 
Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
DEFAULT_ITERATION, iteration
 
Constructor Summary
LatentDirichletAllocationVectorGibbsSampler()
          Creates a new LatentDirichletAllocationVectorGibbsSampler with default parameters.
LatentDirichletAllocationVectorGibbsSampler(int topicCount, double alpha, double beta, int maxIterations, int burnInIterations, int iterationsPerSample, Random random)
          Creates a new LatentDirichletAllocationVectorGibbsSampler with the given parameters.
 
Method Summary
protected  void cleanupAlgorithm()
          Called to clean up the learning algorithm's state after learning has finished.
 double getAlpha()
          Gets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
 double getBeta()
          Gets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
 int getBurnInIterations()
          Gets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
 int getDocumentCount()
          Gets the number of documents in the dataset.
 int getIterationsPerSample()
          Gets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
 Random getRandom()
          Gets the random number generator used by this object.
 LatentDirichletAllocationVectorGibbsSampler.Result getResult()
          Gets the current result of the algorithm.
 int getTermCount()
          Gets the number of terms in the dataset.
 int getTopicCount()
          Gets the number of topics (k) created by the topic model.
protected  boolean initializeAlgorithm()
          Called to initialize the learning algorithm's state based on the data that is stored in the data field.
protected  void readParameters()
          Reads the current set of parameters.
protected  int sampleTopic(int document, int term, double[] topicCumulativeProportions)
          Samples a topic for a given document and term.
 void setAlpha(double alpha)
          Sets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
 void setBeta(double beta)
          Sets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
 void setBurnInIterations(int burnInIterations)
          Sets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
 void setIterationsPerSample(int iterationsPerSample)
          Sets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
 void setRandom(Random random)
          Sets the random number generator used by this object.
 void setTopicCount(int topicCount)
          Sets the number of topics (k) created by the topic model.
protected  boolean step()
          Called to take a single step of the learning algorithm.
 
Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
clone, getData, getKeepGoing, learn, setData, setKeepGoing, stop
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
getMaxIterations, isResultValid, setMaxIterations
 
Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm
getMaxIterations, setMaxIterations
 
Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
 
Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm
isResultValid
 

Field Detail

DEFAULT_TOPIC_COUNT

public static final int DEFAULT_TOPIC_COUNT
The default topic count is 10.

See Also:
Constant Field Values

DEFAULT_ALPHA

public static final double DEFAULT_ALPHA
The default value of alpha is 5.0.

See Also:
Constant Field Values

DEFAULT_BETA

public static final double DEFAULT_BETA
The default value of beta is 0.5.

See Also:
Constant Field Values

DEFAULT_MAX_ITERATIONS

public static final int DEFAULT_MAX_ITERATIONS
The default maximum number is iterations is 10000.

See Also:
Constant Field Values

DEFAULT_BURN_IN_ITERATIONS

public static final int DEFAULT_BURN_IN_ITERATIONS
The default number of burn-in iterations is 2000.

See Also:
Constant Field Values

DEFAULT_ITERATIONS_PER_SAMPLE

public static final int DEFAULT_ITERATIONS_PER_SAMPLE
The default number of iterations per sample is 100.

See Also:
Constant Field Values

topicCount

protected int topicCount
The number of topics for the algorithm to create.


alpha

protected double alpha
The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.


beta

protected double beta
The beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.


burnInIterations

protected int burnInIterations
The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.


iterationsPerSample

protected int iterationsPerSample
The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).


random

protected Random random
The random number generator to use.


documentCount

protected transient int documentCount
The number of documents in the dataset.


termCount

protected transient int termCount
The number of terms in the dataset.


documentTopicCount

protected transient int[][] documentTopicCount
For each document, the number of terms assigned to each topic. Thus, the first index is a document index and the second is a term index.


documentTopicSum

protected transient int[] documentTopicSum
The number of term occurrences in each document.


topicTermCount

protected transient int[][] topicTermCount
For each topic, the number of occurrences assigned to each term. Thus, the first index is a topic index and the second is a term index.


topicTermSum

protected transient int[] topicTermSum
The number of term occurrences assigned to each term.


occurrenceTopicAssignments

protected transient int[] occurrenceTopicAssignments
The assignments of term occurrences to topics.


sampleCount

protected transient int sampleCount
The number of model parameter samples that have been made.


result

protected transient LatentDirichletAllocationVectorGibbsSampler.Result result
The result probabilities. Note that if multiple samples are taken, this will be a sum of the probabilities for the different samples until the algorithm is done and they are turned into an average.

Constructor Detail

LatentDirichletAllocationVectorGibbsSampler

public LatentDirichletAllocationVectorGibbsSampler()
Creates a new LatentDirichletAllocationVectorGibbsSampler with default parameters.


LatentDirichletAllocationVectorGibbsSampler

public LatentDirichletAllocationVectorGibbsSampler(int topicCount,
                                                   double alpha,
                                                   double beta,
                                                   int maxIterations,
                                                   int burnInIterations,
                                                   int iterationsPerSample,
                                                   Random random)
Creates a new LatentDirichletAllocationVectorGibbsSampler with the given parameters.

Parameters:
topicCount - The number of topics for the algorithm to create. Must be positive.
alpha - The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts. Must be positive.
beta - The beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.
maxIterations - The maximum number of iterations to run for. Must be positive.
burnInIterations - The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
iterationsPerSample - The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
random - The random number generator to use.
Method Detail

initializeAlgorithm

protected boolean initializeAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to initialize the learning algorithm's state based on the data that is stored in the data field. The return value indicates if the algorithm can be run or not based on the initialization.

Specified by:
initializeAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
Returns:
True if the learning algorithm can be run and false if it cannot.

step

protected boolean step()
Description copied from class: AbstractAnytimeBatchLearner
Called to take a single step of the learning algorithm.

Specified by:
step in class AbstractAnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
Returns:
True if another step can be taken and false it the algorithm should halt.

sampleTopic

protected int sampleTopic(int document,
                          int term,
                          double[] topicCumulativeProportions)
Samples a topic for a given document and term.

Parameters:
document - The document index.
term - The term index.
topicCumulativeProportions - The array to use to store the proportions in.
Returns:
A topic index sampled from the topic probabilities of the given document and term.

cleanupAlgorithm

protected void cleanupAlgorithm()
Description copied from class: AbstractAnytimeBatchLearner
Called to clean up the learning algorithm's state after learning has finished.

Specified by:
cleanupAlgorithm in class AbstractAnytimeBatchLearner<Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>

readParameters

protected void readParameters()
Reads the current set of parameters.


getResult

public LatentDirichletAllocationVectorGibbsSampler.Result getResult()
Description copied from interface: AnytimeAlgorithm
Gets the current result of the algorithm.

Specified by:
getResult in interface AnytimeAlgorithm<LatentDirichletAllocationVectorGibbsSampler.Result>
Returns:
Current result of the algorithm.

getTopicCount

public int getTopicCount()
Gets the number of topics (k) created by the topic model.

Returns:
The number of topics created by the topic model. Must be greater than zero.

setTopicCount

public void setTopicCount(int topicCount)
Sets the number of topics (k) created by the topic model.

Parameters:
topicCount - The number of topics created by the topic model. Must be greater than zero.

getAlpha

public double getAlpha()
Gets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.

Returns:
The alpha parameter.

setAlpha

public void setAlpha(double alpha)
Sets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.

Parameters:
alpha - The alpha parameter. Must be positive.

getBeta

public double getBeta()
Gets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.

Returns:
The beta parameter.

setBeta

public void setBeta(double beta)
Sets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.

Parameters:
beta - The beta parameter. Must be positive.

getBurnInIterations

public int getBurnInIterations()
Gets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins. Note that if this number is greater than the maximum number of iterations, it will only run up to the maximum number of iterations and will only generate one parameter sample.

Returns:
The number of burn-in iterations. Must be non-negative.

setBurnInIterations

public void setBurnInIterations(int burnInIterations)
Sets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins. Note that if this number is greater than the maximum number of iterations, it will only run up to the maximum number of iterations and will only generate one parameter sample.

Parameters:
burnInIterations - The number of burn-in iterations. Must be non-negative.

getIterationsPerSample

public int getIterationsPerSample()
Gets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).

Returns:
The number of iterations between samples.

setIterationsPerSample

public void setIterationsPerSample(int iterationsPerSample)
Sets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).

Parameters:
iterationsPerSample - The number of iterations between samples. Must be positive.

getRandom

public Random getRandom()
Description copied from interface: Randomized
Gets the random number generator used by this object.

Specified by:
getRandom in interface Randomized
Returns:
The random number generator used by this object.

setRandom

public void setRandom(Random random)
Description copied from interface: Randomized
Sets the random number generator used by this object.

Specified by:
setRandom in interface Randomized
Parameters:
random - The random number generator for this object to use.

getDocumentCount

public int getDocumentCount()
Gets the number of documents in the dataset.

Returns:
The number of documents.

getTermCount

public int getTermCount()
Gets the number of terms in the dataset.

Returns:
The number of terms.