gov.sandia.cognition.text.term.vector.weighter.global
Class EntropyGlobalTermWeighter

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.text.term.vector.AbstractVectorSpaceModel
          extended by gov.sandia.cognition.text.term.vector.weighter.global.AbstractGlobalTermWeighter
              extended by gov.sandia.cognition.text.term.vector.weighter.global.AbstractFrequencyBasedGlobalTermWeighter
                  extended by gov.sandia.cognition.text.term.vector.weighter.global.AbstractEntropyBasedGlobalTermWeighter
                      extended by gov.sandia.cognition.text.term.vector.weighter.global.EntropyGlobalTermWeighter
All Implemented Interfaces:
VectorFactoryContainer, VectorSpaceModel, GlobalTermWeighter, CloneableSerializable, Serializable, Cloneable

@PublicationReference(author="Susan T. Dumais",
                      title="Improving the retrieval of information from external sources",
                      year=1991,
                      type=Journal,
                      publication="Behavior Research Methods, Instruments, and Computers",
                      pages={229,236},
                      url="http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fwww.psychonomic.org%2Fsearch%2Fview.cgi%3Fid%3D5145&ei=o7joSdGEHY-itgPLre3tAQ&usg=AFQjCNEvm6PZEL6_Hk3XThI6DQ-gGx9EnQ&sig2=-gjFzNroJQirwGtwjaJvgQ")
public class EntropyGlobalTermWeighter
extends AbstractEntropyBasedGlobalTermWeighter

Implements the entropy global term weighting scheme. It has been seen that this weighting scheme can work well with Latent Semantic Analysis (Dumais, 1991). For a term i, the global weight (W(i)) is: W(i) = 1 - E(i) / log(n) E(i) = - sum_j (p_ij log(p_ij)) p_ij = tf_ij / gf_i where n = The total number of documents gf_i = The total number of times that term i appears tf_ij = The number of times that term i appears in document j This class uses an optimization for computing E(i): E(i) = - (sum_j (tf_ij log(tf_ij))) / log(gf_i) + log(gf_i) which allows sum_j (tf_ij log(tf_ij)) to be incrementally computed and then divided by gf_i when needed, instead of needing to compute p_ij each time.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Field Summary
protected  Vector entropy
          A vector caching the global entropy weight of the document collection.
 
Fields inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractEntropyBasedGlobalTermWeighter
termEntropiesSum
 
Fields inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractFrequencyBasedGlobalTermWeighter
documentCount, termDocumentFrequencies, termGlobalFrequencies
 
Fields inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractGlobalTermWeighter
vectorFactory
 
Constructor Summary
EntropyGlobalTermWeighter()
          Creates a new EntropyGlobalTermWeighter.
EntropyGlobalTermWeighter(VectorFactory<? extends Vector> vectorFactory)
          Creates a new EntropyGlobalTermWeighter.
 
Method Summary
 void add(Vector counts)
          Adds a document to the model.
 EntropyGlobalTermWeighter clone()
          This makes public the clone method on the Object class and removes the exception that it throws.
 int getDimensionality()
          Gets the dimensionality of the global weights.
 Vector getEntropy()
          Gets the entropy weight (global weight) vector for all of the terms.
 Vector getGlobalWeights()
          Gets the current vector of global weights.
 boolean remove(Vector counts)
          Removes the document from the model.
protected  void setEntropy(Vector entropy)
          Sets the cached entropy weight vector.
 
Methods inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractEntropyBasedGlobalTermWeighter
getTermEntropiesSum, growVectors, initializeVectors, setTermEntropiesSum
 
Methods inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractFrequencyBasedGlobalTermWeighter
getDocumentCount, getTermDocumentFrequencies, getTermGlobalFrequencies, setDocumentCount, setTermDocumentFrequencies, setTermGlobalFrequencies
 
Methods inherited from class gov.sandia.cognition.text.term.vector.weighter.global.AbstractGlobalTermWeighter
getVectorFactory, setVectorFactory
 
Methods inherited from class gov.sandia.cognition.text.term.vector.AbstractVectorSpaceModel
add, addAll, remove, removeAll
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.text.term.vector.VectorSpaceModel
add, addAll, remove, removeAll
 

Field Detail

entropy

protected Vector entropy
A vector caching the global entropy weight of the document collection. It may be null. Use getEntropy() to compute the proper value if it has not been updated yet.

Constructor Detail

EntropyGlobalTermWeighter

public EntropyGlobalTermWeighter()
Creates a new EntropyGlobalTermWeighter.


EntropyGlobalTermWeighter

public EntropyGlobalTermWeighter(VectorFactory<? extends Vector> vectorFactory)
Creates a new EntropyGlobalTermWeighter.

Parameters:
vectorFactory - The vector factory to use.
Method Detail

clone

public EntropyGlobalTermWeighter clone()
Description copied from class: AbstractCloneableSerializable
This makes public the clone method on the Object class and removes the exception that it throws. Its default behavior is to automatically create a clone of the exact type of object that the clone is called on and to copy all primitives but to keep all references, which means it is a shallow copy. Extensions of this class may want to override this method (but call super.clone() to implement a "smart copy". That is, to target the most common use case for creating a copy of the object. Because of the default behavior being a shallow copy, extending classes only need to handle fields that need to have a deeper copy (or those that need to be reset). Some of the methods in ObjectUtil may be helpful in implementing a custom clone method. Note: The contract of this method is that you must use super.clone() as the basis for your implementation.

Specified by:
clone in interface CloneableSerializable
Overrides:
clone in class AbstractEntropyBasedGlobalTermWeighter
Returns:
A clone of this object.

add

public void add(Vector counts)
Description copied from interface: VectorSpaceModel
Adds a document to the model.

Specified by:
add in interface VectorSpaceModel
Overrides:
add in class AbstractEntropyBasedGlobalTermWeighter
Parameters:
counts - Adds a document to the model.

remove

public boolean remove(Vector counts)
Description copied from interface: VectorSpaceModel
Removes the document from the model.

Specified by:
remove in interface VectorSpaceModel
Overrides:
remove in class AbstractEntropyBasedGlobalTermWeighter
Parameters:
counts - The document to remove.
Returns:
True if this object changed as a result of the removal.

getDimensionality

public int getDimensionality()
Description copied from interface: GlobalTermWeighter
Gets the dimensionality of the global weights.

Returns:
The dimensionality of the global weights. -1 if unknown.

getGlobalWeights

public Vector getGlobalWeights()
Description copied from interface: GlobalTermWeighter
Gets the current vector of global weights.

Returns:
The global weights.

getEntropy

public Vector getEntropy()
Gets the entropy weight (global weight) vector for all of the terms.

Returns:
The entropy weight (global weight) vector for all of the terms.

setEntropy

protected void setEntropy(Vector entropy)
Sets the cached entropy weight vector.

Parameters:
entropy - The cached entropy weight vector.