gov.sandia.cognition.text.spelling
Class SimpleStatisticalSpellingCorrector

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.text.spelling.SimpleStatisticalSpellingCorrector
All Implemented Interfaces:
Evaluator<String,String>, CloneableSerializable, Serializable, Cloneable

@PublicationReference(author="Peter Norvig",
                      title="How to Write a Spelling Corrector",
                      year=2009,
                      type=WebPage,
                      url="http://norvig.com/spell-correct.html")
public class SimpleStatisticalSpellingCorrector
extends AbstractCloneableSerializable
implements Evaluator<String,String>

A simple statistical spelling corrector based on word counts that looks at possible one and two-character edits.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Nested Class Summary
static class SimpleStatisticalSpellingCorrector.Learner
          A learner for the SimpleStatisticalSpellingCorrector.
 
Field Summary
protected  char[] alphabet
          The alphabet of lower case characters.
protected  DefaultDataDistribution<String> wordCounts
          Maps known words to the number of times they've been seen.
 
Constructor Summary
SimpleStatisticalSpellingCorrector()
          Creates a new, default SimpleStatisticalSpellingCorrector with a default alphabet.
SimpleStatisticalSpellingCorrector(char[] alphabet)
          Creates a new SimpleStatisticalSpellingCorrector with a given alphabet.
SimpleStatisticalSpellingCorrector(DefaultDataDistribution<String> wordCounts, char[] alphabet)
          Creates a new SimpleStatisticalSpellingCorrector.
 
Method Summary
 void add(String word)
          Adds a word to the dictionary of counts for the spelling corrector.
 void add(String word, int count)
          Adds a given number of counts for a word to the dictionary of counts for the spelling corrector.
static char[] createDefaultAlphabet()
          Creates the default alphabet, which are the lower-case English letters.
 String evaluate(String word)
          Evaluates the function on the given input and returns the output.
 String findBest(Iterable<String> words, String defaultBestWord)
          Finds the best word from a given list of words by finding the one with the highest count in the dictionary.
 char[] getAlphabet()
          Gets the alphabet of lower-case characters that can be used for replaces and inserts.
 DefaultDataDistribution<String> getWordCounts()
          Gets the dictionary of word counts.
protected  Set<String> knownTwoCharacterEdits(Iterable<String> oneCharacterEdits)
          Creates the set of known two character edits for a given list of one character edits.
protected  void possibleOneCharacterEdits(String word, Collection<String> result)
          Lists all possible one-character edits for a given word by looking at character deletes, transposes, replaces, and inserts.
 void setAlphabet(char[] alphabet)
          Sets the alphabet of lower-case characters that can be used for replaces and inserts.
 void setWordCounts(DefaultDataDistribution<String> wordCounts)
          Sets the dictionary of words counts.
 
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordCounts

protected DefaultDataDistribution<String> wordCounts
Maps known words to the number of times they've been seen.


alphabet

protected char[] alphabet
The alphabet of lower case characters.

Constructor Detail

SimpleStatisticalSpellingCorrector

public SimpleStatisticalSpellingCorrector()
Creates a new, default SimpleStatisticalSpellingCorrector with a default alphabet.


SimpleStatisticalSpellingCorrector

public SimpleStatisticalSpellingCorrector(char[] alphabet)
Creates a new SimpleStatisticalSpellingCorrector with a given alphabet.

Parameters:
alphabet - The alphabet to use.

SimpleStatisticalSpellingCorrector

public SimpleStatisticalSpellingCorrector(DefaultDataDistribution<String> wordCounts,
                                          char[] alphabet)
Creates a new SimpleStatisticalSpellingCorrector.

Parameters:
wordCounts - The initial word counts.
alphabet - The alphabet to use.
Method Detail

createDefaultAlphabet

public static char[] createDefaultAlphabet()
Creates the default alphabet, which are the lower-case English letters.

Returns:
The default alphabet.

add

public void add(String word)
Adds a word to the dictionary of counts for the spelling corrector.

Parameters:
word - The word to add an occurrence of.

add

public void add(String word,
                int count)
Adds a given number of counts for a word to the dictionary of counts for the spelling corrector.

Parameters:
word - The word to add.
count - The count of occurrences.

evaluate

public String evaluate(String word)
Description copied from interface: Evaluator
Evaluates the function on the given input and returns the output.

Specified by:
evaluate in interface Evaluator<String,String>
Parameters:
word - The input to evaluate.
Returns:
The output produced by evaluating the input.

findBest

public String findBest(Iterable<String> words,
                       String defaultBestWord)
Finds the best word from a given list of words by finding the one with the highest count in the dictionary. If no words are in the dictionary, the given default best word is returned.

Parameters:
words - The list of words.
defaultBestWord - The default word to return if none are in the dictionary.
Returns:
The word with the highest count.

possibleOneCharacterEdits

protected void possibleOneCharacterEdits(String word,
                                         Collection<String> result)
Lists all possible one-character edits for a given word by looking at character deletes, transposes, replaces, and inserts.

Parameters:
word - The word to get the edits for.
result - The collection to write the edits into.

knownTwoCharacterEdits

protected Set<String> knownTwoCharacterEdits(Iterable<String> oneCharacterEdits)
Creates the set of known two character edits for a given list of one character edits.

Parameters:
oneCharacterEdits - The list of one character edits.
Returns:
The set of known two-character edits, which are the two-character edits that are in the dictionary.

getWordCounts

public DefaultDataDistribution<String> getWordCounts()
Gets the dictionary of word counts.

Returns:
The word counts.

setWordCounts

public void setWordCounts(DefaultDataDistribution<String> wordCounts)
Sets the dictionary of words counts.

Parameters:
wordCounts - The dictionary of word counts.

getAlphabet

public char[] getAlphabet()
Gets the alphabet of lower-case characters that can be used for replaces and inserts.

Returns:
The alphabet of lower-case characters.

setAlphabet

public void setAlphabet(char[] alphabet)
Sets the alphabet of lower-case characters that can be used for replaces and inserts.

Parameters:
alphabet - The alphabet of lower-case characters.