gov.sandia.cognition.text.document.extractor
Class TextDocumentExtractor

java.lang.Object
  extended by gov.sandia.cognition.util.AbstractCloneableSerializable
      extended by gov.sandia.cognition.text.document.extractor.AbstractDocumentExtractor
          extended by gov.sandia.cognition.text.document.extractor.AbstractSingleDocumentExtractor
              extended by gov.sandia.cognition.text.document.extractor.TextDocumentExtractor
All Implemented Interfaces:
DocumentExtractor, SingleDocumentExtractor, CloneableSerializable, Serializable, Cloneable

public class TextDocumentExtractor
extends AbstractSingleDocumentExtractor

Extracts text from plain text documents.

Since:
3.0
Author:
Justin Basilico
See Also:
Serialized Form

Field Summary
static String CONTENT_TYPE
          The content type is "text/plain".
static List<String> DEFAULT_TEXT_FILE_EXTENSIONS
          The default set of file extensions for text files.
 
Constructor Summary
TextDocumentExtractor()
          Creates a new TextDocumentExtractor.
 
Method Summary
 boolean canExtract(URI uri)
          Determines if the given file can be extracted by this extractor.
 boolean canExtract(URLConnection connection)
          Determines if the given file can be extracted by this extractor.
 Document extractDocument(URLConnection connection)
          Attempts to extract a document from the given file.
 
Methods inherited from class gov.sandia.cognition.text.document.extractor.AbstractSingleDocumentExtractor
extractAll, extractAll, extractAll, extractDocument, extractDocument
 
Methods inherited from class gov.sandia.cognition.text.document.extractor.AbstractDocumentExtractor
canExtract
 
Methods inherited from class gov.sandia.cognition.util.AbstractCloneableSerializable
clone
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gov.sandia.cognition.text.document.extractor.DocumentExtractor
canExtract
 

Field Detail

CONTENT_TYPE

public static final String CONTENT_TYPE
The content type is "text/plain".

See Also:
Constant Field Values

DEFAULT_TEXT_FILE_EXTENSIONS

public static final List<String> DEFAULT_TEXT_FILE_EXTENSIONS
The default set of file extensions for text files.

Constructor Detail

TextDocumentExtractor

public TextDocumentExtractor()
Creates a new TextDocumentExtractor.

Method Detail

canExtract

public boolean canExtract(URI uri)
                   throws IOException
Description copied from interface: DocumentExtractor
Determines if the given file can be extracted by this extractor.

Parameters:
uri - The URI of the file to extract.
Returns:
True if this extractor can extract the file and false otherwise.
Throws:
IOException - If there is an IO error.

canExtract

public boolean canExtract(URLConnection connection)
                   throws IOException
Description copied from interface: DocumentExtractor
Determines if the given file can be extracted by this extractor.

Parameters:
connection - The connection to the file to extract.
Returns:
True if this extractor can extract the file and false otherwise.
Throws:
IOException - If there is an IO error.

extractDocument

public Document extractDocument(URLConnection connection)
                         throws IOException
Description copied from interface: SingleDocumentExtractor
Attempts to extract a document from the given file.

Parameters:
connection - The connection to the file to extract.
Returns:
The document extracted from the given file.
Throws:
IOException - If there is an IO error.