gov.sandia.cognition.learning.data
Class DatasetUtil

java.lang.Object
  extended by gov.sandia.cognition.learning.data.DatasetUtil

public class DatasetUtil
extends Object

Static class containing utility methods for handling Collections of data in the learning package.

Since:
2.0
Author:
Kevin R. Dixon

Constructor Summary
DatasetUtil()
           
 
Method Summary
static ArrayList<Vector> appendBias(Collection<? extends Vector> dataset)
          Appends a bias (constant 1.0) to the end of each Vector in the dataset, the original dataset is unmodified.
static ArrayList<Vector> appendBias(Collection<? extends Vector> dataset, double biasValue)
          Appends "biasValue" to the end of each Vector in the dataset, the original dataset is unmodified.
static
<EntryType>
MultiCollection<EntryType>
asMultiCollection(Collection<EntryType> collection)
          Takes a collection and returns a multi-collection version of that collection.
static void assertDimensionalitiesAllEqual(Iterable<? extends Vectorizable> data)
          Asserts that all of the dimensionalities of the vectors in the given set of data are the same.
static void assertInputDimensionalitiesAllEqual(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data)
          Asserts that all of the dimensionalities of the input vectors in the given set of input-output pairs are the same.
static void assertInputDimensionalitiesAllEqual(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data, int dimensionality)
          Asserts that all of the dimensionalities of the input vectors in the given set of input-output pairs equal the given dimensionality.
static Collection<Vector> asVectorCollection(Collection<? extends Vectorizable> collection)
          Takes a collection of Vectorizable objects and returns a collection of Vector objects of the same size.
static Matrix computeOuterProductDataMatrix(ArrayList<? extends Vector> data)
          Computes the outer-product Matrix of the given set of data: XXt = [ x1 x2 ...
static double computeOutputMean(Collection<? extends InputOutputPair<?,? extends Number>> data)
          Computes the mean of the output data.
static double computeOutputVariance(Collection<? extends InputOutputPair<?,? extends Number>> data)
          Computes the variance of the output of a given set of input-output pairs.
static double computeWeightedOutputMean(Collection<? extends InputOutputPair<?,? extends Number>> data)
          Computes the mean of the output data.
static
<OutputType>
DataDistribution<OutputType>
countOutputValues(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
          Creates a data histogram over the output values from the given data.
static ArrayList<ArrayList<Double>> decoupleVectorDataset(Collection<? extends Vector> dataset)
          Takes a dataset of M-dimensional Vectors and turns it into M datasets of Doubles
static ArrayList<ArrayList<InputOutputPair<Double,Double>>> decoupleVectorPairDataset(Collection<? extends InputOutputPair<? extends Vector,? extends Vector>> dataset)
          Takes a set of equal-dimension Vector-Vector InputOutputPairs and turns them into a collection of Double-Double InputOutputPairs.
static
<OutputType>
Set<OutputType>
findUniqueOutputs(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
          Creates a set containing the unique output values from the given data.
static int getDimensionality(Iterable<? extends Vectorizable> data)
          Gets the dimensionality of the vectors in given set of data.
static int getInputDimensionality(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data)
          Gets the dimensionality of the input vectors in given set of input-output pairs.
static double getWeight(InputOutputPair<?,?> pair)
          Gets the weight of a given input-output pair.
static double getWeight(TargetEstimatePair<?,?> pair)
          Gets the weight of a given target-estimate pair.
static
<InputType>
List<InputType>
inputsList(Iterable<? extends InputOutputPair<? extends InputType,?>> data)
          Creates a list containing all of the input values from the given data.
static
<OutputType>
List<OutputType>
outputsList(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
          Creates a list containing all of the output values from the given data.
static
<DataType> DefaultPair<LinkedList<DataType>,LinkedList<DataType>>
splitDatasets(Collection<? extends InputOutputPair<? extends DataType,Boolean>> data)
          Splits a dataset of input-output pair into two datasets, one for the inputs that have a "true" output and another for the inputs that have a "false" output
static
<InputType,CategoryType>
Map<CategoryType,List<InputType>>
splitOnOutput(Iterable<? extends InputOutputPair<? extends InputType,? extends CategoryType>> data)
          Splits a dataset according to its output value (usually a category) so that all the inputs for that category are given in a list.
static double sumWeights(Collection<? extends InputOutputPair<?,?>> data)
          Gets the sum of the weights of the weights of the elements of the dataset.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DatasetUtil

public DatasetUtil()
Method Detail

appendBias

public static ArrayList<Vector> appendBias(Collection<? extends Vector> dataset)
Appends a bias (constant 1.0) to the end of each Vector in the dataset, the original dataset is unmodified. The resulting Vectors will have one greater dimension and look like: [ x1 x2 ] -> [ x1 x2 1.0 ]

Parameters:
dataset - Dataset to append a bias term to, Vectors can be of different dimensionality
Returns:
Dataset with 1.0 appended to each Vector in the dataset

appendBias

public static ArrayList<Vector> appendBias(Collection<? extends Vector> dataset,
                                           double biasValue)
Appends "biasValue" to the end of each Vector in the dataset, the original dataset is unmodified. The resulting Vectors will have one greater dimension and look like: [ x1 x2 ] -> [ x1 x2 1.0 ]

Parameters:
dataset - Dataset to append a bias term to, Vectors can be of different dimensionality
biasValue - Bias value to append to the samples
Returns:
Dataset with "biasValue" appended to each Vector in the dataset

decoupleVectorPairDataset

public static ArrayList<ArrayList<InputOutputPair<Double,Double>>> decoupleVectorPairDataset(Collection<? extends InputOutputPair<? extends Vector,? extends Vector>> dataset)
Takes a set of equal-dimension Vector-Vector InputOutputPairs and turns them into a collection of Double-Double InputOutputPairs. This is useful when one can treat each element of the Vector-Vector pairs as independent of the other elements

Parameters:
dataset - Collection of Vector-Vector InputOutputPairs. All Vectors (both inputs and outputs) must have equal dimension!!
Returns:
ArrayList of ArrayList of Double-Double InputOutputPairs. The outer ArrayList contains a dataset for each element in the vector (and thus there are as many ArrayList> as there are elements in the Vectors). Each ArrayList> has as many elements as the original dataset

decoupleVectorDataset

public static ArrayList<ArrayList<Double>> decoupleVectorDataset(Collection<? extends Vector> dataset)
Takes a dataset of M-dimensional Vectors and turns it into M datasets of Doubles

Parameters:
dataset - M-dimensional Vectors, throws IllegalArgumentException if all Vectors aren't the same dimensionality
Returns:
M datasets of dataset.size() Doubles

splitDatasets

public static <DataType> DefaultPair<LinkedList<DataType>,LinkedList<DataType>> splitDatasets(Collection<? extends InputOutputPair<? extends DataType,Boolean>> data)
Splits a dataset of input-output pair into two datasets, one for the inputs that have a "true" output and another for the inputs that have a "false" output

Type Parameters:
DataType - The type of the data.
Parameters:
data - Collection of InputOutputPairs to split according to the output flag
Returns:
DefaultPair of LinkedLists where the first dataset corresponds to the inputs where the output was "true" and the second dataset corresponds to the inputs where the output was "false"

splitOnOutput

public static <InputType,CategoryType> Map<CategoryType,List<InputType>> splitOnOutput(Iterable<? extends InputOutputPair<? extends InputType,? extends CategoryType>> data)
Splits a dataset according to its output value (usually a category) so that all the inputs for that category are given in a list. It maps the category value to its list.

Type Parameters:
InputType - The the of the input values.
CategoryType - The type of the output values.
Parameters:
data - The input-output pairs to split.
Returns:
A mapping of category to a list of all of the inputs for that category.

computeOuterProductDataMatrix

public static Matrix computeOuterProductDataMatrix(ArrayList<? extends Vector> data)
Computes the outer-product Matrix of the given set of data: XXt = [ x1 x2 ... xn ] * [ x1 x2 ... xn ]^T. The outer-product data Matrix is useful in things like computing the Principal Components Analysis of the dataset. For exapmle, finding the eigenvectors of the outer-product data Matrix is equivalent to finding the left singular Vectors ("U" from the SingularValueDecomposition) of the dataset. Note that if the input dataset has a size of "N" and each Vector in the dataset has "M" dimensions, then the return Matrix (XXt) is an (MxM) Matrix. This method computes the return Matrix without explicitly forming the data matrix, potentially saving quite a lot of memory.

Parameters:
data - Input dataset where each of "N" Vectors has dimension of "M"
Returns:
Outer product Matrix of the input dataset, having dimensions (MxM).

computeOutputMean

public static double computeOutputMean(Collection<? extends InputOutputPair<?,? extends Number>> data)
Computes the mean of the output data.

Parameters:
data - The data to compute the mean of the output.
Returns:
The mean of the output values of the given data.

computeWeightedOutputMean

public static double computeWeightedOutputMean(Collection<? extends InputOutputPair<?,? extends Number>> data)
Computes the mean of the output data.

Parameters:
data - The data to compute the mean of the output.
Returns:
The mean of the output values of the given data.

computeOutputVariance

public static double computeOutputVariance(Collection<? extends InputOutputPair<?,? extends Number>> data)
Computes the variance of the output of a given set of input-output pairs.

Parameters:
data - The data.
Returns:
The variance of the output of the data.

findUniqueOutputs

public static <OutputType> Set<OutputType> findUniqueOutputs(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
Creates a set containing the unique output values from the given data.

Type Parameters:
OutputType - The type of the output values.
Parameters:
data - The data to collect the unique output values from.
Returns:
The set of unique output values. Implemented as a linked hash set.

countOutputValues

public static <OutputType> DataDistribution<OutputType> countOutputValues(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
Creates a data histogram over the output values from the given data.

Type Parameters:
OutputType - The type of the output values.
Parameters:
data - The data to collect the output values from.
Returns:
The histogram of output values.

inputsList

public static <InputType> List<InputType> inputsList(Iterable<? extends InputOutputPair<? extends InputType,?>> data)
Creates a list containing all of the input values from the given data.

Type Parameters:
InputType - The type of the input values.
Parameters:
data - The data to collect the input values from.
Returns:
A list containing the output values.

outputsList

public static <OutputType> List<OutputType> outputsList(Iterable<? extends InputOutputPair<?,? extends OutputType>> data)
Creates a list containing all of the output values from the given data.

Type Parameters:
OutputType - The type of the output values.
Parameters:
data - The data to collect the output values from.
Returns:
A list containing the output values.

asMultiCollection

public static <EntryType> MultiCollection<EntryType> asMultiCollection(Collection<EntryType> collection)
Takes a collection and returns a multi-collection version of that collection. If the given collection is a multi-collection, it casts it to that value and returns it. If it is not a multi-collection, it creates a new, singleton multi-collection with the given collection and returns it.

Type Parameters:
EntryType - The entry type of the collection.
Parameters:
collection - A collection.
Returns:
A multi-collection version of the given collection.

asVectorCollection

public static Collection<Vector> asVectorCollection(Collection<? extends Vectorizable> collection)
Takes a collection of Vectorizable objects and returns a collection of Vector objects of the same size.

Parameters:
collection - The collection of Vectorizable objects to convert.
Returns:
The corresponding collection of Vector objects.

getInputDimensionality

public static int getInputDimensionality(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data)
Gets the dimensionality of the input vectors in given set of input-output pairs. It finds the first non-null vector and returns its dimensionality. If there are non-null vectors, then -1 is returned.

Parameters:
data - The data to find the input dimensionality of.
Returns:
The dimensionality of the first non-null in put in the given data. -1 if there are no non-null inputs.

assertInputDimensionalitiesAllEqual

public static void assertInputDimensionalitiesAllEqual(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data)
Asserts that all of the dimensionalities of the input vectors in the given set of input-output pairs are the same.

Parameters:
data - A collection of input-output pairs.
Throws:
DimensionalityMismatchException - If the dimensionalities are not all equal.

assertInputDimensionalitiesAllEqual

public static void assertInputDimensionalitiesAllEqual(Iterable<? extends InputOutputPair<? extends Vectorizable,?>> data,
                                                       int dimensionality)
Asserts that all of the dimensionalities of the input vectors in the given set of input-output pairs equal the given dimensionality.

Parameters:
data - A collection of input-output pairs.
dimensionality - The dimensionality that all the inputs must have.
Throws:
DimensionalityMismatchException - If the dimensionalities are not all equal.

getDimensionality

public static int getDimensionality(Iterable<? extends Vectorizable> data)
Gets the dimensionality of the vectors in given set of data. It finds the first non-null vector and returns its dimensionality. If there are non-null vectors, then -1 is returned.

Parameters:
data - The data to find the dimensionality of.
Returns:
The dimensionality of the first non-null vector in the given data. -1 if there are no non-null vector.

assertDimensionalitiesAllEqual

public static void assertDimensionalitiesAllEqual(Iterable<? extends Vectorizable> data)
Asserts that all of the dimensionalities of the vectors in the given set of data are the same.

Parameters:
data - A collection of data.
Throws:
DimensionalityMismatchException - If the dimensionalities are not all equal.

getWeight

public static double getWeight(InputOutputPair<?,?> pair)
Gets the weight of a given input-output pair. If it is a weighted input-output pair (implements the WeightedInputOutputPair interface, then it casts it to retrieve its weight. Otherwise, it returns 1.0.

Parameters:
pair - The pair to get the weight of.
Returns:
The weight of the given pair, if it exists, otherwise 1.0.

getWeight

public static double getWeight(TargetEstimatePair<?,?> pair)
Gets the weight of a given target-estimate pair. If it is a weighted target-estimate pair (implements the WeightedTargetEstimatePair interface, then it casts it to retrieve its weight. Otherwise, it returns 1.0.

Parameters:
pair - The pair to get the weight of.
Returns:
The weight of the given pair, if it exists, otherwise 1.0.

sumWeights

public static double sumWeights(Collection<? extends InputOutputPair<?,?>> data)
Gets the sum of the weights of the weights of the elements of the dataset. It loops over the items and calls getWeight on each one.

Parameters:
data - The dataset to compute the sum of the weights
Returns:
The sum of the weights of the elements in the dataset.