Package gPy :: Module Data :: Class Data
[hide private]
[frames] | no frames]

Class Data

source code


Factors whose data is stored in a table in an sqlite database

There is one row in the table for each non-zero value. These are meant to be used for factors with many variables, but not too many non-zero values: contingency tables for example. This object is a sensible choice when most of the information required from the data can be gleaned in a few passes over it.

At present all database are stored in RAM.

Instance Methods [hide private]
 
__init__(self, data=None, variables=(), domain=None, new_domain_variables=None, must_be_new=False, check=False, convert=False)
Initialise a Data object
source code
 
__del__(self) source code
 
__getstate__(self) source code
Iterator
__iter__(self)
Iterates over those joint instantiations of the data which have non-zero counts associated with them
source code
 
__setstate__(self, state) source code
Float
h_score(self, precision=1.0)
Return the gPyC.lgh score for entire data set
source code
Data object
marginal(self, variables)
Return the marginal dataset containing only variables
source code
 
conditional_entropy(self, x, y)
Return the conditional entropy H(x|y) for variable sets x and y using the empirical distribution given by the data
source code
 
entropy(self, x)
Return the entropy of the marginal empirical distribution given by x and the data
source code
 
_bic_search2(self, child, n, child_df, old_parents, further_parents, store, pa_lim, highest_llh) source code
 
_bic_search(self, child, n, child_df, lower, bic_lower, upper, upper_bound, store)
Compute the BIC score for every parent set for child which is a proper superset of lower and a subset of the union of lower and upper and add it to the dictionary store.
source code
 
bic_search(self, child, pa_lim)
Branch and bound search for all parent sets for child which do not have a higher scoring subset
source code
 
loglikelihood(self, adg)
The log-likelihood of adg with MLE parameters (up to an additive constant)
source code
 
bic_complexity_penalty(self, adg) source code
Float
mutual_information(self, x, y)
Return the mutual information between the variable sets x and y in the empirical distribution determined by the data
source code
The number of datapoints in the data
total_count(self)
Return the number of datapoints in the data
source code
 
populate(self, records, variables=None)
Simply inserts the records into the database
source code
 
qhs(self) source code
 
ub(self, qpa, alpha, ri)
The upper bound self provides on a the score of a smaller parent set where
source code
 
qh(self, h=0)
Return the number of instantiations (ie cells) having a value greater than h
source code
 
make_family_scores_naively(self, pa_size_lim=4, precision=10.0, batch_size=65536)
Make scores for all parent sets for all variables where (1) the size of the parent set is at most pa_size_lim.
source code
List
makeFactorsn(self, n, block=1000000)
Yield counts for all marginals with n variables in blocks of block
source code
 
_countsfromdata(self, count, mults, marginals_including) source code
List
makeFactorsn_old(self, n)
Return counts for all marginals with n variables
source code
 
family_score(self, child, parents, precision=1.0) source code
 
score_adg(self, adg, precision=1.0)
Get Bdeu score for an adg
source code
 
h_scores(self, precision=1.0, textfun=<type 'str'>) source code
 
makeCPT(self, child, parents, force_cpt=False, check=False, prior=0) source code
Factor object
makeFactor(self, variables=None)
Simple way to get a factor
source code
Factor object
__getitem__(self, variables=None)
Simple way to get a factor
source code

Inherited from Variables.SubDomain: __add__, __div__, __iadd__, __idiv__, __imul__, __isub__, __mul__, __rdiv__, __repr__, __rmul__, __str__, __sub__, copy, drop_variable, drop_variables, inst2index, insts, insts_indices, marginalise_onto, sumout, table_size, uses_default_domain, variables, varvalues

Inherited from Variables.Domain: add_domain_variable, add_domain_variables, add_domain_variables_from_rawdata, change_domain_variable, change_domain_variables, common_domain, known_variable, numvals, values

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__

Class Variables [hide private]
sqlite3.Connection object db = sqlite.connect(':memory:')
The common database for objects of this class
  cursor = db.cursor()
Instance Variables [hide private]
Integer n
Number of datapoints in the data
String table
The name of table containing the object's data.

Inherited from Variables.SubDomain (private): _variables

Inherited from Variables.Domain (private): _domain, _instd, _numvals

Properties [hide private]

Inherited from object: __class__

Imports: sqlite


Method Details [hide private]

__init__(self, data=None, variables=(), domain=None, new_domain_variables=None, must_be_new=False, check=False, convert=False)
(Constructor)

source code 

Initialise a Data object

Parameters:
  • data (Tuple) - If None, then an empty Data object is created. If a file object, then assumed to be connected to a CSV file in the format that IO.read_csv can read. If a string, assumed to be the name of a CSV file in the format that IO.read_csv can read. If a tuple, assumed to be like one returned by IO.read_csv.

    Unless data is None any values for variables and new_domain_variables are ignored.

  • variables (Sequence) - Variables in the data
  • new_domain_variables (Dict or None) - A dictionary containing a mapping from any new variables to their values.
  • domain (Variables.Domain or None) - A domain for the model. If None the internal default domain is used.
  • must_be_new (Boolean) - Whether domain variables in new_domain_variables have to be new
  • check (Boolean) - Whether to check that (1) variables is of the right form, and (2) that each variable has an associated set of values and (3) that data is the right size and type.
  • convert (Boolean) - If True, data is converted to a list.
Raises:
  • TypeError - If check is set and convert is not set and data is of the wrong type.
  • VariableError - If check is set and there is a variable in variables which does not have associated values. Or If a variable in new_domain_variables already exists with values different from its values in new_domain_variables; Or if must_be_new is set and the variable already exists.
Overrides: object.__init__

__iter__(self)

source code 

Iterates over those joint instantiations of the data which have non-zero counts associated with them

On each iteration a tuple of non-negative integers is returned. The final number is the count, the preceding numbers encode the joint instantiation. For example, (1,0,2,23) states that the instantiation (1,0,2) has occurred 23 times. (1,0,2) is the first variable with its instantiation 1 (its 2nd), the second with instantiation 0 (its 1st) and the third with instantiation 2 (its 3rd). Variables and values are ordered lexicographically.

Returns: Iterator
Iterator over non-zero count instantiations

h_score(self, precision=1.0)

source code 

Return the gPyC.lgh score for entire data set

Parameters:
  • precision (Float) - BDe-like precision
Returns: Float
gPyC.lgh score

marginal(self, variables)

source code 

Return the marginal dataset containing only variables

Returned Data object will have same domain as self

Does not alter self

Parameters:
  • variables (Iterable, e.g. list, tuple, set) - Variables in returned marginal table
Returns: Data object
New marginal dataset

_bic_search(self, child, n, child_df, lower, bic_lower, upper, upper_bound, store)

source code 

Compute the BIC score for every parent set for child which is a proper superset of lower and a subset of the union of lower and upper and add it to the dictionary store. bound is a bound on the log-likelihood (modulo an additive constant) on any possible parentset bic_lower is the BIC score of lower.

bic_search(self, child, pa_lim)

source code 

Branch and bound search for all parent sets for child which do not have a higher scoring subset

TODO: use a random graph as an input

loglikelihood(self, adg)

source code 

The log-likelihood of adg with MLE parameters (up to an additive constant)

The missing constant is the log of the multinomial coefficient which is the same for all adgs.

mutual_information(self, x, y)

source code 

Return the mutual information between the variable sets x and y in the empirical distribution determined by the data

Parameters:
  • x (Iterable, frozenset most efficient) - Variable set
  • y (Iterable, frozenset most efficient) - Variable set
Returns: Float
The mutual information
Raises:
  • ValueError - If x and y are not disjoint
  • TypeError - If either x or y are not iterables

total_count(self)

source code 

Return the number of datapoints in the data

Returns: The number of datapoints in the data
Integer

populate(self, records, variables=None)

source code 

Simply inserts the records into the database

Each record is a tuple of integers. Each integer, except the last, corresponds to a value. The last value is the count. If a joint instantiation occurs more than once, only the last count is used.

Assumes lexicographic order of variables if variables is None, otherwise the order given by variables.

ub(self, qpa, alpha, ri)

source code 

The upper bound self provides on a the score of a smaller parent set where

Parameters:
  • qpa - Size of contingency table for smaller parent set
  • alpha - Effective sample size
  • ri - Number of values of the child

qh(self, h=0)

source code 

Return the number of instantiations (ie cells) having a value greater than h

Parameters:
  • h (Integer) - Threshold

make_family_scores_naively(self, pa_size_lim=4, precision=10.0, batch_size=65536)

source code 

Make scores for all parent sets for all variables where (1) the size of the parent set is at most pa_size_lim. No pruning!

makeFactorsn(self, n, block=1000000)

source code 

Yield counts for all marginals with n variables in blocks of block

Marginals are ordered according to how they are generated by the generator Utils.subseteqn.

Parameters:
  • n (Int) - The number of variables in the marginals
Returns: List
counts where counts[i][idx] is the count for the idxth instantiation of the ith marginal

makeFactorsn_old(self, n)

source code 

Return counts for all marginals with n variables

Marginals are ordered according to how they are generated by the generator Utils.subseteqn.

Parameters:
  • n (Int) - The number of variables in the marginals
Returns: List
counts where counts[i][idx] is the count for the idxth instantiation of the ith marginal

makeCPT(self, child, parents, force_cpt=False, check=False, prior=0)

source code 
Parameters:
  • prior - the Dirichlet prior parameter (the same parameter value is used for all instances!) Note there may be some problems with this method: a different prior is used by the BDeu score. However, in practice, for parameter estimation, this prior method seems to be ok. I was lazy and it was simple to implement (cb). If prior is zero, then the parameters are the maximum likelihood estimation solutions.

makeFactor(self, variables=None)

source code 

Simple way to get a factor

Parameters:
  • variables (Iterable) - The variables in the required factor. If None, then all of self's variables are used: which could produce a very large object!
Returns: Factor object
Marginal table

__getitem__(self, variables=None)
(Indexing operator)

source code 

Simple way to get a factor

Parameters:
  • variables (Iterable) - The variables in the required factor. If None, then all of self's variables are used: which could produce a very large object!
Returns: Factor object
Marginal table

Instance Variable Details [hide private]

n

Number of datapoints in the data
Get Method:
unreachable.n(self) - Return the current number of datapoints stored
Type:
Integer

table

The name of table containing the object's data. This is read-only (it is defined by a 'property') and is always equal to: 'table%d' % id(self)
Get Method:
unreachable.table(self) - Return the name of the table storing self's data
Type:
String