[frames] | no frames]

# Class Data

source code

Factors whose data is stored in a table in an sqlite database

There is one row in the table for each non-zero value. These are meant to be used for factors with many variables, but not too many non-zero values: contingency tables for example. This object is a sensible choice when most of the information required from the data can be gleaned in a few passes over it.

At present all database are stored in RAM.

 Instance Methods

 __init__(self, data=None, variables=`(``)`, domain=None, new_domain_variables=None, must_be_new=False, check=False, convert=False) Initialise a Data object source code

 __del__(self) source code

 __getstate__(self) source code
Iterator
 __iter__(self) Iterates over those joint instantiations of the data which have non-zero counts associated with them source code

 __setstate__(self, state) source code
Float
 h_score(self, precision=1.0) Return the gPyC.lgh score for entire data set source code
Data object
 marginal(self, variables) Return the marginal dataset containing only `variables` source code

 conditional_entropy(self, x, y) Return the conditional entropy H(x|y) for variable sets `x` and `y` using the empirical distribution given by the data source code

 entropy(self, x) Return the entropy of the marginal empirical distribution given by `x` and the data source code

 _bic_search2(self, child, n, child_df, old_parents, further_parents, store, pa_lim, highest_llh) source code

 _bic_search(self, child, n, child_df, lower, bic_lower, upper, upper_bound, store) Compute the BIC score for every parent set for `child` which is a proper superset of `lower` and a subset of the union of `lower` and `upper` and add it to the dictionary `store`. source code

 bic_search(self, child, pa_lim) Branch and bound search for all parent sets for `child` which do not have a higher scoring subset source code

 loglikelihood(self, adg) The log-likelihood of `adg` with MLE parameters (up to an additive constant) source code

Float
 mutual_information(self, x, y) Return the mutual information between the variable sets `x` and `y` in the empirical distribution determined by the data source code
The number of datapoints in the data
 total_count(self) Return the number of datapoints in the data source code

 populate(self, records, variables=None) Simply inserts the records into the database source code

 qhs(self) source code

 ub(self, qpa, alpha, ri) The upper bound self provides on a the score of a smaller parent set where source code

 qh(self, h=0) Return the number of instantiations (ie cells) having a value greater than `h` source code

 make_family_scores_naively(self, pa_size_lim=4, precision=10.0, batch_size=65536) Make scores for all parent sets for all variables where (1) the size of the parent set is at most `pa_size_lim`. source code
List
 makeFactorsn(self, n, block=1000000) Yield counts for all marginals with `n` variables in blocks of `block` source code

 _countsfromdata(self, count, mults, marginals_including) source code
List
 makeFactorsn_old(self, n) Return counts for all marginals with `n` variables source code

 family_score(self, child, parents, precision=1.0) source code

 h_scores(self, precision=1.0, textfun=) source code

 makeCPT(self, child, parents, force_cpt=False, check=False, prior=0) source code
Factor object
 makeFactor(self, variables=None) Simple way to get a factor source code
Factor object
 __getitem__(self, variables=None) Simple way to get a factor source code

Inherited from `Variables.SubDomain`: `__add__`, `__div__`, `__iadd__`, `__idiv__`, `__imul__`, `__isub__`, `__mul__`, `__rdiv__`, `__repr__`, `__rmul__`, `__str__`, `__sub__`, `copy`, `drop_variable`, `drop_variables`, `inst2index`, `insts`, `insts_indices`, `marginalise_onto`, `sumout`, `table_size`, `uses_default_domain`, `variables`, `varvalues`

Inherited from `Variables.SubDomain` (private): `_decode_inst`, `_get_result_variables`, `_insts`, `_insts_indices`, `_pointwise_op`

Inherited from `Variables.Domain`: `add_domain_variable`, `add_domain_variables`, `add_domain_variables_from_rawdata`, `change_domain_variable`, `change_domain_variables`, `common_domain`, `known_variable`, `numvals`, `values`

Inherited from `object`: `__delattr__`, `__getattribute__`, `__hash__`, `__new__`, `__reduce__`, `__reduce_ex__`, `__setattr__`

 Class Variables
sqlite3.Connection object db = `sqlite.connect(':memory:')`
The common database for objects of this class
cursor = `db.cursor()`
 Instance Variables
Integer n
Number of datapoints in the data
String table
The name of table containing the object's data.

Inherited from `Variables.SubDomain` (private): `_variables`

Inherited from `Variables.Domain` (private): `_domain`, `_instd`, `_numvals`

 Properties

Inherited from `object`: `__class__`

Imports: sqlite

 Method Details

### __init__(self, data=None, variables=`(``)`, domain=None, new_domain_variables=None, must_be_new=False, check=False, convert=False)(Constructor)

source code

Initialise a Data object

Parameters:
• `data` (Tuple) - If `None`, then an empty Data object is created. If a file object, then assumed to be connected to a CSV file in the format that IO.read_csv can read. If a string, assumed to be the name of a CSV file in the format that IO.read_csv can read. If a tuple, assumed to be like one returned by IO.read_csv.

Unless `data` is None any values for `variables` and `new_domain_variables` are ignored.

• `variables` (Sequence) - Variables in the data
• `new_domain_variables` (Dict or None) - A dictionary containing a mapping from any new variables to their values.
• `domain` (Variables.Domain or None) - A domain for the model. If None the internal default domain is used.
• `must_be_new` (Boolean) - Whether domain variables in `new_domain_variables` have to be new
• `check` (Boolean) - Whether to check that (1) `variables` is of the right form, and (2) that each variable has an associated set of values and (3) that `data` is the right size and type.
• `convert` (Boolean) - If `True`, `data` is converted to a list.
Raises:
• `TypeError` - If `check` is set and `convert` is not set and `data` is of the wrong type.
• `VariableError` - If `check` is set and there is a variable in `variables` which does not have associated values. Or If a variable in `new_domain_variables` already exists with values different from its values in `new_domain_variables`; Or if `must_be_new` is set and the variable already exists.
Overrides: object.__init__

### __iter__(self)

source code

Iterates over those joint instantiations of the data which have non-zero counts associated with them

On each iteration a tuple of non-negative integers is returned. The final number is the count, the preceding numbers encode the joint instantiation. For example, (1,0,2,23) states that the instantiation (1,0,2) has occurred 23 times. (1,0,2) is the first variable with its instantiation 1 (its 2nd), the second with instantiation 0 (its 1st) and the third with instantiation 2 (its 3rd). Variables and values are ordered lexicographically.

Returns: Iterator
Iterator over non-zero count instantiations

### h_score(self, precision=1.0)

source code

Return the gPyC.lgh score for entire data set

Parameters:
• `precision` (Float) - BDe-like precision
Returns: Float
gPyC.lgh score

### marginal(self, variables)

source code

Return the marginal dataset containing only `variables`

Returned Data object will have same domain as `self`

Does not alter `self`

Parameters:
• `variables` (Iterable, e.g. list, tuple, set) - Variables in returned marginal table
Returns: Data object
New marginal dataset

### _bic_search(self, child, n, child_df, lower, bic_lower, upper, upper_bound, store)

source code

Compute the BIC score for every parent set for `child` which is a proper superset of `lower` and a subset of the union of `lower` and `upper` and add it to the dictionary `store`. `bound` is a bound on the log-likelihood (modulo an additive constant) on any possible parentset `bic_lower` is the BIC score of `lower`.

### bic_search(self, child, pa_lim)

source code

Branch and bound search for all parent sets for `child` which do not have a higher scoring subset

TODO: use a random graph as an input

source code

The log-likelihood of `adg` with MLE parameters (up to an additive constant)

The missing constant is the log of the multinomial coefficient which is the same for all adgs.

### mutual_information(self, x, y)

source code

Return the mutual information between the variable sets `x` and `y` in the empirical distribution determined by the data

Parameters:
• `x` (Iterable, frozenset most efficient) - Variable set
• `y` (Iterable, frozenset most efficient) - Variable set
Returns: Float
The mutual information
Raises:
• `ValueError` - If `x` and `y` are not disjoint
• `TypeError` - If either `x` or `y` are not iterables

### total_count(self)

source code

Return the number of datapoints in the data

Returns: The number of datapoints in the data
Integer

### populate(self, records, variables=None)

source code

Simply inserts the records into the database

Each record is a tuple of integers. Each integer, except the last, corresponds to a value. The last value is the count. If a joint instantiation occurs more than once, only the last count is used.

Assumes lexicographic order of variables if `variables` is None, otherwise the order given by `variables`.

### ub(self, qpa, alpha, ri)

source code

The upper bound self provides on a the score of a smaller parent set where

Parameters:
• `qpa` - Size of contingency table for smaller parent set
• `alpha` - Effective sample size
• `ri` - Number of values of the child

### qh(self, h=0)

source code

Return the number of instantiations (ie cells) having a value greater than `h`

Parameters:
• `h` (Integer) - Threshold

### make_family_scores_naively(self, pa_size_lim=4, precision=10.0, batch_size=65536)

source code

Make scores for all parent sets for all variables where (1) the size of the parent set is at most `pa_size_lim`. No pruning!

### makeFactorsn(self, n, block=1000000)

source code

Yield counts for all marginals with `n` variables in blocks of `block`

Marginals are ordered according to how they are generated by the generator Utils.subseteqn.

Parameters:
• `n` (Int) - The number of variables in the marginals
Returns: List
`counts` where `counts[i][idx]` is the count for the `idx`th instantiation of the `i`th marginal

### makeFactorsn_old(self, n)

source code

Return counts for all marginals with `n` variables

Marginals are ordered according to how they are generated by the generator Utils.subseteqn.

Parameters:
• `n` (Int) - The number of variables in the marginals
Returns: List
`counts` where `counts[i][idx]` is the count for the `idx`th instantiation of the `i`th marginal

### makeCPT(self, child, parents, force_cpt=False, check=False, prior=0)

source code
Parameters:
• `prior` - the Dirichlet prior parameter (the same parameter value is used for all instances!) Note there may be some problems with this method: a different prior is used by the BDeu score. However, in practice, for parameter estimation, this prior method seems to be ok. I was lazy and it was simple to implement (cb). If prior is zero, then the parameters are the maximum likelihood estimation solutions.

### makeFactor(self, variables=None)

source code

Simple way to get a factor

Parameters:
• `variables` (Iterable) - The variables in the required factor. If `None`, then all of `self`'s variables are used: which could produce a very large object!
Returns: Factor object
Marginal table

### __getitem__(self, variables=None)(Indexing operator)

source code

Simple way to get a factor

Parameters:
• `variables` (Iterable) - The variables in the required factor. If `None`, then all of `self`'s variables are used: which could produce a very large object!
Returns: Factor object
Marginal table

 Instance Variable Details

### n

Number of datapoints in the data
Get Method:
unreachable.n(self) - Return the current number of datapoints stored
Type:
Integer

### table

The name of table containing the object's data. This is read-only (it is defined by a 'property') and is always equal to: 'table%d' % id(self)
Get Method:
unreachable.table(self) - Return the name of the table storing `self`'s data
Type:
String

 Generated by Epydoc 3.0.1 on Thu Oct 15 15:34:05 2009 http://epydoc.sourceforge.net