Bayesian network learning by compiling to weighted MAX-SAT

This webpage includes materials for reproducing the results of:

The slides of a talk on this work given at the Spring Workshop on Mining and Learning 2008 are also available.

The slides of a talk given at the Durham symposium on Mathematical Aspects of Graphical Models are available.

There is also some material on work done after submission of the UAI paper.

If anything does not work drop me a line.


Resources

Resource Requirements
MaxWalkSATC compiler
Weighted CNF filesNothing
Family scoresPython
DataNothing
Computing family scoresData, Python, gPy, SQLite, pysqlite
Filtering family scoresPython
Making (weighted) CNF filesPython
Extracting BN structures from Maxwalksat runsPython, gPy

MaxWalkSat

MaxWalkSat is available from the Walksat home page.


Weighted CNF files

(Gzipped) weighted CNF files using the 'Total Order' Encoding are available in this directory. Note that the size of these files can be reduced considerably by deleting the comments at the start. A file giving the random seeds used with the Total Order encoding is also available. These inputs were used to produce the results in the 'Long' column of Table 6 in the paper.

Using the given random seeds it is possible on my machine at least to reproduce the results given in the 'Long' column of Table 6. Here's how to get the score of 408,282 for the Mildew 10,000 dataset:

maxwalksat -seed 11870113 -hard -noise 10 100 -cutoff 10000000 -tries 100 < Mildew_10000_pascores_3._filtered.pck.cnf 
maxwalksat version 20
seed = 11870113
cutoff = 10000000
tries = 100
numsol = 1
targetcost = 0
heuristic = best, noise 10 / 100
allocating memory...
clauses contain explicit costs
numatom = 1060, numclause = 14208, numliterals = 41436
wff read in
                                           average             average       mean              standard
    lowest     worst    number                when                over      flips                 error
      cost    clause    #unsat    #flips     model   success       all      until        std         of
  this try  this try  this try  this try     found      rate     tries     assign        dev       mean
    408919     33187        35  10000000         *         0         *          *          *          *
    408983     33187        35  10000000         *         0         *          *          *          *
    409542     39343        35  10000000         *         0         *          *          *          *
    409198     34651        35  10000000         *         0         *          *          *          *
    408982     33187        35  10000000         *         0         *          *          *          *
    408282     33187        35  10000000         *         0         *          *          *          *
    409283     33187        35  10000000         *         0         *          *          *          *
    .....

Note that maxwalksat achieves this best score of 408,282 on only the 6th try, the following 94 tries (not shown) failing to match this early success.

Gzipped weighted CNF files using the 'Ancestor' encoding are available in this directory. These files were used to produce the scores in the 'Ancestor' column of Table 6 in the paper. Unfortunately the random seeds used to create those scores have been lost. Note that these files have the same names as those using the 'Total Order' encoding.


Family scores

The (filtered) 'family scores' mentioned in Table 4 are available in this directory. They are stored as pickled Python dictionaries. Here's an interaction with the Python interpreter showing how to extract information from the file asia_10000_pascores_3._filtered.pck:

cscipc001 ~/godot/research/uai08/data python
Python 2.5.1 (r251:54863, May  4 2007, 16:52:23)
[GCC 4.1.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cPickle
>>> foo = cPickle.load(open('asia_10000_pascores_3._filtered.pck'))
>>> foo.keys()
['Cancer', 'Bronchitis', 'TbOrCa', 'XRay', 'VisitAsia',
'Tuberculosis', 'Smoking', 'Dyspnea']
>>> foo['Cancer']
{frozenset(['XRay', 'Tuberculosis']): -796.08205890562385,
frozenset(['Bronchitis']): -2145.9112979259517, frozenset(['XRay',
'Tuberculosis', 'Dyspnea']): -700.72999859493575, frozenset(['XRay',
'Smoking', 'Dyspnea']): -668.48756881769805, frozenset([]):
-2168.6416529098788, frozenset(['TbOrCa', 'Tuberculosis']):
-23.720963290572399, frozenset(['Dyspnea']): -1975.7774261979357,
frozenset(['Smoking', 'TbOrCa']): -236.48304769890092,
frozenset(['TbOrCa', 'Bronchitis']): -302.71226261477568,
frozenset(['Smoking']): -1906.216217612091, frozenset(['XRay',
'Smoking', 'Tuberculosis']): -667.00410148876836, frozenset(['XRay',
'Bronchitis']): -844.45162448912743, frozenset(['VisitAsia',
'TbOrCa']): -307.28061528033868, frozenset(['TbOrCa']):
-309.98926825194212, frozenset(['Bronchitis', 'Dyspnea']):
-1942.0722846225835, frozenset(['XRay', 'Smoking']):
-706.48236515406461, frozenset(['XRay', 'Tuberculosis',
'Bronchitis']): -792.30478260022937, frozenset(['VisitAsia', 'TbOrCa',
'Bronchitis']): -300.96353114905651, frozenset(['XRay', 'Dyspnea']):
-779.76300946372794, frozenset(['Smoking', 'Bronchitis', 'Dyspnea']):
-1727.6342931018444, frozenset(['Smoking', 'Dyspnea']):
-1793.5086176705518, frozenset(['XRay']): -857.50825977628119,
frozenset(['Smoking', 'TbOrCa', 'Tuberculosis']): -22.811097057667212}
>>> foo['Cancer'][frozenset(['XRay', 'Tuberculosis'])]
-796.08205890562385
>>>
    

Data

The (gzipped) datasets, as described in Table 2, are available in this directory. Each file has 3 sections. In the 1st section there is one line naming each variable and its possible values like this:

Cancer:absent,present
Bronchitis:absent,present
TbOrCa:false,true
XRay:abnormal,normal
VisitAsia:no_visit,visit
Tuberculosis:absent,present
Smoking:smoker,nonsmoker
Dyspnea:absent,present

The 2nd section is a single line giving an ordering of the variables. For example:

Smoking,Cancer,Bronchitis,VisitAsia,Tuberculosis,TbOrCa,XRay,Dyspnea
    

The 3rd section has one full joint instantiation of the variables on each its lines. The ith value on each line is the value of the ith variable according to the variable ordering given in the 2nd section. Here's an example:

nonsmoker,absent,present,no_visit,absent,false,normal,present
nonsmoker,absent,absent,no_visit,absent,false,normal,present
smoker,absent,present,no_visit,absent,false,normal,absent
smoker,absent,absent,no_visit,absent,false,normal,absent
smoker,present,absent,no_visit,absent,true,abnormal,present
    

Computing family scores

All of the following assumes you are running Linux/UNIX.

To compute your own 'family scores' dictionary from data you need to install the gPy Python package. Just download the .tar.gz file untar it and refer to the README file. As for SQLite and pysqlite just follow the links. You may well have SQLite installed already.

Once all is installed grab this script and run it like this:

python make_family_scores.py asia_100.data.gz
    

where the input file is as described in the Data section. This is currently a slow process on large data files. This will produce 3 files:

asia_100_family_scores_3_1.pck
asia_100.sqldata.pck
asia_100.db
    

The first of these has your family scores. The other two are just reformulations of the data. asia_100.db is an SQLite database file and asia_100.sqldata.pck is a Python wrapper object for it. If either of these last two files is already present an error will occur, so just delete them.


Filtering family scores

Use this script to filter a pickled (i.e. saved to disk) Python dictionary of family scores. Just supply the pickled file as a command line argument. (Some diagnostic stuff gets printed out.)


Making (weighted) CNF files

Run this script with a pickled dictionary of family scores as a command line argument. A '.cnf' file will be created. By default, the total order with families encoding is used. In addition a file ending in conditioned_probadg.pck will be created which stores the encoding used.


Extracting BN structures from Maxwalksat runs

Run this script with a first argument of a 'conditioned_probadg.pck' file as created when making a weighted CNF file (see above) and a second argument a file containing the output of a Maxwalksat run. For example, suppose the following output from a maxwalksat run were saved in a file called 'foo.txt'.

cscipc001 ~ maxwalksat -sol -targetcost 241 < asia_100_pascores_3._filtered.cnf
maxwalksat version 20 (Huge)
seed = 53353416
cutoff = 100000
tries = 10
numsol = 1
targetcost = 241
heuristic = best, noise 50 / 100
allocating memory...
clauses contain explicit costs
numatom = 69, numclause = 207, numliterals = 510
wff read in
                                           average             average       mean              standard
    lowest     worst    number                when                over      flips                 error
      cost    clause    #unsat    #flips     model   success       all      until        std         of
  this try  this try  this try  this try     found      rate     tries     assign        dev       mean
       241        71         8      6105      6105       100      6105     6105.0          *          *
Begin assign with lowest # bad = 241 (atoms assigned true)
 3 10 16 22 25 26 29 39 42 43
 44 45 46 47 48 51 54 55 56 59
 60 63 66 68 69
End assign

total elapsed seconds = 0.003333
average flips per second = 1831683
number of solutions found = 1
mean flips until assign = 6105.000000
mean seconds until assign = 0.003333
mean restarts until assign = 1.000000
ASSIGNMENT ACHIEVING TARGET 241 FOUND
    

Then assuming that asia_100_pascores_3._filtered_conditioned_probadg.pck is the file created when asia_100_pascores_3._filtered.cnf was, then the BN encoded by the atoms in the assignment in the above maxwalksat run can be uncovered as follows:

cscipc001 python extract_bns.py asia_100_pascores_3._filtered_conditioned_probadg.pck foo.txt
Bayesian network:
Vertices:
['Bronchitis', 'Cancer', 'Dyspnea', 'Smoking', 'TbOrCa', 'Tuberculosis', 'VisitAsia', 'XRay']
Arrows:
Bronchitis -> Dyspnea
Bronchitis -> Smoking
Cancer -> TbOrCa
Smoking -> Cancer
TbOrCa -> XRay
Tuberculosis -> Cancer
Tuberculosis -> TbOrCa


Essential graph:
Vertices:
['Bronchitis', 'Cancer', 'Dyspnea', 'Smoking', 'TbOrCa', 'Tuberculosis', 'VisitAsia', 'XRay']
Arrows:
Cancer -> TbOrCa
Smoking -> Cancer
TbOrCa -> XRay
Tuberculosis -> Cancer
Tuberculosis -> TbOrCa
Lines:
Bronchitis - Dyspnea
Bronchitis - Smoking
    

As well as printing out the graph and essential graph, both of these will appear in a Tk window. Chickering's algorithm is used to compute the essential graph. Move the vertices around the the left mouse button to get a nice arrangment. Just kill the window to finish. If your Tkinter installation is not working just comment out the appropriate lines.


Further work

Given a particular total ordering of the variables the best parentset (from the available candidate parensets) for any variable is determined. It follows that there is no need to search for such a parentset. It follows that there is no need to explicitly represent parent sets. Each (single literal) weighted clause representing a choice of parents can be replaced by one with the same weight but mentioning only the order relations corresponding to that choice.

This approach of searching in the space of orders can be found in M. Teyssier and D. Koller (2005). "Ordering-based Search: A Simple and Effective Algorithm for Learning Bayesian Networks." Proceedings of the Twenty-first Conference on Uncertainty in AI (UAI) (pp. 584-590). In that paper the authors use "greedy local hill-climbing with random restarts and a tabu list"

Here the search in the space of orders has been done by encoding the problem to weighted CNF and experimenting with solvers in UBCSAT. So far the "Iterative Robust Taboo" (irots) solver has produced the best results.

The weighted CNF files used for these experiments can be found in this directory (look for .cnf.gz files). These files use floating point weights since UBCSAT allows them. For other solvers you may have to change these to integers.The results of learning BNs using this approach can be found in the same place (look for .out files).

The (BDeu) scores of the learned BNs are:

Mildew_100.out:with score -5721.10745197
Mildew_1000.out:with score -47102.3261124
Mildew_10000.out:with score -407643.891468
Water_100.out:with score -1500.9683917
Water_1000.out:with score -13262.3417867
Water_10000.out:with score -128705.64988
alarm_100.out:with score -1349.95493762
alarm_1000.out:with score -11240.5432394
alarm_10000.out:with score -105226.511917
asia_100.out:with score -245.644265388
asia_1000.out:with score -2317.41150668
asia_10000.out:with score -22466.3965465
carpo_100.out:with score -1834.4018744
carpo_1000.out:with score -17721.8687069
carpo_10000.out:with score -174132.08478
hailfinder_100.out:with score -6019.46992322
hailfinder_1000.out:with score -52473.245609
hailfinder_10000.out:with score -497730.346946
insurance_100.out:with score -1686.22587606
insurance_1000.out:with score -13887.3501469
insurance_10000.out:with score -132968.576882
    

When learning from a dataset of size 10,000 these scores are better than those achieved in the paper (see Table 6). UBSAT runs took a minute or so. Also these scores are usually higher than the true BN's score, which explains why (with the exception of asia) the learned BNs have positive structural Hamming distance from the true BN.

The structural Hamming distance from the true essential graph to the learned one is as follows:

Mildew_100.out:Essential graphs structural Hamming distance from true BN is  70
Mildew_1000.out:Essential graphs structural Hamming distance from true BN is  36
Mildew_10000.out:Essential graphs structural Hamming distance from true BN is  29
Water_100.out:Essential graphs structural Hamming distance from true BN is  73
Water_1000.out:Essential graphs structural Hamming distance from true BN is  69
Water_10000.out:Essential graphs structural Hamming distance from true BN is  51
alarm_100.out:Essential graphs structural Hamming distance from true BN is  41
alarm_1000.out:Essential graphs structural Hamming distance from true BN is  13
alarm_10000.out:Essential graphs structural Hamming distance from true BN is  2
asia_100.out:Essential graphs structural Hamming distance from true BN is  8
asia_1000.out:Essential graphs structural Hamming distance from true BN is  2
asia_10000.out:Essential graphs structural Hamming distance from true BN is  0
carpo_100.out:Essential graphs structural Hamming distance from true BN is  142
carpo_1000.out:Essential graphs structural Hamming distance from true BN is  82
carpo_10000.out:Essential graphs structural Hamming distance from true BN is  35
hailfinder_100.out:Essential graphs structural Hamming distance from true BN is  68
hailfinder_1000.out:Essential graphs structural Hamming distance from true BN is  59
hailfinder_10000.out:Essential graphs structural Hamming distance from true BN is  37
insurance_100.out:Essential graphs structural Hamming distance from true BN is  42
insurance_1000.out:Essential graphs structural Hamming distance from true BN is  29
insurance_10000.out:Essential graphs structural Hamming distance from true BN is  12
    

Valid XHTML 1.1! Last modified: Thu Sep 4 14:23:02 BST 2008