JC's Project Suggestions 2003-04

(chosen in 03-04; to be taken in 04-05)



I am willing to consider self-defined projects in any of the following areas:

JC/1 - Visualising Hidden Markov Models with Python [CS/CSM, IP]

Hidden Markov models (HMMs) are simple but effective probabilistic models with applications in speech processing and bioinformatics. There exist well-know algorithms for (i) estimating the parameters of an HMM from data and (ii) find the most likely 'path' through an HMM. The purpose of this project is to add a graphical component to an implementation of these algorithms which clarifies how they work.

Python is an object-oriented scripting language with a great deal of functionality. Python's TKinter module adds graphical capabilities to Python programs.

References:

  1. Biopython
  2. Python
  3. The IML module covers HMMs

JC/2 - Comparing techniques for part-of-speech tagging [CS/CSM, IP]

Many natural language processing applications require an assignment of parts of speech (noun, verb etc) to words in the process of analysing text. This provides the foundation for more abstract syntactic or semantic levels of analysis. There are several standard methods currently used:Hidden Markov Models (see e.g. Chapter 7, James Allen, Natural LanguageUnderstanding, 2nd ed. Benjamin/Cummings Publishing, 1995;
or rule based: see the transformation based tagging papers at

http://www.cs.jhu.edu/~brill/

You can also use inductive logic programming to learn rules for tagging: see James Cussens. Part-of-speech tagging using Progol, available at

http://www-users.cs.york.ac.uk/~jc/research/pub.html

This project invites you to compare the different techniques and design and train your own part of speech tagger using one of the above methods. You will use the locally available Wall Street Journal corpus as the training material.



JC/3 - Markov chain vs. search for learning Bayes nets from data [CS/CSM]

Learning Bayesian nets from data is an important task, for example in the analysis of gene expression data. In the most commonly used approach a search algorithm is used which looks for the Bayesian net which most closely fits the data (perhaps with a penalty included to penalise overly complex nets). This approach has been explored in 2 previous projects which will provide useful background to this project. In the Markov chain approach a sequence of nets is generated and inferences about features of the 'true' network are computed from this sequence. For example, one could compute the probability that any two nodes are connected.

The goal of this project is to compare these two approaches. Publicly available search-based approaches are available, but you will be required to implement a Markov chain approach to enable the comparison.

This project is only suitable for a student taking the BAN module.


JC/4 - Decision networks in Python [CS/CSM]

Decision networks are graphical representations which are used to make rational decisions under conditions of uncertainty. Given probabilities concerning relevant events, the utilities of various outcomes and a set of actions available, one can construct a decision network to model the situation. "Solving" the network amounts to working out the best action to take. The goal of this project is to construct software for representing and solving decision networks. The software will be written in the Python language (an object-oriented, scripting language). You are required to build a GUI for drawing networks. There is a fair amount of Python code available which you can build on. Ideally, this project would dovetail with JC's work on Python software to animate algorithms used in Bayesian networks (a formalism closely related to decision networks), but this is not a requirement.


JC/5 - Learning Bayesian Networks in Python [CS/CSM]

Learning Bayesian neworks from data is a `hot' topic in machine learning. You are not required to come up with a new algorithm: your job is to implement an existing approach (or two). The software will be written in the Python language (an object-oriented, scripting language). There is a fair amount of Python code available which you can build on. Ideally, this project would dovetail with JC's work on Python software to animate algorithms used in Bayesian networks (a formalism closely related to decision networks), but this is not a requirement


JC/6 - Better EMBOSS support for Biopython [CS/CSM/IP]

Biopython is a collection of Python modules which implement useful code for dealing with biological data (running remote database queries, manipulating biological sequences, etc). Python is an object-oriented, scripting language with a very clear syntax which is becoming popular in bioinformatics. Biopython does not yet have all the utilities which the longer established Bioperl project has. The goal of this project is to extend Biopython's interface with the EMBOSS suite of sequence analysis programs. The challenge here is to produce software of a sufficiently high quality that it is accepted into the Biopython distribution. This means that as well as the actual code, proper documentation and test suites are produced.

JC/7 - Better alignment program support for Biopython [CS/CSM/IP]

Biopython is a collection of Python modules which implement useful code for dealing with biological data (running remote database queries, manipulating biological sequences, etc). Python is an object-oriented, scripting language with a very clear syntax which is becoming popular in bioinformatics. Biopython does not yet have all the utilities which the longer established Bioperl project has. The goal of this project is to extend Biopython's capabilities to better handle interacting with computer programs for aligning biological sequences. At present there is support for the clustalw alignment program, but support for other programs such as T-Coffee and POA would improve Biopython. The challenge here is to produce software of a sufficiently high quality that it is accepted into the Biopython distribution. This means that as well as the actual code, proper documentation and test suites are produced.