JC's Project Suggestions 2003-04
(chosen in 03-04; to be taken in 04-05)
I am willing to consider self-defined projects in any of the following
areas:
- Machine Learning
- Bioinformatics
- Inductive Logic Programming
- Logic Programming
- Statistical Natural Language Processing
- Bayesian Nets
- Statistical Computing
JC/1 - Visualising Hidden Markov Models with Python [CS/CSM, IP]
Hidden Markov models (HMMs) are simple but effective probabilistic
models with applications in speech processing and bioinformatics. There
exist well-know algorithms for (i) estimating the parameters of an HMM
from data and (ii) find the most likely 'path' through an HMM. The
purpose of this project
is to add a graphical component to an implementation of these
algorithms which clarifies how they work.
Python is an object-oriented scripting language with a great deal of
functionality. Python's TKinter module adds
graphical capabilities to Python programs.
References:
- Biopython
- Python
- The IML module
covers HMMs
JC/2 - Comparing techniques for part-of-speech tagging [CS/CSM, IP]
Many natural language processing applications require an assignment of
parts of speech (noun, verb etc) to words in the process of analysing
text. This provides the foundation for more abstract syntactic or
semantic
levels of analysis. There are several standard methods currently
used:Hidden
Markov Models (see e.g. Chapter 7, James Allen, Natural
LanguageUnderstanding,
2nd ed. Benjamin/Cummings Publishing, 1995;
or rule based: see the transformation based tagging papers at
http://www.cs.jhu.edu/~brill/
You can also use inductive logic programming to learn rules for
tagging: see James Cussens. Part-of-speech tagging using Progol,
available at
http://www-users.cs.york.ac.uk/~jc/research/pub.html
This project invites you to compare the different techniques and
design and train your own part of speech tagger using one of the above
methods.
You will use the locally available Wall Street Journal corpus as the
training material.
JC/3 - Markov chain vs. search for learning Bayes nets from data
[CS/CSM]
Learning Bayesian nets from data is an important task, for example in
the analysis
of gene expression data. In the most commonly used approach a
search algorithm
is used which looks for the Bayesian net which most closely fits the
data
(perhaps with a penalty included to penalise overly complex nets). This
approach
has been explored in 2 previous projects which will provide useful
background
to this project. In the Markov chain approach a sequence of nets is
generated
and inferences about features of the 'true' network are computed from
this
sequence. For example, one could compute the probability that any two
nodes
are connected.
The goal of this project is to compare these two approaches. Publicly
available search-based approaches are available, but you will be
required to implement a Markov chain approach to enable the comparison.
This project is only suitable for a student taking the BAN module.
JC/4 - Decision networks in Python [CS/CSM]
Decision networks are graphical representations which are used to make
rational decisions under conditions of uncertainty. Given probabilities
concerning relevant events, the utilities of various outcomes and a set
of actions available, one can construct a decision network to model the
situation. "Solving" the network amounts to working out the best action
to take. The goal of this project is to construct software for
representing and solving decision networks. The software will be
written in the Python language (an object-oriented, scripting
language). You are required to build a GUI for drawing networks. There
is a fair amount of Python code available which you can build on.
Ideally, this project would dovetail with JC's work on Python software
to animate algorithms used in Bayesian networks (a formalism closely
related to decision networks), but this is not a requirement.
JC/5 - Learning Bayesian Networks in Python [CS/CSM]
Learning Bayesian neworks from data is a `hot' topic in machine
learning. You are not required to come up with a new algorithm: your
job is to implement an existing approach (or two). The software will be
written in the Python language (an
object-oriented, scripting language). There is a fair amount of Python
code available
which you can build on. Ideally, this project would dovetail with JC's
work on Python software to animate algorithms used in Bayesian networks
(a formalism closely related to decision networks), but this is not a
requirement
JC/6 - Better EMBOSS support for Biopython [CS/CSM/IP]
Biopython is a collection of Python modules which implement useful code
for dealing with biological data (running remote database queries,
manipulating biological sequences, etc). Python is an
object-oriented, scripting language with a very clear syntax which is
becoming popular in bioinformatics. Biopython does not yet have all the
utilities which the longer established Bioperl project has. The goal of
this project is to extend Biopython's interface with the EMBOSS suite
of sequence analysis programs. The challenge here is to produce
software of a sufficiently high quality that it is accepted into the
Biopython distribution. This means that as well as the actual code,
proper documentation and test suites are produced.
JC/7 - Better alignment program support for Biopython [CS/CSM/IP]
Biopython is a collection of Python modules which implement useful code
for dealing with biological data (running remote database queries,
manipulating biological sequences, etc). Python is an
object-oriented, scripting language with a very clear syntax which is
becoming popular in bioinformatics. Biopython does not yet have all the
utilities which the longer established Bioperl project has. The goal of
this project is to extend Biopython's capabilities to better handle
interacting with computer programs for aligning biological sequences.
At present there is support for the clustalw alignment program, but
support for other programs such as T-Coffee and POA would improve
Biopython. The challenge here is to produce software of a sufficiently
high
quality that it is accepted into the Biopython distribution. This means
that as well as the actual code, proper documentation and test suites
are produced.