Research proposal

Ontology-based Feature-Generation for Biological Sequence Categorization
A major challenge in bioinformatics over the next few years will be to identify and ascribe putative function to the thousands of genes that make up an organism. Whole prokaryotic and eukaryotic genome sequences are now becoming available. However, existing methodologies based on the searching for protein homologs of known function can only ascribe identity and function to a portion of the total genes present. For example, in the model higher plant Arabidopsis thaliana chromosomes 2 and 4, which were recently completely sequenced [Lin et al., 1999; Mayer et al., 1999], about 50% and 40% of the open reading frames, respectively, have not been ascribed any function. Here we propose to use a machine-learning and constructive induction approach to assign putative function to unannotated sequences.

The sequence categorization problem is to take a collection of labeled sequences and compute a procedure for accurately assigning labels to unlabeled sequences. This problem is especially relevant in the molecular biology domain where most data is in sequential form (e.g., DNA or protein sequences) and the effective prediction of function or structure labels is an area of active research. Sequence categorization is becoming a highly important tool for the automatic prediction of structure and/or function of biological sequences based on the sequence itself.

The problem of biological sequence categorization has been approached in the past using methods such as Hidden Markov Models, inductive logic programming, and feature-based learning algorithms. Most methods generate procedures that predict the function or structure of a sequence based on relatively simple patterns, such as the ones used in PROSITE [Hofmann et al., 1999]. Hirsh and Noordewier [1994] have shown that such sequence patterns yield sub-optimal performance when predicting whether a given DNA sequence is a promoter, and suggested an alternative sequence representation that uses features such as A/T-composition and helical parameters. We propose to develop a sequence categorization method that uses feature-based machine learning systems enhanced with a feature generation algorithm that enables them to incorporate more sophisticated sequence-related features that go beyond simple patterns. The proposed project will extend and evaluate an existing domain-independent feature generation method for sequence categorization tasks (FGEN [Kudenko et al., 1998]). We plan to tailor FGEN specifically to biological sequences by providing ways to inject domain knowledge. Our aim is to use this new tool for structural and functional re-annotation of the chromosomes of Arabidopsis thaliana and other higher plant databases. Also, we believe it could be used for categorization of promoter regions of co-regulated genes. The technology proposed should make a significant contribution to the gene discovery programmes within the Centre for Novel Agricultural Products at the University of York.

Related Online Papers:

D.Kudenko, H.Hirsh: Feature Generation for Sequence Categorization, Proceedings of the Fifteenth National Conference on Artificial
Intelligence, AAAI 98.
D.Kudenko, H.Hirsh: Feature-Based Learners for Description Logics, Proceedings of the International Workshop on Description Logics (DL
99).
H.Hirsh, D.Kudenko: Representing Sequences in Description Logics, Proceedings of the Fourteenth National Conference
on Artificial Intelligence, AAAI 97.


This page is maintained by Daniel Kudenko
Last updated on 21 May 2000.