Dimitar Kazakov's project proposals (2001/02)

DLK/1
Project 1: Evolutionary simulation of inherited kinship-driven altruism [CS]

The aim of the project is to assess the role of a hypothetically inherited feature (gene) promoting altruism between relatives as a factor for survival. The approach will simulate an environment in which two species, A and B, are randomly seeking for food, competing for the same food resources. Food consumption will result in increased individual energy level. Falling below a certain energy level would mean death. An encounter of two individuals of the same species results in the creation of an offspring if (1) their energy levels are sufficient and (2) they meet each other's criteria based on present and past energy levels. The initial energy level of the offspring is subtracted from that of the parents.

Some of the individuals of species A will be carriers of a gene forcing them to share their energy with the individuals they meet in proportion to the degree of kinship (shared genes). For example, identical twins would always make their energy levels equal etc. The gene will be passed with a certain probability from parent to child.

Evaluation will assess the progressive change of the following two parameters:

  1. the ratio between individuals of species A and B;
  2. the proportion of gene carriers within species A.
In the simulation, spatial phenomena (food discovery, encounter of another individual) will be represented as random processes with a certain probability. In this case, physical distance between individuals is ignored, and the encounter of each pair is equally probable. Genetic algorithms are most suited to the implementation of this method.

Note: The project does not aim to imply the existence of such gene in reality, and indeed nothing of the said above would change if we assumed altruistic behaviour being inherited not as a gene, but through upbringing.

Reading:


DLK/2
Project 2:  WordNet-based Optimal Lexical Semantic Tagging [CS]

Aims: The aims of the project are to: A semantic tag will be considered optimal if it belongs to the optimal set of semantic tags. A set of semantic tags SST is considered optimal if it consists of those semantic tags which have minimal entropy among the set of all relevant semantic tags, and cover the required minimum of words in the text. For the purposes of the project, the lexical resource WordNet will be used as a source of semantic tags.

Background: Natural Language Processing (NLP) aims to develop tools that are able to retrieve,  extract and classify linguistic  information, and,  to a certain  level,  understand language  and provide  translation or relevant feedback. The large area of  natural  language study is traditionally  divided into sub-domains, e.g. morphology, syntax, semantics, etc. This division can be reflected in the range and functionality of tools being used. For instance, part-of-speech tagging serves to assign the corresponding parts of speech:

John/proper_noun likes/verb_3rd_person chocolate/common_noun ./full_stop

Once the parts of speech are assigned to words, the sentence can be parsed, i.e. syntactically analysed. (Compare with programming languages - IPL, remember? - where each token is assigned a type: NUM, ID, 'IF', etc.)  However, the PoSs are often not sufficient.

John/noun loves/verb bananas/noun with/preposition chocolate/noun

John/noun loves/verb bananas/noun with/preposition passion/noun

The sentences above have exactly the same PoSs, but different parse trees. If the grammar is based on PoSs, there is no way of discriminating between the two sentences. A way around this is to use the word itself (e.g. chocolate)  instead of the PoS (noun) in the grammar rule. This technique, called lexicalisation, is not a panacea though, as it results in a large number of grammar rules with very low coverage. Semantic tags (classes) can provide the means to describe grammar rules that do not generate ambiguity yet have sufficient coverage:

NP --> noun/food prep/with noun/food

Method: The semantic classes of words contained in a document can be used as a crude description of its content. The problem is that each word can correspond to several semantic classes. The output of the lexical database WordNet for the word 'bank' is shown in Figure 1 (note the US spelling).
 
 
Overview of noun bank

1. depository financial institution, bank, banking concern, banking company -- (a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home")
2. bank -- (sloping land (especially the slope beside a body of water); "they pulled the canoe up on the bank"; "he sat on the bank of the river and watched the currents")
3. bank -- (a supply or stock held in reserve especially for future use (especially in emergencies))
4. bank, bank building -- (a building in which commercial banking is transacted; "the bank is on the corner of Nassau and Witherspoon")
5. bank -- (an arrangement of similar objects in a row or in tiers; "he operated a bank of switches")
6. savings bank, coin bank, money box, bank -- (a container (usually with a slot in the top) for keeping money at home; "the coin bank was empty")
7. bank -- (a long ridge or pile; "a huge bank of earth")
8. bank -- (the funds held by a gambling house or the dealer in some gambling games; "he tried to break the bank at Monte Carlo")
9. bank, cant, camber -- (a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force)
10. bank -- (a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning))

Overview of verb bank

1. bank -- (tip laterally; of boats and aircraft)
2. bank -- (enclose with a bank; "bank roads")
3. bank -- (do business with a bank or keep an account at a bank; "Where do you bank in this town?")
4. bank -- (be in the banking business)
5. deposit, bank -- (put into a bank account)
6. bank -- (cover with ashes, of fires, to control the rate of burning)
7. trust, swear, rely, bank -- (have confidence or faith in; "We can trust in God"; "Rely on your friends"; "bank on your good education")

Fig. 1: WordNet output for the word 'bank'

The semantic classes of each word can form a hierarchy (see Fig. 2):
 
 
depository financial institution, bank, banking concern, banking company
       => financial institution, financial organization
           => institution, establishment
               => organization, organisation
                   => social group
                       => group, grouping
Fig. 2: WordNet hierarchy of semantic classes

Lexical semantic tags can be obtained by manually annotating the text - a slow and costly procedure. Here we expect to use measures from Information Theory, such as entropy (defining the numerical measure of the uncertainty of an outcome), to find the most informative semantic classes serving to distinguish among nouns (resp. verbs) present in our text. Tags with the lowest entropy will be selected and used to

Reading:


DLK/3
Project 3:  Segmentation of Text with Genetic Algorithms [CS]

Background: Although  very complex  in  nature, natural  language  is yet  another information  code,  and as  such  should  be  subject to  the  general principles  of  information  theory  (IT). Unsupervised  learning  of language studies the models of language extracted from raw text, which are based on statistical techniques such as Hidden Markov Models (HMM) or  language biases otherwise  related to  IT -  Maximum Likelihood or Minimal  Description  Length (MDL).   Other  relevant techniques include de  Saussure's analogy or models based  on finite state  automata.  The  impoverished data  renders the task difficult.  The prize to win though is a representation or theory backed  by the  IT  principles, which  could  also be  of interest  to psycholinguists.  On a more  practical level, unsupervised learning of language can  provide a cheap alternative to learning  from annotated corpora.

The segmentation of text, or other sequential data, is often used as the first step in data processing. In the area of Natural Language Processing (NLP), this is done to segment text into tokens (words) as well as to produce the word constituents (morphemes) of each word. Segmentation methods vary from sets of hand-written rules to ones learned from annotated examples where the correct segmentation is provided, to unsupervised learning of segmentation, in which case only text with no annotation whatsoever is used.

Unsupervised learning combines a language, or data representation formalism, with a  bias  that allows for the definition of the best representation of a given input in this language, and a search technique to find this representation. For instance:

Language: represent words as a concatenation of a stem and suffix
Bias    : minimise the size of stem and suffix lexicons 
Search technique: genetic algorithm (GA)
Then the list of words work, works, worked, bird, birds might be segmented as  work+, work+s, work+ed, bird+, bird+s as it results in a prefix lexicon {work, bird} and suffix lexicon {'',-s, -ed} which contain less characters in comparison with any other segmentation (Kazakov, 1997).

Aims: The project aims at the implementation of a genetic algorithm (C++) that generates the optimal segmentation of a given text with an MDL-based bias.  The method efficiency will be tested (1) directly on annotated data or (2) indirectly, evaluating the generative power of the segmentations produced (e.g. using de Saussure's analogy principle: work+s, work+ing, and sleep+ing  are correct segmentations, because sleep+s is a valid word).

Method: The method described in (Kazakov, 1997) has two limitations: the number of segments per word is limited and it cannot be efficiently applied on more than about a thousand words at a time. Changes in the data representation formalism should tackle the former issue; for the improvement of the latter, the GA fitness function (reflecting the bias) would be computed in a much more efficient manner.

Reading:


DLK/4
Project 4:    Pretty Print of Graphs [CS/M, CS]

Aim: The aim of the project is to generate a suitable 2D representation of graphs from a textual description of their vertices and edges. The output should be in the internal format of  xfig or any other comparable graph editor accepting text input. The task is to minimise the number of crosspoints between edges. This is possible in the case of the so-called planar graphs. Otherwise, the number of crosspoints should be reasonably reduced (sub-optimal and heuristic-based strategies are allowed).

Method: Use theory of graphs, namely planar graphs, planarity testing and planar representations. Implement in C or Perl.

Reading:


DLK/5
Project 5:    Behaviour Learning in Game Environments (Student-defined project) [Alex Champandard, CS]

Aim: The project will focus on the use of machine learning methods, such as Inductive Logic Programming (ILP) and Genetic Algorithms (GA) in  the design of virtual actors for computer games.

Method: Use Quake II game engine, ILP engine Progol, and genetic algorithms. Implement in C++.

Reading:



DLK/6
Project 6:  Machine Learning of Motion Control in 3D Virtual Environment  (Student-defined project) [Chris Peake, CS]

Aim: The project should develop a mechanism to learn the low-level motor co-ordination of a virtual character given its initial position, and high-level task in a simulated 3D environment. Real world properties, such as gravity and friction could be added to make the simulation more realistic.

Method: Design and model a virtual character and its environment, and the underlying data structures...

Reading:
 


DLK/7
Project 7:    Speech Driven Character Animation (Student-defined project) [Andrew Coates, CS]

Aim:

Method:
 

  1. construct a computer model with a predefined set of actions. The character need not be complicated, but must have a resting state with several sets of actions that it can perform starting and ending in this resting state.

  2.  
  3. Use IBM voice recognition SW to recoginse voice commands.

  4.  
  5. Bring steps one and two together in order to control the character in real time using IBM speech recognition.


Reading:



DLK/8
Project 8:  Modelling the constraints that influence cycle commuter choices [CS, CS/M, MScIP]

Aim:

Research  in the Department  of Sociology  has been  investigating the ways in which different factors affect people's willingness or ability to  cycle to  work.  Influences  can come  from a  range  of different sources  -  the  workplace  itself, domestic circumstances,  personal preferences, bicycle technology and sources external to all of these.

Provisional data on these  factors already exists for 100 individuals, with  a further  survey planned  at two  companies which should yield additional data for up to 500 further individuals.  The objective will be to develop a system which can anticipate whether or not somebody is able to  cycle to  work, and  if not, which  are the  most significant constraints for them.  The system will need not just to rank different constraints but  to incorporate a number of  'packages' of constraints that interact  together.  For  example, with the  relationship between journey  distance  (d), the  availability  of  shower facilities  (s), personal fitness (f) and office dress code (dc), slight changes in the patterning can make a big difference:

Providing cycle promoters in  local authorities with information about how such patterns are formed, and where they occur across a workforce, can help make their work more effective.

Method:

Use  machine learning algorithms,  such as C4.5 and  Progol to analyse data and form hypotheses in the form of decision trees (in the case of C4.5) and first order  logic rules (when Progol is used). Both C4.5  and  Progol  can  help  the collection  of  additional  data  by selecting   the  attributes/features,  which   are  relevant   to  the hypotheses  formed.    In  addition,  Progol  can   be  provided  with potentially  relevant  higher-level concepts  that  are  based on  the information provided in the questionnaire, e.g.:

fit_and_well_dressed(Person):-
       fit(Person),
       well_dressed(Person).

The above  condition is only satisfied  for those people  who are both fit and well dressed. If Progol includes this composite feature in the theories  learned (and  never  makes  use of  the  features `fit'  and `well-dressed' separately), that might be a reason to add this feature to  the questionnaire  in  this very  form,  and not  as two  separate questions.

References:

DLK/9
Project 9: Evolutionary simulation of inherited kinship-driven altruism [MScIP]

The aim of the project is to assess the role of a hypothetically inherited feature (gene) promoting altruism between relatives as a factor for survival. The approach will simulate an environment in which two species, A and B, are randomly seeking for food, competing for the same food resources. Food consumption will result in increased individual energy level. Falling below a certain energy level would mean death. An encounter of two individuals of the same species results in the creation of an offspring if (1) their energy levels are sufficient and (2) they meet each other's criteria based on present and past energy levels. The initial energy level of the offspring is subtracted from that of the parents.

Some of the individuals of species A will be carriers of a gene forcing them to share their energy with the individuals they meet in proportion to the degree of kinship (shared genes). For example, identical twins would always make their energy levels equal etc. The gene will be passed with a certain probability from parent to child.

Evaluation will assess the progressive change of the following two parameters:

  1. the ratio between individuals of species A and B;
  2. the proportion of gene carriers within species A.
An existing tool would be used for the simulations.

Note: The project does not aim to imply the existence of such gene in reality, and indeed nothing of the said above would change if we assumed altruistic behaviour being inherited not as a gene, but through upbringing.

Reading:


DLK/10 (Joint supervision with Guiem Bernat)
Project 10: Behavioural cloning of football players [Student-defined: Jamie Hodkinson]

Aim: The project will use machine learning techniques (Inductive Logic Programming, Decision Trees) to model (copy) the recorded behaviour of single players within a team or the team as a whole. The project will use the setting of Robocup as a testbed.
 



Last change: 28 Feb 2000
kazakov@cs.york.ac.uk