Mark Bartlett - Document Clustering

Document Clustering

Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. We have used the hierarchical agglomerative clustering (HAC) technique to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parallel corpora potentially has several attractive features. Firstly, training samples in another language provide indirect evidence for a classification or clustering result. Secondly, constraints from both languages may help to eliminate some biased language-specific usages, resulting in classes of better quality. Finally, the alignment between pairs of clustered documents can be used to extract words from each language, which may then be used for other applications, as an example in this paper, we utilise these words for term reduction.

We investigated the clustering of a significant parallel corpus for a low-density and high-density of paired language, English and Bulgarian. Preliminary results show that the HAC algorithm can effectively cluster bilingual parallel corpora separately and still produce the same extracted words that best describe these clusters for both English and Bulgarian corpora.

Additionally, I have worked on the construction of a (potentially commercial) tool for clustering documents by textual similarity. This was initially funded by a Capacity Building Grant for Knowledge Transfer to Industry from the White Rose University Consortium. A screenshot of an earlier version of the implemented system can be seen here.

Complete List of My Publications on Constraint Programming

Hierarchical Agglomerative Clustering for Cross-Language Information Retrieval
Rayner Alfred, Dimitar Kazakov, Mark Bartlett and Elena Paskaleva
International Journal Of Translation 19(1) pp139–162, 2007
Hierarchical Agglomerative Clustering of Parallel Corpora of Bulgarian-English Documents
Rayner Alfred, Elena Paskaleva, Dimitar Kazakov and Mark Bartlett
Proc. of Recent Advances in Natural Language Processing 2007 (RANLP 2007), 2007