Document Clustering
Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. We have used the hierarchical agglomerative clustering (HAC) technique to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parallel corpora potentially has several attractive features. Firstly, training samples in another language provide indirect evidence for a classification or clustering result. Secondly, constraints from both languages may help to eliminate some biased language-specific usages, resulting in classes of better quality. Finally, the alignment between pairs of clustered documents can be used to extract words from each language, which may then be used for other applications, as an example in this paper, we utilise these words for term reduction.
We investigated the clustering of a significant parallel corpus for a low-density and high-density of paired language, English and Bulgarian. Preliminary results show that the HAC algorithm can effectively cluster bilingual parallel corpora separately and still produce the same extracted words that best describe these clusters for both English and Bulgarian corpora.
Additionally, I have worked on the construction of a (potentially commercial) tool for clustering documents by textual similarity. This was initially funded by a Capacity Building Grant for Knowledge Transfer to Industry from the White Rose University Consortium. A screenshot of an earlier version of the implemented system can be seen here.