Qumqum v 0.1

Arabic-English Generation-Heavy
Hybrid Machine Translation

What is Qumqum?

Qumqum is an Arabic-English machine translation system implemented following the Genereation-heavy Hybrid approach to Machine Translation (GHMT). The focus of GHMT is addressing the lack of resource symmetry between source and target languages. GHMT exploits symbolic and statistical target language resources in source-poor/target-rich language pairs. Expected source language resources include a syntactic parser and a simple one-to-many translation dictionary. No transfer rules or complex interlingual representations are used. Rich target language symbolic resources such as word lexical semantics, categorial variations and subcategorization frames are used to overgenerate multiple structural variations from a target-glossed syntactic dependency representation of source language sentences. This symbolic overgeneration, which accounts for possible translation divergences, is constrained by multiple statistical target language models including surface n-grams and structural n-grams. The source-target asymmetry of systems developed in this approach makes them more easily retargetable (re-source-able) to new source languages (provided a source language parser and translation dictionary).

The basic intuition of the GHMT approach parallels the experience of most language learners whose lack of symmetrical knowledge impairs their ability to translate into their newly learned language but does not hinder them as much when translating from the foreign language into their native tongue (where they are assisted by rich resources).

For more information check the publications section.

Qumqum Demo

THE DEMO IS CURRENLTY DISABLED. PLEASE TRY LATER.

Arabic is parsed with Dan Bikel's parser.

Demo Options

Buckwlater Arabic Encoding

Tim Buckwalter's Arabic Transliteration

Publications

Habash, Nizar. The Use of a Structural N-gram Language Model in Generation-Heavy Hybrid Machine Translation. In Proceedings of the Third International Conference of Natural Language Generation (INLG-04). Careys Manor, UK, July 2004
Habash, Nizar. Matador: A Large Scale Spanish-English GHMT System. In Proceedings of the MT Summit, New Orleans, LA, pp. 149--156, 2003.
Habash, Nizar and Bonnie Dorr. A Categorial Variation Database for English. In Proceedings of NAACL/HLT-2003. Edmonton, Canada. 2003
Habash, Nizar and Bonnie Dorr. Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation. AMTA-2002. Tiburon, California, USA.
Dorr, Bonnie and Nizar Habash. Interlingua Approximation: A Generation-Heavy Approach. AMTA-2002 Interlingua Reliability Workshop. Tiburon, California, USA.
Habash, Nizar. Generation-Heavy Hybrid Machine Translation. INLG-02. New York.
Nizar Habash. A Reference Manual to the Linearization Engine oxyGen. University of Maryland Technical Report: LAMP-TR-079/ CS-TR-4295/ UMIACS-TR-2001-73/ MDA-904-96-C-1250.
Habash, Nizar. oxyGen: A Language Independent Language Realization Engine. AMTA-2000. Cuernavaca, Mexico.

Credits

Arabic parsing: Dan Bikel's parser
English statistical extraction: HALogen

Contacts

Nizar Habash habash@umiacs.umd.edu (www.nizarHabash.com)