Matador v 0.9

Spanish-English Generation-Heavy
Hybrid Machine Translation

What is Matador?

Matador is a Spanish-English machine translation system implemented following the Genereation-heavy Hybrid approach to Machine Translation (GHMT). The focus of GHMT is addressing the lack of resource symmetry between source and target languages. GHMT exploits symbolic and statistical target language resources in source-poor/target-rich language pairs. Expected source language resources include a syntactic parser and a simple one-to-many translation dictionary. No transfer rules or complex interlingual representations are used. Rich target language symbolic resources such as word lexical semantics, categorial variations and subcategorization frames are used to overgenerate multiple structural variations from a target-glossed syntactic dependency representation of source language sentences. This symbolic overgeneration, which accounts for possible translation divergences, is constrained by multiple statistical target language models including surface n-grams and structural n-grams. The source-target asymmetry of systems developed in this approach makes them more easily retargetable (re-source-able) to new source languages (provided a source language parser and translation dictionary).

The basic intuition of the GHMT approach parallels the experience of most language learners whose lack of symmetrical knowledge impairs their ability to translate into their newly learned language but does not hinder them as much when translating from the foreign language into their native tongue (where they are assisted by rich resources).

For more information check the publications section.

Matador Demo

Spanish is parsed with Conexor(on-line demo).

Demo Options

Explicit Diacritics

This option allows users to input Spanish diacritized characters (e.g. á or ñ) when no Spanish keyboard is available. The following table describes how these characters can be specified:

Diacritized Character	Explicitly Diacritized Character
á	a'
é	e'
í	i'
ó	o'
ú	u'
ñ	n~
ü	u"

Diacritized Character	Explicitly Diacritized Character
Á	A'
É	E'
Í	I'
Ó	O'
Ú	U'
Ñ	N~
Ü	U"

Publications

Habash, Nizar and Bonnie Dorr. A Categorial Variation Database for English. (to appear) in Proceedings of NAACL/HLT-2003. Edmonton, Canada. 2003

Habash, Nizar and Bonnie Dorr. Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation. AMTA-2002. Tiburon, California, USA.
Dorr, Bonnie and Nizar Habash. Interlingua Approximation: A Generation-Heavy Approach. AMTA-2002 Interlingua Reliability Workshop. Tiburon, California, USA.
Habash, Nizar. Generation-Heavy Hybrid Machine Translation. INLG-02. New York.
Nizar Habash. A Reference Manual to the Linearization Engine oxyGen. University of Maryland Technical Report: LAMP-TR-079/ CS-TR-4295/ UMIACS-TR-2001-73/ MDA-904-96-C-1250.
Habash, Nizar. oxyGen: A Language Independent Language Realization Engine. AMTA-2000. Cuernavaca, Mexico.

Credits

Spanish parsing: Connexor
English statistical extraction: HALogen

Contacts

Nizar Habash habash@umiacs.umd.edu (www.nizarHabash.com)