Catvar 2.0

The Categorial Variation Database (English)

What is a Catvar?

A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).

The database was developed for English using a combination of resources and algorithms including the LCS Verb and Preposition Databases (Dorr 2001), the Brown Corpus section of the Penn Treebank (Marcus et al. 1994), an English morphological analysis lexicon developed for PC-Kimmo (ENGLEX) (Antworth 1990), WordNet1.6 (Fellbaum 1998), an English Verb-Noun list extracted from Nomlex (Macleod et al. 1998), a similar list extracted from LDOCE (Procter 1978) and the Porter stemmer (Porter 1980).

In its first release, the database contained 28,305 clusters for 46,037 words. In this second release, there are 63,146 and 109,807 words.

Access Catvar2.0 online

Download Catvar

Catvar2.1 (c) 2003 Copyright University of Maryland. All Rights Reserved.
Licensed under the Open Software License version 1.1

Catvar2.1 is now on Github (Jan 2, 2019).

Credits

Englex (Antworth 1990) [extractor Nizar Habash, UMCP]
NOMLEX (Macleod et al. 1998) [extractor Bonnie Dorr and Greg Marton, UMCP]
LDOCE (??) [extractor Rebecca Green, UMCP]
Brown Corpus (Marcus et al. 1994) [extractor Nizar Habash, UMCP]
WordNet1.6 (Fellbaum 1998) [extractor Nizar Habash, UMCP]
LCS Verb and Preposition Database (Dorr 2001) [extractor Nizar Habash, UMCP]
Nizar Habash provided some pairs not extractable from the above resources.

How to cite CATVAR

Habash, Nizar and Bonnie Dorr, A Categorial Variation Database for English, Proceedings of the North American Association for Computational Linguistics, Edmonton, Canada, pp. 96--102, 2003. [PDF]

References

E.L. Antworth. 1990. PC-KIMMO: A Two-Level Processor for Morphological Analysis. Dallas Summer Institute of Linguistics.
Bonnie J. Dorr. 2001. LCS Verb Database. Technical Report Online Software Database , University of Maryland, College Park, MD.
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Catherine Macleod, Ralph Grishman, Adam Meyers, Leslie Barrett, and Ruth Reeves. 1998. NOMLEX: A Lexicon of Nominalizations. In Proceedings of EURALEX'98, Liege, Belgium.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330.
M.F. Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130--137.
P. Procter. Longman Dictionary of Contemporary English. Longman. London, 1978.

Contacts

Nizar Habash nizar.habash@nyu.edu
Bonnie Dorr bdorr@ihmc.us