MULTILINGUA

Barbara McGillivray
Home institution
Università degli Studi di Pisa
Country
Italy
MULTILINGUA Fellowship
March-June 2008
Project
Selectional Preference Acquisition. A preliminary contrastive study on Italian and English verbs.

This PhD research project has tested a new method for automatically acquiring selectional preferences for verbs from large corpora. Research results concerning such preferences are useful both to computational lexicography and to various areas of NLP. Earlier work in automatic SP acquisition using a fixed wordnet have a problem with low coverage, because the real distribution of arguments across the studied corpus is rarely reflected, especially when dealing with special domain texts. Also, from a multilingual perspective, some languages do not have manually constructed wordnet resources, but they have corpus materials. The present project investigates a syntax-driven distributional method for acquiring selectional preferences from large corpora for Italian and English verbs. The method extracts lexical headwords for chosen argument slots from a corpus, collects frequent fillers, generalizes these to a semantic class, and builds a computational lexicon containing predicate information on the type of arguments and preferences, together with their frequency distribution. The project investigated to what extent the method, first tried on Italian, provides similar results for English, which provides and indication of stability across other languages. In all these cases, correspondance analysis was able to detect the structure underlying the data, highlighting semantic clusters of words. Testing this method on different corpora demonstrated how correspondence analysis reflects the real verbal distribution in texts and is affected by topical features of the corpus. The results obtained are encouraging, showing correspondance analysis as a promising method to investigate and geometrically visualize semantic associations between verbs and nouns that go beyond selectional preferences.. The project has benefited from multilingual corpus tools and related expertise at Bergen.

Edit