Creating a system for lexical substitutions from scratch using crowdsourcing

被引:22
|
作者
Biemann, Chris [1 ]
机构
[1] Tech Univ Darmstadt, D-64289 Darmstadt, Germany
关键词
Amazon Turk; Lexical substitution; Word sense disambiguation; Language resource creation; Crowdsourcing;
D O I
10.1007/s10579-012-9180-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This article describes the creation and application of the Turk Bootstrap Word Sense Inventory for 397 frequent nouns, which is a publicly available resource for lexical substitution. This resource was acquired using Amazon Mechanical Turk. In a bootstrapping process with massive collaborative input, substitutions for target words in context are elicited and clustered by sense; then, more contexts are collected. Contexts that cannot be assigned to a current target word's sense inventory re-enter the bootstrapping loop and get a supply of substitutions. This process yields a sense inventory with its granularity determined by substitutions as opposed to psychologically motivated concepts. It comes with a large number of sense-annotated target word contexts. Evaluation on data quality shows that the process is robust against noise from the crowd, produces a less fine-grained inventory than WordNet and provides a rich body of high precision substitution data at low cost. Using the data to train a system for lexical substitutions, we show that amount and quality of the data is sufficient for producing high quality substitutions automatically. In this system, co-occurrence cluster features are employed as a means to cheaply model topicality.
引用
收藏
页码:97 / 122
页数:26
相关论文
共 50 条