Comparison of data annotation approaches using dependency tree annotation as a case study

被引：0

作者：

Zhou M. ^{[1
]}

Gong C. ^{[1
]}

Li Z. ^{[1
]}

Zhang M. ^{[1
]}

机构：

[1] School of Computer Science and Technology, Soochow University, Suzhou

来源：

Qinghua Daxue Xuebao/Journal of Tsinghua University | 2022年 / 62卷 / 05期

关键词：

Data annotation; Double-blind annotation; Human-model double-blind annotation; Model annotation followed by human corrections;

D O I：

10.16511/j.cnki.qhdxxb.2022.22.010

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The important considerations for data annotation are the annotation data quality and the annotation cost. Data annotation in natural language processing usually first uses automated model annotation followed by human corrections to reduce the cost. There have been few studies comparing the effects of different annotation approaches on the annotation quality and cost. This study uses a mature annotation team completing a dependency tree annotation as a case study. This study compares three data annotation approaches using model annotation followed by human corrections, double-blind annotation, and human-model double-blind annotation that is the fusion of the first two approaches. The human-model double-blind annotation effectively combines the advantages of model annotation followed by human corrections and double-blind annotation to reduce the annotation cost and then to improve the annotation quality by eliminating the identification tendency problem. © 2022, Tsinghua University Press. All right reserved.

引用

页码：908 / 916

页数：8

共 20 条

[1] MARCUS M, SANTORINI B, MARCINKIEWICZ M A., Building a large annotated corpus of English: The Penn treebank, (1993)
[2] XUE N W, XIA F, CHIOU F D, Et al., The Penn Chinese treebank: Phrase structure annotation of a large corpus, Natural Language Engineering, 11, 2, pp. 207-238, (2005)
[3] CHEN K J, HUANG C R, CHANG L P, Et al., Sinica corpus: Design methodology for balanced corpora, Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation, pp. 167-176, (1996)
[4] QIU L K, JIN P, WANG H F., A multi-view Chinese treebank based on dependency grammar, Journal of Chinese Information Processing, 29, 3, pp. 9-15, (2015)
[5] YU S W, DUAN H M, ZHU X F, Et al., The basic processing of contemporary Chinese corpus at Peking University specification, Journal of Chinese Information Processing, 16, 5, pp. 49-64, (2002)
[6] ZHOU Q, REN H B, SUN M S., Build a large scale Chinese treebank through two-stages approach, Proceedings of the Second China-Japan Natural Language Processing Joint Research Promotion Conference, pp. 189-197, (2002)
[7] XIA F, PALMER M, XUE N W, Et al., Developing guidelines and ensuring consistency for Chinese text annotation, Proceedings of the Second International Conference on Language Resources and Evaluation, (2000)
[8] MCDONALD R, NIVRE J, QUIRMBACH-BRUNDAGE Y, Et al., Universal dependency annotation for multilingual parsing, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 92-97, (2013)
[9] KESSLER J S, ECKERT M, CLARK L, Et al., The ICWSM 2010 JDPA sentiment corpus for the automotive domain, Proceedings of the 4th International AAAI Conference on Weblogs and Social Media Data Workshop Challenge (ICWSM-DWC), (2010)
[10] KUBLER S, MCDONALD R, NIVRE J., Dependency parsing, Synthesis Lectures on Human Language Technologies, 2, 1, pp. 1-127, (2009)

← 1 2 →