An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

被引:4
|
作者
Nguyen, Long H. B. [1 ]
Dien Dinh [1 ]
Phuoc Tran [2 ]
机构
[1] Univ Sci, Fac Informat Technol, Hcm City, Vietnam
[2] Ton Duc Thang Univ, Fac Informat Technol, Hcm City, Vietnam
关键词
Named entity translation; annotated bilingual corpus; English-Vietnamese; word alignment constraint;
D O I
10.1145/2990191
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor-intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.
引用
收藏
页数:17
相关论文
共 31 条
  • [31] The OPT-ional Phenomenon in Singapore English: a Corpus-based Approach Using Time Annotated Corpora
    Lai, Eric YongMing
    Tan, Liling
    Wong, Vincent
    Loke, Lenny Teng Tao
    Bond, Francis
    CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 431 - 441