An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

被引:4
|
作者
Nguyen, Long H. B. [1 ]
Dien Dinh [1 ]
Phuoc Tran [2 ]
机构
[1] Univ Sci, Fac Informat Technol, Hcm City, Vietnam
[2] Ton Duc Thang Univ, Fac Informat Technol, Hcm City, Vietnam
关键词
Named entity translation; annotated bilingual corpus; English-Vietnamese; word alignment constraint;
D O I
10.1145/2990191
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor-intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.
引用
下载
收藏
页数:17
相关论文
共 30 条
  • [1] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
    Tuan-An Dao
    Hung-Thinh Truong
    Long Nguyen
    Dien Dinh
    PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66
  • [2] Building an English-Vietnamese Bilingual Corpus for Machine Translation
    Quoc Hung Ngo
    Winiwarter, Werner
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 157 - 160
  • [3] Projecting dependency syntax labels from english into Vietnamese in English-Vietnamese bilingual corpus
    Tran P.
    Duong V.-D.
    Dinh D.
    Vo B.
    Nguyen H.
    Nguyen L.H.B.
    International Journal of Intelligent Information and Database Systems, 2020, 13 (01) : 17 - 32
  • [4] Improving Named Entity Recognition of English and Vietnamese Languages using Bilingual Constraints
    Thinh Truong
    An Dao
    Long Nguyen
    Dien Dinh
    PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL (NLPIR 2018), 2018, : 70 - 75
  • [5] Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents
    Ha Nguyen Tien
    Quyen Ngo The
    Huyen Nguyen Thi Minh
    Linh Ha My
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 56 - 62
  • [6] A Hybrid Method for Word Segmentation with English-Vietnamese Bilingual Text
    Quoc Hung Ngo
    Dinh Dien
    Winiwarter, Werner
    2013 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND INFORMATION SCIENCES (ICCAIS), 2013,
  • [7] Assessment of disease named entity recognition on a corpus of annotated sentences
    Jimeno, Antonio
    Jimenez-Ruiz, Ernesto
    Lee, Vivian
    Gaudan, Sylvain
    Berlanga, Rafael
    Rebholz-Schuhmann, Dietrich
    BMC BIOINFORMATICS, 2008, 9 (Suppl 3)
  • [8] Assessment of disease named entity recognition on a corpus of annotated sentences
    Antonio Jimeno
    Ernesto Jimenez-Ruiz
    Vivian Lee
    Sylvain Gaudan
    Rafael Berlanga
    Dietrich Rebholz-Schuhmann
    BMC Bioinformatics, 9
  • [9] An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition
    Hoxha, Klesti
    Baxhaku, Artur
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2018, 18 (01) : 95 - 108
  • [10] BanglaBioMed: A Biomedical Named-Entity Annotated Corpus for Bangla (Bengali)
    Sazzed, Salim
    PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, : 323 - 329