Vietnamese treebank construction and entropy-based error detection

被引:7
|
作者
Phuong-Thai Nguyen [1 ]
Anh-Cuong Le [1 ]
Tu-Bao Ho [2 ]
Van-Hiep Nguyen [3 ]
机构
[1] Vietnam Natl Univ, Univ Engn & Technol, Hanoi, Vietnam
[2] Japan Adv Inst Sci & Technol, Nomi, Japan
[3] Vietnam Acad Social Sci, Inst Linguist, Hanoi, Vietnam
关键词
Treebank; Error detection; Entropy; MODELS;
D O I
10.1007/s10579-015-9308-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 8090 % of the total errors. This method can also be applied to languages similar to Vietnamese.
引用
下载
收藏
页码:487 / 519
页数:33
相关论文
共 50 条
  • [21] Entropy-based multipath detection model for MIMO radar
    Junpeng Shi
    Guoping Hu
    Hao Zhou
    Journal of Systems Engineering and Electronics, 2017, 28 (01) : 51 - 57
  • [22] Entropy-Based Anomaly Detection for Gaussian Mixture Modeling
    Scrucca, Luca
    ALGORITHMS, 2023, 16 (04)
  • [23] Web Attack Detection using Entropy-based Analysis
    Threepak, T.
    Watcharapupong, A.
    2014 INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2014), 2014, : 244 - 247
  • [24] Sample entropy-based fault detection for photovoltaic arrays
    Khoshnami, Aria
    Sadeghkhani, Iman
    IET RENEWABLE POWER GENERATION, 2018, 12 (16) : 1966 - 1976
  • [25] Entropy-Based Methods for Motor Fault Detection: A Review
    Aguayo-Tapia, Sarahi
    Avalos-Almazan, Gerardo
    Rangel-Magdaleno, Jose de Jesus
    ENTROPY, 2024, 26 (04)
  • [26] Entropy-based electricity theft detection in AMI network
    Singh, Sandeep Kumar
    Bose, Ranjan
    Joshi, Anupam
    IET CYBER-PHYSICAL SYSTEMS: THEORY & APPLICATIONS, 2018, 3 (02) : 99 - 105
  • [27] Voice Activity Detection Using Entropy-Based Method
    Xu, Ning
    Wang, Chengcheng
    Bao, Jingyi
    2015 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2015,
  • [28] An Analysis of Entropy-Based Eye Movement Events Detection
    Harezlak, Katarzyna
    Augustyn, Dariusz R.
    Kasprowski, Pawel
    ENTROPY, 2019, 21 (02)
  • [29] Entropy-based multipath detection model for MIMO radar
    Shi, Junpeng
    Hu, Guoping
    Zhou, Hao
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2017, 28 (01) : 51 - 57
  • [30] Entropy-Based Anomaly Detection for In-Vehicle Networks
    Mueter, Michael
    Asaj, Naim
    2011 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2011, : 1110 - 1115