Vietnamese treebank construction and entropy-based error detection

被引:7
|
作者
Phuong-Thai Nguyen [1 ]
Anh-Cuong Le [1 ]
Tu-Bao Ho [2 ]
Van-Hiep Nguyen [3 ]
机构
[1] Vietnam Natl Univ, Univ Engn & Technol, Hanoi, Vietnam
[2] Japan Adv Inst Sci & Technol, Nomi, Japan
[3] Vietnam Acad Social Sci, Inst Linguist, Hanoi, Vietnam
关键词
Treebank; Error detection; Entropy; MODELS;
D O I
10.1007/s10579-015-9308-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 8090 % of the total errors. This method can also be applied to languages similar to Vietnamese.
引用
下载
收藏
页码:487 / 519
页数:33
相关论文
共 50 条
  • [31] Entropy-Based Anomaly Detection in Household Electricity Consumption
    Moure-Garrido, Marta
    Campo, Celeste
    Garcia-Rubio, Carlos
    ENERGIES, 2022, 15 (05)
  • [32] RELATIONAL ENTROPY-BASED SALIENCY DETECTION IN IMAGES AND VIDEOS
    Duncan, Kester
    Sarkar, Sudeep
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 1093 - 1096
  • [33] An Empirical Evaluation of Entropy-based Traffic Anomaly Detection
    Nychis, George
    Sekar, Vyas
    Andersen, David G.
    Kim, Hyong
    Zhang, Hui
    IMC'08: PROCEEDINGS OF THE 2008 ACM SIGCOMM INTERNET MEASUREMENT CONFERENCE, 2008, : 151 - 156
  • [34] Decentralized detection: Optimizing with Bayes' or entropy-based criterion?
    Pomorski, D
    FUSION 2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE OF INFORMATION FUSION, VOLS 1 AND 2, 2003, : 894 - 901
  • [35] Entropy-Based Detection of Genetic Markers for Bacteria Genotyping
    Nykrynova, Marketa
    Maderankova, Denisa
    Barton, Vojtech
    Bezdicek, Matej
    Lengerova, Martina
    Skutkova, Helena
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING (IWBBIO 2019), PT II, 2019, 11466 : 177 - 188
  • [36] Entropy-based thresholding for detection of microcalcifications in a digital mammogram
    Bhajammanavar, VM
    Keong, KC
    Krishnan, SM
    CARS 2000: COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2000, 1214 : 735 - 740
  • [37] Entropy-Based Feature Selection for Network Anomaly Detection
    Alabi, Ruth
    Yurtkan, Kamil
    2018 2ND INTERNATIONAL SYMPOSIUM ON MULTIDISCIPLINARY STUDIES AND INNOVATIVE TECHNOLOGIES (ISMSIT), 2018, : 563 - 569
  • [38] Entropy-based concept drift detection in information systems
    Sun, Yingying
    Mi, Jusheng
    Jin, Chenxia
    KNOWLEDGE-BASED SYSTEMS, 2024, 290
  • [39] Entropy-based Framework Dealing with Error in Software Development Effort Estimation
    El Koutbi, Salma
    Idri, Ali
    ENASE: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, 2017, : 195 - 202
  • [40] An entropy-based method to evaluate plane form error for precision assembly
    Jin, Xin
    Zuo, Fuchang
    Zhang, Tingyu
    Zhang, Zhijing
    Chen, Jianfeng
    Ye, Xin
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART B-JOURNAL OF ENGINEERING MANUFACTURE, 2013, 227 (B5) : 726 - 734