Vietnamese treebank construction and entropy-based error detection

被引:7
|
作者
Phuong-Thai Nguyen [1 ]
Anh-Cuong Le [1 ]
Tu-Bao Ho [2 ]
Van-Hiep Nguyen [3 ]
机构
[1] Vietnam Natl Univ, Univ Engn & Technol, Hanoi, Vietnam
[2] Japan Adv Inst Sci & Technol, Nomi, Japan
[3] Vietnam Acad Social Sci, Inst Linguist, Hanoi, Vietnam
关键词
Treebank; Error detection; Entropy; MODELS;
D O I
10.1007/s10579-015-9308-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 8090 % of the total errors. This method can also be applied to languages similar to Vietnamese.
引用
下载
收藏
页码:487 / 519
页数:33
相关论文
共 50 条
  • [1] Vietnamese treebank construction and entropy-based error detection
    Phuong-Thai Nguyen
    Anh-Cuong Le
    Tu-Bao Ho
    Van-Hiep Nguyen
    Language Resources and Evaluation, 2015, 49 : 487 - 519
  • [2] Automatic Detection of Problematic Rules in Vietnamese Treebank
    Hong-Quan Nguyen
    Phuong-Thai Nguyen
    Thanh-Quyen Dang
    Van-Hiep Nguyen
    2015 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES - RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2015, : 13 - 18
  • [3] Entropy-based fade modeling and detection
    San Pedro Wandelmer, Jose
    Dominguez Cabrerizo, Sergio
    Denis, Nicolas
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2007, 23 (04) : 1265 - 1280
  • [4] Entropy-Based Anomaly Detection in a Network
    Ajay Shankar Shukla
    Rohit Maurya
    Wireless Personal Communications, 2018, 99 : 1487 - 1501
  • [5] Entropy-based Network Anomaly Detection
    Callegari, Christian
    Giordano, Stefano
    Pagano, Michele
    2017 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2016, : 334 - 340
  • [6] Entropy-based concept shift detection
    Vorburger, Peter
    Bernstein, Abraham
    ICDM 2006: SIXTH INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2006, : 1113 - +
  • [7] The Inadequacy of Entropy-Based Ransomware Detection
    McIntosh, Timothy
    Jang-Jaccard, Julian
    Watters, Paul
    Susnjak, Teo
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 181 - 189
  • [8] ENTROPY-BASED RAIN DETECTION AND REMOVAL
    Jha, Rajib Kumar
    Mohanty, Sraban Kumar
    Maitrey, Anand
    2013 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND EMBEDDED SYSTEMS (CARE-2013), 2013,
  • [9] Entropy-Based Anomaly Detection in a Network
    Shukla, Ajay Shankar
    Maurya, Rohit
    WIRELESS PERSONAL COMMUNICATIONS, 2018, 99 (04) : 1487 - 1501
  • [10] Entropy-based outlier detection using spark
    Feng, Guilan
    Li, Zhengnan
    Zhou, Wengang
    Dong, Shi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (02): : 409 - 419