Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

被引:1
|
作者
Li, Fenfang [1 ]
Lv, Hui [1 ]
La, Duo [2 ]
Yong, Binbin [1 ]
Zhou, Qingguo [1 ]
机构
[1] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou 730070, Gansu, Peoples R China
[2] Northwest Univ National, Key Lab Chinas Natl Linguist Informat Technol, Lanzhou 730070, Gansu, Peoples R China
基金
中国国家自然科学基金;
关键词
Low-resource language; Tibetan sentence boundary disambiguation; recurrent neural network; attention mechanism; shad;
D O I
10.1145/3527663
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models' reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad
    Li, Fenfang
    Lv, Hui
    Gao, Yiming
    Dolha
    Li, Yan
    Zhou, Qingguo
    TSINGHUA SCIENCE AND TECHNOLOGY, 2023, 28 (06): : 1085 - 1100
  • [2] Adaptive multilingual sentence boundary disambiguation
    Palmer, DD
    Hearst, MA
    COMPUTATIONAL LINGUISTICS, 1997, 23 (02) : 241 - 267
  • [3] Sentence Boundary Disambiguation for Indonesian Language
    Putra, Syopiansyah Jaya
    Gunawan, Muhamad Nur
    Khalil, Ismail
    Mantoro, Teddy
    19TH INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2017), 2017, : 587 - 590
  • [4] Tibetan Syllable-Based Functional Chunk Boundary Identification
    Shi, Shumin
    Liu, Yujian
    Wang, Tianhang
    Long, Congjun
    Huang, Heyan
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017, 2017, 10565 : 439 - 448
  • [5] A Hybrid Approach for Urdu Sentence Boundary Disambiguation
    Rehman, Zobia
    Anwar, Waqas
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2012, 9 (03) : 250 - 255
  • [6] Conceptual Sentence Embeddings Based on Attention Mechanism
    Wang Y.-S.
    Huang H.-Y.
    Feng C.
    Zhou Q.
    Zidonghua Xuebao/Acta Automatica Sinica, 2020, 46 (07): : 1390 - 1400
  • [7] Sentence classification based on the concept kernel attention mechanism
    Li, Hui
    Huang, Guimin
    Li, Yiqun
    Zhang, Xiaowei
    Wang, Yabing
    EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2022, 10 (01)
  • [8] Image Captioning Based On Sentence-Level And Word-Level Attention
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    Zhou, Tao
    Quan, Yu
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [9] A method of constructing syllable level Tibetan text classification corpus
    Dao, Jizhaxi
    Cai, Zhijie
    Cai, Rangzhuoma
    San, Maocuo
    Ban, Mabao
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [10] Algorithm optimization of sentence similarity based on sematic disambiguation
    Jun, Yuan
    INFORMATION SCIENCE AND MANAGEMENT ENGINEERING, VOLS 1-3, 2014, 46 : 135 - 140