Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

被引:1
|
作者
Li, Fenfang [1 ]
Lv, Hui [1 ]
La, Duo [2 ]
Yong, Binbin [1 ]
Zhou, Qingguo [1 ]
机构
[1] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou 730070, Gansu, Peoples R China
[2] Northwest Univ National, Key Lab Chinas Natl Linguist Informat Technol, Lanzhou 730070, Gansu, Peoples R China
基金
中国国家自然科学基金;
关键词
Low-resource language; Tibetan sentence boundary disambiguation; recurrent neural network; attention mechanism; shad;
D O I
10.1145/3527663
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models' reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] A Tibetan Text Classification Method Based on Hybrid Model and Channel Attention Mechanism
    Hao, Minghui
    Yan, Xiaodong
    Ouyang, Xinpeng
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1522 - 1527
  • [22] Distant Supervised Relation Extraction Based on Sentence-Level Attention with Relation Alignment
    Li, Jing
    Huang, Xingjie
    Gao, Yating
    Liu, Jianyi
    Zhang, Ru
    Zhao, Jinmeng
    ARTIFICIAL INTELLIGENCE AND SECURITY, ICAIS 2022, PT I, 2022, 13338 : 142 - 152
  • [23] Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks
    Yuan, Yu
    Sharoff, Serge
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1858 - 1865
  • [24] A Tibetan Word Sense Disambiguation Method Based on HowNet and Chinese-Tibetan Parallel Corpora
    Jiang, Xinmin
    Qiu, Lirong
    Li, Yeqing
    TRUSTWORTHY COMPUTING AND SERVICES (ISCTCS 2014), 2015, 520 : 152 - 159
  • [25] Recurrent Deep Network Models for Clinical NLP Tasks: Use Case with Sentence Boundary Disambiguation
    Knoll, Benjamin C.
    Lindemann, Elizabeth A.
    Albert, Arian L.
    Melton, Genevieve B.
    Pakhomov, Serguei V. S.
    MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL, 2019, 264 : 198 - 202
  • [26] Biomedical Event Trigger Detection Based on BiLSTM Integrating Attention Mechanism and Sentence Vector
    He, Xinyu
    Li, Lishuang
    Wan, Jia
    Song, Dingxin
    Meng, Jun
    Wang, Zhanjie
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 651 - 654
  • [27] The synergy of double attention: Combine sentence-level and word-level attention for image captioning
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 201
  • [28] Generation of emotional speech by prosody imposition on Sentence, Word and Syllable level fragments of neutral speech
    Yadav, Jainath
    Rao, K. Sreenivasa
    2015 INTERNATIONAL CONFERENCE ON COGNITIVE COMPUTING AND INFORMATION PROCESSING (CCIP), 2015,
  • [29] Online Handwritten Tibetan Syllable Recognition Based on Component Segmentation Method
    Ma, Long-Long
    Wu, Jian
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 46 - 50
  • [30] Tibetan Word Sense Disambiguation Based on a Semantic knowledge Network Diagram
    Qiu, Lirong
    Jiang, Xinmin
    Ling, Renqiang
    Journal of Digital Information Management, 2015, 13 (05): : 346 - 353