Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

被引：1

作者：

Li, Fenfang ^{[1
]}

Lv, Hui ^{[1
]}

La, Duo ^{[2
]}

Yong, Binbin ^{[1
]}

Zhou, Qingguo ^{[1
]}

机构：

[1] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou 730070, Gansu, Peoples R China

[2] Northwest Univ National, Key Lab Chinas Natl Linguist Informat Technol, Lanzhou 730070, Gansu, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2022年 / 21卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Low-resource language; Tibetan sentence boundary disambiguation; recurrent neural network; attention mechanism; shad;

D O I：

10.1145/3527663

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models' reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

引用

页数：18

共 50 条

[21] A Tibetan Text Classification Method Based on Hybrid Model and Channel Attention Mechanism
Hao, Minghui
Yan, Xiaodong
Ouyang, Xinpeng
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1522 - 1527
[22] Distant Supervised Relation Extraction Based on Sentence-Level Attention with Relation Alignment
Li, Jing
Huang, Xingjie
Gao, Yating
Liu, Jianyi
Zhang, Ru
Zhao, Jinmeng
ARTIFICIAL INTELLIGENCE AND SECURITY, ICAIS 2022, PT I, 2022, 13338 : 142 - 152
[23] Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks
Yuan, Yu
Sharoff, Serge
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1858 - 1865
[24] A Tibetan Word Sense Disambiguation Method Based on HowNet and Chinese-Tibetan Parallel Corpora
Jiang, Xinmin
Qiu, Lirong
Li, Yeqing
TRUSTWORTHY COMPUTING AND SERVICES (ISCTCS 2014), 2015, 520 : 152 - 159
[25] Recurrent Deep Network Models for Clinical NLP Tasks: Use Case with Sentence Boundary Disambiguation
Knoll, Benjamin C.
Lindemann, Elizabeth A.
Albert, Arian L.
Melton, Genevieve B.
Pakhomov, Serguei V. S.
MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL, 2019, 264 : 198 - 202
[26] Biomedical Event Trigger Detection Based on BiLSTM Integrating Attention Mechanism and Sentence Vector
He, Xinyu
Li, Lishuang
Wan, Jia
Song, Dingxin
Meng, Jun
Wang, Zhanjie
PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 651 - 654
[27] The synergy of double attention: Combine sentence-level and word-level attention for image captioning
Wei, Haiyang
Li, Zhixin
Zhang, Canlong
Ma, Huifang
COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 201
[28] Generation of emotional speech by prosody imposition on Sentence, Word and Syllable level fragments of neutral speech
Yadav, Jainath
Rao, K. Sreenivasa
2015 INTERNATIONAL CONFERENCE ON COGNITIVE COMPUTING AND INFORMATION PROCESSING (CCIP), 2015,
[29] Online Handwritten Tibetan Syllable Recognition Based on Component Segmentation Method
Ma, Long-Long
Wu, Jian
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 46 - 50
[30] Tibetan Word Sense Disambiguation Based on a Semantic knowledge Network Diagram
Qiu, Lirong
Jiang, Xinmin
Ling, Renqiang
Journal of Digital Information Management, 2015, 13 (05): : 346 - 353

← 1 2 3 4 5 →