Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

被引：1

作者：

Li, Fenfang ^{[1
]}

Lv, Hui ^{[1
]}

La, Duo ^{[2
]}

Yong, Binbin ^{[1
]}

Zhou, Qingguo ^{[1
]}

机构：

[1] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou 730070, Gansu, Peoples R China

[2] Northwest Univ National, Key Lab Chinas Natl Linguist Informat Technol, Lanzhou 730070, Gansu, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2022年 / 21卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Low-resource language; Tibetan sentence boundary disambiguation; recurrent neural network; attention mechanism; shad;

D O I：

10.1145/3527663

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models' reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

引用

页数：18

共 50 条

[1] A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad
Li, Fenfang
Lv, Hui
Gao, Yiming
Dolha
Li, Yan
Zhou, Qingguo
TSINGHUA SCIENCE AND TECHNOLOGY, 2023, 28 (06): : 1085 - 1100
[2] Adaptive multilingual sentence boundary disambiguation
Palmer, DD
Hearst, MA
COMPUTATIONAL LINGUISTICS, 1997, 23 (02) : 241 - 267
[3] Sentence Boundary Disambiguation for Indonesian Language
Putra, Syopiansyah Jaya
Gunawan, Muhamad Nur
Khalil, Ismail
Mantoro, Teddy
19TH INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2017), 2017, : 587 - 590
[4] Tibetan Syllable-Based Functional Chunk Boundary Identification
Shi, Shumin
Liu, Yujian
Wang, Tianhang
Long, Congjun
Huang, Heyan
CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017, 2017, 10565 : 439 - 448
[5] A Hybrid Approach for Urdu Sentence Boundary Disambiguation
Rehman, Zobia
Anwar, Waqas
INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2012, 9 (03) : 250 - 255
[6] Conceptual Sentence Embeddings Based on Attention Mechanism
Wang Y.-S.
Huang H.-Y.
Feng C.
Zhou Q.
Zidonghua Xuebao/Acta Automatica Sinica, 2020, 46 (07): : 1390 - 1400
[7] Sentence classification based on the concept kernel attention mechanism
Li, Hui
Huang, Guimin
Li, Yiqun
Zhang, Xiaowei
Wang, Yabing
EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2022, 10 (01)
[8] Image Captioning Based On Sentence-Level And Word-Level Attention
Wei, Haiyang
Li, Zhixin
Zhang, Canlong
Zhou, Tao
Quan, Yu
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[9] A method of constructing syllable level Tibetan text classification corpus
Dao, Jizhaxi
Cai, Zhijie
Cai, Rangzhuoma
San, Maocuo
Ban, Mabao
2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
[10] Algorithm optimization of sentence similarity based on sematic disambiguation
Jun, Yuan
INFORMATION SCIENCE AND MANAGEMENT ENGINEERING, VOLS 1-3, 2014, 46 : 135 - 140

← 1 2 3 4 5 →