A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

被引:2
|
作者
Wang, Chao [1 ]
Liu, Jingping [3 ,5 ]
Zhuang, Tianyi [1 ]
Li, Jiahang [1 ]
Liu, Juntao [1 ]
Xiao, Yanghua [1 ,2 ]
Wang, Wei [1 ]
Xie, Rui [4 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[2] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China
[3] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai, Peoples R China
[4] Meituan, Shanghai, Peoples R China
[5] Fudan Univ, Shanghai, Peoples R China
关键词
Chinese abbreviation; Sequence-to-sequence model;
D O I
10.1145/3488560.3498430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.
引用
收藏
页码:1063 / 1071
页数:9
相关论文
共 50 条
  • [1] A Realistic Drum Accompaniment Generator Using Sequence-to-Sequence Model and MIDI Music Database
    Akyuz, Yavuz Batuhan
    Gumustekin, Sevket
    2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
  • [2] A Sequence-to-Sequence Model for Semantic Role Labeling
    Daza, Angel
    Frank, Anette
    REPRESENTATION LEARNING FOR NLP, 2018, : 207 - 216
  • [3] Document Ranking with a Pretrained Sequence-to-Sequence Model
    Nogueira, Rodrigo
    Jiang, Zhiying
    Pradeep, Ronak
    Lin, Jimmy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 708 - 718
  • [4] MULTI-SCALE ALIGNMENT AND CONTEXTUAL HISTORY FOR ATTENTION MECHANISM IN SEQUENCE-TO-SEQUENCE MODEL
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 648 - 655
  • [5] Statistics of large-scale sequence searching
    Spang, R
    Vingron, M
    BIOINFORMATICS, 1998, 14 (03) : 279 - 284
  • [6] Large-Scale Pairwise Sequence Alignments on a Large-Scale GPU Cluster
    Savran, Ibrahim
    Gao, Yang
    Bakos, Jason D.
    IEEE DESIGN & TEST, 2014, 31 (01) : 51 - 61
  • [7] AN ANALYSIS OF INCORPORATING AN EXTERNAL LANGUAGE MODEL INTO A SEQUENCE-TO-SEQUENCE MODEL
    Kannan, Anjuli
    Wu, Yonghui
    Nguyen, Patrick
    Sainath, Tara N.
    Chen, Zhifeng
    Prabhavalkar, Rohit
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5824 - 5828
  • [8] A sequence-to-sequence model for joint bridge response forecasting
    Bahrami, Omid
    Wang, Wentao
    Hou, Rui
    Lynch, Jerome P.
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2023, 203
  • [9] A Hierarchical Sequence-to-Sequence Model for Korean POS Tagging
    Jin, Guozhe
    Yu, Zhezhou
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (02)
  • [10] A Sequence-to-Sequence Pronunciation Model for Bangla Speech Synthesis
    Ahmad, Arif
    Hussain, Mohammed Raihan
    Selim, Mohammad Reza
    Iqbal, Muhammed Zafar
    Rahman, Mohammad Shahidur
    2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,