A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

被引:2
|
作者
Wang, Chao [1 ]
Liu, Jingping [3 ,5 ]
Zhuang, Tianyi [1 ]
Li, Jiahang [1 ]
Liu, Juntao [1 ]
Xiao, Yanghua [1 ,2 ]
Wang, Wei [1 ]
Xie, Rui [4 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[2] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China
[3] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai, Peoples R China
[4] Meituan, Shanghai, Peoples R China
[5] Fudan Univ, Shanghai, Peoples R China
关键词
Chinese abbreviation; Sequence-to-sequence model;
D O I
10.1145/3488560.3498430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.
引用
收藏
页码:1063 / 1071
页数:9
相关论文
共 50 条
  • [41] Graph augmented sequence-to-sequence model for neural question generation
    Hui Ma
    Jian Wang
    Hongfei Lin
    Bo Xu
    Applied Intelligence, 2023, 53 : 14628 - 14644
  • [42] A Clustering based Adaptive Sequence-to-Sequence Model for Dialogue Systems
    Ren, Da
    Cai, Yi
    Chan, Wai Hong
    Li, Zongxi
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 775 - 781
  • [43] Prediction of MicroRNA Subcellular Localization by Using a Sequence-to-Sequence Model
    Xiao, Yiqun
    Cai, Jiaxun
    Yang, Yang
    Zhao, Hai
    Shen, Hong-Bin
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1332 - 1337
  • [44] A Sequence-to-Sequence Model for Online Signal Detection and Format Recognition
    Cheng, Le
    Zhu, Hongna
    Hu, Zhengliang
    Luo, Bin
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 994 - 998
  • [45] Construction and application of a large-scale DNA sequence analysis system based on PC/Linux
    Zhang, CG
    Ouyang, SG
    Zhang, SW
    Qu, XH
    Yu, YT
    Zhou, GQ
    Wu, SF
    He, FC
    PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS, 2001, 28 (02) : 263 - 266
  • [46] A universal database reduction method based on the sequence tag strategy to facilitate large-scale database search in proteomics
    Wang, Kai-Fei
    Wu, Yu-Zhuo
    Chi, Hao
    INTERNATIONAL JOURNAL OF MASS SPECTROMETRY, 2023, 483
  • [47] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
    Hrinchuk, Oleksii
    Popova, Mariya
    Ginsburg, Boris
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078
  • [48] Comparing Three Data Representations for Music with a Sequence-to-Sequence Model
    Li, Sichao
    Martin, Charles Patrick
    AI 2020: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 12576 : 16 - 28
  • [49] A sequence-to-sequence based multi-scale deep learning model for satellite cloud image prediction
    Lian, Jie
    Chen, Ruirong
    EARTH SCIENCE INFORMATICS, 2023, 16 (2) : 1207 - 1225
  • [50] Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
    Zhou, Shiyu
    Dong, Linhao
    Xu, Shuang
    Xu, Bo
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 791 - 795