A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

被引:2
|
作者
Wang, Chao [1 ]
Liu, Jingping [3 ,5 ]
Zhuang, Tianyi [1 ]
Li, Jiahang [1 ]
Liu, Juntao [1 ]
Xiao, Yanghua [1 ,2 ]
Wang, Wei [1 ]
Xie, Rui [4 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[2] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China
[3] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai, Peoples R China
[4] Meituan, Shanghai, Peoples R China
[5] Fudan Univ, Shanghai, Peoples R China
关键词
Chinese abbreviation; Sequence-to-sequence model;
D O I
10.1145/3488560.3498430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.
引用
收藏
页码:1063 / 1071
页数:9
相关论文
共 50 条
  • [21] Accelerated large-scale multiple sequence alignment
    Scott Lloyd
    Quinn O Snell
    BMC Bioinformatics, 12
  • [22] Accelerated large-scale multiple sequence alignment
    Lloyd, Scott
    Snell, Quinn O.
    BMC BIOINFORMATICS, 2011, 12
  • [23] Comparing algorithms for large-scale sequence analysis
    Nash, H
    Blair, D
    Grefenstette, J
    2ND ANNUAL IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2001, : 89 - 96
  • [24] Effective large-scale sequence similarity searches
    Claverie, JM
    COMPUTER METHODS FOR MACROMOLECULAR SEQUENCE ANALYSIS, 1996, 266 : 212 - 227
  • [25] Large-scale homologous analysts of genome sequence
    Tang, HX
    Ding, DF
    ACTA BIOCHIMICA ET BIOPHYSICA SINICA, 1996, 28 (06): : 686 - 693
  • [26] A WORKBENCH FOR LARGE-SCALE SEQUENCE HOMOLOGY ANALYSIS
    SONNHAMMER, ELL
    DURBIN, R
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1994, 10 (03): : 301 - 307
  • [27] Large-scale sequence analyses of Atlantic cod
    Johansen, Steinar D.
    Coucheron, Dag H.
    Andreassen, Morten
    Karlsen, Bard Ove
    Furmanek, Tomasz
    Jorgensen, Tor Erik
    Emblem, Ase
    Breines, Ragna
    Nordeide, Jarle T.
    Moum, Truls
    Nederbragt, Alexander J.
    Stenseth, Nils C.
    Jakobsen, Kjetill S.
    NEW BIOTECHNOLOGY, 2009, 25 (05) : 263 - 271
  • [28] LASH: Large-Scale Sequence Mining with Hierarchies
    Beedkar, Kaustubh
    Gemulla, Rainer
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 491 - 503
  • [29] Building a Filipino Colloquialism Translator Using Sequence-to-Sequence Model
    Nocon, Nicco
    Michelle Kho, Nyssa
    Arroyo, Jeniffer
    PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 2199 - 2204
  • [30] High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
    Thai-Son Nguyen
    Ngoc-Quan Pham
    Stueker, Sebastian
    Waibel, Alex
    INTERSPEECH 2020, 2020, : 2147 - 2151