A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

被引：2

作者：

Wang, Chao ^{[1
]}

Liu, Jingping ^{[3
,5
]}

Zhuang, Tianyi ^{[1
]}

Li, Jiahang ^{[1
]}

Liu, Juntao ^{[1
]}

Xiao, Yanghua ^{[1
,2
]}

Wang, Wei ^{[1
]}

Xie, Rui ^{[4
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China

[2] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China

[3] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai, Peoples R China

[4] Meituan, Shanghai, Peoples R China

[5] Fudan Univ, Shanghai, Peoples R China

来源：

WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2022年

关键词：

Chinese abbreviation; Sequence-to-sequence model;

D O I：

10.1145/3488560.3498430

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.

引用

页码：1063 / 1071

页数：9

共 50 条

[41] Graph augmented sequence-to-sequence model for neural question generation
Hui Ma
Jian Wang
Hongfei Lin
Bo Xu
Applied Intelligence, 2023, 53 : 14628 - 14644
[42] A Clustering based Adaptive Sequence-to-Sequence Model for Dialogue Systems
Ren, Da
Cai, Yi
Chan, Wai Hong
Li, Zongxi
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 775 - 781
[43] Prediction of MicroRNA Subcellular Localization by Using a Sequence-to-Sequence Model
Xiao, Yiqun
Cai, Jiaxun
Yang, Yang
Zhao, Hai
Shen, Hong-Bin
2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1332 - 1337
[44] A Sequence-to-Sequence Model for Online Signal Detection and Format Recognition
Cheng, Le
Zhu, Hongna
Hu, Zhengliang
Luo, Bin
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 994 - 998
[45] Construction and application of a large-scale DNA sequence analysis system based on PC/Linux
Zhang, CG
Ouyang, SG
Zhang, SW
Qu, XH
Yu, YT
Zhou, GQ
Wu, SF
He, FC
PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS, 2001, 28 (02) : 263 - 266
[46] A universal database reduction method based on the sequence tag strategy to facilitate large-scale database search in proteomics
Wang, Kai-Fei
Wu, Yu-Zhuo
Chi, Hao
INTERNATIONAL JOURNAL OF MASS SPECTROMETRY, 2023, 483
[47] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
Hrinchuk, Oleksii
Popova, Mariya
Ginsburg, Boris
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078
[48] Comparing Three Data Representations for Music with a Sequence-to-Sequence Model
Li, Sichao
Martin, Charles Patrick
AI 2020: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 12576 : 16 - 28
[49] A sequence-to-sequence based multi-scale deep learning model for satellite cloud image prediction
Lian, Jie
Chen, Ruirong
EARTH SCIENCE INFORMATICS, 2023, 16 (2) : 1207 - 1225
[50] Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
Zhou, Shiyu
Dong, Linhao
Xu, Shuang
Xu, Bo
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 791 - 795

← 1 2 3 4 5 →