A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

被引：2

作者：

Wang, Chao ^{[1
]}

Liu, Jingping ^{[3
,5
]}

Zhuang, Tianyi ^{[1
]}

Li, Jiahang ^{[1
]}

Liu, Juntao ^{[1
]}

Xiao, Yanghua ^{[1
,2
]}

Wang, Wei ^{[1
]}

Xie, Rui ^{[4
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China

[2] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China

[3] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai, Peoples R China

[4] Meituan, Shanghai, Peoples R China

[5] Fudan Univ, Shanghai, Peoples R China

来源：

WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2022年

关键词：

Chinese abbreviation; Sequence-to-sequence model;

D O I：

10.1145/3488560.3498430

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.

引用

页码：1063 / 1071

页数：9

共 50 条

[21] Accelerated large-scale multiple sequence alignment
Scott Lloyd
Quinn O Snell
BMC Bioinformatics, 12
[22] Accelerated large-scale multiple sequence alignment
Lloyd, Scott
Snell, Quinn O.
BMC BIOINFORMATICS, 2011, 12
[23] Comparing algorithms for large-scale sequence analysis
Nash, H
Blair, D
Grefenstette, J
2ND ANNUAL IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2001, : 89 - 96
[24] Effective large-scale sequence similarity searches
Claverie, JM
COMPUTER METHODS FOR MACROMOLECULAR SEQUENCE ANALYSIS, 1996, 266 : 212 - 227
[25] Large-scale homologous analysts of genome sequence
Tang, HX
Ding, DF
ACTA BIOCHIMICA ET BIOPHYSICA SINICA, 1996, 28 (06): : 686 - 693
[26] A WORKBENCH FOR LARGE-SCALE SEQUENCE HOMOLOGY ANALYSIS
SONNHAMMER, ELL
DURBIN, R
COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1994, 10 (03): : 301 - 307
[27] Large-scale sequence analyses of Atlantic cod
Johansen, Steinar D.
Coucheron, Dag H.
Andreassen, Morten
Karlsen, Bard Ove
Furmanek, Tomasz
Jorgensen, Tor Erik
Emblem, Ase
Breines, Ragna
Nordeide, Jarle T.
Moum, Truls
Nederbragt, Alexander J.
Stenseth, Nils C.
Jakobsen, Kjetill S.
NEW BIOTECHNOLOGY, 2009, 25 (05) : 263 - 271
[28] LASH: Large-Scale Sequence Mining with Hierarchies
Beedkar, Kaustubh
Gemulla, Rainer
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 491 - 503
[29] Building a Filipino Colloquialism Translator Using Sequence-to-Sequence Model
Nocon, Nicco
Michelle Kho, Nyssa
Arroyo, Jeniffer
PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 2199 - 2204
[30] High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
Thai-Son Nguyen
Ngoc-Quan Pham
Stueker, Sebastian
Waibel, Alex
INTERSPEECH 2020, 2020, : 2147 - 2151

← 1 2 3 4 5 →