A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

被引:0
|
作者
Wane, Dingmin [1 ]
Song, Yan [2 ]
Li, Jing [2 ]
Han, Jialong [2 ]
Zhang, Haisong [2 ]
机构
[1] Tencent Inc, Shenzhen, Peoples R China
[2] Tencent AI Lab, Shenzhen, Peoples R China
关键词
CLASSIFICATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Chinese spelling check (CSC) is a challenging yet meaningful task, which not only serves as a preprocessing in many natural language processing (NLP) applications, but also facilitates reading and understanding of running texts in peoples' daily lives. However, to utilize datadriven approaches for CSC, there is one major limitation that annotated corpora are not enough in applying algorithms and building models. In this paper, we propose a novel approach of constructing CSC corpus with automatically generated spelling errors, which are either visually or phonologically resembled characters, corresponding to the OCRand ASR-based methods, respectively. Upon the constructed corpus, different models are trained and evaluated for CSC with respect to three standard test sets. Experimental results demonstrate the effectiveness of the corpus, therefore confirm the validity of our approach.
引用
收藏
页码:2517 / 2527
页数:11
相关论文
共 50 条
  • [1] A Hybrid Ranking Approach to Chinese Spelling Check
    Liu, Xiaodong
    Cheng, Fei
    Duh, Kevin
    Matsumoto, Yuji
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2015, 14 (04)
  • [2] A Hybrid Model for Chinese Spelling Check
    Zhao, Hai
    Cai, Deng
    Xin, Yang
    Wang, Yuzhu
    Jia, Zhongye
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2017, 16 (03)
  • [3] A Chinese OCR spelling check approach based on statistical language models
    Li, Z
    Bao, T
    Zhu, XY
    Wang, CH
    Naoi, SS
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOLS 1-7, 2004, : 4727 - 4732
  • [4] Improve Chinese Spelling Check by Reevaluation
    Wang, Shuai
    Shang, Lin
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT III, 2022, 13282 : 237 - 248
  • [5] A Probabilistic Framework for Chinese Spelling Check
    Chen, Kuan-Yu
    Wang, Hsin-Min
    Chen, Hsin-Hsi
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2015, 14 (04)
  • [6] Automatic Generation and Evaluation of Chinese Grammar Proofreading Corpus
    Zhang, Mei
    Pan, Lijian
    Duan, Jianyong
    Xu, Zhitong
    Xu, Lishan
    [J]. 2022 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2022), 2022, : 433 - 438
  • [7] CCCSpell: A Consistent and Contrastive Learning Approach with Character Similarity for Chinese Spelling Check
    Su, Jindian
    Lin, Xiaobin
    Xie, Yunhao
    Cheng, Zehua
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [8] Chinese Spelling Check based on Sequence Labeling
    Han, Zijia
    Lv, Chengguo
    Wang, Qiansheng
    Fu, Guohong
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 373 - 378
  • [9] Dynamic Connected Networks for Chinese Spelling Check
    Wang, Baoxin
    Che, Wanxiang
    Wu, Dayong
    Wang, Shijin
    Hu, Guoping
    Liu, Ting
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2437 - 2446
  • [10] Prompt as a Knowledge Probe for Chinese Spelling Check
    Peng, Kun
    Sun, Nannan
    Cao, Jiahao
    Liu, Rui
    Ren, Jiaqian
    Jiang, Lei
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2022, PT III, 2022, 13370 : 516 - 527