A Seed-Based Method for Generating Chinese Confusion Sets

被引:6
|
作者
Liu, Liangliang [1 ]
Cao, Cungen [2 ]
机构
[1] Shanghai Univ Int Business & Econ, Sch Business Informat, Shanghai 201620, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Confusion set; pattern matching; context probability; pinyin similarity; shape similarity;
D O I
10.1145/2933396
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In natural language, people often misuse a word (called a "confused word") in place of other words (called "confusing words"). In misspelling corrections, many approaches to finding and correcting misspelling errors are based on a simple notion called a "confusion set." The confusion set of a confused word consists of confusing words. In this article, we propose a new method of building Chinese character confusion sets. Our method is composed of two major phases. In the first phase, we build a list of seed confusion sets for each Chinese character, which is based on measuring similarity in character pinyin or similarity in character shape. In this phase, all confusion sets are constructed manually, and the confusion sets are organized into a graph, called a "seed confusion graph" (SCG), in which vertices denote characters and edges are pairs of characters in the form (confused character, confusing character). In the second phase, we extend the SCG by acquiring more pairs of (confused character, confusing character) from a large Chinese corpus. For this, we use several word patterns (or patterns) to generate new confusion pairs and then verify the pairs before adding them into a SCG. Comprehensive experiments show that our method of extending confusion sets is effective. Also, we shall use the confusion sets in Chinese misspelling corrections to show the utility of our method.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Seed-Based Authentication
    Nassar, Nader
    Chen, Li-Chiou
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON COLLABORATION TECHNOLOGIES AND SYSTEMS, 2015, : 345 - 350
  • [2] A Seed-Based Segmentation Method for Scene Text Extraction
    Bai, Bo
    Yin, Fei
    Liu, Cheng-Lin
    [J]. 2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 262 - 266
  • [3] Microarray expression analysis using seed-based clustering method
    Shin, M
    Park, SH
    [J]. ON THE CONVERGENCE OF BIO-INFORMATION-, ENVIRONMENTAL-, ENERGY-, SPACE- AND NANO-TECHNOLOGIES, PTS 1 AND 2, 2005, 277-279 : 343 - 348
  • [4] Rapid seed-based propagation method for the threatened African cherry (Prunus africana)
    Negash, L
    [J]. NEW FORESTS, 2004, 27 (03) : 215 - 227
  • [5] Robust-Seed: seed-based segmentation improvement by optimisation
    Kronman, Achia
    Joskowicz, Leo
    [J]. COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2018, 6 (05): : 564 - 572
  • [6] Seed-based exclusion method for non-coding RNA gene search
    Duchesne, Jean-Eudes
    Giraud, Mathieu
    El-Mabrouk, Nadia
    [J]. COMPUTING AND COMBINATORICS, PROCEEDINGS, 2007, 4598 : 27 - +
  • [7] A seed-based method for predicting common secondary structures in unaligned RNA sequences
    Fang, Xiaoyong
    Luo, Zhigang
    Wang, Zhenghua
    Yuan, Bo
    Shi, Jinlong
    [J]. MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4617 : 403 - +
  • [8] Rapid seed-based propagation method for the threatened African cherry (Prunus africana)
    Legesse Negash
    [J]. New Forests, 2004, 27 : 215 - 227
  • [9] Seed-Based Data Dissemination in Campus MSNets
    Wang Q.-S.
    Tang Y.
    Wang Q.
    Wang D.
    Chen L.-J.
    [J]. 1600, Beijing University of Posts and Telecommunications (40): : 97 - 101
  • [10] Seed-Based Biclustering of Gene Expression Data
    An, Jiyuan
    Liew, Alan Wee-Chung
    Nelson, Colleen C.
    [J]. PLOS ONE, 2012, 7 (08):