Automatic word spacing using probabilistic models based on character n-grams

被引:13
|
作者
Lee, Do-Gil [1 ]
Rim, Hae-Chang [1 ]
Yook, Dongsuk [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea
关键词
D O I
10.1109/MIS.2007.4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.
引用
收藏
页码:28 / 35
页数:8
相关论文
共 50 条
  • [1] A Probabilistic Model Based on n-Grams for Bilingual Word Sense Disambiguation
    Vilarino, Darnes
    Pinto, David
    Tovar, Mireya
    Balderas, Carlos
    Beltran, Beatriz
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, MICAI 2010, PT I, 2010, 6437 : 82 - 91
  • [2] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [3] Measuring similarity between Karel programs using character and word n-grams
    G. Sidorov
    M. Ibarra Romero
    I. Markov
    R. Guzman-Cabrera
    L. Chanona-Hernández
    F. Velásquez
    [J]. Programming and Computer Software, 2017, 43 : 47 - 50
  • [4] Automatic restoration of diacritics based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Mesko, Matej
    Hrkut, Patrik
    [J]. 2019 IEEE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS (INFORMATICS 2019), 2019, : 243 - 248
  • [5] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [6] Combining Word and Character N-grams for Detecting Deceptive Opinions
    Siagian, Al Hafiz Akbar Maulana
    Aritsugi, Masayoshi
    [J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 828 - 833
  • [7] Measuring similarity between Karel programs using character and word n-grams
    Sidorov, G.
    Ibarra Romero, M.
    Markov, I.
    Guzman-Cabrera, R.
    Chanona-Hernandez, L.
    Velasquez, F.
    [J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (01) : 47 - 50
  • [8] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [9] IDF for Word N-grams
    Shirakawa, Masumi
    Hara, Takahiro
    Nishio, Shojiro
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
  • [10] Instance Based Authorship Attribution for Kannada Text Using Amalgamation of Character and Word N-grams Technique
    Chandrika, C. P.
    Kallimani, Jagadish S.
    [J]. DISTRIBUTED COMPUTING AND OPTIMIZATION TECHNIQUES, ICDCOT 2021, 2022, 903 : 547 - 557