Automatic word spacing using probabilistic models based on character n-grams

被引:13
|
作者
Lee, Do-Gil [1 ]
Rim, Hae-Chang [1 ]
Yook, Dongsuk [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea
关键词
Probabilistic logics;
D O I
10.1109/MIS.2007.4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.
引用
收藏
页码:28 / 35
页数:8
相关论文
共 50 条
  • [31] Using character N-grams to explore diachronic change in medieval English
    Buckley, Kevin
    Vogel, Carl
    [J]. FOLIA LINGUISTICA, 2019, 53 : 249 - 299
  • [32] Handwritten address recognition with open vocabulary using character n-grams
    Brakensiek, A
    Rottland, J
    Rigoll, G
    [J]. EIGHTH INTERNATIONAL WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION: PROCEEDINGS, 2002, : 357 - 362
  • [33] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    [J]. ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [34] ROBUST MODELING OF MUSICAL CHORD SEQUENCES USING PROBABILISTIC N-GRAMS
    Scholz, Ricardo
    Vincent, Emmanuel
    Bimbot, Frederic
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 53 - 56
  • [35] Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms
    Andrejkova, Gabriela
    Almarimi, Abdulwahed
    [J]. OPEN COMPUTER SCIENCE, 2016, 6 (01): : 168 - 177
  • [36] Classifying True and False Hebrew Stories Using Word N-Grams
    HaCohen-Kerner, Yaakov
    Dilmon, Rakefet
    Friedlich, Shimon
    Cohen, Daniel Nissim
    [J]. CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
  • [37] Turkish Spelling Error Detection and Correction by Using Word N-grams
    Dalkilic, Gokhan
    Cebi, Yalcin
    [J]. 2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
  • [38] Mining generalized character n-grams in large corpora
    Marques, Nuno C.
    Braud, Agnès
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2003, 2902 : 419 - 423
  • [39] Character N-Grams for Detecting Deceptive Controversial Opinions
    Sanchez-Junquera, Javier
    Villasenor-Pineda, Luis
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    [J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2018), 2018, 11018 : 135 - 140
  • [40] A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams
    Ishihara, Shunichi
    [J]. 2014 5TH CYBERCRIME AND TRUSTWORTHY COMPUTING CONFERENCE CTC, 2014, : 1 - 11