Automatic word spacing using probabilistic models based on character n-grams

被引:13
|
作者
Lee, Do-Gil [1 ]
Rim, Hae-Chang [1 ]
Yook, Dongsuk [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea
关键词
D O I
10.1109/MIS.2007.4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.
引用
收藏
页码:28 / 35
页数:8
相关论文
共 50 条
  • [21] Probabilistic retrieval of OCR degraded text using N-grams
    Harding, SM
    Croft, WB
    Weir, C
    [J]. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 345 - 359
  • [22] Detection of Opinion Spam with Character n-grams
    Hernandez Fusilier, Donato
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    Guzman Cabrera, Rafael
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 285 - 294
  • [23] Error Classification Using Automatic Measures Based on n-grams and Edit Distance
    Benko, L'ubomir
    Benkova, Lucia
    Munkova, Dasa
    Munk, Michal
    Shulzenko, Danylo
    [J]. ADVANCED RESEARCH IN TECHNOLOGIES, INFORMATION, INNOVATION AND SUSTAINABILITY, ARTIIS 2022, PT I, 2022, 1675 : 345 - 356
  • [24] Diacritics restoration based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Hrkut, Patrik
    Mesko, Matej
    [J]. OPEN COMPUTER SCIENCE, 2021, 11 (01): : 180 - 189
  • [25] Using n-grams for the Automated Clustering of Structural Models
    Babur, Onder
    Cleophas, Loek
    [J]. SOFSEM 2017: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2017, 10139 : 510 - 524
  • [26] Predicting Political Donations Using Twitter Hashtags and Character N-Grams
    Conrad, Colin
    Keselj, Vlado
    [J]. 2016 IEEE 18TH CONFERENCE ON BUSINESS INFORMATICS (CBI), VOL. 2, 2016, : 1 - 7
  • [27] Decoding Algorithm of Automatic Stochastic Translation based on N-grams
    Crego, Josep M.
    Marino, Jose B.
    de Gispert, Adria
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 85 - 92
  • [28] Unconstrained Offline Handwriting Recognition using Connectionist Character N-grams
    Zamora-Martinez, F.
    Castro-Bleda, M. J.
    Espana-Boquera, S.
    Gorbe-Moya, J.
    [J]. 2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
  • [29] Author Assertion of Furtive Write Print Using Character N-Grams
    Hassan, Feryal H.
    Chaurasia, Mousmi A.
    [J]. FUTURE INFORMATION TECHNOLOGY, 2011, 13 : 274 - 278
  • [30] Using character N-grams to explore diachronic change in medieval English
    Buckley, Kevin
    Vogel, Carl
    [J]. FOLIA LINGUISTICA, 2019, 53 : 249 - 299