Automatic word spacing using probabilistic models based on character n-grams

被引：13

作者：

Lee, Do-Gil ^{[1
]}

Rim, Hae-Chang ^{[1
]}

Yook, Dongsuk ^{[1
]}

机构：

[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea

来源：

IEEE INTELLIGENT SYSTEMS | 2007年 / 22卷 / 01期

关键词：

D O I：

10.1109/MIS.2007.4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.

引用

页码：28 / 35

页数：8

共 50 条

[1] A Probabilistic Model Based on n-Grams for Bilingual Word Sense Disambiguation
Vilarino, Darnes
Pinto, David
Tovar, Mireya
Balderas, Carlos
Beltran, Beatriz
[J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, MICAI 2010, PT I, 2010, 6437 : 82 - 91
[2] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
Lecluze, Charlotte
Rigouste, Lois
Giguet, Emmanuel
Lucas, Nadine
[J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
[3] Measuring similarity between Karel programs using character and word n-grams
G. Sidorov
M. Ibarra Romero
I. Markov
R. Guzman-Cabrera
L. Chanona-Hernández
F. Velásquez
[J]. Programming and Computer Software, 2017, 43 : 47 - 50
[4] Automatic restoration of diacritics based on word n-grams for Slovak texts
Toth, Stefan
Zaymus, Emanuel
Duracik, Michal
Mesko, Matej
Hrkut, Patrik
[J]. 2019 IEEE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS (INFORMATICS 2019), 2019, : 243 - 248
[5] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
ISOTANI, R
MATSUNAGA, S
SAGAYAMA, S
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
[6] Combining Word and Character N-grams for Detecting Deceptive Opinions
Siagian, Al Hafiz Akbar Maulana
Aritsugi, Masayoshi
[J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 828 - 833
[7] Measuring similarity between Karel programs using character and word n-grams
Sidorov, G.
Ibarra Romero, M.
Markov, I.
Guzman-Cabrera, R.
Chanona-Hernandez, L.
Velasquez, F.
[J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (01) : 47 - 50
[8] Spam detection using character N-grams
Kanaris, Ioannis
Kanaris, Konstantinos
Stamatatos, Efstathios
[J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
[9] IDF for Word N-grams
Shirakawa, Masumi
Hara, Takahiro
Nishio, Shojiro
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
[10] Instance Based Authorship Attribution for Kannada Text Using Amalgamation of Character and Word N-grams Technique
Chandrika, C. P.
Kallimani, Jagadish S.
[J]. DISTRIBUTED COMPUTING AND OPTIMIZATION TECHNIQUES, ICDCOT 2021, 2022, 903 : 547 - 557

← 1 2 3 4 5 →