Automatic restoration of diacritics based on word n-grams for Slovak texts

被引:0
|
作者
Toth, Stefan [1 ]
Zaymus, Emanuel [1 ]
Duracik, Michal [1 ]
Mesko, Matej [1 ]
Hrkut, Patrik [1 ]
机构
[1] Univ Zilina, Dept Software Technol, Fac Management Sci & Informat, Zilina, Slovakia
关键词
diacritic; diacritics restoration; n-gram; Slovak language;
D O I
10.1109/informatics47936.2019.9119328
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the past and even now, many people still write texts without diacritics, especially in chat messages, e-mails or discussion posts. This issue evolved from historical reasons when people had a problem with text encoding in messages or wanted to write them faster. In this paper, we propose an algorithm based on word n-grams (contiguous sequence of n words) that restore diacritics of text written in the Slovak language. We also compare and evaluate our results with existing algorithms developed for Slovak texts.
引用
收藏
页码:243 / 248
页数:6
相关论文
共 50 条
  • [1] Diacritics restoration based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Hrkut, Patrik
    Mesko, Matej
    [J]. OPEN COMPUTER SCIENCE, 2021, 11 (01): : 180 - 189
  • [2] IDF for Word N-grams
    Shirakawa, Masumi
    Hara, Takahiro
    Nishio, Shojiro
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
  • [3] Automatic word spacing using probabilistic models based on character n-grams
    Lee, Do-Gil
    Rim, Hae-Chang
    Yook, Dongsuk
    [J]. IEEE INTELLIGENT SYSTEMS, 2007, 22 (01) : 28 - 35
  • [4] Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms
    Andrejkova, Gabriela
    Almarimi, Abdulwahed
    [J]. OPEN COMPUTER SCIENCE, 2016, 6 (01): : 168 - 177
  • [5] Automatic statistical translation based on n-grams
    Oliver, Antonio
    Badia, Toni
    Boleda, Gemma
    Melero, Maite
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 77 - 84
  • [6] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [7] The subjective frequency of word n-grams
    Shaoul, Cyrus
    Westbury, Chris F.
    Baayen, R. Harald
    [J]. PSIHOLOGIJA, 2013, 46 (04) : 497 - 537
  • [8] Variable word rate n-grams
    Gotoh, Y
    Renals, S
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1591 - 1594
  • [9] Diacritics Restoration in the Slovak Texts Using Hidden Markov Model
    Hladek, Daniel
    Stas, Jan
    Juhar, Jozef
    [J]. HUMAN LANGUAGE TECHNOLOGY: CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2016, 9561 : 29 - 40
  • [10] On Automatic Plagiarism Detection Based on n-Grams Comparison
    Barron-Cedeno, Alberto
    Rosso, Paolo
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 696 - 700