Automatic restoration of diacritics based on word n-grams for Slovak texts

被引:0
|
作者
Toth, Stefan [1 ]
Zaymus, Emanuel [1 ]
Duracik, Michal [1 ]
Mesko, Matej [1 ]
Hrkut, Patrik [1 ]
机构
[1] Univ Zilina, Dept Software Technol, Fac Management Sci & Informat, Zilina, Slovakia
关键词
diacritic; diacritics restoration; n-gram; Slovak language;
D O I
10.1109/informatics47936.2019.9119328
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the past and even now, many people still write texts without diacritics, especially in chat messages, e-mails or discussion posts. This issue evolved from historical reasons when people had a problem with text encoding in messages or wanted to write them faster. In this paper, we propose an algorithm based on word n-grams (contiguous sequence of n words) that restore diacritics of text written in the Slovak language. We also compare and evaluate our results with existing algorithms developed for Slovak texts.
引用
收藏
页码:243 / 248
页数:6
相关论文
共 50 条
  • [31] miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM
    Ding, Jiandong
    Zhou, Shuigeng
    Guan, Jihong
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [32] Interpolated N-Grams for Model Based Testing
    Tonella, Paolo
    Tiella, Roberto
    Cu Duy Nguyen
    [J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 562 - 572
  • [33] Turkish Spelling Error Detection and Correction by Using Word N-grams
    Dalkilic, Gokhan
    Cebi, Yalcin
    [J]. 2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
  • [34] Classifying True and False Hebrew Stories Using Word N-Grams
    HaCohen-Kerner, Yaakov
    Dilmon, Rakefet
    Friedlich, Shimon
    Cohen, Daniel Nissim
    [J]. CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
  • [35] miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM
    Jiandong Ding
    Shuigeng Zhou
    Jihong Guan
    [J]. BMC Bioinformatics, 12
  • [36] Word Length n-Grams for Text Re-use Detection
    Barron-Cedeno, Alberto
    Basile, Chiara
    Degli Esposti, Mirko
    Rosso, Paolo
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2010, 6008 : 687 - +
  • [37] Comparing word, character, and phoneme n-grams for subjective utterance recognition
    Wilson, Theresa
    Raaijmakers, Stephan
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1614 - +
  • [38] Language Identification in Multilingual, Short and Noisy Texts using Common N-Grams
    Kosmajac, Dijana
    Keselj, Vlado
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2752 - 2759
  • [39] Combining N-Grams and Stemming for Arabic Word-Based Inexact Matching and Term Conflation
    Mustafa, Suleiman H.
    [J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2005, 4 (01) : 29 - 36
  • [40] Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features
    Belvisi, Nicole Mariah Sharon
    Muhammad, Naveed
    Alonso-Fernandez, Fernando
    [J]. 2020 8TH INTERNATIONAL WORKSHOP ON BIOMETRICS AND FORENSICS (IWBF 2020), 2020,