Automatic restoration of diacritics based on word n-grams for Slovak texts

被引：0

作者：

Toth, Stefan ^{[1
]}

Zaymus, Emanuel ^{[1
]}

Duracik, Michal ^{[1
]}

Mesko, Matej ^{[1
]}

Hrkut, Patrik ^{[1
]}

机构：

[1] Univ Zilina, Dept Software Technol, Fac Management Sci & Informat, Zilina, Slovakia

来源：

2019 IEEE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS (INFORMATICS 2019) | 2019年

关键词：

diacritic; diacritics restoration; n-gram; Slovak language;

D O I：

10.1109/informatics47936.2019.9119328

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the past and even now, many people still write texts without diacritics, especially in chat messages, e-mails or discussion posts. This issue evolved from historical reasons when people had a problem with text encoding in messages or wanted to write them faster. In this paper, we propose an algorithm based on word n-grams (contiguous sequence of n words) that restore diacritics of text written in the Slovak language. We also compare and evaluate our results with existing algorithms developed for Slovak texts.

引用

页码：243 / 248

页数：6

共 50 条

[31] miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM
Ding, Jiandong
Zhou, Shuigeng
Guan, Jihong
[J]. BMC BIOINFORMATICS, 2011, 12
[32] Interpolated N-Grams for Model Based Testing
Tonella, Paolo
Tiella, Roberto
Cu Duy Nguyen
[J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 562 - 572
[33] Turkish Spelling Error Detection and Correction by Using Word N-grams
Dalkilic, Gokhan
Cebi, Yalcin
[J]. 2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
[34] Classifying True and False Hebrew Stories Using Word N-Grams
HaCohen-Kerner, Yaakov
Dilmon, Rakefet
Friedlich, Shimon
Cohen, Daniel Nissim
[J]. CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
[35] miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM
Jiandong Ding
Shuigeng Zhou
Jihong Guan
[J]. BMC Bioinformatics, 12
[36] Word Length n-Grams for Text Re-use Detection
Barron-Cedeno, Alberto
Basile, Chiara
Degli Esposti, Mirko
Rosso, Paolo
[J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2010, 6008 : 687 - +
[37] Comparing word, character, and phoneme n-grams for subjective utterance recognition
Wilson, Theresa
Raaijmakers, Stephan
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1614 - +
[38] Language Identification in Multilingual, Short and Noisy Texts using Common N-Grams
Kosmajac, Dijana
Keselj, Vlado
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2752 - 2759
[39] Combining N-Grams and Stemming for Arabic Word-Based Inexact Matching and Term Conflation
Mustafa, Suleiman H.
[J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2005, 4 (01) : 29 - 36
[40] Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features
Belvisi, Nicole Mariah Sharon
Muhammad, Naveed
Alonso-Fernandez, Fernando
[J]. 2020 8TH INTERNATIONAL WORKSHOP ON BIOMETRICS AND FORENSICS (IWBF 2020), 2020,

← 1 2 3 4 5 →