Social media text normalization for Turkish

被引:18
|
作者
Eryigit, Gulsen [1 ]
Torunoglu-Selamet, Dilara [1 ]
机构
[1] Istanbul Tech Univ, Dept Comp Engn, Istanbul, Turkey
关键词
RECOGNITION; MODEL;
D O I
10.1017/S1351324917000134
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).
引用
收藏
页码:835 / 875
页数:41
相关论文
共 50 条
  • [1] Neural Text Normalization for Turkish Social Media
    Goker, Sinan
    Can, Burcu
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 161 - 166
  • [2] Turkish Normalization Lexicon for Social Media
    Demir, Seniz
    Tan, Murat
    Topcu, Berkay
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 418 - 429
  • [3] Lexical Normalization for Social Media Text
    Han, Bo
    Cook, Paul
    Baldwin, Timothy
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2013, 4 (01)
  • [5] Roman to Gurmukhi Social Media Text Normalization
    Kaur, Jagroop
    Singh, Jaswinder
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT COMPUTING AND CYBERNETICS, 2020, 13 (04) : 407 - 435
  • [6] A Modular Approach for Social Media Text Normalization
    Rehan, Palak
    Kumar, Mukesh
    Singh, Sarbjeet
    [J]. INFORMATION AND DECISION SCIENCES, 2018, 701 : 187 - 195
  • [7] Text Normalization in Code-Mixed Social Media Text
    Dutta, Sukanya
    Saha, Tista
    Banerjee, Somnath
    Naskar, Sudip Kumar
    [J]. 2015 IEEE 2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION SYSTEMS (RETIS), 2015, : 378 - 382
  • [8] Rule-based Text Normalization for Malay Social Media Texts
    Ariffin, Siti Noor Allia Noor
    Tiun, Sabrina
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 156 - 162
  • [9] A Natural Language Normalization Approach to Enhance Social Media Text Reasoning
    Long Hoang Nguyen
    Salopek, Andrew
    Zhao, Liang
    Jin, Fang
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2019 - 2026
  • [10] Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text
    Khan, Jebran
    Lee, Sungchang
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (17):