Social media text normalization for Turkish

被引:18
|
作者
Eryigit, Gulsen [1 ]
Torunoglu-Selamet, Dilara [1 ]
机构
[1] Istanbul Tech Univ, Dept Comp Engn, Istanbul, Turkey
关键词
RECOGNITION; MODEL;
D O I
10.1017/S1351324917000134
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).
引用
下载
收藏
页码:835 / 875
页数:41
相关论文
共 50 条
  • [11] An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis
    Abu Bakar, Muhammad Fakhrur Razi
    Idris, Norisma
    Shuib, Liyana
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 211 - 215
  • [12] Graph-based Turkish text normalization and its impact on noisy text processing
    Demir, Seniz
    Topcu, Berkay
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2022, 35
  • [13] Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form
    Zarnoufi, Randa
    Jaafar, Hamid
    Abik, Mounia
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (04)
  • [14] UTILIZING SOCIAL MEDIA DATA THROUGH SIMILARITY-BASED TEXT NORMALIZATION FOR LVCSR LANGUAGE MODELING
    Chotimongkol, Ananlada
    Thangthai, Kwanchiva
    Wutiwiwatchai, Chai
    2014 17TH ORIENTAL CHAPTER OF THE INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDIZATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (COCOSDA), 2014,
  • [15] Text normalization in social media: progress, problems and applications for a pre-processing system of casual English
    Clark, Eleanor
    Araki, Kenji
    COMPUTATIONAL LINGUISTICS AND RELATED FIELDS, 2011, 27 : 2 - 11
  • [16] Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings
    Kusumawardani, Renny Pradina
    Priansya, Stezar
    Atletiko, Faizal Johan
    INNS CONFERENCE ON BIG DATA AND DEEP LEARNING, 2018, 144 : 105 - 117
  • [17] A normalization model for repeated letters in social media hate speech text based on rules and spelling correction
    Mansur, Zainab
    Omar, Nazlia
    Tiun, Sabrina
    Alshari, Eissa M.
    PLOS ONE, 2024, 19 (03):
  • [18] A text typology of social media
    Berber Sardinha, Tony
    REGISTER STUDIES, 2022, 4 (02) : 138 - 170
  • [19] Evaluating text normalization for speech-based media selection
    Pfeil, Martin
    Buehler, Dirk
    Gruhn, Rainer
    Minker, Wolfgang
    PERCEPTION IN MULTIMODAL DIALOGUE SYSTEMS, PROCEEDINGS, 2008, 5078 : 52 - +
  • [20] Parser Adaptation for Social Media by Integrating Normalization
    van der Goot, Rob
    van Noord, Gertjan
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 491 - 497