A customizable pipeline for social media text normalization

被引:12
|
作者
Sarker A. [1 ]
机构
[1] Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
基金
美国国家卫生研究院;
关键词
Lexical normalization; Natural language processing; Social media data preparation; Social media text normalization; Social network mining; Text mining;
D O I
10.1007/s13278-017-0464-z
中图分类号
学科分类号
摘要
Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text. © 2017, Springer-Verlag GmbH Austria.
引用
收藏
相关论文
共 50 条
  • [1] Social media text normalization for Turkish
    Eryigit, Gulsen
    Torunoglu-Selamet, Dilara
    [J]. NATURAL LANGUAGE ENGINEERING, 2017, 23 (06) : 835 - 875
  • [2] Lexical Normalization for Social Media Text
    Han, Bo
    Cook, Paul
    Baldwin, Timothy
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2013, 4 (01)
  • [3] Neural Text Normalization for Turkish Social Media
    Goker, Sinan
    Can, Burcu
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 161 - 166
  • [4] Roman to Gurmukhi Social Media Text Normalization
    Kaur, Jagroop
    Singh, Jaswinder
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT COMPUTING AND CYBERNETICS, 2020, 13 (04) : 407 - 435
  • [5] A Modular Approach for Social Media Text Normalization
    Rehan, Palak
    Kumar, Mukesh
    Singh, Sarbjeet
    [J]. INFORMATION AND DECISION SCIENCES, 2018, 701 : 187 - 195
  • [6] Text Normalization in Code-Mixed Social Media Text
    Dutta, Sukanya
    Saha, Tista
    Banerjee, Somnath
    Naskar, Sudip Kumar
    [J]. 2015 IEEE 2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION SYSTEMS (RETIS), 2015, : 378 - 382
  • [7] Rule-based Text Normalization for Malay Social Media Texts
    Ariffin, Siti Noor Allia Noor
    Tiun, Sabrina
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 156 - 162
  • [8] A Natural Language Normalization Approach to Enhance Social Media Text Reasoning
    Long Hoang Nguyen
    Salopek, Andrew
    Zhao, Liang
    Jin, Fang
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2019 - 2026
  • [9] Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text
    Khan, Jebran
    Lee, Sungchang
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (17):
  • [10] An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis
    Abu Bakar, Muhammad Fakhrur Razi
    Idris, Norisma
    Shuib, Liyana
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 211 - 215