A Survey on Text Classification Algorithms: From Text to Predictions

被引:52
|
作者
Gasparetto, Andrea [1 ]
Marcuzzo, Matteo [1 ]
Zangari, Alessandro [1 ]
Albarelli, Andrea [2 ]
机构
[1] Ca Foscari Univ, Dept Management, I-30123 Venice, Italy
[2] Ca Foscari Univ, Dept Environm Sci Informat & Stat, I-30123 Venice, Italy
关键词
text classification; tokenisation; topic labelling; news classification; transformer; shallow learning; deep learning; multilabel corpora; LOGISTIC-REGRESSION; NETWORKS;
D O I
10.3390/info13020083
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models.
引用
收藏
页数:39
相关论文
共 50 条
  • [1] Text Classification Algorithms: A Survey
    Kowsari, Kamran
    Meimandi, Kiana Jafari
    Heidarysafa, Mojtaba
    Mendu, Sanjana
    Barnes, Laura
    Brown, Donald
    [J]. INFORMATION, 2019, 10 (04)
  • [2] A Comprehensive Study of Text Classification Algorithms
    Vijayan, Vikas K.
    Bindu, K. R.
    Parameswaran, Latha
    [J]. 2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 1109 - 1113
  • [3] A Survey on Text Classification: From Traditional to Deep Learning
    Li, Qian
    Peng, Hao
    Li, Jianxin
    Xia, Congying
    Yang, Renyu
    Sun, Lichao
    Yu, Philip S.
    He, Lifang
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2022, 13 (02)
  • [4] A survey of Arabic text classification approaches
    Sayed, Mostafa
    Salem, Rashed K.
    Khder, Ayman E.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2019, 59 (03) : 236 - 251
  • [5] Text classification using embeddings: a survey
    Liliane Soares da Costa
    Italo L. Oliveira
    Renato Fileto
    [J]. Knowledge and Information Systems, 2023, 65 : 2761 - 2803
  • [6] A survey on text classification and its applications
    Zhou, Xujuan
    Gururajan, Raj
    Li, Yuefeng
    Venkataraman, Revathi
    Tao, Xiaohui
    Bargshady, Ghazal
    Barua, Prabal D.
    Kondalsamy-Chennakesavan, Srinivas
    [J]. WEB INTELLIGENCE, 2020, 18 (03) : 205 - 216
  • [7] Text classification using embeddings: a survey
    da Costa, Liliane Soares
    Oliveira, Italo L.
    Fileto, Renato
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (07) : 2761 - 2803
  • [8] A SURVEY ON CLASSIFICATION TECHNIQUES FOR TEXT MINING
    Brindha, S.
    Sukumaran, S.
    Prabha, K.
    [J]. 2016 3RD INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2016,
  • [9] A Survey of Topic Models in Text Classification
    Xia, Linzhong
    Luo, Dean
    Zhang, Chunxiao
    Wu, Zhou
    [J]. 2019 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2019), 2019, : 244 - 250
  • [10] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (07)