COMPARATIVE STUDY OF LONG DOCUMENT CLASSIFICATION

被引:6
|
作者
Wagh, Vedangi [1 ]
Khandve, Snehal [1 ]
Joshi, Isha [1 ]
Wani, Apurva [1 ]
Kale, Geetanjali [1 ]
Joshi, Raviraj [2 ]
机构
[1] Pune Inst Comp Technol, Pune, Maharashtra, India
[2] Indian Inst Technol Madras, Chennai, Tamil Nadu, India
关键词
Transformer; BERT; Recurrent Neural Net-works; Topic Identification; Text Categorization; Hierarchical Attention Networks; Deep Learning;
D O I
10.1109/TENCON54134.2021.9707465
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based architectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets. We present an exhaustive comparison of different algorithms on a range of long document datasets. We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets. The BERT-based models perform consistently well on all the datasets and can be blindly used for the document classification task when the computations cost is not a concern. In the shallow model's category, we suggest the usage of raw BiLSTM + Max architecture which performs decently across all the datasets. Even simpler Glove + Attention bag of words model can be utilized for simpler use cases. The importance of using sophisticated models is clearly visible in the IMDB sentiment dataset which is a comparatively harder task.
引用
收藏
页码:732 / 737
页数:6
相关论文
共 50 条
  • [1] A comparative study of citations and links in document classification
    Couto, Thierson
    Cristo, Marco
    Goncalves, Marcos Andre
    Calado, Pavel
    Ziviani, Nivio
    Moura, Edleno
    Ribeiro-Neto, Berthier
    [J]. OPENING INFORMATION HORIZONS, 2006, : 75 - +
  • [2] A Comparative Study of Local Detectors and Descriptors for Mobile Document Classification
    Rusinol, Marcal
    Chazalon, Joseph
    Ogier, Jean-Marc
    Llados, Josep
    [J]. 2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 596 - 600
  • [3] Comparative Document Summarisation via Classification
    Bista, Umanga
    Mathews, Alexander
    Shin, Minjeong
    Menon, Aditya Krishna
    Xie, Lexing
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 20 - 28
  • [4] A comparative study of two automatic document classification methods in a library setting
    Pong, Joanna Yi-Hang
    Kwok, Ron Chi-Wai
    Lau, Raymond Yiu-Keung
    Hao, Jin-Xing
    Wong, Percy Ching-Chi
    [J]. JOURNAL OF INFORMATION SCIENCE, 2008, 34 (02) : 213 - 230
  • [5] HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION
    Pappagari, Raghavendra
    Zelasko, Piotr
    Villalba, Jesus
    Carmiel, Yishay
    Dehak, Najim
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 838 - 844
  • [6] Machine Learning Algorithms for Document Classification: Comparative Analysis
    Rashid, Faizur
    Gargaare, Suleiman M. A.
    Aden, Abdulkadir H.
    Abdi, Afendi
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (04) : 260 - 265
  • [7] Hierarchical Attention Transformer Networks for Long Document Classification
    Hu, Yongli
    Chen, Puman
    Liu, Tengfei
    Gao, Junbin
    Sun, Yanfeng
    Yin, Baocai
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Automating Document Classification in the Financial Markets: A Comparative Study of Simple Models Versus Complex Models
    Atmani, Houda
    Moutacalli, Mohamed Tarik
    [J]. 2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED SYSTEMS AND EMERGENT TECHNOLOGIES, ICASET 2024, 2024,
  • [9] Comparative Study of Single-task and Multi-task Learning on Research Protocol Document Classification
    Abdillah, Abid Famasya
    Hamidi, Mohammad Zaenuddin
    Anggraeni, Ratih Nur Esti
    Sarno, Riyanarto
    [J]. PROCEEDINGS OF 2021 13TH INTERNATIONAL CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGY AND SYSTEM (ICTS), 2021, : 213 - 217
  • [10] Long Length Document Classification by Local Convolutional Feature Aggregation
    Liu, Liu
    Liu, Kaile
    Cong, Zhenghai
    Zhao, Jiali
    Ji, Yefei
    He, Jun
    [J]. ALGORITHMS, 2018, 11 (08)