Text classification framework for short text based on TFIDF-FastText

被引:6
|
作者
Chawla, Shrutika [1 ]
Kaur, Ravreet [1 ]
Aggarwal, Preeti [1 ]
机构
[1] Panjab Univ, Univ Inst Engn & Technol UIET, CSE Dept, Chandigarh, India
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
10.1007/s11042-023-15211-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset - Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:14
相关论文
共 50 条
  • [1] Text classification framework for short text based on TFIDF-FastText
    Shrutika Chawla
    Ravreet Kaur
    Preeti Aggarwal
    Multimedia Tools and Applications, 2023, 82 : 40167 - 40180
  • [2] Text Classification Model Based on fastText
    Yao, Tengjun
    Zhai, Zhengang
    Gao, Bingtao
    PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS), 2020, : 154 - 157
  • [3] Improved Algorithm Based on TFIDF in Text Classification
    Jiang, Hao
    Li, Wenqiang
    MEMS, NANO AND SMART SYSTEMS, PTS 1-6, 2012, 403-408 : 1791 - 1794
  • [4] Improvement and application of TFIDF method based on text classification
    Zhang, Yufang
    Peng, Shiming
    Lv, Jia
    Jisuanji Gongcheng/Computer Engineering, 2006, 32 (19): : 76 - 78
  • [5] An improved TFIDF Algorithm in text classification
    Xu, Dongdong
    Wu, Shaobo
    MATERIAL SCIENCE, CIVIL ENGINEERING AND ARCHITECTURE SCIENCE, MECHANICAL ENGINEERING AND MANUFACTURING TECHNOLOGY II, 2014, 651-653 : 2258 - 2261
  • [6] A Probabilistic Framework for Short Text Classification
    Ali, Mubashir
    Khalid, Shehzad
    Rana, Mazhar Iqbal
    Azhar, Fizza
    2018 IEEE 8TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2018, : 742 - 747
  • [7] TFIDF based Feature Words Extraction and Topic Modeling for Short Text
    Zhao, Guifen
    Liu, Yanjun
    Zhang, Wei
    Wang, Yiou
    PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2018), 2018, : 188 - 191
  • [8] A Comparison of fastText Implementations Using Arabic Text Classification
    Alghamdi, Nuha
    Assiri, Fatmah
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2, 2020, 1038 : 306 - 311
  • [9] Short Text Classification Based on Semantics
    Ma, Chenglong
    Wan, Xin
    Zhang, Zhen
    Li, Taisong
    Zhang, Yan
    ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, ICIC 2015, PT III, 2015, 9227 : 463 - 470
  • [10] Impact of convolutional neural network and FastText embedding on text classification
    Muhammad Umer
    Zainab Imtiaz
    Muhammad Ahmad
    Michele Nappi
    Carlo Medaglia
    Gyu Sang Choi
    Arif Mehmood
    Multimedia Tools and Applications, 2023, 82 : 5569 - 5585