Text classification framework for short text based on TFIDF-FastText

被引:6
|
作者
Chawla, Shrutika [1 ]
Kaur, Ravreet [1 ]
Aggarwal, Preeti [1 ]
机构
[1] Panjab Univ, Univ Inst Engn & Technol UIET, CSE Dept, Chandigarh, India
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
10.1007/s11042-023-15211-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset - Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:14
相关论文
共 50 条
  • [21] A Word Distributed Representation Based Framework for Large-scale Short Text Classification
    Yao, Di
    Bi, Jingping
    Huang, Jianhui
    Zhu, Jin
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [22] An Ensemble Framework for Text Classification
    Kamateri, Eleni
    Salampasis, Michail
    INFORMATION, 2025, 16 (02)
  • [23] General Framework for Text Classification based on Domain Ontology
    Yang, Xi-quan
    Sun, Na
    Zhang, Ye
    Kong, De-ran
    THIRD INTERNATIONAL WORKSHOP ON SEMANTIC MEDIA ADAPTATION AND PERSONALIZATION, PROCEEDINGS, 2008, : 147 - 152
  • [24] Short text sentiment classification based on context reconstruction
    Yang, Zhen
    Lai, Ying-Xu
    Duan, Li-Juan
    Li, Yu-Jian
    Zidonghua Xuebao/Acta Automatica Sinica, 2012, 38 (01): : 55 - 67
  • [25] Short Text based Cooperative Classification for Multiple Platforms
    Li, Mingzhu
    Chen, Lihua
    Liu, Tianyuan
    Sun, Yuqing
    PROCEEDINGS OF THE 2019 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2019, : 87 - 92
  • [26] RETRACTED: Improved TFIDF weighting for imbalanced biomedical text classification (Retracted Article)
    Xu, Guixian
    Gao, Xu
    Zhang, Xin
    Zhao, Xiaobing
    2011 INTERNATIONAL CONFERENCE ON ENERGY AND ENVIRONMENTAL SCIENCE-ICEES 2011, 2011, 11 : 2360 - 2367
  • [27] SHORT TEXT CLASSIFICATION BASED ON LDA TOPIC MODEL
    Chen, Qiuxing
    Yao, Lixiu
    Yang, Jie
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), 2016, : 749 - 753
  • [28] Semantic dictionary based method for short text classification
    Tang, Hao-Jin
    Yan, Dan-Feng
    Tian, Yuan
    Journal of China Universities of Posts and Telecommunications, 2013, 20 (SUPPL. 1): : 15 - 19
  • [29] Short-text classification based on ICA and LSA
    Pu, Qiang
    Yang, Guo-Wei
    ADVANCES IN NEURAL NETWORKS - ISNN 2006, PT 2, PROCEEDINGS, 2006, 3972 : 265 - 270
  • [30] Short text classification based on strong feature thesaurus
    Bingkun WANG Yongfeng HUANG Wanxia YANG Xing LI Information Cognitive and Intelligent System Research Institute Department of Electronic and Engineering Tsinghua University Beijing China Information Technology National Laboratory Tsinghua University Beijing China
    JournalofZhejiangUniversity-ScienceC(Computers&Electronics), 2012, 13 (09) : 649 - 659