Text classification framework for short text based on TFIDF-FastText

被引:6
|
作者
Chawla, Shrutika [1 ]
Kaur, Ravreet [1 ]
Aggarwal, Preeti [1 ]
机构
[1] Panjab Univ, Univ Inst Engn & Technol UIET, CSE Dept, Chandigarh, India
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
10.1007/s11042-023-15211-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset - Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:14
相关论文
共 50 条
  • [31] Short text classification based on strong feature thesaurus
    Bing-kun Wang
    Yong-feng Huang
    Wan-xia Yang
    Xing Li
    Journal of Zhejiang University SCIENCE C, 2012, 13 : 649 - 659
  • [32] Short text classification based on strong feature thesaurus
    Wang, Bing-kun
    Huang, Yong-feng
    Yang, Wan-xia
    Li, Xing
    JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS, 2012, 13 (09): : 649 - 659
  • [33] A Short Text Classification Algorithm Based on Semantic Extension
    Zhou, Yajian
    Deng, Dingpeng
    Chi, Junhui
    CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (01) : 153 - 159
  • [34] Short Chinese Text Classification Based on Correlation Analysis
    Zheng, Chenyang
    Usagawa, Tsuyoshi
    PROCEEDINGS OF 2017 11TH INTERNATIONAL CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGY AND SYSTEMS (ICTS), 2017, : 265 - 268
  • [35] A Classification for Short Text Based on Category Distinguishing Features
    Hu, Xuegang
    Yang, Chaoqun
    Zhang, Yuhong
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING APPLICATIONS (CSEA 2015), 2015, : 304 - 310
  • [36] Short Text Classification Based on Rule and Information Entropy
    Jin, Hua
    Zhu, Yat-Tao
    Jin, Zhi-Qiang
    PROCEEDINGS OF THE 2013 ASIA-PACIFIC COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY CONFERENCE, 2013, : 193 - 199
  • [37] A Short Text Classification Algorithm Based on Semantic Extension
    ZHOU Yajian
    DENG Dingpeng
    CHI Junhui
    Chinese Journal of Electronics, 2021, 30 (01) : 153 - 159
  • [39] Short Text Classification Based on Distributional Representations of Words
    Ma, Chenglong
    Zhao, Qingwei
    Pan, Jielin
    Yan, Yonghong
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2562 - 2565
  • [40] A TEXT FEATURE SELECTION METHOD USING TFIDF BASED ON ENTROPY
    Song, Jiang
    Xu, Min
    Fan, Chuyi
    COMPUTATIONAL INTELLIGENCE: FOUNDATIONS AND APPLICATIONS: PROCEEDINGS OF THE 9TH INTERNATIONAL FLINS CONFERENCE, 2010, 4 : 962 - 967