Text classification framework for short text based on TFIDF-FastText

被引:6
|
作者
Chawla, Shrutika [1 ]
Kaur, Ravreet [1 ]
Aggarwal, Preeti [1 ]
机构
[1] Panjab Univ, Univ Inst Engn & Technol UIET, CSE Dept, Chandigarh, India
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
10.1007/s11042-023-15211-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset - Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:14
相关论文
共 50 条
  • [11] Structural Learning Framework for Binary Short Text Classification
    Liu, Wuying
    Wang, Lin
    2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 1188 - 1193
  • [12] Impact of convolutional neural network and FastText embedding on text classification
    Umer, Muhammad
    Imtiaz, Zainab
    Ahmad, Muhammad
    Nappi, Michele
    Medaglia, Carlo
    Choi, Gyu Sang
    Mehmood, Arif
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (04) : 5569 - 5585
  • [13] Improvement of Text Feature Selection Method based on TFIDF
    Qu, Shouning
    Wang, Sujuan
    Zou, Yan
    2008 INTERNATIONAL SEMINAR ON FUTURE INFORMATION TECHNOLOGY AND MANAGEMENT ENGINEERING, PROCEEDINGS, 2008, : 79 - 81
  • [14] A Text Feature Selection Algorithm Based on Improved TFIDF
    Chengcheng Yang
    Xingshi He
    PROCEEDINGS OF THE 2008 CHINESE CONFERENCE ON PATTERN RECOGNITION (CCPR 2008), 2008, : 416 - 419
  • [15] Short Text Classification Based on Keywords Extension
    Gu, Yiran
    Shen, Jiajia
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 2616 - 2621
  • [16] Wikipedia Based Short Text Classification Method
    Li, Junze
    Cai, Yi
    Cai, Zhiwei
    Leung, Hofung
    Yang, Kai
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2017), 2017, 10179 : 275 - 286
  • [17] Medical-Based Text Classification Using FastText Features and CNN-LSTM Model
    Zeghdaoui, Mohamed Walid
    Boussaid, Omar
    Bentayeb, Fadila
    Joly, Frederik
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2021, PT I, 2021, 12923 : 155 - 167
  • [18] Classification of Proactive Personality: Text Mining Based on Weibo Text and Short-Answer Questions Text
    Wang, Peng
    Yan, Yun
    Si, Yingdong
    Zhu, Gancheng
    Zhan, Xiangping
    Wang, Jun
    Pan, Runsheng
    IEEE ACCESS, 2020, 8 : 97370 - 97382
  • [19] From Text Classification to Keyphrase Extraction for Short Text
    Lee, Song-Eun
    Kim, Kang-Min
    Ryu, Woo-Jong
    Park, Jemin
    Lee, SangKeun
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 1137 - 1142
  • [20] An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing
    Kuyumcu, Birol
    Aksakalli, Cuneyt
    Delil, Selman
    NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, 2019, : 1 - 4