Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method

被引:12
|
作者
Lan, Fei [1 ]
机构
[1] Chongqing Coll Elect Engn, Sch Elect & Internet Things, Chongqing 400000, Peoples R China
关键词
Compendex;
D O I
10.1155/2022/7923262
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.
引用
收藏
页数:11
相关论文
共 49 条
  • [1] A text similarity measurement combining word semantic information with TF-IDF method
    Huang C.-H.
    Yin J.
    Hou F.
    Jisuanji Xuebao/Chinese Journal of Computers, 2011, 34 (05): : 856 - 864
  • [2] Research of Text Classification Based on Improved TF-IDF Algorithm
    Liu, Cai-zhi
    Sheng, Yan-xiu
    Wei, Zhi-qiang
    Yang, Yong-Quan
    2018 IEEE INTERNATIONAL CONFERENCE OF INTELLIGENT ROBOTICS AND CONTROL ENGINEERING (IRCE), 2018, : 218 - 222
  • [3] Improvement and Application of TF-IDF Algorithm in Text Orientation Analysis
    Wang, Wei
    Tang, Yongxin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ADVANCED MATERIALS SCIENCE AND ENVIRONMENTAL ENGINEERING, 2016, 52 : 230 - 233
  • [4] News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark
    Zhou, Zhuo
    Qin, Jiaohua
    Xiang, Xuyu
    Tan, Yun
    Liu, Qiang
    Xiong, Neal N.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 62 (01): : 217 - 231
  • [5] A Method of Text Dimension Reduction Based on CHI and TF-IDF
    Tang, HaiBo
    Zhou, Lei
    Xu Chengjie
    Zhu, Quanyin
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MECHATRONICS, MATERIALS, CHEMISTRY AND COMPUTER ENGINEERING 2015 (ICMMCCE 2015), 2015, 39 : 1854 - 1857
  • [6] Application of an Improved TF-IDF Method in Literary Text Classification
    Xiang, Lin
    ADVANCES IN MULTIMEDIA, 2022, 2022
  • [7] Research on case reasoning method based on TF-IDF
    Lin Zhang
    International Journal of System Assurance Engineering and Management, 2021, 12 : 608 - 615
  • [8] Research on case reasoning method based on TF-IDF
    Zhang, Lin
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2021, 12 (03) : 608 - 615
  • [9] An improvement to TF-IDF: Term distribution based term weight algorithm
    Xia T.
    Chai Y.
    Journal of Software, 2011, 6 (03) : 413 - 420
  • [10] Turning from TF-IDF to TF-IGM for term weighting in text classification
    Chen, Kewen
    Zhang, Zuping
    Long, Jun
    Zhang, Hao
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 : 245 - 260