A data-driven text similarity measure based on classification algorithms

被引:0
|
作者
机构
[1] Cho, Su Gon
[2] Kim, Seoung Bum
来源
Kim, Seoung Bum (sbkim1@korea.ac.kr) | 1600年 / University of Cincinnati卷 / 24期
基金
新加坡国家研究基金会;
关键词
Application problems - Classification accuracy - Classification algorithm - Comparative experiments - Machine learning repository - Similarity measure - Text similarity - University of California;
D O I
暂无
中图分类号
学科分类号
摘要
Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters. © International Journal of Industrial Engineering.
引用
收藏
相关论文
共 50 条
  • [1] A DATA-DRIVEN TEXT SIMILARITY MEASURE BASED ON CLASSIFICATION ALGORITHMS
    Cho, Su Gon
    Kim, Seoung Bum
    INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING-THEORY APPLICATIONS AND PRACTICE, 2017, 24 (03): : 328 - 339
  • [2] Learning Data-driven Image Similarity Measure
    Kobayashi, Takumi
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3679 - 3684
  • [3] A Data-driven Affective Text Classification Analysis
    Ardakani, Saeid Pourroostaei
    Zhou, Can
    Wu, Xuting
    Ma, Yingrui
    Che, Jizhou
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 199 - 204
  • [4] Data-driven Gene Regulatory Network Inference based on Classification Algorithms
    Peignier, Sergio
    Schmitt, Pauline
    Calevro, Federica
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1065 - 1072
  • [5] Data-driven Gene Regulatory Networks Inference Based on Classification Algorithms
    Peignier, Sergio
    Schmitt, Pauline
    Calevro, Federica
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2021, 30 (04)
  • [6] Similarity Measure Development for Case-Based Reasoning-A Data-Driven Approach
    Verma, Deepika
    Bach, Kerstin
    Mork, Paul Jarle
    NORDIC ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2019, 1056 : 143 - 148
  • [7] A Similarity Measure for Text Classification and Clustering
    Lin, Yung-Shen
    Jiang, Jung-Yi
    Lee, Shie-Jue
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) : 1575 - 1590
  • [8] A set theory based similarity measure for text clustering and classification
    Amer, Ali A.
    Abdalla, Hassan I.
    JOURNAL OF BIG DATA, 2020, 7 (01)
  • [9] A set theory based similarity measure for text clustering and classification
    Ali A. Amer
    Hassan I. Abdalla
    Journal of Big Data, 7
  • [10] An Improved Similarity Measure for Text Clustering and Classification
    Reddy, G. Suresh
    Kanth, T. V. Rajini
    Rao, A. Ananda
    ADVANCED SCIENCE LETTERS, 2015, 21 (11) : 3583 - 3590