EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

被引:0
|
作者
Kumar, Niraj [1 ]
Vemula, Venkata Vinay Babu [1 ]
Srinathan, Kannan [1 ]
Varma, Vasudeva [2 ]
机构
[1] Int Inst Informat Technol, Ctr Secur Theory & Algorithmi Res, Hyderabad, Andhra Pradesh, India
[2] Int Inst Informat Technol, Search & Informat Extract Lab, Hyderabad, Andhra Pradesh, India
关键词
Document clustering; Group-average agglomerative clustering; Community detection; Similarity measure; N-gram; Wikipedia based additional knowledge;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper provides a solution to the issue: "How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?" In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.
引用
收藏
页码:182 / 187
页数:6
相关论文
共 50 条
  • [1] Exploiting Wikipedia as External Knowledge for Document Clustering
    Hu, Xiaohua
    Zhang, Xiaodan
    Lu, Caimei
    Park, E. K.
    Zhou, Xiaohua
    [J]. KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2009, : 389 - 396
  • [2] N-Gram Based Secure Similar Document Detection
    Jiang, Wei
    Samanthula, Bharath K.
    [J]. DATA AND APPLICATIONS SECURITY AND PRIVACY XXV, 2011, 6818 : 239 - 246
  • [3] Bangla Word Clustering Based on N-gram Language Model
    Ismail, Sabir
    Rahman, M. Shahidur
    [J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,
  • [4] NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2310 - 2314
  • [5] A Semantic Similarity Measure for Scholarly Document Based on the Study of n-gram
    Samen, Yannick-Ulrich Tchantchou
    [J]. JOURNAL OF WEB ENGINEERING, 2022, 21 (07): : 2095 - 2114
  • [6] N-Gram Based Paraphrase Generator from Large text Document
    Gadag, Ashwini I.
    Sagar, B. M.
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMPUTATION SYSTEM AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTIONS (CSITSS), 2016, : 91 - 94
  • [7] Clustering botnet communication traffic based on n-gram feature selection
    Lu, Wei
    Rammidi, Goaletsa
    Ghorbani, Ali A.
    [J]. COMPUTER COMMUNICATIONS, 2011, 34 (03) : 502 - 514
  • [8] Short Text Clustering using Numerical data based on N-gram
    Kumar, Rajiv
    Mathur, Robin Prakash
    [J]. 2014 5TH INTERNATIONAL CONFERENCE CONFLUENCE THE NEXT GENERATION INFORMATION TECHNOLOGY SUMMIT (CONFLUENCE), 2014, : 274 - 276
  • [9] Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge
    Kiran, G. V. R.
    Shankar, Ravi
    Pudi, Vikram
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT II, 2010, 6277 : 11 - 20
  • [10] DOCUMENT-BASED DIRICHLET CLASS LANGUAGE MODEL FOR SPEECH RECOGNITION USING DOCUMENT-BASED N-GRAM EVENTS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 42 - 47