Clustering web documents using hierarchical representation with multi-granularity

被引:11
|
作者
Huang, Faliang [1 ]
Zhang, Shichao [2 ,5 ]
He, Minghua [3 ]
Wu, Xindong [4 ]
机构
[1] Fujian Normal Univ, Fac Software, Fuzhou 350007, Peoples R China
[2] Guangxi Normal Univ, Coll Comp Sci & IT, Guilin 541004, Peoples R China
[3] Aston Univ, Birmingham B4 7ET, Aston Triangle, England
[4] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[5] Univ Technol Sydney, Fac Engn & Informat Technol, Broadway, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
web document clustering; hierarchical representation; multi-granularity; INFORMATION GRANULATION;
D O I
10.1007/s11280-012-0197-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with "false correlation". In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
引用
收藏
页码:105 / 126
页数:22
相关论文
共 50 条
  • [21] A review on network representation learning with multi-granularity perspective
    Fu, Shun
    Wang, Lufeng
    Yang, Jie
    INTELLIGENT DATA ANALYSIS, 2024, 28 (01) : 3 - 32
  • [22] Multi-granularity semantic representation model for relation extraction
    Ming Lei
    Heyan Huang
    Chong Feng
    Neural Computing and Applications, 2021, 33 : 6879 - 6889
  • [23] Multi-granularity Visualization of Trajectory Clusters using Sub-trajectory Clustering
    Chang, Cheng
    Zhou, Baoyao
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 577 - 582
  • [24] Effect of Multi-word Features on the Hierarchical Clustering of Web Documents
    Karthick, S.
    Shalinie, S. Mercy
    Eswarimeena, A. R.
    Madhumitha, P.
    Abhinaya, T. Naga
    2014 INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION TECHNOLOGY (ICRTIT), 2014,
  • [25] A Novel Indexing Technique for Web Documents using Hierarchical Clustering
    Gupta, Deepti
    Bhatia, Komal Kumar
    Sharma, A. K.
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2009, 9 (09): : 168 - 175
  • [26] Multi-granularity Hierarchical Attention Siamese Network for Visual Tracking
    Chen, Xing
    Zhang, Xiang
    Tan, Huibin
    Lan, Long
    Luo, Zhigang
    Huang, Xuhui
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [27] Hierarchical multi-granularity classification based on bidirectional knowledge transfer
    Jiang, Juan
    Yang, Jingmin
    Zhang, Wenjie
    Zhang, Hongbin
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [28] Multi-Granularity Ensemble Classification Algorithm Based on Attribute Representation
    Zhang Q.-H.
    Zhi X.-C.
    Wang G.-Y.
    Yang F.
    Xue F.-Z.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (08): : 1712 - 1729
  • [29] A Text Vector Representation Model Merging Multi-Granularity Information
    Nie W.
    Chen Y.
    Ma J.
    Data Analysis and Knowledge Discovery, 2019, 3 (09) : 45 - 52
  • [30] A multidimensional approach to the representation of the spatio-temporal multi-granularity
    Gascuena, Concepcion M.
    Cuadra, Dolores
    Martinez, Paloma
    ICEIS 2006: PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATIONAL SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2006, : 175 - +