Unsupervised classification of text-centric XML document collections

被引:0
|
作者
Doucet, Antoine [1 ,2 ]
Lehtonen, Miro [2 ]
机构
[1] INRIA, IRISA, F-35042 Rennes, France
[2] Univ Helsinki, Dept Comp Sci, FIN-00014 Helsinki, Finland
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets. The paper also discusses the problem of the evaluation of XML clustering. Currently, in the INEX mining track, XML clustering techniques are evaluated against semantic categories. We believe there is a mismatch between the task (to exploit the document structure) and the evaluation, which disregards structural aspects. An illustration of this fact is that, over all the clustering track submissions, our text-based runs obtained the 1st rank (Wikipedia collection, out of 7) and 2nd rank (IEEE collection, out of 13).
引用
收藏
页码:497 / 509
页数:13
相关论文
共 50 条
  • [1] Towards a query optimizer for text-centric tasks
    Ipeirotis, Panagiotis G.
    Agichtein, Eugene
    Jain, Pranay
    Gravano, Luis
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2007, 32 (04):
  • [2] Utilizing Knowledge Graphs in Text-centric Information Retrieval
    Dietz, Laura
    Kotov, Alexander
    Meij, Edgar
    [J]. WSDM'17: PROCEEDINGS OF THE TENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2017, : 815 - 816
  • [3] METHODOLOGY OF WORK ON THE PRESENTATION AND COMPOSITION: TEXT-CENTRIC APPROACH
    Zorina, M. E.
    Sokolova, A. V.
    [J]. PHILOLOGICAL CLASS, 2014, 35 (01): : 105 - 110
  • [4] Text-Centric Multimodal Contrastive Learning for Sentiment Analysis
    Peng, Heng
    Gu, Xue
    Li, Jian
    Wang, Zhaodan
    Xu, Hao
    [J]. ELECTRONICS, 2024, 13 (06)
  • [5] Utilizing Knowledge Graphs for Text-Centric Information Retrieval
    Dietz, Laura
    Kotov, Alexander
    Meij, Edgar
    [J]. ACM/SIGIR PROCEEDINGS 2018, 2018, : 1387 - 1390
  • [6] Reducing MapReduce Abstraction Costs for Text-Centric Applications
    Hsiao, Chun-Hung
    Cafarella, Michael
    Narayanasamy, Satish
    [J]. 2014 43RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2014, : 40 - 49
  • [7] Metrics for XML document collections
    Klettke, M
    Schneider, L
    Heuer, A
    [J]. XML-BASED DATA MANAGEMENT AND MULTIMEDIA ENGINEERING-EDBT 2002 WORKSHOPS, 2002, 2490 : 15 - 28
  • [8] Temporal contexts: Effective text classification in evolving document collections
    Rocha, Leonardo
    Mourao, Fernando
    Mota, Hilton
    Salles, Thiago
    Goncalves, Marcos Andre
    Meira, Wagner, Jr.
    [J]. INFORMATION SYSTEMS, 2013, 38 (03) : 388 - 409
  • [9] TCHFN: Multimodal sentiment analysis based on Text-Centric Hierarchical Fusion Network
    Hou, Jingming
    Omar, Nazlia
    Tiun, Sabrina
    Saad, Saidah
    He, Qian
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [10] Discriminative category matching: Efficient text classification for huge document collections
    Fung, GPC
    Yu, JX
    Lu, HJ
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 187 - 194