Unsupervised classification of text-centric XML document collections

被引:0
|
作者
Doucet, Antoine [1 ,2 ]
Lehtonen, Miro [2 ]
机构
[1] INRIA, IRISA, F-35042 Rennes, France
[2] Univ Helsinki, Dept Comp Sci, FIN-00014 Helsinki, Finland
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets. The paper also discusses the problem of the evaluation of XML clustering. Currently, in the INEX mining track, XML clustering techniques are evaluated against semantic categories. We believe there is a mismatch between the task (to exploit the document structure) and the evaluation, which disregards structural aspects. An illustration of this fact is that, over all the clustering track submissions, our text-based runs obtained the 1st rank (Wikipedia collection, out of 7) and 2nd rank (IEEE collection, out of 13).
引用
收藏
页码:497 / 509
页数:13
相关论文
共 50 条
  • [21] Text Document Classification
    Novovicova, Jana
    [J]. ERCIM NEWS, 2005, (62): : 53 - 54
  • [22] The Benefit of Document Embedding in Unsupervised Document Classification
    Novotny, Jaromir
    Ircing, Pavel
    [J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 470 - 478
  • [23] Distributed Access Control For XML Document Centric Collaborations
    Rahaman, Mohammad Ashiqur
    Roudier, Yves
    Schaad, Andreas
    [J]. EDOC 2008: 12TH IEEE INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING, PROCEEDINGS, 2008, : 267 - +
  • [24] Mapping Heterogeneous XML Document Collections to Relational Databases
    Janga, Prudhvi
    Davis, Karen C.
    [J]. CONCEPTUAL MODELING, 2014, 8824 : 86 - 99
  • [25] Schema Extraction and Integration of Heterogeneous XML Document Collections
    Janga, Prudhvi
    Davis, Karen C.
    [J]. MODEL AND DATA ENGINEERING, MEDI 2013, 2013, 8216 : 176 - 187
  • [26] Extensible access control model for XML document collections
    Sladic, Goran
    Milosavljevic, Branko
    Konjovic, Zora
    [J]. SECRYPT 2007: PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, 2007, : 373 - 380
  • [27] XML document classification based on ELM
    Zhao, Xiang-guo
    Wang, Guoren
    Bi, Xin
    Gong, Peizhen
    Zhao, Yuhai
    [J]. NEUROCOMPUTING, 2011, 74 (16) : 2444 - 2451
  • [28] Document-centric XML workflows with fragment digital signatures
    Brooke, Phillip J.
    Paige, Richard F.
    Power, Christopher
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2010, 40 (08): : 655 - 672
  • [29] Unsupervised Document Classification and Topic Detection
    Novotny, Jaromir
    Ircing, Pavel
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 748 - 756
  • [30] XFlow: An XML-based document-centric workflow
    Marchetti, A
    Tesconi, M
    Minutoli, S
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 290 - 303