A semi-structured document model for text mining

被引:0
|
作者
Jianwu Yang
Xiaoou Chen
机构
[1] Peking University,National Key Laboratory for Text Processing, Institute of Computer Science and Technology
关键词
semi-structured document; XML; text mining; vector space model; structured link vector model;
D O I
暂无
中图分类号
学科分类号
摘要
A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents, a document, and vectors’ elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65–0.73 to 0.82–0.86.
引用
收藏
页码:603 / 610
页数:7
相关论文
共 50 条
  • [1] A semi-structured document model for text mining
    Yang, JW
    Chen, XO
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (05) : 603 - 610
  • [2] Bayesian network model for semi-structured document classification
    Denoyer, L
    Gallinari, P
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (05) : 807 - 827
  • [3] SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL
    Lv, Tao
    Liu, Jiang
    Lu, Fan
    Zhang, Peng
    Wang, Xinyan
    Wang, Cong
    [J]. PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 461 - 465
  • [4] History-based visual mining of semi-structured audio and text
    Bouamrane, Matt-Mouley
    Luz, Saturnino
    Masoodian, Masood
    [J]. 12TH INTERNATIONAL MULTI-MEDIA MODELLING CONFERENCE PROCEEDINGS, 2006, : 360 - 363
  • [5] Survey on Mining in Semi-Structured Data
    Shettar, Rajashree
    Shobha, G.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (08): : 226 - 231
  • [6] Text Mining on Semi-structured E-Government Digital Archives of China
    Dong, Hui
    Yu, Siwei
    Jiang, Ying
    [J]. PROCEEDINGS OF THE 2009 SECOND PACIFIC-ASIA CONFERENCE ON WEB MINING AND WEB-BASED APPLICATION, 2009, : 11 - 14
  • [7] Semi-structured document categorization with a semantic kernel
    Aseervatham, Sujeevan
    Bennani, Younes
    [J]. PATTERN RECOGNITION, 2009, 42 (09) : 2067 - 2076
  • [8] Semi-structured document image matching and recognition
    Augereau, Olivier
    Journet, Nicholas
    Domenger, Jean-Philippe
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
  • [9] List data extraction in semi-structured document
    Xu, H
    Li, JZ
    Xu, P
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 584 - 585
  • [10] Automated Transformation of Semi-Structured Text Elements
    Heurix, Johannes
    Rella, Antonio
    Fenz, Stefan
    Neubauer, Thomas
    [J]. AMCIS 2012 PROCEEDINGS, 2012,