The hybrid representation model for web document classification

被引:10
|
作者
Markov, A. [1 ]
Last, M. [1 ]
Kandel, A. [2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
D O I
10.1002/int.20290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.
引用
收藏
页码:654 / 679
页数:26
相关论文
共 50 条
  • [31] The Influence of Feature Representation of Text on the Performance of Document Classification
    Martincic-Ipsic, Sanda
    Milicic, Tanja
    Todorovski, Ljupco
    APPLIED SCIENCES-BASEL, 2019, 9 (04):
  • [32] USING CONCEPTUAL DOCUMENT REPRESENTATION FOR MULTILINGUAL TEXT CLASSIFICATION
    Borges Garcia, A.
    Castro Castro, D.
    Ortega-Bueno, R.
    HOLOS, 2018, 34 (02) : 386 - 396
  • [33] A Novel Ensemble Representation Learning method for Document Classification
    Sharmila, P.
    Venkatesh, S.
    Deisy, C.
    Parthasarathy, S.
    Parasuraman, S.
    2018 IEEE 4TH INTERNATIONAL SYMPOSIUM IN ROBOTICS AND MANUFACTURING AUTOMATION (ROMA), 2018,
  • [34] A Similarity Rough Set Model for Document Representation and Document Clustering
    Nguyen Chi Thanh
    Yamada, Koichi
    Unehara, Muneyuki
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2011, 15 (02) : 125 - 133
  • [35] Classification of forensic autopsy reports through conceptual graph-based document representation model
    Mujtaba, Ghulam
    Shuib, Liyana
    Raj, Ram Gopal
    Rajandram, Retnagowri
    Shaikh, Khairunisa
    Al-Garadi, Mohammed Ali
    JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 82 : 88 - 105
  • [36] The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model
    Mountassir, Asmaa
    Benbrahim, Houda
    Berrada, Ilham
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, MLDM 2014, 2014, 8556 : 442 - 456
  • [37] PerSaDoR: Personalized social document representation for improving web search
    Bouadjenek, Mohamed Reda
    Hacid, Hakim
    Bouzeghoub, Mokrane
    Vakali, Athena
    INFORMATION SCIENCES, 2016, 369 : 614 - 633
  • [38] An augmentation hybrid system for document classification and rating
    Dazeley, R
    Kang, BH
    PRICAI 2004: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3157 : 985 - 986
  • [39] SVM multi-classifier and web document classification
    Liang, JZ
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1347 - 1351
  • [40] Web Document Classification by Keywords Using Random Forests
    Klassen, Myungsook
    Paturi, Nikhila
    NETWORKED DIGITAL TECHNOLOGIES, PT 2, 2010, 88 : 256 - 261