The hybrid representation model for web document classification

被引：10

作者：

Markov, A. ^{[1
]}

Last, M. ^{[1
]}

Kandel, A. ^{[2
]}

机构：

[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel

[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS | 2008年 / 23卷 / 06期

关键词：

D O I：

10.1002/int.20290

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily-be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naive Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. (c) 2008 Wiley Periodicals, Inc.

引用

页码：654 / 679

页数：26

共 50 条

[31] The Influence of Feature Representation of Text on the Performance of Document Classification
Martincic-Ipsic, Sanda
Milicic, Tanja
Todorovski, Ljupco
APPLIED SCIENCES-BASEL, 2019, 9 (04):
[32] USING CONCEPTUAL DOCUMENT REPRESENTATION FOR MULTILINGUAL TEXT CLASSIFICATION
Borges Garcia, A.
Castro Castro, D.
Ortega-Bueno, R.
HOLOS, 2018, 34 (02) : 386 - 396
[33] A Novel Ensemble Representation Learning method for Document Classification
Sharmila, P.
Venkatesh, S.
Deisy, C.
Parthasarathy, S.
Parasuraman, S.
2018 IEEE 4TH INTERNATIONAL SYMPOSIUM IN ROBOTICS AND MANUFACTURING AUTOMATION (ROMA), 2018,
[34] A Similarity Rough Set Model for Document Representation and Document Clustering
Nguyen Chi Thanh
Yamada, Koichi
Unehara, Muneyuki
JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2011, 15 (02) : 125 - 133
[35] Classification of forensic autopsy reports through conceptual graph-based document representation model
Mujtaba, Ghulam
Shuib, Liyana
Raj, Ram Gopal
Rajandram, Retnagowri
Shaikh, Khairunisa
Al-Garadi, Mohammed Ali
JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 82 : 88 - 105
[36] The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model
Mountassir, Asmaa
Benbrahim, Houda
Berrada, Ilham
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, MLDM 2014, 2014, 8556 : 442 - 456
[37] PerSaDoR: Personalized social document representation for improving web search
Bouadjenek, Mohamed Reda
Hacid, Hakim
Bouzeghoub, Mokrane
Vakali, Athena
INFORMATION SCIENCES, 2016, 369 : 614 - 633
[38] An augmentation hybrid system for document classification and rating
Dazeley, R
Kang, BH
PRICAI 2004: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3157 : 985 - 986
[39] SVM multi-classifier and web document classification
Liang, JZ
PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1347 - 1351
[40] Web Document Classification by Keywords Using Random Forests
Klassen, Myungsook
Paturi, Nikhila
NETWORKED DIGITAL TECHNOLOGIES, PT 2, 2010, 88 : 256 - 261

← 1 2 3 4 5 →