Web page classification based on heterogeneous features and a combination of multiple classifiers

被引:0
|
作者
Li Deng
Xin Du
Ji-zhong Shen
机构
[1] Zhejiang University,College of Information Science & Electronic Engineering
关键词
Web page classification; Web page features; Combined classifiers; TP391;
D O I
暂无
中图分类号
学科分类号
摘要
Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.
引用
收藏
页码:995 / 1004
页数:9
相关论文
共 50 条
  • [31] A Tool for Link-Based Web Page Classification
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    Corchuelo, Rafael
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, 2011, 7023 : 443 - 452
  • [32] Malicious Web Page Detection Based on Feature Classification
    Phakoontod, Chanachai
    Limthanmaphon, Benchaphon
    [J]. 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTING AND CONVERGENCE TECHNOLOGY (ICCCT2012), 2012, : 66 - 71
  • [33] Artificial Immune System Based Web Page Classification
    Onan, Aytug
    [J]. SOFTWARE ENGINEERING IN INTELLIGENT SYSTEMS (CSOC2015), VOL 3, 2015, 349 : 189 - 199
  • [34] A Web Page Classification Algorithm Based On Link Information
    Xu, Zhaohui
    Yan, Fuliang
    Qin, Jie
    Zhu, Haifeng
    [J]. 2011 TENTH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES), 2011, : 82 - 86
  • [35] Web Page Classification Method Based on Semantics and Structure
    Li, Huaxin
    Zhang, Zhaoxin
    Xu, Yongdong
    [J]. 2019 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2019), 2019, : 238 - 243
  • [36] Web Page Classification Algorithm Based on Deep Learning
    Yu, Yuanhui
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [37] SVM based Chinese web page automatic classification
    Liang, JZ
    [J]. 2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 2265 - 2268
  • [38] Dictionary-based Bilingual Web Page Classification
    Liu, Jicheng
    Liang, Chunyan
    Qi, Jianxun
    [J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 11542 - 11545
  • [39] Chinese web page classification based on text contents
    Liang, JZ
    [J]. ISTM/2003: 5TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-6, CONFERENCE PROCEEDINGS, 2003, : 4733 - 4736
  • [40] A web page classification algorithm based on feature selection
    Zhou, Hongfang
    Guo, Jie
    Wang, Xinyi
    Duan, Wencong
    Wang, Peng
    Cao, Wenquan
    [J]. Journal of Information and Computational Science, 2015, 12 (04): : 1549 - 1556