An Effective and Discriminative Feature Learning for URL based Web Page Classification

被引:5
|
作者
Rajalakshmi, R. [1 ]
Aravindan, Chandrabose [2 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn, Chennai, Tamil Nadu, India
[2] SSN Coll Engn, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India
关键词
D O I
10.1109/SMC.2018.00240
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Ever growing World Wide Web results in a large volume of web pages with variety of topics. Many applications such as information filtering and focused crawling demand large scale topic classification of a web page. To classify the web pages, URL based approach is proposed by which downloading the contents of the web page for classification purpose is avoided. In this paper, an automated way of learning category specific universal dictionary of discriminating URL features is proposed. Using this automatically learnt dictionary, the feature vector dimensionality is made independent of training set and it overcomes the difficulty of handling large scale data. For constructing this dictionary, publicly available ODP dataset have been used. The proposed approach was evaluated by applying the automatically learnt URL feature dictionaries on another dataset that contains search results from Google. Through experiments, it is shown that macro-average precision, recall and F1 values of 0.93, 0.85 and 0.88 have been achieved. We have observed that, the difference is not statistically significant when the universal dictionary is applied instead of using dataset-specific term dictionary.
引用
收藏
页码:1374 / 1379
页数:6
相关论文
共 50 条
  • [1] CALA: An unsupervised URL-based web page classification system
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    Corchuelo, Rafael
    KNOWLEDGE-BASED SYSTEMS, 2014, 57 : 168 - 180
  • [2] Machine learning techniques for automated web page classification using URL features
    Devi, M. Indra
    Rajaram, R.
    Selvakuberan, K.
    ICCIMA 2007: INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, VOL II, PROCEEDINGS, 2007, : 116 - 118
  • [3] An Experiment to Test URL Features for Web Page Classification
    Hernandez, Inma
    River, Carlos R.
    Ruiz, David
    Luis Arjona, Jose
    TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTIAGENT SYSTEMS, 2012, 157 : 109 - +
  • [4] A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
    Baykan, Eda
    Henzinger, Monika
    Weber, Ingmar
    ACM TRANSACTIONS ON THE WEB, 2013, 7 (01)
  • [5] Web Page Classification using n-gram based URL Features
    Rajalakshmi, R.
    Aravindan, Chandrabose
    2013 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2013, : 15 - 21
  • [6] Malicious Web Page Detection Based on Feature Classification
    Phakoontod, Chanachai
    Limthanmaphon, Benchaphon
    2012 7TH INTERNATIONAL CONFERENCE ON COMPUTING AND CONVERGENCE TECHNOLOGY (ICCCT2012), 2012, : 66 - 71
  • [7] A web page classification algorithm based on feature selection
    Zhou, Hongfang
    Guo, Jie
    Wang, Xinyi
    Duan, Wencong
    Wang, Peng
    Cao, Wenquan
    Journal of Information and Computational Science, 2015, 12 (04): : 1549 - 1556
  • [8] UPCA: An Efficient URL-Pattern Based Algorithm for Accurate Web Page Classification
    Yang, Yiming
    Zhang, Lei
    Liu, Guiquan
    Chen, Enhong
    2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2015, : 1475 - 1480
  • [9] URL-Based Web Page Classification: With n-Gram Language Models
    Abdallah, Tarek Amr
    de La Iglesia, Beatriz
    KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, IC3K 2014, 2015, 553 : 19 - 33
  • [10] A Method of Web Page Classification Based on Feature Dimension Reduction
    Ren, Xun-yi
    Zhang, Dan
    2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL MODELING, SIMULATION AND APPLIED MATHEMATICS (CMSAM 2016), 2016, : 252 - 256