An Effective and Discriminative Feature Learning for URL based Web Page Classification

被引:5
|
作者
Rajalakshmi, R. [1 ]
Aravindan, Chandrabose [2 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn, Chennai, Tamil Nadu, India
[2] SSN Coll Engn, Dept Comp Sci & Engn, Chennai, Tamil Nadu, India
关键词
D O I
10.1109/SMC.2018.00240
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Ever growing World Wide Web results in a large volume of web pages with variety of topics. Many applications such as information filtering and focused crawling demand large scale topic classification of a web page. To classify the web pages, URL based approach is proposed by which downloading the contents of the web page for classification purpose is avoided. In this paper, an automated way of learning category specific universal dictionary of discriminating URL features is proposed. Using this automatically learnt dictionary, the feature vector dimensionality is made independent of training set and it overcomes the difficulty of handling large scale data. For constructing this dictionary, publicly available ODP dataset have been used. The proposed approach was evaluated by applying the automatically learnt URL feature dictionaries on another dataset that contains search results from Google. Through experiments, it is shown that macro-average precision, recall and F1 values of 0.93, 0.85 and 0.88 have been achieved. We have observed that, the difference is not statistically significant when the universal dictionary is applied instead of using dataset-specific term dictionary.
引用
收藏
页码:1374 / 1379
页数:6
相关论文
共 50 条
  • [41] Image Logging Technique of A Web URL Page on the Tiny Web Server
    Yoo, Seunghee
    Cho, Dongsub
    PROCEEDINGS OF 2008 INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTING AND COMPUTATIONAL SCIENCES: ADVANCES IN APPLIED COMPUTING AND COMPUTATIONAL SCIENCES, 2008, : 92 - 95
  • [42] Design and Research of Composite Web Page Classification Network Based on Deep Learning
    Zhao, Qiuhan
    Yang, Wenchuan
    Hua, Rui
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1531 - 1535
  • [43] Web Page Classification based on Unsupervised Learning using MIME type Analysis
    Roberto Jimenez, Luis
    2021 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2021, : 375 - 377
  • [44] A review of machine learning algorithms for web page classification
    Lassri, Safae
    El Habib, Benlahmar
    Abderrahim, Tragha
    2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 220 - 226
  • [45] Efficient Machine Learning Technique for Web Page Classification
    S. Markkandeyan
    M. Indra Devi
    Arabian Journal for Science and Engineering, 2015, 40 : 3555 - 3566
  • [46] Efficient Machine Learning Technique for Web Page Classification
    Markkandeyan, S.
    Devi, M. Indra
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2015, 40 (12) : 3555 - 3566
  • [47] Page-Level Handwritten Word Spotting via Discriminative Feature Learning
    Gao, Jie
    Guo, Xiaopeng
    Shang, Mingyu
    Sun, Jun
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2020), PT I, 2020, 12274 : 368 - 379
  • [48] A Framework for Incremental Deep Web Crawler Based on URL Classification
    Zhang, Zhixiao
    Dong, Guoqing
    Peng, Zhaohui
    Yan, Zhongmin
    WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 302 - 310
  • [49] Web Page Classification Based on Social Annotations
    Shen, J.
    Xu, F. Y.
    Bi, L.
    Wei, L. H.
    He, K.
    Zhu, Y.
    ITESS: 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES, PT 1, 2008, : 1115 - 1121
  • [50] An approach to Web page classification based on granules
    Duan, Qiguo
    Miao, Duoqian
    Wang, Ruizhi
    Chen, Min
    PROCEEDINGS OF THE IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE: WI 2007, 2007, : 279 - 282