Quantitative evaluation of web metrics for automatic genre classification of web pages

被引:4
|
作者
Malhotra R. [1 ]
Sharma A. [2 ]
机构
[1] Department of Computer Science, Delhi Technological University, Bawana Road, Delhi
[2] Department of Planning, Monitoring and Evaluation, CSIR-National Physical Laboratory, Dr K S Krishnan Marg, New Delhi
关键词
Entertainment websites; Machine learning; Web genre classification; Web metrics;
D O I
10.1007/s13198-017-0629-1
中图分类号
学科分类号
摘要
An additional dimension that facilitate a swift and relevant response from a web search engine is to introduce a genre class for each web page. The web genre classification distinguishes between pages by means of their features such as functionality, style, presentation layout, form and meta-content rather than on content. In this work, nineteen web metrics are identified according to the lexical, structural and functionality attributes of the web page rather than topic. The study is carried out to determine which of these attributes (lexical, structural and functionality) or its combinations, are significant for the development of web genre classification model. Also, we investigate the best web genre prediction model using parametric (Logistic Regression), non-parametric (Decision Tree) and ensemble (Bagging, Boosting) machine learning algorithms. We built forty-two genre classification models to classify web pages into Movie, TV or Music genre using a sample space data extracted from the Pixel Awards nominated and award winning websites. Our results obtained from the area under the curve analysis of these forty-two models show that the ensemble algorithms provide better performance. The rest of the models have acceptable performance, only in cases for which the lexical and structural attributes were fed in combination. Functionality metrics were found to considerably degrade the performance measure, irrespective of the algorithm used. The overall results of the study indicate the predictive capability of machine learning models for web genre classification, provided an appropriate choice is made on the selection of the input metrics. © 2017, The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden.
引用
收藏
页码:1567 / 1579
页数:12
相关论文
共 50 条
  • [41] Automatic identification of informative sections of Web pages
    Debnath, S
    Mitra, P
    Pal, N
    Giles, CL
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (09) : 1233 - 1246
  • [42] Automatic template detection for structured web pages
    Lo, Lawrence
    Ng, Vincent To-Yee
    Ng, Patrick
    Chan, Stephen C. F.
    [J]. 2006 10TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, PROCEEDINGS, VOLS 1 AND 2, 2006, : 708 - 713
  • [43] The automatic identification of the emotion status of web pages
    John, David
    Boucouvalas, Anthony C.
    [J]. EUROMEDIA '2008, 2008, : 18 - +
  • [44] Automatic partitioning of web pages using clustering
    Romero, R
    Berger, A
    [J]. MOBILE HUMAN-COMPUTER INTERACTION - MOBILEHCI 2004, PROCEEDINGS, 2004, 3160 : 388 - 393
  • [45] Automatic data record detection in Web Pages
    Gao, Xiaoying
    Vuong, Le Phong Bao
    Zhang, Mengjie
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 349 - +
  • [46] Automatic text summarization for web pages on Internet
    State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210093, China
    [J]. Jisuanji Gongcheng, 2006, 3 (88-90):
  • [47] Web Services Metrics: A Survey and A Classification
    Ladan, Mohamad Ibrahim
    [J]. NETWORK AND ELECTRONICS ENGINEERING, 2011, 11 : 93 - 98
  • [48] Adaptive automatic classification on the web
    Jenkins, C
    Inman, D
    [J]. 11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS, 2000, : 504 - 511
  • [49] Automatic Web Page Classification
    Materna, Jiri
    [J]. RASLAN 2008: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING: SECOND WORKSHOP, 2008, : 84 - 93
  • [50] Automatic classification of images on the web
    Hartmann, A
    Lienhart, R
    [J]. STORAGE AND RETRIEVAL FOR MEDIA DATABASES 2002, 2002, 4676 : 31 - 40