Web Page Genre Classification

被引:0
|
作者
Chen, Guangyu [1 ]
Choi, Ben [1 ]
机构
[1] Louisiana Tech Univ, Ruston, LA 71272 USA
关键词
Web Ontology; Semantic Web; Knowledge Classification; Web Mining; Information Retrieval;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper we present an automatic genre-based Web page classification system. Unlike subject or topic based classifications, genre-based classifications focus on functional purposes and classify web pages into categories such as online shopping, technical paper, or discussion forum. Until now, the genre classifications are not well developed due to the subjectivities and difficulties to define the genre, the features and even the categories. In this paper, we define five top-level genre categories, each of which has several subcategories, and develop new methods to extract 31 features from Web pages to identify the categories. We analyze not only the contents of the Web pages, but also the URLs, HTML tags, Java scripts, and VB scripts. We developed a genre classification system that achieved average accuracy of 93%. In addition, we combined this genre classification with our subject-based classification to produce a comprehensive Web page classification system.
引用
收藏
页码:2353 / 2357
页数:5
相关论文
共 50 条
  • [1] Web page genre classification
    Computer Science, Louisiana Tech University, LA 71272, United States
    [J]. Proc ACM Symp Appl Computing, (2353-2357):
  • [2] The Role of Word String Patterns in Chinese Web Page Genre Classification
    Wu, Yangyang
    Wu, Chukun
    [J]. IMCIC 2010: INTERNATIONAL MULTI-CONFERENCE ON COMPLEXITY, INFORMATICS AND CYBERNETICS, VOL I (POST-CONFERENCE EDITION), 2010, : 204 - 208
  • [3] What type of page is this? Genre as web descriptor
    Rosso, MA
    [J]. PROCEEDINGS OF THE 5TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, PROCEEDINGS, 2005, : 398 - 398
  • [4] An n-gram Based Approach to Multi-labeled Web Page Genre Classification
    Mason, Jane E.
    Shepherd, Michael
    Duffy, Jack
    Keselj, Vlado
    Watters, Carolyn
    [J]. 43RD HAWAII INTERNATIONAL CONFERENCE ON SYSTEMS SCIENCES VOLS 1-5 (HICSS 2010), 2010, : 1526 - 1535
  • [5] Web page downloading and classification
    Tran, LQ
    Moon, CW
    Le, DX
    Thoma, GR
    [J]. FOURTEENTH IEEE SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, PROCEEDINGS, 2001, : 321 - 326
  • [6] Automatic Web Page Classification
    Materna, Jiri
    [J]. RASLAN 2008: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING: SECOND WORKSHOP, 2008, : 84 - 93
  • [7] On Chinese web page classification
    Liang, JZ
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING - ICAISC 2004, 2004, 3070 : 634 - 639
  • [8] Exploiting link structure for web page genre identification
    Jia Zhu
    Qing Xie
    Shoou-I Yu
    Wai Hung Wong
    [J]. Data Mining and Knowledge Discovery, 2016, 30 : 550 - 575
  • [9] Exploiting link structure for web page genre identification
    Zhu, Jia
    Xie, Qing
    Yu, Shoou-I
    Wong, Wai Hung
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2016, 30 (03) : 550 - 575
  • [10] Web Page Segmentation with Structured Prediction and its Application in Web Page Classification
    Bing, Lidong
    Guo, Rui
    Lam, Wai
    Niu, Zheng-Yu
    Wang, Haifeng
    [J]. SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 767 - 776