CLASSIFYING WEB PAGES BY GENRE

被引:0
|
作者
Mason, Jane E. [1 ]
Shepherd, Michael [1 ]
Duffy, Jack [1 ]
机构
[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada
关键词
Information retrieval; Web genre classification; Web page genres; Web page representation; n-gram analysis;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The research reported in this paper is part of a larger project on the automatic classification of Web pages by their genres, using a distance function classification model. In this paper, we investigate the effect of several commonly used data preprocessing steps, explore the use of byte and word n-grams, and test our classification model on three Web page data sets. Our approach is to represent each Web page by a profile that is composed of fixed-length n-grams and their normalized frequencies within the document. Similarly, each of the genres in a data set is represented by a profile that is constructed by combining the n-gram profiles for each exemplar Web page of that genre, forming a centroid profile for each Web page genre. We use a distance function approach to determine the similarity between two profiles, assigning each Web page the label of the genre profile to which its profile is most similar. Our results compare very favorably to those of other researchers.
引用
收藏
页码:651 / 658
页数:8
相关论文
共 50 条
  • [1] Classifying Web Pages by Genre: An n-gram Based Approach
    Mason, Jane E.
    Shepherd, Michael
    Duffy, Jack
    [J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 458 - 465
  • [2] CLASSIFYING WEB PAGES WITH VISUAL FEATURES
    de Boer, Viktor
    van Someren, Maarten
    Lupascu, Tiberiu
    [J]. WEBIST 2010: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGY, VOL 1, 2010, : 245 - 252
  • [3] Enhance Web Pages Genre Identification Using Neighboring Pages
    Zhu, Jia
    Zhou, Xiaofang
    Fung, Gabriel
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2011, 2011, 6997 : 282 - +
  • [4] Classifying web pages using adaptive ontology
    Noh, S
    Seo, H
    Choi, J
    Choi, K
    Jung, G
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-5, CONFERENCE PROCEEDINGS, 2003, : 2144 - 2149
  • [5] Applying semantic links for classifying Web pages
    Choi, B
    Guo, Q
    [J]. DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 148 - 153
  • [6] Micro Genre: Building Block of Web Pages
    Kudelka, Milos
    Snasel, Vaclav
    Horak, Zdenek
    Abraham, Ajith
    [J]. NDT: 2009 FIRST INTERNATIONAL CONFERENCE ON NETWORKED DIGITAL TECHNOLOGIES, 2009,
  • [7] Recognition of pornographic web pages by classifying texts and images
    Hu, Weiming
    Wu, Ou
    Chen, Zhouyao
    Fu, Zhouyu
    Maybank, Steve
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (06) : 1019 - 1034
  • [8] Quantitative evaluation of web metrics for automatic genre classification of web pages
    Malhotra R.
    Sharma A.
    [J]. International Journal of System Assurance Engineering and Management, 2017, 8 (Suppl 2) : 1567 - 1579
  • [9] Transposition of the cocitation method with a view to classifying web pages
    Prime-Claverie, C
    Beigbeder, M
    Lafouge, T
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (14): : 1282 - 1289
  • [10] Training the genre classifier for automatic classification of web pages
    Vidulin, Vedrana
    Lustrek, Mitja
    Gams, Matjaz
    [J]. PROCEEDINGS OF THE ITI 2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES, 2007, : 93 - +