Framework for building a high-quality web page collection considering page group structure

被引:0
|
作者
Wang, Yuxin [1 ]
Oyama, Keizo [1 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, 2-1-2 Hitotsubashi, Tokyo 1018430, Japan
基金
日本学术振兴会;
关键词
web page collection; page group model; three-way classifier; quality assurance; precision and recall;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a framework for building a high-quality web page collection considering page group structure in a two-step process: rough filtering and accurate classification. In both processes, we apply the idea of local page group structure. The rough filtering comprehensively gathers all potential homepages from the web with as few noise pages as possible. It uses property-based keyword lists according to four page group models that are based on the page group structure. The accurate classification uses a textual feature set for the support vector machine, which is composed by independently concatenating the feature subsets on the surrounding pages grouped according to the page group structure. Using a combination of a recall-assured classifier and a precision-assured classifier, we build a three-way classifier to accurately select the pages that need manual assessment to assure the required quality. The effectiveness of proposed method is shown by the experimental results.
引用
收藏
页码:95 / +
页数:3
相关论文
共 50 条
  • [1] Web page classification considering page group structure for building a high-quality homepage collection
    Wang, Yuxin
    Oyama, Keizo
    [J]. WEBIST 2007: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, VOL WIA: WEB INTERFACES AND APPLICATIONS, 2007, : 170 - +
  • [2] Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection
    Wang, Yuxin
    Oyama, Keizo
    [J]. Digital Libraries: Achievements, Challenges and Opportunities, Proceedings, 2006, 4312 : 515 - 518
  • [3] A data mining framework for building a web-page recommender system
    Haruechaiyasak, C
    Shyu, ML
    Chen, SC
    [J]. PROCEEDINGS OF THE 2004 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI-2004), 2004, : 357 - 362
  • [4] A framework to derive web page context from hyperlink structure
    Chauhan, Naresh
    Sharma, A.K.
    [J]. International Journal of Information and Communication Technology, 2008, 1 (3-4) : 329 - 346
  • [5] A Framework for Web Page Rank Prediction
    Voudigari, Elli
    Pavlopoulos, John
    Vazirgiannis, Michalis
    [J]. ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT II, 2011, 364 : 240 - 249
  • [6] High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models
    Kim, Jeong-Jae
    On, Byung-Won
    Lee, Ingyu
    [J]. IEEE ACCESS, 2021, 9 : 85240 - 85254
  • [7] Topic-independent web high-quality page selection based on K-means clustering
    Wang, CH
    Liu, YQ
    Zhang, M
    Ma, SP
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 516 - 521
  • [8] A fuzzy logic framework for web page filtering
    Vrettos, S
    Stafylopatis, A
    [J]. 2002 6TH SEMINAR ON NEURAL NETWORK APPLICATIONS IN ELECTRICAL ENGINEERING, PROCEEDINGS, 2002, : 47 - 51
  • [9] A NOVEL WEB PAGE DUPLICATION DETECTION FRAMEWORK
    Han, Zhongming
    Duan, Dagao
    Liu, Hongzhi
    Sun, Jianzhi
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT, PROCEEDINGS, 2009, : 374 - 378
  • [10] Building interactive simulations in a Web page design program
    Kootsey, JM
    Siriphongs, D
    McAuley, G
    [J]. PROCEEDINGS OF THE 26TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-7, 2004, 26 : 5166 - 5168