Cleaning web pages for effective web content mining

被引:0
|
作者
Li, Jing [1 ]
Ezeife, C. I. [1 ]
机构
[1] Univ Windsor, Sch Comp Sci, Windsor, ON N9B 3P4, Canada
关键词
web page cleaning; noise block; web content mining; classification; near-duplicate; text similarity;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.
引用
收藏
页码:560 / 571
页数:12
相关论文
共 50 条
  • [1] Effectual Web Content Mining using Noise Removal from Web Pages
    P. Sivakumar
    [J]. Wireless Personal Communications, 2015, 84 : 99 - 121
  • [2] Effectual Web Content Mining using Noise Removal from Web Pages
    Sivakumar, P.
    [J]. WIRELESS PERSONAL COMMUNICATIONS, 2015, 84 (01) : 99 - 121
  • [3] Evolving dynamic web pages using web mining
    Menon, K
    Dagli, CH
    [J]. INTELLIGENT COMPUTING: THEORY AND APPLICATIONS, 2003, 5103 : 48 - 57
  • [4] Mining web logs to locate target web pages
    Guo, Ping
    Yang, Houqun
    Chen, Ting
    Wang, Yanxia
    [J]. Journal of Computational Information Systems, 2007, 3 (04): : 1691 - 1698
  • [5] Web Pages Classification: An Effective Approach Based on Text Mining Techniques
    Babapour, Seyed Moein
    Roostaee, Meysam
    [J]. 2017 IEEE 4TH INTERNATIONAL CONFERENCE ON KNOWLEDGE-BASED ENGINEERING AND INNOVATION (KBEI), 2017, : 320 - 323
  • [6] Web navigation patterns mining based on clustering of paths and pages content
    Gang, F
    Ma, GS
    Jing, H
    [J]. ADVANCED WEB AND NETWORK TECHNOLOGIES, AND APPLICATIONS, PROCEEDINGS, 2006, 3842 : 857 - 860
  • [7] Structural Analysis and Regular Expressions based Noise Elimination from Web Pages for Web Content Mining
    Dutta, Amit
    Paria, Sudipta
    Golui, Tanmoy
    Kole, Dipak K.
    [J]. 2014 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2014, : 1445 - 1451
  • [8] Mining unstructured web pages to enhance web information retrieval
    Yang, Hsin-Chang
    Lee, Chung-Hong
    [J]. ICICIC 2006: FIRST INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING, INFORMATION AND CONTROL, VOL 2, PROCEEDINGS, 2006, : 429 - +
  • [9] Mining answers in German web pages
    Neumann, G
    Xu, FY
    [J]. IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2003, : 125 - 131
  • [10] Mining key information of web pages
    Wang, C
    Lu, J
    Zhang, GQ
    [J]. PROCEEDINGS OF THE 8TH JOINT CONFERENCE ON INFORMATION SCIENCES, VOLS 1-3, 2005, : 1573 - 1576