Effectual Web Content Mining using Noise Removal from Web Pages

被引:6
|
作者
Sivakumar, P. [1 ]
机构
[1] KSR Coll Engn Autonomous, Dept CSE, Namakkal, Tamil Nadu, India
关键词
Web mining; Web content mining; Web cleaning; Duplicate blocks; Keyword redundancy; Linkword percentage; Titleword relevancy;
D O I
10.1007/s11277-015-2596-7
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World Wide Web (WWW) is WCM. The WCM is further classified into two categories first category is to directly mine the content on documents and second category is to mine the content using search engine. The mining method focuses on the information extraction and integration. The content of Web may be text, image, audio, video. Web pages typically contain a large amount of information that is not part of the main contents of the pages, like banner advertisements, navigation bars, copyright notices, etc. Such noises on Web pages usually lead to poor results in Web mining. This paper focuses on the problem of Noise free Information retrieval on web pages, which means the pre-processing of Web pages automatically to detect and eliminate noises. This paper proposes an approach for eliminating noises from web pages for the purpose of improving the accuracy and efficiency of web content mining. The main objective of removing noise from a Web Page is to improve the performance of the search. It is very essential to differentiate important information from noisy content that may misguide users' interest. This approach mainly concentrates on removing the following noises in stages: (1) Primary noises-Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and other Uninteresting Data such as audio, video, multiple links. (2) Duplicate Contents and (3) Noise Contents according to block importance. The removal of these noises is done by performing three operations. Firstly, using the Block Splitting operation, primary noises are removed and only the useful text contents are partitioned into blocks. Secondly, using simhash algorithm, the duplicate blocks are removed to obtain the distinct blocks. For each block, three parameters namely Keyword Redundancy (KR), Linkword Percentage (LP) and Titleword Relevancy (TR) calculated. Using these three parameters block importance value (BI) is calculated, which is called Simhash algorithm. The importance of the block is then calculated using simhash algorithm. Based on a threshold value the important blocks are selected using sketching algorithm and the keywords are extracted from those important blocks.
引用
收藏
页码:99 / 121
页数:23
相关论文
共 50 条
  • [31] Improving web sites with web usage mining, web content mining, and semantic analysis
    Norguet, JP
    Zimányi, E
    Steinberger, R
    [J]. SOFSEM 2006: THEORY AND PRACTICE OF COMPUTER SCIENCE, PROCEEDINGS, 2006, 3831 : 430 - 439
  • [32] Filtering Web Pages by Sensitive Mining Approach
    Sreedevi, M.
    Kaveri, A. Sowmya
    Deepak, V
    Venkatesh, K.
    Sravan, D.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (03): : 125 - 128
  • [33] LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES
    Vijendran, Anna Saro
    Deepa, C.
    [J]. PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2013,
  • [34] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [35] A mining method for linked web pages using associated keyword space
    Yaguchi, Y
    Ohnishi, H
    Mori, S
    Naruse, K
    Oka, R
    Takahashi, H
    [J]. INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET , PROCEEDINGS, 2006, : 268 - 276
  • [36] Automatic metadata generation for Web pages using a text mining approach
    Yang, HC
    Lee, CH
    [J]. INTERNATIONAL WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION, PROCEEDINGS, 2005, : 186 - 194
  • [37] Improving the web text content by extracting significant pages into a Web Site
    Ríos, SA
    Velásquez, JD
    Vera, ES
    Yasuda, H
    Aoki, T
    [J]. 5th International Conference on Intelligent Systems Design and Applications, Proceedings, 2005, : 32 - 36
  • [38] Exploiting Web Sites Structural and Content Features for Web Pages Clustering
    Lanotte, Pasqua Fabiana
    Fumarola, Fabio
    Malerba, Donato
    Ceci, Michelangelo
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 446 - 456
  • [39] A method for indexing web pages using web bots
    Szymanski, BK
    Chung, MS
    [J]. 2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C1 - C6
  • [40] Using the web information structure for retrieving web pages
    Adriani, Mirna
    Pandugita, Rama
    [J]. ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 892 - 897