Extraction of core web content from web pages using noise elimination

被引:0
|
作者
Saravanan A. [1 ]
Bama S.S. [2 ]
机构
[1] School of Computing Science, Sree Saraswathi Thvagaraia College, Tamil Nadu
[2] Coimbatore, Tamil Nadu
关键词
Modified simhash algorithm; Near duplicates removal; Noise removal; Tag analysis;
D O I
10.25103/jestr.134.17
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
Due to the emergent of technological development, Web has evolved as the most powerful digital weapon for mankind in recent days. As the size of the web is increasing rapidly, extracting the interesting content from the web become the supreme challenge. In the meantime, the retrieved web pages have many uninteresting content blocks that are not useful for the user which also degrades the performance of content extraction. These uninteresting blocks include advertisements, banners, copyrights, navigation bars etc., and are normally named as web page noise. Removing these noises from the web pages is considered to be the primary task in pre-processing. This paper presents an approach that eliminates the noise and near duplicates for extracting significant content from the web page. The proposed method has three steps. Initially, the web page is divided into various blocks and the block which is considered as noise is removed using tag analysis and Document Object Model Tree. Secondly, the elimination of redundant blocks is carried out by computing fingerprints using modified simhash algorithm with proximity measure. From the distinct blocks, several parameters such as Titlewords, Linkwords and Contentwords are extracted. Thus, the extraction of significant content is carried out by computing the scores for each block using a weighted block scoring mechanism. The blocks having higher score values are extracted and finally, the core content is extracted from the web page. The experimental analysis has been performed and the results show that the proposed method eliminates noise in an efficient way. © 2020 School of Science.
引用
收藏
页码:173 / 187
页数:14
相关论文
共 50 条
  • [1] Effectual Web Content Mining using Noise Removal from Web Pages
    P. Sivakumar
    Wireless Personal Communications, 2015, 84 : 99 - 121
  • [2] Effectual Web Content Mining using Noise Removal from Web Pages
    Sivakumar, P.
    WIRELESS PERSONAL COMMUNICATIONS, 2015, 84 (01) : 99 - 121
  • [3] Structural Analysis and Regular Expressions based Noise Elimination from Web Pages for Web Content Mining
    Dutta, Amit
    Paria, Sudipta
    Golui, Tanmoy
    Kole, Dipak K.
    2014 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2014, : 1445 - 1451
  • [4] Noise elimination from web pages for efficacious information retrieval
    Uma, R.
    Latha, B.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 6): : 14583 - 14602
  • [5] Noise elimination from web pages for efficacious information retrieval
    R. Uma
    B. Latha
    Cluster Computing, 2019, 22 : 14583 - 14602
  • [6] A Novel Approach for Content Extraction from Web Pages
    Bhardwaj, Aanshi
    Mangat, Veenu
    2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
  • [7] Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration
    Li, Wei-gang
    Sun, Ke
    Wang, Shuo-chen
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGY (CNCT 2016), 2016, 54 : 734 - 740
  • [8] Extraction of web news from web pages using a ternary tree approach
    Laishram, Debina
    Sebastian, Merin
    2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
  • [9] Pattern Matching for Extraction of Core Contents from News Web Pages
    Sirsat, Sandeep
    Chavan, Vinay
    2016 SECOND INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2016, : 13 - 18
  • [10] Information Extraction from Web pages
    Novotny, Robert
    Vojtas, Peter
    Maruscak, Dusan
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +