Extraction of core web content from web pages using noise elimination

被引:0
|
作者
Saravanan A. [1 ]
Bama S.S. [2 ]
机构
[1] School of Computing Science, Sree Saraswathi Thvagaraia College, Tamil Nadu
[2] Coimbatore, Tamil Nadu
关键词
Modified simhash algorithm; Near duplicates removal; Noise removal; Tag analysis;
D O I
10.25103/jestr.134.17
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
Due to the emergent of technological development, Web has evolved as the most powerful digital weapon for mankind in recent days. As the size of the web is increasing rapidly, extracting the interesting content from the web become the supreme challenge. In the meantime, the retrieved web pages have many uninteresting content blocks that are not useful for the user which also degrades the performance of content extraction. These uninteresting blocks include advertisements, banners, copyrights, navigation bars etc., and are normally named as web page noise. Removing these noises from the web pages is considered to be the primary task in pre-processing. This paper presents an approach that eliminates the noise and near duplicates for extracting significant content from the web page. The proposed method has three steps. Initially, the web page is divided into various blocks and the block which is considered as noise is removed using tag analysis and Document Object Model Tree. Secondly, the elimination of redundant blocks is carried out by computing fingerprints using modified simhash algorithm with proximity measure. From the distinct blocks, several parameters such as Titlewords, Linkwords and Contentwords are extracted. Thus, the extraction of significant content is carried out by computing the scores for each block using a weighted block scoring mechanism. The blocks having higher score values are extracted and finally, the core content is extracted from the web page. The experimental analysis has been performed and the results show that the proposed method eliminates noise in an efficient way. © 2020 School of Science.
引用
收藏
页码:173 / 187
页数:14
相关论文
共 50 条
  • [21] Data Engineered Content Extraction Studies for Indian Web Pages
    Kolla, Bhanu Prakash
    Raman, Arun Raja
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2019, 711 : 505 - 512
  • [22] Web Content Extraction Using Clustering with Web Structure
    Huang, Xiaotao
    Gao, Yan
    Huang, Liqun
    Zhang, Zhizhao
    Li, Yuhua
    Wang, Fen
    Kang, Ling
    ADVANCES IN NEURAL NETWORKS, PT I, 2017, 10261 : 95 - 103
  • [23] Cleaning web pages for effective web content mining
    Li, Jing
    Ezeife, C. I.
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 560 - 571
  • [24] Information extraction from massive Web pages based on node property and text content
    Wang H.-Y.
    Cao P.
    1600, Editorial Board of Journal on Communications (37): : 9 - 17
  • [25] Extracting Topic Maps from Web Pages by Web Link Structure and Content
    Mase, Motohiro
    Yamada, Seiji
    Nitta, Katsumi
    2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 1232 - +
  • [26] From Web Pages to Web Communities
    Kudelka, Milos
    Snasel, Vaclav
    Horak, Zdenek
    Hassanien, Aboul Ella
    DATESO 2009 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS: PROCEEDINGS OF THE 9TH ANNUAL INTERNATIONAL WORKSHOP, 2009, 471 : 13 - 22
  • [27] Turkish Keyphrase Extraction from Web Pages with BERT
    Ayan, Emre Tolga
    Arslan, Rabia
    Zengin, Muhammed Said
    Duru, Haci Ali
    Salman, Sedat
    Bardak, Batuhan
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [28] Structrued and semantic data extraction from Web pages
    Gan, Y
    Zhang, SZ
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 2930 - 2935
  • [29] Content Information Extraction of Theme Web Pages based on Tag Information
    Wang, Jie
    Wu, Jian
    Zhang, Yafeng
    He, Guowan
    2014 SEVENTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2014), VOL 1, 2014, : 501 - 504
  • [30] Information extraction from Web pages using presentation regularities and domain knowledge
    Vadrevu, Srinivas
    Gelgi, Fatih
    Davulcu, Hasan
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2007, 10 (02): : 157 - 179