Extraction of core web content from web pages using noise elimination

被引:0
|
作者
Saravanan A. [1 ]
Bama S.S. [2 ]
机构
[1] School of Computing Science, Sree Saraswathi Thvagaraia College, Tamil Nadu
[2] Coimbatore, Tamil Nadu
关键词
Modified simhash algorithm; Near duplicates removal; Noise removal; Tag analysis;
D O I
10.25103/jestr.134.17
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
Due to the emergent of technological development, Web has evolved as the most powerful digital weapon for mankind in recent days. As the size of the web is increasing rapidly, extracting the interesting content from the web become the supreme challenge. In the meantime, the retrieved web pages have many uninteresting content blocks that are not useful for the user which also degrades the performance of content extraction. These uninteresting blocks include advertisements, banners, copyrights, navigation bars etc., and are normally named as web page noise. Removing these noises from the web pages is considered to be the primary task in pre-processing. This paper presents an approach that eliminates the noise and near duplicates for extracting significant content from the web page. The proposed method has three steps. Initially, the web page is divided into various blocks and the block which is considered as noise is removed using tag analysis and Document Object Model Tree. Secondly, the elimination of redundant blocks is carried out by computing fingerprints using modified simhash algorithm with proximity measure. From the distinct blocks, several parameters such as Titlewords, Linkwords and Contentwords are extracted. Thus, the extraction of significant content is carried out by computing the scores for each block using a weighted block scoring mechanism. The blocks having higher score values are extracted and finally, the core content is extracted from the web page. The experimental analysis has been performed and the results show that the proposed method eliminates noise in an efficient way. © 2020 School of Science.
引用
收藏
页码:173 / 187
页数:14
相关论文
共 50 条
  • [31] Unsupervised Keyphrase Extraction for Web Pages
    Haarman, Tim
    Zijlema, Bastiaan
    Wiering, Marco
    MULTIMODAL TECHNOLOGIES AND INTERACTION, 2019, 3 (03)
  • [32] Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge
    Srinivas Vadrevu
    Fatih Gelgi
    Hasan Davulcu
    World Wide Web, 2007, 10 : 157 - 179
  • [33] To Extract Informative Content from online web pages by using Hybrid Approach
    Kaddu, Madhura R.
    Kulkarni, R. B.
    2016 INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, AND OPTIMIZATION TECHNIQUES (ICEEOT), 2016, : 972 - 977
  • [34] Automatic Web Pages Author Extraction
    Changuel, Sahar
    Labroche, Nicolas
    Bouchon-Meunier, Bernadette
    FLEXIBLE QUERY ANSWERING SYSTEMS: 8TH INTERNATIONAL CONFERENCE, FQAS 2009, 2009, 5822 : 300 - 311
  • [35] Universal Web Pages Content Parser
    Pawlas, Piotr
    Domanski, Adam
    Domanska, Joanna
    COMPUTER NETWORKS, 2012, 291 : 130 - 138
  • [36] Using the web infrastructure to preserve web pages
    Nelson, Michael L.
    McCown, Frank
    Smith, Joan A.
    Klein, Martin
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2007, 6 (04) : 327 - 349
  • [37] Extricating web pages from deep web using deaima architecture
    Devasirvatham, Weslin
    Thiyagarajan, Joshva Devadas
    THEORETICAL COMPUTER SCIENCE, 2022, 931 : 93 - 103
  • [38] Ontology Extraction Considering Content Concordance from Tagging to Web Pages in Similar SBM Users
    Harada, Fumiko
    Shimakawa, Hiromitsu
    2013 SECOND IIAI INTERNATIONAL CONFERENCE ON ADVANCED APPLIED INFORMATICS (IIAI-AAI 2013), 2013, : 289 - 295
  • [39] Person Attribute Extraction from the Textual Parts of Web Pages
    Istvan, Nagy T.
    ACTA CYBERNETICA, 2012, 20 (03): : 419 - 440
  • [40] Zero-shot Entity Extraction from Web Pages
    Pasupat, Panupong
    Liang, Percy
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 391 - 401