Extraction of core web content from web pages using noise elimination

被引:0
|
作者
Saravanan A. [1 ]
Bama S.S. [2 ]
机构
[1] School of Computing Science, Sree Saraswathi Thvagaraia College, Tamil Nadu
[2] Coimbatore, Tamil Nadu
关键词
Modified simhash algorithm; Near duplicates removal; Noise removal; Tag analysis;
D O I
10.25103/jestr.134.17
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
Due to the emergent of technological development, Web has evolved as the most powerful digital weapon for mankind in recent days. As the size of the web is increasing rapidly, extracting the interesting content from the web become the supreme challenge. In the meantime, the retrieved web pages have many uninteresting content blocks that are not useful for the user which also degrades the performance of content extraction. These uninteresting blocks include advertisements, banners, copyrights, navigation bars etc., and are normally named as web page noise. Removing these noises from the web pages is considered to be the primary task in pre-processing. This paper presents an approach that eliminates the noise and near duplicates for extracting significant content from the web page. The proposed method has three steps. Initially, the web page is divided into various blocks and the block which is considered as noise is removed using tag analysis and Document Object Model Tree. Secondly, the elimination of redundant blocks is carried out by computing fingerprints using modified simhash algorithm with proximity measure. From the distinct blocks, several parameters such as Titlewords, Linkwords and Contentwords are extracted. Thus, the extraction of significant content is carried out by computing the scores for each block using a weighted block scoring mechanism. The blocks having higher score values are extracted and finally, the core content is extracted from the web page. The experimental analysis has been performed and the results show that the proposed method eliminates noise in an efficient way. © 2020 School of Science.
引用
收藏
页码:173 / 187
页数:14
相关论文
共 50 条
  • [41] Extraction of flat and nested data records from web pages
    Algur, Siddu P.
    Hiremath, P.S.
    Conferences in Research and Practice in Information Technology Series, 2006, 61 : 163 - 168
  • [42] Automatic Extraction of Textual Elements from News Web Pages
    Ibrahim, Hossam
    Darwish, Kareem
    Abdel-sabor, Abdel-Rahim
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1600 - 1603
  • [43] TEXT: Automatic Template Extraction from Heterogeneous Web Pages
    Kim, Chulyun
    Shim, Kyuseok
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (04) : 612 - 626
  • [44] Schema Inference and Data Extraction from Templatized Web Pages
    Krishna, Shinde Santaji
    Dattatraya, Joshi Shashank
    2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
  • [45] Automatic data extraction from template generated web pages
    Ma, L
    Goharian, N
    Chowdhury, A
    PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
  • [46] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [47] LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES
    Vijendran, Anna Saro
    Deepa, C.
    PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2013,
  • [48] Improving the web text content by extracting significant pages into a Web Site
    Ríos, SA
    Velásquez, JD
    Vera, ES
    Yasuda, H
    Aoki, T
    5th International Conference on Intelligent Systems Design and Applications, Proceedings, 2005, : 32 - 36
  • [49] CIRank: A Method for Keyword Extraction from Web pages using clustering and distribution of nouns
    Rezaei, Mohammad
    Gali, Najlah
    Franti, Pasi
    2015 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), VOL 1, 2015, : 79 - 84
  • [50] Exploiting Web Sites Structural and Content Features for Web Pages Clustering
    Lanotte, Pasqua Fabiana
    Fumarola, Fabio
    Malerba, Donato
    Ceci, Michelangelo
    FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 446 - 456