A Novel Approach for Content Extraction from Web Pages

被引:0
|
作者
Bhardwaj, Aanshi [1 ]
Mangat, Veenu [1 ]
机构
[1] Panjab Univ, UIET, Chandigarh 160014, India
关键词
Content extraction; Entropy; Document object Model; hub and authority; ontology generation; template; Content Structure Tree; web page segmentation; Vision Based Page Segmentation; clustering; anchor text; IDENTIFICATION;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Extraction of core web content from web pages using noise elimination
    Saravanan, A.
    Bama, S. Sathya
    [J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
  • [2] A Novel Approach for Extraction and Representation of Main Data from Web Pages to Android Application
    Veeraiah, D.
    Ramanjaneyulu, Y. V.
    Yakobu, D.
    Sahithi, T.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 1126 - 1130
  • [3] Content Extraction from Web Pages Based on Chinese Punctuation Number
    Song, Mingqiu
    Wu, Xintao
    [J]. 2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 5573 - 5575
  • [4] Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration
    Li, Wei-gang
    Sun, Ke
    Wang, Shuo-chen
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGY (CNCT 2016), 2016, 54 : 734 - 740
  • [5] LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES
    Vijendran, Anna Saro
    Deepa, C.
    [J]. PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2013,
  • [6] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [7] Extraction of web news from web pages using a ternary tree approach
    Laishram, Debina
    Sebastian, Merin
    [J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
  • [8] Information Extraction from Web pages
    Novotny, Robert
    Vojtas, Peter
    Maruscak, Dusan
    [J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +
  • [9] Content Extraction from Web Pages Based on the Row Block Semantics and Punctuations
    Song, Anping
    Ding, Xuehai
    Li, Mingbo
    Si, Wulin
    Zhang, Wu
    [J]. PROCEEDINGS OF THE 2013 ASIA-PACIFIC COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY CONFERENCE, 2013, : 327 - 334
  • [10] An Approach to Image Extraction and Accurate Skin Detection from Web Pages
    Girgis, Moheb R.
    Mahmoud, Tarek M.
    Abd-El-Hafeez, Tarek
    [J]. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 21, 2007, 21 : 367 - 375