A Novel Approach for Content Extraction from Web Pages

被引：0

作者：

Bhardwaj, Aanshi ^{[1
]}

Mangat, Veenu ^{[1
]}

机构：

[1] Panjab Univ, UIET, Chandigarh 160014, India

来源：

2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS) | 2014年

关键词：

Content extraction; Entropy; Document object Model; hub and authority; ontology generation; template; Content Structure Tree; web page segmentation; Vision Based Page Segmentation; clustering; anchor text; IDENTIFICATION;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.

引用

页数：4

共 50 条

[1] Extraction of core web content from web pages using noise elimination
Saravanan, A.
Bama, S. Sathya
[J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
[2] A Novel Approach for Extraction and Representation of Main Data from Web Pages to Android Application
Veeraiah, D.
Ramanjaneyulu, Y. V.
Yakobu, D.
Sahithi, T.
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 1126 - 1130
[3] Content Extraction from Web Pages Based on Chinese Punctuation Number
Song, Mingqiu
Wu, Xintao
[J]. 2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 5573 - 5575
[4] Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration
Li, Wei-gang
Sun, Ke
Wang, Shuo-chen
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGY (CNCT 2016), 2016, 54 : 734 - 740
[5] LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES
Vijendran, Anna Saro
Deepa, C.
[J]. PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2013,
[6] A hybrid approach for extracting informative content from web pages
Uzun, Erdinc
Agun, Hayri Volkan
Yerlikaya, Tarik
[J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
[7] Extraction of web news from web pages using a ternary tree approach
Laishram, Debina
Sebastian, Merin
[J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
[8] Information Extraction from Web pages
Novotny, Robert
Vojtas, Peter
Maruscak, Dusan
[J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +
[9] Content Extraction from Web Pages Based on the Row Block Semantics and Punctuations
Song, Anping
Ding, Xuehai
Li, Mingbo
Si, Wulin
Zhang, Wu
[J]. PROCEEDINGS OF THE 2013 ASIA-PACIFIC COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY CONFERENCE, 2013, : 327 - 334
[10] An Approach to Image Extraction and Accurate Skin Detection from Web Pages
Girgis, Moheb R.
Mahmoud, Tarek M.
Abd-El-Hafeez, Tarek
[J]. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 21, 2007, 21 : 367 - 375

← 1 2 3 4 5 →